Computer Vision and Dance are natural partners in interactive setups involving either professional performers or "amateur" audience. Computer Vision is the science of teaching computers to see. But computers are machines- and any analogy to the human perceptory apparatus will always be just this- an analogy1.
My experiments involving realtime video analysis of human motion all circle around this question: what does the machine see? In my interactive setups, I am applying well-known computer vision techniques to a video signal recording human actors, trying to give a glimpse into the underlying processes, visualizing basic recognition methods and producing imagery that exemplifies the computers' "perception".
All and any input for a computer has to be digitized, converting continuous streams of data into discrete units that are easily digestible for algorithms. The bit - the minimal unit of information - is the basic unit of any digital process. It is combined with other bits (usually quite many of them) to form higher level abstractions. So how, exactly, do we get from just zeros and ones to a stream of video?
It's not actually that hard. Start with a digital image: it is made up of pixels- tiny square colored parts of the rectangle. The resolution (amount of pixels) has to be high enough so our own (human) apparatus can ignore this discreteness and convert the quanta of light entering our eyes into what we perceive as an "image".
Every pixel is made up of usually three components- individual numeric values describing the amount of red, green and blue in the color of the pixels2. These numeric values is what a computer can work with- numbers can easily be expressed in bits3.
To get video, we arrange a number of those images in rapid succession- a usual number is to have twenty-five frames (individual images) per second, as that seems to fit the "refresh rate" of our eyes. If it is less (say, only five images per second), we perceive the motion as "choppy".
So we can say that computers, at least in the context of video, handle both space and time "discretely". The (two-dimensional) space of a video image is made up of individual pixels (usually already at the camera), and individual frames are images at certain points of time. There is nothing in between those pixels, or in between those frames - and that's why it's called discrete (as opposed to "continuous").
We find that the computer does not, per se, "understand" any of the concepts that seem so natural to us- image, motion, color - it's all just bits to it. But we can create software that enables the computer to build higher-level abstractions- where it does not only record and play some video, but performs at least a basic analysis of the content of the video stream.
Most of my experiments start from the same process: producing a difference image. The aim is to differentiate between foreground and background of a video stream. The process is mathematically surprisingly simple: we take a single image of the background, without any object in the foreground. Then, for each individual video frame, we compare each pixel to the pixel at the same position within the background image. The "comparison" is nothing more than the absolute value of a substraction of the color values of the corresponding pixels. The result is, of course, again a number, and if we have the computer interpret that number as a brightness value of some new image we get a grayscale image that "describes" the difference of the analyzed frame to the background image-- pixels that are very similar to the background turn out black, and those very different become white. Thus, the white parts of this image represent the object in the foreground, which is what we are really interested in (usually). We can further simplify that representation by applying a threshold function- so the result is a binary difference image where any pixel is either completely black (if below threshold value), or completely white (if above). To this image we can apply higher level analysis.
One of the simplest analysis procedures is connected components analysis: it identifies "blobs", that is, groups of white pixels that are connected to each other by neighborhood. While not as simple as producing the difference image itself, the process is still relatively easy to implement: say you could start from the upper left pixel and search, left to right, line by line, for the first white pixel. From there, you would traverse all neighbours that are also white, and from these again. In the process, you would turn the analyzed pixels black (so they wont be recognized again), and accumulate some basic information about the "component": the amount of pixels that neighbor each other defines its weight, the position of the contained pixels define its extension and position in space. That is already quite a bit of information, which one can use for various purposes.
A different process is contour analysis. The process is rather similar to connected components analysis, but instead of traversing all pixels of a component, we are only interested in the outermost pixels- the ones that define the edge of the component. This edge is represented as a list of vectors, in total describing the contour around the component. We can apply geometric analysis to the detected contours, or use them for some simple visualization effects.
The analysis described above applies to single individual frames of a video stream - mostly, except for the initial difference image which compares the current frame to an image taken some earlier time.
If we now do the exact same comparison (i.e., subtraction), but not of the current frame with an image of the empty scene, but with the last image in the succession of frames, we get a very different result: still a grayscale image, but white areas now do not describe presence of some object in the scene, but motion, i.e., where there's a white pixel, something has moved. It does not tell us anything about the direction or kind of motion, it simply expresses "something has changed". This simplicity is exactly what the machine needs to do its analysis.
We can apply the same analysis processes (connected components and contour analysis) to the motion image, too (giving us a quite different information), or we can do even simpler things - e.g., aggregate the pixel values over a certain rectangular area and triggering some process when there are enough white pixels in it - this basically forms an "invisible button", a sensor that can be triggered by waving an object in the right area of the video image.
Motion analysis as described above already relates to time. But we can do more- a simple tracker would use the output of connected component analysis and compare the position and sizes of components in consecutive frames. In all likelyhood, a similarily sized component at a similar position will be the same object that has moved a bit. If we repeat this process over time, we have our tracker - it tracks the position of objects in space, over time. We can derive speed and accelleration from this information, opening a plethora of possibile visualizations and playful experiments with the data gained.
As the computer is relatively indifferent to what kind of data a certain set of bits express, we can also accumulate a period of time in a single video frame. We can show the output of contour analysis just like above, but not only the detected contours of a single video frame but those spanning a few seconds in time. The individual image accumulates to a snapshot of time- but an ever-changing one, as it is continuously updated to represent the very last few seconds.
These few effects constitute the basic possibilites of my software. There's more to be explored, of course, and over the last years I have increasingly tried to simplify my results- trying to exhaust individual properties of how the computer handles video. It has turned out that - at least sometimes - the technically most simple effects can provide the most powerful tools for the work of performers - and this is exactly what I want to work on with BADco4 this year.