Virtual Videographer We propose to build a collaborative system that automatically edits video of a lecture, taken with a small number of static cameras (typically one), into a presentation, including changes in viewpoints and including close-ups. The system would allow a (human) editor to review the presentation and give the system tips on how to edit portions of the video. Ideally, the system would do a good enough job that such human collaboration would be unnecessary when the expectations for the presentation are not too high. A video of a lecture taken with a single stationary camera is not an effective method for presenting material. To hold the interest of a viewer and to direct a viewer's attention at various points in a presentation, different viewpoints and close-ups are needed. We propose a system that uses gesture recognition to detect when to use different shots and image processing to manipulate the images in various ways, including changing the perspective and creating tighter shots. Using traditional filmmaking to create a presentation would use a number of cameras to record the presentation from different angles, and would require a director to determine when to zoom in. An editor could then cut together the film from the different available angles. Our system assumes a presenter in front of a static scene such as a blackboard. This allows the system to make assumptions that simplify the image analysis greatly. The lone presenter can be isolated, and the image of the presenter can be manipulated. If the presenter is obscuring part of the blackboard, the image of the presenter could be partially dissolved so that the information on the board can be seen. Further, since the system would have the entire presentation to work with, it would be able to capture information written on the board in future frames and have them displayed earlier. For example, if a presenter is deriving a proof, the complete proof taken from a future frame could be shown on the board. This would enable the viewer to see the complete proof while watching the presenter step through the derivation of the proof. In cases where the presenter is not referring to information on the blackboard, a tighter shot of the presenter or the presenter's face may be appropriate. This may be when the presenter is facing the viewer for an extended period and is not gesturing at the board. Simple face detection, such as color classification [1], would allow the system to avoid close-ups of the back of the presenter's head. In addition to digitally zooming in on items of interest, the viewpoint can be altered so that different perspectives can be used to add variety to the presentation. In addition, shots which contain solely the board can be shown from a straight-on perspective. When the presenter's face is turned away from the viewer, the presenter could be partially dissolved, under the assumption that there may be information on the board that the presenter is obscuring. If the presenter is pointing at an example on the board, a close-up of the example being pointed at might be appropriate. This would involve identifying that the presenter is pointing, identifying what is being pointed at, and manipulating the images to zoom in on or to highlight the area being pointed at, as well as changing the perspective to show a straight-on view. The system would edit a complete video, including an audio track. Once the system has done its automatic editing, the presentation could be used as is, or if needed, it would be reviewed by a person, who could give the system tips on other close-ups to add or remove, or viewpoint changes to make. The development of the system is, appropriately, a systems effort, requiring the integration of a number of pieces. Our goal is to get something working for each of the pieces (a user interface, ability to handle video and audio, vision, shot selection, etc.) The system will then provide a platform for exploring issues in computer vision, automatic editing, presentation, and special effects for presentation. Depending on how the project evolves, we may draw more on computer vision to improve the identification of where cuts should be made, or more on educational science to improve our theory of which cuts are more appropriate. The graphics group is building a toolkit to support video-based applications. The toolkit, together with video digitizing hardware and a set of lecture tapes - both of which the graphics group already possesses, will be used in the development of our system. The following is an outline of the project: 1. Create a proof of concept, where a human does the work of our proposed system. This will help us understand what can be achieved with limited video material and help us understand the cinematography issues involved. 2. Create a demo of some of the video editing techniques as determined by the virtual videographer, showing that what was done manually in step one can be automated. 3. Build tools for displaying and modifying the data that controls the edits made by the virtual videographer. 4. Bring all the pieces together as a first pass at the complete system. 5. As time allows, revisit the pieces one at a time (these are not ordered) - focus on intelligence (making better shot selection choices) - focus on the output (special effects, view generation) - focus on the input (segmentation, tracking, "pointing" detection) The fifth step is totally open-ended, in all respects. Assuming time allows work on the last step, which part or parts or it are worked on will depend on what is learned in doing the other four steps of the project. [1] M. Hunke and Alex Waibel. Face locating and tracking for human-computer interaction. In "Proceedings of the Twenty-Eighth ACSSC 94", 1994.