MIDSHIP (Managing Image Data sets with Scalable HIgh Performance)

MIDSHIP is a NSF-supported CISE Research Infrastructure Grant to the Computer Sciences Department at the University of Wisconsin. Over the period 1996-2001, this grant will support the purchase of the infrastructure to support research in the management of image data.

For more information, please contact the project directors:
Jeff Naughton (naughton@cs.wisc.edu)
James Larus (larus@cs.wisc.edu)

Overview

Today the technical and non-technical literature is full of descriptions of applications that require the storage, management, and processing of huge image datasets. In the near future, these applications will grow dramatically both in size and number. Unfortunately, current technology cannot handle these applications; MIDSHIP seeks to develop the technological underpinnings for these applications. The goal of this project is to use NSF funding, along with institutional matching and corporate donations, to purchase the infrastructure for research focused on developing technology to support a wide range of large-scale image applications.

Perhaps most visible example of such an application is the NASA EOSDIS project. EOSDIS starts with a number of satellites, each of which will collect and transmit a huge amount of image data--projections call for more than a terabyte per day. Current technology can only dump these images to tape and warehouse the tapes. If this happens, earth scientists will not be able to query or process more than a tiny fraction of this data set. The result would be an effectively write-only data set, with its wealth of expensive information going virtually untapped.

Although EOSDIS is the premier example of a large image data set application, it is far from the only one. In fact, much of the motivation for this project arose from our university, which is awash in images collected by a diverse set of researchers.

While the full scope of how such a system could be used is hard to predict, it is clear that such systems will have to be scalable. Demands for scalability arise in at least three ways:

  1. Data size.

    Already NASA has proposed to build and use a petabyte image database; even if not many petabyte databases are built in the near future, many terabyte image data sets will be. These huge data sets will arise from scientific, medical, and commercial applications.

  2. Number of users.

    Some image data sets will be valuable to large numbers of users. A hint of the potential workload on such systems is given by the current workload on popular web sites (for example, the NCSA web site reports over 40 connections per second). Even if these users are running simple queries to access images, their aggregate computational demands will be huge.

  3. Complexity of user requests.

    For queries that require complex processing of images, and especially those that require comparative analysis of large sets of images, satisfying even a single user's query will require vast computational resources.

Clearly, image management is a broad area that encompasses far more research than one team of researchers can hope to cover. Accordingly, in the MIDSHIP project we have chosen to focus on the areas of image management to which we bring the greatest expertise, with a primary emphasis on scalability.

Obviously, a system that supports scalable image management must have scalable hardware resources. In our opinion (an opinion backed by the majority of parallel hardware vendors today), the best hardware option to manage such large data sets is a cluster of high-end SMP servers. A cluster of SMP's presents a hybrid programming model (shared memory within an SMP, message passing between SMPs). A significant aspect of the MIDSHIP project will be the development of hardware and system software to provide an efficient unified programming environment on top of such a hybrid system.

We are developing Paradise database system as part of NASA's effort to manage their immense EOSDIS data sets. Many users will interact with Paradise directly, using it to store and query their data sets. One focus of our work for MIDSHIP will be to extend the Paradise query language to better support image applications and queries. The other focus will be to explore the issue of how parallel database systems in general, and image database systems in particular, should take advantage of hybrid cluster of SMP hardware systems.

In addition to the Paradise portion of the project, there are two other database-related parts to MIDSHIP. The first is research into quality controlled image compression and support for processing data directly from tertiary storage; the second is research into query by image content. We also envision other users will use Paradise as a server of image data for their image processing programs. As representatives of this class of user, MIDSHIP encompasses three efforts that take very different approaches to extracting information from images. Mangasarian leads research into applying mathematical programming to a particularly important medical image analysis problem -- cancer diagnosis and prognosis. Shavlik is exploring using machine learning techniques to automatically classify images. Finally, Dyer is applying computer vision technology to interpret images and sequences of images. These applications all will need scalable hardware to solve large problem instances and will use the cluster of SMPs.

Finally, a thread running through this entire project is performance. Parallel performance is notoriously difficult to measure and analyze; for this reason we have included the Paradyn parallel program analysis effort, led by Miller, in MIDSHIP.

NSF Progress Reports

Year 1 Report

Modified by: James Larus
Tue Apr 8 17:08:46 CDT 1997