International Integrated Microdata Access System Steven Ruggles, Director, Minnesota Population Center, University of Minnesota Catherine Fitch, Research Assistant, Minnesota Population Center, and IPUMS Coordinator This paper describes a new NSF infrastructure project to create and disseminate an integrated international census database composed of high-precision, high-density samples of individuals and households. Our project has two components. First, we propose to collect data that will support broad-based investigations into the most important scientific questions facing social and behavioral science. Second, we will create a web-based data dissemination system that will incorporate innovative capabilities for worldwide access to both metadata and microdata. Large machine-readable census microdata samples exist for many countries around the world, but access to these data is highly limited and the documentation is often inadequate. Even where such microdata are available for scholarly research, comparisons across countries or time periods are difficult because of inconsistencies in both data and documentation. This project will provide basic infrastructure for the social sciences by making the samples publicly available, converting them to a consistent format, supplying comprehensive documentation, and by developing new web-based tools for disseminating the microdata and documentation over the Internet. The Internet is transforming the nature of electronic data dissemination. At the same time, the proliferation of fast personal computers and UNIX workstations has slashed the cost of large-scale data analysis. This project capitalizes on both of these developments by creating a population database of unprecedented size and power, and by providing tools to make it readily available for analysis on desktop machines. The project builds on our experience with the Integrated Public Use Microdata Series (IPUMS). The IPUMS is a coherent series of individual-level U.S. census data drawn from thirteen census years between 1850 and 1990. By putting all the census samples in a compatible format with consistent variable codes and integrating their documentation, the IPUMS greatly simplifies the use of multiple census years. Just as important, we have developed methods of electronic dissemination that have democratized access to these resources (http://www.ipums.umn.edu). The original IPUMS project includes 22 samples drawn from one country, the United States. It contains 65 million records totaling 25 gigabytes when uncompressed. Although this is one of the world’s largest public-use databases, it is modest by comparison with our new endeavor: we plan to build a database with some 650 samples drawn from 21 countries on six continents, and it will include about 550 million records requiring some 250 gigabytes in uncompressed form. We will need to write the equivalent of about 14,000 pages of documentation, compared with 3,000 pages in the current IPUMS. This increase in scale will necessitate a proportionate increase in complexity, so we will develop new navigation and extraction tools to keep access to the data and documentation simple. Our task is not merely to convert hundreds of additional samples into IPUMS format. Because of international variation in census concepts such as "group quarters," and cultural concepts such as race and marital status, we will need to design the database from the ground up. This design process will be undertaken in close collaboration with international and domestic microdata experts. The basic design goals, however, remain the same as in the original IPUMS: we will create a system that simplifies use of the data and at the same time loses no meaningful information except when necessary to protect respondent confidentiality. The project will incorporate domestic and international data from a variety of sources. We will start with the U.S. census samples for the period 1850 through 1990 in the current IPUMS. Then we will add additional domestic samples to allow detailed study of the U.S. population in the late twentieth and early twenty-first centuries. Specifically, we will incorporate 528 monthly samples of the Current Population Survey (CPS) for the period 1964 through 2008, the 2000 Census Public Use Microdata Sample (PUMS), and the American Community Surveys (ACS) for the period 2000 through 2008. With these additions, the database will have a much stronger contemporary focus than the current IPUMS, and will be especially useful for national and local studies addressing policy questions. The international component of the database falls into two categories. For some countries, we will incorporate public-use census or survey samples that already exist, just as we have done for the United States. These data are generally well-documented, but they will pose complexities we have not previously encountered because of national differences in census concepts, cultural practices, and language. For other countries, no public-use census files presently exist. In these instances, we will create new anonymized samples drawn from surviving census tapes that were used to construct census tabulations for publication. In collaboration with the statistical offices of the countries concerned, we will explore new techniques to ensure full respondent confidentiality while maximizing detail. These data files are often poorly documented, and we will require extensive assistance from the statistical offices and experts of each country to ensure that we interpret them correctly. The development of metadata is central to the project and poses even greater challenges than the manipulation of the microdata. For every census and country we aim to provide comprehensive documentation at or exceeding the standards of the U.S. Census Bureau. The metadata will not be confined to codebooks and census questionnaires. As in the case of the existing IPUMS, we will provide a wide variety of ancillary information to aid in the interpretation of the data, including full detail on sample designs and sampling errors, procedural histories of each dataset, full documentation of error correction and other post-enumeration processing, and analyses of data quality. Both the data and the documentation will be distributed through an integrated data access system on the Internet. Users will extract customized subsets of both data and documentation tailored to their particular research questions. This will not, however, simply be a data extraction system. Rather, it will be a set of tools for navigating documentation, defining datasets, constructing customized variables, and adding contextual information. The most difficult task will be to provide a system whereby users can easily gauge the comparability of a particular variable in any sample to its counterpart variable in any other sample. Given the large number of samples, this level of documentation would be so unwieldy as to be virtually unusable in printed form. Accordingly, we will develop software that will construct electronic documentation customized for the needs of each user.