This document is currently being revised (May 3, 2001).

DATA LIBRARY-- INTRODUCTION TO DATA HANDLING

Select a topic if you wish to ZOOM directly to a particular section within this document

[ INTRODUCTION ] [ BASIC CONCEPTS ]
[ READING AND USING A CODEBOOK]
[ EXTRACTION - RECTANGULAR ] [ EXTRACTION - MULTIPLE OBS. ]
[ EXTRACTION - HIERARCHICAL ] [ EXTRACTION - MERGING ]
[ TRANSPORTING DATA ] [ OTHER STUFF ]

NOTE: Example SAS, SPSS and STATA command files are located at: ftp://ftp.src.uchicago.edu/data/programs.

Please feel free to copy these files for your own use.


INTRODUCTION

For most, the "Introduction to Data" occurs in the context of an introductory or advanced course in statistical analysis. Typically, the data you have used has been preformatted by a TA or RA for use with a particular statistical package such as SPSS, SAS or STATA. Once you embark upon your own research, you will rapidly discover that few datasets come prepared for immediate access by your favorite application. In fact, you will find that getting your data in an appropriate form for analysis is oftentimes more involved than the most complex statistical analyses you will employ!

This handout is intended to introduce you to the basics required to converting "raw" data into a dataset to be used by a statistical application, specifically SPSS or SAS. It illuminates topics such as reading a codebook, identifying data structures, and developing programs for reading raw data into a statistical application.

As a starting point, assume that you have identified a dataset containing information you would like to analyze. These data consist of a number of measured attributes--called variables--each describing a set of observations. In the case of a survey, the observations are typically individual respondents and the variables are responses solicited from questions about attitudes, behaviors and traits.

Further assume that you have received this data in a "raw" format--this simply means that the data are in a text file that may look like this:

H000000118560010080999700 63001402000005009300001520000141112111520006
P000000100000182000180000000221000000000228020000110939999256001002000
P000000101100177000180000000221040000000228031000110939999256001002000
H000000218560010080999700 57002201000002004222100000000531112111721001
P000000200100180100210000000103040000000203056000105020999100000002000
H000000318560010080999700 07002204000002009122180000000251113111821004
P000000300000135000220000000221000000000495017000117032051256001002000
P000000301100132000220000000221020000000495017000115032142256001002000
P000000302000107400180000111220101000000495031000203032999256001001625
P000000302000109400140000111220101000000495046000204032999256001002000

Your task will be to map these numbers and characters to variables and observations such that a statistical application such as SAS or SPSS can read and analyze the data.


BASIC CONCEPTS

Before getting started, it is essential that you understand the technical language of data. The most basic concepts are the record and the field. A record is simply a line. In some cases, the record contains all of the information about an observation, but as is noted below that isn't always true. A field is a column or columns containing the data for a specific variable. This assumes, of course, that your data is in a fixed column format--meaning that the values for a particular variable are in the same column(s) for all observations--which is true for 90% of all data files you will encounter. The alternative is a variable format in which the variables a found in a specified sequence within the file but are not in the same locations for each observation.

The relationship between records and fields constitutes a "data structure". The four most common structures are as follows:

  1. Logical Record or Rectangular structure.
    This means that each line of data or record contain all of the variables for a single observation.
  2. Multiple Record or Card-image Structures.
    This means that several lines of data contain all of the variables for a particular observation. Card-image structures are those in which there is a fixed record length of 80 characters.
  3. Hierarchical Structure.
    This type of data contains multiple levels of related records within the same data file. For example, a file containing both household records and household member records.
  4. Relational Structure.
    This refers to multiple files that can be merged on the basis of a predefined structure or variable--the relationship. Examples include data collected on the same population at different time periods (the relation is the individual) or file containing data on students and a file containing information about the student's school.

Typically, you won't want all of the variables in a file, and in some instances you won't want all of the observations. The process of reducing the number of variables is called an extraction; reducing the number of observations is called subsetting a dataset.


READING AND USING A CODEBOOK

[ Example Codebook ] [ Example Data Dictionary] [ Info You Need from the Codebook]

The place to start is with a CODEBOOK which is more or less a manual describing a particular study or data collection. While the content and format of codebooks vary considerably between data collections, the typical codebook contains the following information:

  1. A description of how the data was collected including sampling design;
  2. The variables contained in the data;
  3. In the case of surveys, the survey instrument or questionairre used to solicit responses from the respondent and the coded values of each question;
  4. The location and format of the variable within the raw data file;
  5. Meaning of the coded values for each variable
  6. .

Most codebooks have at least two major sections: the data dictionary which lists the variables and column locations and the data collection instrument. In a number of cases, there is a section describing how to read the codebook!


Example Codebook Page

ICPSR 6067
American National Election Study 1992

 

0          VAR 0062      R INTREST-POL CAMPGN                MD=0 OR GE  8
           REF 0062         LOC  151 WIDTH  1
0
              In this interview I will be talking with you about the
              recent elections, as well as a number of other things.
              First, I have some questions about the political campaigns
              that took place this election year.

              Q.A1.  Some people don't pay much attention to political
              campaigns.  How about you?   Would you say that you were
              VERY MUCH INTERESTED, SOMEWHAT INTERESTED, or NOT MUCH
              INTERESTED in following the political campaigns this year?
              ----------------------------------------------------------

             304  1.  VERY MUCH INTERESTED
             635  3.  SOMEWHAT INTERESTED
             419  5.  NOT MUCH INTERESTED

                  8.  DK
               1  9.  NA
            1126  0.  INAP, 1992 cross section
0          ...............................................................

The top section, beginning with "VAR 0062", contains the basic information required for your extraction code. "VAR 0062" is the variable name (you would probably list this as "V0062" or "V62"), "R INTREST-POL CAMPGN" is the variable label describing what "V0062" is, "LOC 151" is the column location of the variable within the record, and "WIDTH 1" indicates that the variable is in a single column. The next section, beginning with "In this interview" is the survey question text. The section beginning with "304 1. VERY MUCH INTERESTED" provides information about the coded values and labels of V62, e.g. a value of 1 means "VERY MUCH INTERESTED".


Example Data Dictionary

1990 Public Use Microdata Sample. A Hierarchical File

 

HOUSING UNIT RECORD
DATA               SIZE              BEGIN
D  RECTYPE         1                 1
         Record Type
V          H  .Housing Record
D  SERIALNO        7                 2
V   0000000..
     9999999  .Housing unit/GQ person serial number unique
              .identifier assigned within state or state group
D  SAMPLE          1                 9
         Sample Identifier
V          1  .5% sample
V          2  .1% sample
V          3  .Elderly
............................................................................
PERSON RECORD

DATA               SIZE              BEGIN
D  RECTYPE         1                 1
         Record Type
V          P  .Person Record
D  SERIALNO        7                 2
V  0000000..
V    9999999  .Housing unit/GQ person serial number unique
V             .identifier assigned within state or state group
D  RELAT1          2                 9
         Relationship
V         00  .Householder
V         01  .Husband/wife
V         02  .Son/daughter
V         03  .Stepson/stepdaughter
V         04  .Brother/sister
V         05  .Father/mother
V         06  .Grandchild
V         07  .Other relative
V         08  .Roomer/boarder/foster child
V         09  .Housemate/roommate
V         10  .Unmarried partner
V         11  .Other nonrelative
V         12  .Institutionalized person
V         13  .Other persons in group quarters
===================================================

The lines beginning with "D RECTYPE", "D SERIALNO", and "D RELAT1" indicate the Variable Name, e.g. "RECTYPE". The Label "Record Type" is beneath the variable information. The "BEGIN Column contains the starting column location of the variable and "SIZE" indicates the width of the variable. The lines beginning with "V" contain the coded values of the variable and a description of these codes.


Information You Need from the Codebook

In almost all cases, the codebook will contain the information you will need to begin thinking about and writing syntax to extract the variables and cases you need from the raw data.

The basics pieces of information that you will need are:

  1. The data structure--rectangular, hierarchical, etc.
  2. The variables that you are interested in: including their column location(s), variable type (alpha or numeric); additional formatting information (number of decimals, whether there are blanks in the field).
  3. Information about other key variables such as unique case identifiers, weights, and the like that are necessary for using the data correctly;
  4. Preparing labels to identify the variables and values on your output.
  5. You should also note some baseline marginals to test. For example, if the codebook lists the number of cases for by geographic area, you might want to consider comparing your extraction results to this table.

Once you've decided which variables you need, you'll need to decide upon an application to use. The application you decide to use should be one that is suited for the types of analyses that you expect to conduct; however, there isn't always an obvious choice. Some applications are well-suited to specific types of analysis; however, they are poor choices as "extraction engines" because they either don't handle data manipulation efficiently or they cannot work with complex data structures. This is true of MATLAB, S Plus, Excel, ect. This document focuses on SAS, SPSS and STATA.

Section VIII below discusses strategies for moving your data to the PC or to another UNIX based application. If you want to use a different application, you should identify the following types of commands:

 

  1. How to reference a raw file stored on disk.
  2. What commands are used read in raw data.
  3. How to save out a file formatted for use with the application.


EXAMPLE EXTRACTION ON A RECTANGULAR FILE

[ SAS EXAMPLE] [ SPSS EXAMPLE] [ STATA EXAMPLE ]

This section describes the basic SPSS and SAS syntax for extracting variables from a raw, ASCII file in which all of the data for an observation is contained in the same record. The example data set is the AMERICAN NATIONAL ELECTION STUDY, 1992. You can view the abridged codebook by clicking here


SAS Code to extract data from a rectangular file

/* =====================================================================        
*   ANES92.SAS
*      Example SAS Syntax for extracting data from an ASCII file
*=======================================================================
*/;

******************  LIBNAME AND FILENAME STATEMENTS **********************
*  File Name Statement defines an alias or pseudonym for the raw data file.
*  LIBNAME assigns an alias to a directory where a SAS data set will be 
*  stored;

FILENAME raw '/DL/polbehav/anes/1992/da6067';
LIBNAME mydata '/tmp';


************ SAS DATA STEP  ***********************************************
*    1) The DATA line contains the name of the SAS dataset to be saved where
*         mydata is the directory location and "anes92" is the name of the SAS
*         file.
*    2)  The INFILE line tells SAS where to locate the raw data file and how
*         many characters of data to read each time.  MISSOVER indicates that
*         blank columns should be treated as system missing data.
*    3)  The INPUT statement indicates what data to read in.  The basic 
*         structure is variable name, character variable ($) designation,
*         column locations and additional formating data (e.g. .4 for four
*         decimal places.
****************************************************************************;

DATA mydata.anes92;
INFILE raw LRECL=3621 MISSOVER;

INPUT   V4 7-10  
        V9 $ 37-40 
        V68 157-157  
        V3008 1752-1757 .4;
   
************  LABEL Statement creates Variable Labels;

LABEL
  V4="CASE ID NUMBER"
  V9="STATE/CONGRESSINAL DIST"
  V68="DISCSS POLTCS W/FAM,FRND"
  V3008="WEIGHT VARIABLE";
  ;

* Recode V68;

if (V68=0 or V68=8 or V68=9) then V68=.;
RUN;

****** Get a listing of variables in your new data set!;
PROC CONTENTS DATA=mydata.anes92;

****** Do a frequency on a key variable;
PROC FREQ DATA=mydata.anes92 (keep=v68);
RUN;
PROC MEANS DATA=mydata.anes92(keep=v3008);
RUN;

SPSS Code for Extraction

* =================================================================
* ANES92.SPS
*     SPSS syntax for extracting variables from an ASCII data file
*     Shows how to setup Data List Statement
*     Formatting
* =================================================

***********   File Handle Statement *******************************
* The File Handle Statment sets up an arbitrary alias--a pseudonym--
* for the data file.  RAW is the pseudonym, the actual file is da6067 which 
* is located in the /DL/polbehav/anes/1992 directory.  
* LRECL is the abbreviation for "Logical Record Length"--the maximum 
* length of a line to be read by SPSS.  By default, SPSS only reads the 
* first 1024 columns.

FILE HANDLE raw /NAME='/DL/polbehav/anes/1992/da6067' LRECL=3621.

DATA LIST FILE=raw
  /   V4 7-10  
      V9 37-40 (A)
      V68 157-157  
      V3008 1752-1757 (4)
.
   
************  VAR LAB Statement creates Variable Labels;

VAR LAB 
  V4 "CASE ID NUMBER"
  / V9 "STATE/CONGRESSINAL DIST"
  / V68 "DISCSS POLTCS W/FAM,FRND"
  / V3008="WEIGHT VARIABLE".

********   Value Labels .

VAL LAB V68 1 'YES' 5 'NO' 8 'DK' 9 'NA' 0 'INAP'.

****** Do a frequency on a key variable;
FRE VAR V68.

***** MISSING VALUES -- Exclude from the analysis .
missing values v68 (8,9,0).
FRE VAR V68.

save outfile='anes92.sav'.

STATA Code for Extraction

Start by creating a dictionary file...

infix dictionary using "/DL/polbehav/anes/1992/da6067." {
         v4 7-10
    str4 v9 37-40
         v68 157-157
         v3008 1752-1757
}
and here's the corresponding STATA DO file

* =======================================================================
* ANES92.DO
*      Example STATA Syntax for extracting data from an ASCII file
*=======================================================================
*
*
log using anes92.log, replace
*
* STATA has several methods for reading in FIXED ASCII 
*   INFIX is the most intuitive (infile is the other)
*   It's best to keep the "dictionary" 
* FIRST, STATA IS A MEMORY HOG -- MAKE SURE YOU ALLOCATE ENOUGH
*
*
set memory 20m
* READ IN THE DATA USING A DICTIONARY
*
*
infix using /home/spcbooke/data101/stata/anes92.dct         
*
*
*
* ***********  LABEL Statement creates Variable Labels;
label variable v4 "CASE ID NUMBER"
label variable v9 "STATE/CONGRESSINAL DIST"
label variable v68 "DISCSS POLTCS W/FAM,FRND"
label variable v3008 "WEIGHT VARIABLE"
*
*ADJUST IMPLIED DECIMAL PLACES for V3008
replace v3008=v3008/10000
*
* Recode V68
replace v68=. if v68==0 | v68==8 | v68==0
*
****** Do a frequency on a key variable;
tab v68


EXTRACTION ON A MULTIPLE RECORD/OBSERVATION FILE

[ SAS EXAMPLE] [ SPSS EXAMPLE] STATA EXAMPLE ]

Extracting data from multiple record/observation files builds upon the example discussed above. The basic idea is to construct command syntax that will extract the variables you need from each record and rectangularize the saved application data set. Visually,

            P1R1
            P1R2      ========>                    P1.........
            P2R1                                   P2.........
            P2R2

where P1, P2 indicate individuals, R1 and R2 are records (in this case 2 records per observation). The strategy in this case is to identify the records that contain the variables of interest and to write the syntax that tells your application which records to process and how many records per observation there are.


Example SAS Code

/* =========================================================================
   SN9112.SAS
        Example SAS syntax for extracting data from an ASCII file in which
           there are multiple lines of data for each observation
        Sample data are the National Youth Survey, Wave 5
   =========================================================================
*/;

***************  Filename and Libname Statements **************************;

FILENAME raw '/DL/soc/youth/nys/wave5/da9112.card';
libname mydata '/tmp';

DATA mydata.nys5;
INFILE raw lrecl=80 missover;
INPUT 
   #1 y5V1 1 y5v2 2 y5v6 9-10 y5v7 11-12 y5v39 45 y5v40 46-47 y5v45 55 
      y5v50 59 hhid 69-72 respid 73-74
   #2 y5v57 1 y5v65 10 
   #3 y5v181 35 y5v182 36 y5v183 37 y5v184 38
   #9 y5v435 4 y5v437 8  y5v439 12 y5v441 16 y5443 20 y5v445 24 y5v447 28 
    y5v463 60 
;

* LABELS ....;
* FORMAT.....;

PROC CONTENTS;

Example SPSS Code

*===================================================
* SN9112.SPS
*       Example SPSS Syntax for reading in data from an ASCII file in which
*       there are more than one record per observation.
*       Data for this example are National Youth Survey, Wave 5
*        These data have 15 records/observation.  
*        Total records = 25875.  Total cases=1725.
* =============================================

FILE HANDLE raw /NAME='/DL/soc/youth/nys/wave5/da9112.card' LRECL=80.

******************  Data List Command  
*  The data list command differs in two ways.  
*     First, the inclusion of the "records" subcommand.  This is the
*            number of records per observation.
*      Second, the "/n" that tells SPSS which record to read data from.

DATA LIST FILE=raw RECORDS=15
  /1 y5V1 1 y5v2 2 y5v6 9-10 y5v7 11-12 y5v39 45 y5v40 46-47 y5v45 55 
     y5v50 59 hhid 69-72 respid 73-74
  /2 y5v57 1 y5v65 10 
  /4 y5v181 35 y5v182 36 y5v183 37 y5v184 38
  /9 y5v435 4 y5v437 8  y5v439 12 y5v441 16 y5443 20 y5v445 24 
     y5v447 28 y5v463 60 
 
*  Variable Labels ....
*  Value Labels ....

* Do a frequency command to test your results.

FREQUENCIES VARIABLES=y5v1 y5v463

* Save your results to an SPSS system file.

SAVE OUTFILE='/tmp/nys5.sys'

Example STATA Code

Here's the STATA dictionary

infix dictionary using "/DL/soc/youth/nys/wave5/da9112.card" {
   15 lines
   1: 
    y5V1 1 y5v2 2 y5v6 9-10 y5v7 11-12 y5v39 45 y5v40 46-47 y5v45 55
      y5v50 59 hhid 69-72 respid 73-74
   2:
     y5v57 1 y5v65 10
   3: y5v181 35 y5v182 36 y5v183 37 y5v184 38
   9: y5v435 4 y5v437 8  y5v439 12 y5v441 16 y5443 20 y5v445 24 y5v447 28
    y5v463 60
}

and here's the STATA DO File

* =========================================================================
*   SN9112.do 
*       Example STATA syntax for extracting data from an ASCII file in which
*          there are multiple lines of data for each observation
*       Sample data are the National Youth Survey, Wave 5
*  =========================================================================
*
set memory 20m
*
*
infix using sn9112


EXTRACTION ON A HIERARCHICAL FILE

[ SAS EXAMPLE] [ SPSS EXAMPLE]

Extracting data from a hierarchical file involves identifying all distinct data levels within the file and the variable that differentiates each level. The general idea is to extract the required data from each level of the hierarchy to generate a rectangular file. This is done by forcing the application to process each record type differently. Visually,

 

            H1
            P1                                     H1P1
            P2                                     H1P2
            P3                    =====>           H1P3
            H2                                     H2P1
            P1                                     H2P2
            P2

where H1 and H2 are records for separate households, and P records indicate records for persons within each respective household.

The examples below use the 1990 Public Use Microdata for the state of Wyoming. Note: STATA cannot process hierarchical data files, thus there is no STATA example.


Example SAS Code.

/*====================================================
*   PUMS90.SAS
*   Example SAS code for Intro to Data Handling
*     ***  Extraction on a hierarchical dataset. 
*     ***  Example selection of cases
* ===================================================
*/ ;

*****************  SAS Library Set Up Statements ****************** ;
LIBNAME  pumout '.' ;
FILENAME pumsin '/DL/cens/pums/1990/5pct/wy56' ; 

*****************  Data Step ************************************** ;
*        1)  RETAIN Statment.  This tells SAS to Hold on to the values of 
*            specified variables across "input statements".  In this case,
*            the household variables.
*        2)  First INPUT statement extracts the variable identifying the 
*            record type and the "@" is used to hold that record in the 
*            input buffer for further processing.
*        3)  IF--THEN sequences.  The IF statement is used to process cases 
*            by the value of the rectype.  The DO-- END structure executes 
*            multiple commands.
*        4)  DELETE tells SAS NOT to write the Household record to the 
*            output file.
;

DATA pumout.wy56 ;
INFILE pumsin MISSOVER LRECL=231 ;
RETAIN houseid region state puma areatype persons rfaminc rhhinc rwrkr89 
       rhhfamtp ;
INPUT rectype $ 1 @;

IF rectype='H' THEN
   DO;
   INPUT houseid 2-8 region 10 state 11-12 puma 13-17 areatype 18-19 
         persons 33-34 rfaminc 134-140 rhhinc 141-147 rwrkr89 148 
         rhhfamtp 151-152 ;
   END;

IF rectype='P' THEN INPUT
   persid 2-8 relat1 10-11 sex 11 race 12-14 age 15-16 pwgt1 18-21 rpob 29-30
   hispanic 38-40 poverty 41-43 citizen 47 immigr 48-49 yearsch 51-52 
   mobility 59 migpums 62-66 yearwrk 114 industry 115-117 occup 118
   work89 122 rearning 127-132 rpincome 133-138 ;

**********  Selection of cases in which Age > 18 and Age < 65 ;

IF (17 < age < 65) ;

********** End of Data Step **************************************;
PROC FREQ DATA=pumout.wy56(KEEP=state sex race age) ;

Example SPSS Code.

* ===================================================
*  PUMS90.SPS
*  Example SPSS syntax for extracting data from a hierarchical file
*  Also, example selection of cases
* ===================================================

************* File Handle Setup Statment ************************** 
FILE HANDLE pumsin /NAME='/DL/cens/pums/1990/1ptc/wy56' LRECL=231

*************  File Type Command  ***********************************
****  Requires that the record identifier be at the beginning of the record!
****  Hierarcical files are type nested
****  Record subcommand tells SPSS where to find the record id variable

FILE TYPE NESTED FILE=pumsin RECORD=rectype 1 (a)
RECORD TYPE 'H'
DATA LIST
 /houseid 2-8 region 10 state 11-12 puma 13-17 areatype 18-19 
  persons 33-34 rfaminc 134-140 rhhinc 141-147 rwrkr89 148 rhhfamtp 151-152
RECORD TYPE 'P'
DATA LIST
 /persid 2-8 sex 11 race 12-14 age 15-16 pwgt1 18-21 rpob 29-30
  hispanic 38-40 poverty 41-43 citizen 47 immigr 48-49 yearsch 51-52 
  mobility 59 migpums 62-66 yearwrk 114 industry 115-117 occup 118
  work89 122 rearning 127-132 rpincome 133-138 
END FILE TYPE

*************  Example Selection Statement ******************
SELECT IF (age > 17 AND age <= 65)
FREQUENCIES VARIABLES=sex race age

*************  Save out a SPSS system file *******************

SAVE OUTFILE='wy56.sys'


EXTRACTION AND MERGES ON RELATIONAL FILES

[ Concatenating Files ] [ Same Observations (1-to-1) ]
[ Different Levels of Analysis ]

As noted earlier, relational files are those that can be linked or merged on the basis of a relationship or value(s) of a specified variable(s). The key to working with relational files is determining the relationship between the file AND making sure that you have the appropriate variables to use as a basis for combining the files. At the outset, one can think of two basic types of operations--the first is to add cases from two files to create a much larger data set. The second operation would involve merging data from two data sets. For example, you may have two files containing the same individuals but at different points in time (for example, responendents in a panel study) or files that are related on the basis of a set of variables such as students and schools.

CONCATENATING FILES-- ADDING OBSERVATIONS

[ SAS EXAMPLE] [ SPSS EXAMPLE]

In most cases this is a fairly straightforward operation and comes up when you are working with raw files that are divided on the basis of geography or some other factor. For example, the Public Use Microdata sample from the 1990 Census is subdivided into 51 files. After you've run an extraction on two or more states, you might want to combine them into a single data set.

Example SAS Code

/* =====================================================
   PUMSADD.SAS 
   Example file of appending cases from two files with same variables to
   create a new file
=================================================
*/;

LIBNAME add2 '.';

DATA add2.pums2;
  SET add2.wy56 add2.hi15;
PROC CONTENTS;

Example SPSS Code

*=====================================================
* PUMSADD.SPS
*    Example SPSS syntax for concatenating two SPSS files with same variables
*    Example data are WY56.SYS and HI15.SYS
*====================================================

ADD FILES  FILE='wy56.sys' /FILE='hi15.sys'
SAVE OUTFILE='pums.sys' /COMPRESSED

Example STATA Code


USE /home/spcbooke/data101/stata/il17
APPEND USING /home/spcbooke/data101/stata/in18


MERGING DATASETS WITH THE SAME OBSERVATIONS (1-to-1)

[ SAS EXAMPLE] [ SPSS EXAMPLE] [A HREF=#exmerb3" > STATA EXAMPLE ]

Whether you're working with Panel data, you've forgotten to extact a variable of vital importance to your research, or you need to add newly constructed variables from another file, merging data sets is an essential tool to learn. Most merges require that you have a unique or common variable between the two datasets--such as a unique caseid-- that allow you to match them together.

Two points are worth noting here. First, in order to merge files, they must be sorted on the basis of the unique identifying variable(s). Second, the order in which you specify the files to be merged makes a difference. The first listed file is the "master file". The most important implication of this structure is that if both files to be merged have variables with the same name, ONLY the variable in the first file will be saved. In other words, if you have recoded or in any other way modified a variable in the second data set, you should either rename it or drop the variable from the first dataset. (See "Useful" hints below).

Example SAS Code

/*==================================================== 
*   PUMSMATCH.SAS
*   Example of case-to-case match.          
*
*=====================================================
*/;

LIBNAME match1 '.';
FILENAME pumin '/DL/cens/pums/1990/5pct/wy56';

* First step is to read in a new variable from the data file;
* Note that I don't save this as a permanent data set!;

DATA a1;
  INFILE pumin MISSOVER LRECL=231;
  INPUT rectype $ 1 @
  IF rectype="P" THEN INPUT personid 2-8 relat1 9 asex 185;
  IF rectype="P";

*******  Sort both data sets ****************************;
SORT DATA=work.a1;
  BY personid relat1;

SORT DATA=match1.wy56;
  BY personid relat1;

*******  Merge the files ********************************;

DATA match1.wy56;
  MERGE match1.wy56 work.a1;
     BY personid relat1;

Example SPSS Code

*=====================================================
* PUMSMATCH.SPS
*       Merges variables from two data sets by Personid 
*       Example SPSS Syntax for Intro to Data Handling
*
* ====================================================

FILE HANDLE pumin /NAME='/DL/cens/pums/1990/5pct/wy56' LRECL=231      

*****  Extract sample data.

FILE TYPE NESTED  FILE=pumsin RECORD rectype 1 (A)
RECORD TYPE 'P'
DATA LIST /persid 2-8 relat1 9-10 asex 185
END FILE TYPE

******  Sort cases by identifying variables:  persid and relat1.
*****   If you don't sort, SPSS will crash!!  
******  Note that you need to save the sorted data!

SORT CASES BY persid relat1
SAVE OUTFILE='/tmp/wy56a'

GET FILE='wy56.sys'
SORT CASES BY persid relat1  
SAVE OUTFILE='/tmp/wy56sort.sys'

****  Matching the files together.
****  The order in which you specify files makes a difference~!
****  
****  In this first example, wy56sort.sys has fewer cases than w56a.sort

MATCH FILES  FILE='/tmp/wy56sort.sys' /FILE='/tmp/wy56a.sys' /BY persid relat1

****  In this first example, wy56sort.sys has more cases than w56a.sort
MATCH FILES  FILE='/tmp/wy56a.sys' /FILE='/tmp/wy56sort.sys' /BY persid relat1

Example STATA Code

We'll have to pretend that we extracted the data using either SAS or SPSS and we now have two STATA Data Sets. Here's the STATA syntax that's analygous to the programs above.
use /home/spcbooke/data101/file2
sort caseid
save /home/spcbooke/data101/file2, replace
use /home/spcbooke/data101/file1 
sort caseid
merge caseid using /home/spcbooke/data101/file2, 
*  Note, stata produces a nifty little variable named _merge
*  Coded values of _merge are:
*      1 - Observation found only in the “master file” - file1 above
*      2 - Observation found only in the “using” file - file 2 above
*      3 - Observation found in both files


MERGING DATA FROM DIFFERENT LEVELS OF ANALYSIS

[ SAS EXAMPLE] [ SPSS EXAMPLE] [STATA EXAMPLE ]

This section describes a broad array of instances in which your research design calls for merging data from different sources or different levels of analysis. For example, you might want to add census tract level variables to a survey data collection or school characteristics with students.

The command files below demonstrate how to go about merging such files. As in all cases of merging, you MUST a common variable or variables between the two files and each file MUST be sorted on the basis of the common variable(s).

Example SAS Code

/*=================================================
*   PUMSMERGE.SAS
*   Example of a one-to-many merge.  
*   In this case, we're merging a household variable to person records.
*====================================================
*/;

LIBNAME  merge1 '.';
FILENAME pumsin '/ec1/spcdata/cens/pums/1990/5pct/wy56';

DATA m2;
  INFILE pumsin MISSOVER LRECL=231;
  INPUT rectype $ 1;
  IF rectype="H" THEN INPUT housid 2-8 state 9-10 rgrapi 121-122;
  IF rectype="H";

SORT DATA=work.m2;
  BY housid;

SORT DATA=merge1.wy56;
  BY houseid;

DATA merge1.newwy56;
  MERGE merge1.wy56 work.m2;
    BY housid;

Example SPSS Code

*====================================================
* PUMSMERGE.SPS
*      Example SPSS syntax for merging one to many
*      In this case, merging a household variable to person records from
*         the 1990 Public Use Microdata
* =====================================================

FILE HANDLE pumin /name='/ec1/spcdata/cens/pums/1990/5pct/wy56' LRECL=231


*****  Extract sample data.

FILE TYPE NESTED FILE=pumsin record rectype 1 (A)
RECORD TYPE 'H'
DATA LIST /housid 2-8 state 9-10 rgrapi 185-186 
END FILE TYPE

******  Sort cases by identifying variables:  housid.
*****   If you don't sort, SPSS will crash!!  
******  Note that you need to save the sorted data!

SORT CASES BY housid 
SAVE OUTFILE='/tmp/wy56h'

GET FILE='wy56.sys'
SORT CASES BY housid   

SAVE OUTFILE='/tmp/wy56hsort.sys'

****  Matching the files together.
****  Note the TABLE subcommand.
****  

MATCH FILES  FILE='/tmp/wy56hsort.sys' /TABLE='/tmp/wy56h.sys' /BY housid


PORTING DATA TO OTHER APPLICATIONS/PLATFORMS

[ SAS/UNIX <--> SAS/PC ] [ SPSS/Unix <--> SPSS/WIN ]
[ SAS <--> SPSS ] [ SAS --> STATA ]

During the course of a research project, it is not uncommon to discover that another application does an analysis better than the one that you're currently using. For example, SAS might be able to do something that SPSS cannot. You might also want to do some analyses on a UNIX system and others on a desktop. In all of these cases, you will need to "massage" your data to make it readable by another application. This section discusses some of the common cases. If your particular case is not discussed here, feel free to contact a site assistant or a data library staff member to discuss the best strategy AND/OR read the user manual for the application that you want to move to--it often has a lot of information on this subject!


SAS/UNIX <--> SAS/PC

This section describes how to transfer a SAS data set from UNIX to the PC. The same steps work the other way as well!

Step 1: SAS/Unix Command file

* =================================================
* SASTRAN.SAS
*    Command file used to create a SAS Transport File
* =================================================

OPTIONS REPLACE;   /* This is needed, otherwise SAS won't create the 
                      transport file */
LIBNAME in1 '/u7/spcbooke';
LIBNAME out1 XPORT '/u7/spcbooke/sastran.xpt';  /* THIS IS A FILE!!!! 
                                                   NOT DIRECTORY */
PROC COPY IN=in1 OUT=out1;
SELECT FILE;   /* This is the name of the file you want included in 
                  the transport file */
RUN;

Step 2: Transfer the Transport file to the PC.

From a PC, ftp the file from UNIX to the local hard drive.

ftp johnjohn
login:  your-login-id
password: your-passwd
ftp> binary
ftp> get sastran.xpt
ftp> exit

Step 3: Read the file into SAS/WIN. Here are the commands.

*=============================================
* SASIN.SAS
*   Command file used to read transport file into SAS
* =============================================

OPTIONS REPLACE;

LIBNAME in1 XPORT 'C:\USERS\sastran.xpt';    /* This is a file */
LIBNAME out1  'C:\USERS';     /* This is where you want the sas data set 
                                 to be stored */

PROC COPY IN=in1 OUT=out1;

SPSS/Unix <--> SPSS/WIN

This section describes the creation of an SPSS portable file from an SPSS System file that can be read by any version of SPSS on any other platform.

Step 1: Create the SPSS Portable File

* =========================================
* SPSSPOR.CMD
*   Commands used to create a Portable File
* =========================================

GET FILE='/u7/spcbooke/file.sys'.
EXPORT OUTFILE='/u7/spcbooke/file.por'.

Step 2: FTP the file to the PC

ftp johnjohn
login:  your-login-id
password: your-passwd
ftp> get file.por
ftp> exit

Step 3: Read the File into SPSS/Windows Method 1: Using the menus.

Since SPSS/Win is a graphical interface, you can read the file by making the appropriate menu selections. Basically, select File from the main menu bar, OPEN, DATA. You should now be in the "OPEN DATA FILE" window. The trick is to select the correct File Type (SPSS Portable) [ Lower Left corner of the window ]. Select your file, and SPSS will magically read it in. Remember to SAVE it when you're done!

Method 2: Writing Commands.

The SPSS Menus basically write the commands for you. Here's what was done:

IMPORT FILE='c:\users\file.por'.
SAVE OUTFILE='c:\users\file.sav'.

SAS <--> SPSS

As of SPSS version 5.0 on UNIX and version 6.0 on Windows and SAS version 6.09 on all platforms (UNIX and PC), it has become fairly easy to move data sets between SAS and SPSS. Both applications require that the files be in a transport/portable format as described in VIII.A.Step_1 and VIII.B.Step_1 above.

SAS --> SPSS

The command to read a SAS Transport file into SPSS is:

GETSAS DATA='/u29/spcbooke/file.xpt'

NOTE: if you're using SPSS/WIN, you'll have to type this command into a syntax window! This is not currently available as a menu option.

SPSS --> SAS

Getting SAS to read a SPSS file is a little more involved.

LIBNAME here '/u29/spcbooke';  /* This is the directory */
LIBNAME porfile SPSS '/u29/spcbooke/file.por'; /* This is the SPSS 
                                                  portable file */

DATA here.file;
SET porfile.dummy;  /* the .dummy part is because SAS assumes a file 
                       name when referenced with a library */

SAS --> STATA

It is now possible to transfer SAS data sets to STATA under UNIX, with the following caveats: the SAS dataset must not contain more than 2000 variables or 4000 observations!!!

% sas2stata file

This will convert SAS data set, file.ss01, into a STATA dataset, file.dta.


OTHER STUFF

[Working with Compressed ASCII Data ] [Selecting a subset of variables ]
[Random Sampling a subset of observations]

This section discusses other useful things not discussed elsewhere. This section is still under construction!

Working with Compressed ASCII Data.

One way to efficiently work with LARGE data collections (those over 30 MB uncompressed) is to trick your application into reading compressed data. This only works with SAS although rumor has it that SPSS will begin supporting "named pipes" soon......

The basic idea of a "named pipe" is to create a conduit file that will store small chunks of data from the compressed file and then pass it on to SAS when SAS asks for data. Here's how it works:

              % mknod mypipe p &
              % gzcat -c file.gz > mypipe &

mknod is used to create a "named pipe file"--mypipe in this example. This requires that you have /usr/etc in your UNIX path. The gzcat line is used to uncompress the data to the pipe. NOTE: both of these commands should be run in the background (denoted by "&"). NOTE: The gzcat won't begin executing until you start your sasjob.

In your file containing the SAS data step, change your FILENAME line as follows:

         FILENAME raw 'mypipe';

Selecting a subset of variables.

One way to work "smartly" is to select only the variables you want to work with. This makes your working dataset much smaller and easier for the system to process. If you construct new variables, you can merge these back with the original dataset using the "update" commands.

One of the most commonly overlooked features of all statistical applications is that you can choose to use only those variables that are germane to the statistical tasks at hand. This can be done by either keeping the variables of interest or dropping those that are not. Example syntax for SPSS and SAS are presented below:

SPSS:

                 GET FILE='mydata.sys' /KEEP=v3 v4.
                 COMPUTE v5=v3+v4.
                 UPDATE FILE='mydata.sys' /FILE=*.
                 SAVE OUTFILE='newdata.sys'.

SAS:

                 DATA a;
                 SET mydata.junk (KEEP=v3 v4);
                 v5=v3+v4;

                 DATA mydata.junk1;
                 UPDATE mydata.junk a;

Random Sampling a subset of observations.

Another strategy for working with data, particularly at the early stages of a process is to select a random sample of observations to analyze. This has the advantage of allowing you to debug your progams without consuming vast resources. It can also point out potential flaws with your analysis prior to working with the full data collection.

SPSS Example:

              GET FILE='mydata.sys'.
              SAMPLE .10.    {This samples 10% of the observations}

SAS Examples:

If your data are already in a "random" format (aka haven't been sorted), you can specify the number of observations to work with using the observations option:

              OPTIONS OBS=100;

If you need to select a random sample on a file that has been sorted, you can do it as follows:

              DATA;
              SET mydata.junk;
              IF ranuni(2) < .10;  /* This selects 10% of the observations */


Return to the Data Library home page
For further information, E-mail: datahelp@spc.uchicago.edu

Updated: 24JAN96