|
Select a topic if you wish to ZOOM directly to a particular section within this document [ INTRODUCTION ] [ BASIC
CONCEPTS ] |
NOTE: Example SAS, SPSS and STATA command files are located at: ftp://ftp.src.uchicago.edu/data/programs.
Please feel free to copy these files for your own use.
For most, the "Introduction to Data" occurs in the context of an introductory or advanced course in statistical analysis. Typically, the data you have used has been preformatted by a TA or RA for use with a particular statistical package such as SPSS, SAS or STATA. Once you embark upon your own research, you will rapidly discover that few datasets come prepared for immediate access by your favorite application. In fact, you will find that getting your data in an appropriate form for analysis is oftentimes more involved than the most complex statistical analyses you will employ!
This handout is intended to introduce you to the basics required to converting "raw" data into a dataset to be used by a statistical application, specifically SPSS or SAS. It illuminates topics such as reading a codebook, identifying data structures, and developing programs for reading raw data into a statistical application.
As a starting point, assume that you have identified a dataset containing information you would like to analyze. These data consist of a number of measured attributes--called variables--each describing a set of observations. In the case of a survey, the observations are typically individual respondents and the variables are responses solicited from questions about attitudes, behaviors and traits.
Further assume that you have received this data in a "raw" format--this simply means that the data are in a text file that may look like this:
H000000118560010080999700 63001402000005009300001520000141112111520006 P000000100000182000180000000221000000000228020000110939999256001002000 P000000101100177000180000000221040000000228031000110939999256001002000 H000000218560010080999700 57002201000002004222100000000531112111721001 P000000200100180100210000000103040000000203056000105020999100000002000 H000000318560010080999700 07002204000002009122180000000251113111821004 P000000300000135000220000000221000000000495017000117032051256001002000 P000000301100132000220000000221020000000495017000115032142256001002000 P000000302000107400180000111220101000000495031000203032999256001001625 P000000302000109400140000111220101000000495046000204032999256001002000
Your task will be to map these numbers and characters to variables and observations such that a statistical application such as SAS or SPSS can read and analyze the data.
Before getting started, it is essential that you understand the technical language of data. The most basic concepts are the record and the field. A record is simply a line. In some cases, the record contains all of the information about an observation, but as is noted below that isn't always true. A field is a column or columns containing the data for a specific variable. This assumes, of course, that your data is in a fixed column format--meaning that the values for a particular variable are in the same column(s) for all observations--which is true for 90% of all data files you will encounter. The alternative is a variable format in which the variables a found in a specified sequence within the file but are not in the same locations for each observation.
The relationship between records and fields constitutes a "data structure". The four most common structures are as follows:
Typically, you won't want all of the variables in a file, and in some instances you won't want all of the observations. The process of reducing the number of variables is called an extraction; reducing the number of observations is called subsetting a dataset.
| [ Example Codebook ] [ Example Data Dictionary] [ Info You Need from the Codebook] |
The place to start is with a CODEBOOK which is more or less a manual describing a particular study or data collection. While the content and format of codebooks vary considerably between data collections, the typical codebook contains the following information:
Most codebooks have at least two major sections: the data dictionary which lists the variables and column locations and the data collection instrument. In a number of cases, there is a section describing how to read the codebook!
ICPSR 6067
American National Election Study 1992
0 VAR 0062 R INTREST-POL CAMPGN MD=0 OR GE 8
REF 0062 LOC 151 WIDTH 1
0
In this interview I will be talking with you about the
recent elections, as well as a number of other things.
First, I have some questions about the political campaigns
that took place this election year.
Q.A1. Some people don't pay much attention to political
campaigns. How about you? Would you say that you were
VERY MUCH INTERESTED, SOMEWHAT INTERESTED, or NOT MUCH
INTERESTED in following the political campaigns this year?
----------------------------------------------------------
304 1. VERY MUCH INTERESTED
635 3. SOMEWHAT INTERESTED
419 5. NOT MUCH INTERESTED
8. DK
1 9. NA
1126 0. INAP, 1992 cross section
0 ...............................................................
The top section, beginning with "VAR 0062", contains the basic information required for your extraction code. "VAR 0062" is the variable name (you would probably list this as "V0062" or "V62"), "R INTREST-POL CAMPGN" is the variable label describing what "V0062" is, "LOC 151" is the column location of the variable within the record, and "WIDTH 1" indicates that the variable is in a single column. The next section, beginning with "In this interview" is the survey question text. The section beginning with "304 1. VERY MUCH INTERESTED" provides information about the coded values and labels of V62, e.g. a value of 1 means "VERY MUCH INTERESTED".
1990 Public Use Microdata Sample. A Hierarchical File
HOUSING UNIT RECORD
DATA SIZE BEGIN
D RECTYPE 1 1
Record Type
V H .Housing Record
D SERIALNO 7 2
V 0000000..
9999999 .Housing unit/GQ person serial number unique
.identifier assigned within state or state group
D SAMPLE 1 9
Sample Identifier
V 1 .5% sample
V 2 .1% sample
V 3 .Elderly
............................................................................
PERSON RECORD
DATA SIZE BEGIN
D RECTYPE 1 1
Record Type
V P .Person Record
D SERIALNO 7 2
V 0000000..
V 9999999 .Housing unit/GQ person serial number unique
V .identifier assigned within state or state group
D RELAT1 2 9
Relationship
V 00 .Householder
V 01 .Husband/wife
V 02 .Son/daughter
V 03 .Stepson/stepdaughter
V 04 .Brother/sister
V 05 .Father/mother
V 06 .Grandchild
V 07 .Other relative
V 08 .Roomer/boarder/foster child
V 09 .Housemate/roommate
V 10 .Unmarried partner
V 11 .Other nonrelative
V 12 .Institutionalized person
V 13 .Other persons in group quarters
===================================================
The lines beginning with "D RECTYPE", "D SERIALNO", and "D RELAT1" indicate the Variable Name, e.g. "RECTYPE". The Label "Record Type" is beneath the variable information. The "BEGIN Column contains the starting column location of the variable and "SIZE" indicates the width of the variable. The lines beginning with "V" contain the coded values of the variable and a description of these codes.
In almost all cases, the codebook will contain the information you will need to begin thinking about and writing syntax to extract the variables and cases you need from the raw data.
The basics pieces of information that you will need are:
Once you've decided which variables you need, you'll need to decide upon an application to use. The application you decide to use should be one that is suited for the types of analyses that you expect to conduct; however, there isn't always an obvious choice. Some applications are well-suited to specific types of analysis; however, they are poor choices as "extraction engines" because they either don't handle data manipulation efficiently or they cannot work with complex data structures. This is true of MATLAB, S Plus, Excel, ect. This document focuses on SAS, SPSS and STATA.
Section VIII below discusses strategies for moving your data to the PC or to another UNIX based application. If you want to use a different application, you should identify the following types of commands:
| [ SAS EXAMPLE] [ SPSS EXAMPLE] [ STATA EXAMPLE ] |
This section describes the basic SPSS and SAS syntax for extracting variables from a raw, ASCII file in which all of the data for an observation is contained in the same record. The example data set is the AMERICAN NATIONAL ELECTION STUDY, 1992. You can view the abridged codebook by clicking here
/* =====================================================================
* ANES92.SAS
* Example SAS Syntax for extracting data from an ASCII file
*=======================================================================
*/;
****************** LIBNAME AND FILENAME STATEMENTS **********************
* File Name Statement defines an alias or pseudonym for the raw data file.
* LIBNAME assigns an alias to a directory where a SAS data set will be
* stored;
FILENAME raw '/DL/polbehav/anes/1992/da6067';
LIBNAME mydata '/tmp';
************ SAS DATA STEP ***********************************************
* 1) The DATA line contains the name of the SAS dataset to be saved where
* mydata is the directory location and "anes92" is the name of the SAS
* file.
* 2) The INFILE line tells SAS where to locate the raw data file and how
* many characters of data to read each time. MISSOVER indicates that
* blank columns should be treated as system missing data.
* 3) The INPUT statement indicates what data to read in. The basic
* structure is variable name, character variable ($) designation,
* column locations and additional formating data (e.g. .4 for four
* decimal places.
****************************************************************************;
DATA mydata.anes92;
INFILE raw LRECL=3621 MISSOVER;
INPUT V4 7-10
V9 $ 37-40
V68 157-157
V3008 1752-1757 .4;
************ LABEL Statement creates Variable Labels;
LABEL
V4="CASE ID NUMBER"
V9="STATE/CONGRESSINAL DIST"
V68="DISCSS POLTCS W/FAM,FRND"
V3008="WEIGHT VARIABLE";
;
* Recode V68;
if (V68=0 or V68=8 or V68=9) then V68=.;
RUN;
****** Get a listing of variables in your new data set!;
PROC CONTENTS DATA=mydata.anes92;
****** Do a frequency on a key variable;
PROC FREQ DATA=mydata.anes92 (keep=v68);
RUN;
PROC MEANS DATA=mydata.anes92(keep=v3008);
RUN;
* =================================================================
* ANES92.SPS
* SPSS syntax for extracting variables from an ASCII data file
* Shows how to setup Data List Statement
* Formatting
* =================================================
*********** File Handle Statement *******************************
* The File Handle Statment sets up an arbitrary alias--a pseudonym--
* for the data file. RAW is the pseudonym, the actual file is da6067 which
* is located in the /DL/polbehav/anes/1992 directory.
* LRECL is the abbreviation for "Logical Record Length"--the maximum
* length of a line to be read by SPSS. By default, SPSS only reads the
* first 1024 columns.
FILE HANDLE raw /NAME='/DL/polbehav/anes/1992/da6067' LRECL=3621.
DATA LIST FILE=raw
/ V4 7-10
V9 37-40 (A)
V68 157-157
V3008 1752-1757 (4)
.
************ VAR LAB Statement creates Variable Labels;
VAR LAB
V4 "CASE ID NUMBER"
/ V9 "STATE/CONGRESSINAL DIST"
/ V68 "DISCSS POLTCS W/FAM,FRND"
/ V3008="WEIGHT VARIABLE".
******** Value Labels .
VAL LAB V68 1 'YES' 5 'NO' 8 'DK' 9 'NA' 0 'INAP'.
****** Do a frequency on a key variable;
FRE VAR V68.
***** MISSING VALUES -- Exclude from the analysis .
missing values v68 (8,9,0).
FRE VAR V68.
save outfile='anes92.sav'.
infix dictionary using "/DL/polbehav/anes/1992/da6067." {
v4 7-10
str4 v9 37-40
v68 157-157
v3008 1752-1757
}
and here's the corresponding STATA DO file
* ======================================================================= * ANES92.DO * Example STATA Syntax for extracting data from an ASCII file *======================================================================= * * log using anes92.log, replace * * STATA has several methods for reading in FIXED ASCII * INFIX is the most intuitive (infile is the other) * It's best to keep the "dictionary" * FIRST, STATA IS A MEMORY HOG -- MAKE SURE YOU ALLOCATE ENOUGH * * set memory 20m * READ IN THE DATA USING A DICTIONARY * * infix using /home/spcbooke/data101/stata/anes92.dct * * * * *********** LABEL Statement creates Variable Labels; label variable v4 "CASE ID NUMBER" label variable v9 "STATE/CONGRESSINAL DIST" label variable v68 "DISCSS POLTCS W/FAM,FRND" label variable v3008 "WEIGHT VARIABLE" * *ADJUST IMPLIED DECIMAL PLACES for V3008 replace v3008=v3008/10000 * * Recode V68 replace v68=. if v68==0 | v68==8 | v68==0 * ****** Do a frequency on a key variable; tab v68
| [ SAS EXAMPLE] [ SPSS EXAMPLE] STATA EXAMPLE ] |
Extracting data from multiple record/observation files builds upon the example discussed above. The basic idea is to construct command syntax that will extract the variables you need from each record and rectangularize the saved application data set. Visually,
P1R1
P1R2 ========> P1.........
P2R1 P2.........
P2R2
where P1, P2 indicate individuals, R1 and R2 are records (in this case 2 records per observation). The strategy in this case is to identify the records that contain the variables of interest and to write the syntax that tells your application which records to process and how many records per observation there are.
/* =========================================================================
SN9112.SAS
Example SAS syntax for extracting data from an ASCII file in which
there are multiple lines of data for each observation
Sample data are the National Youth Survey, Wave 5
=========================================================================
*/;
*************** Filename and Libname Statements **************************;
FILENAME raw '/DL/soc/youth/nys/wave5/da9112.card';
libname mydata '/tmp';
DATA mydata.nys5;
INFILE raw lrecl=80 missover;
INPUT
#1 y5V1 1 y5v2 2 y5v6 9-10 y5v7 11-12 y5v39 45 y5v40 46-47 y5v45 55
y5v50 59 hhid 69-72 respid 73-74
#2 y5v57 1 y5v65 10
#3 y5v181 35 y5v182 36 y5v183 37 y5v184 38
#9 y5v435 4 y5v437 8 y5v439 12 y5v441 16 y5443 20 y5v445 24 y5v447 28
y5v463 60
;
* LABELS ....;
* FORMAT.....;
PROC CONTENTS;
*===================================================
* SN9112.SPS
* Example SPSS Syntax for reading in data from an ASCII file in which
* there are more than one record per observation.
* Data for this example are National Youth Survey, Wave 5
* These data have 15 records/observation.
* Total records = 25875. Total cases=1725.
* =============================================
FILE HANDLE raw /NAME='/DL/soc/youth/nys/wave5/da9112.card' LRECL=80.
****************** Data List Command
* The data list command differs in two ways.
* First, the inclusion of the "records" subcommand. This is the
* number of records per observation.
* Second, the "/n" that tells SPSS which record to read data from.
DATA LIST FILE=raw RECORDS=15
/1 y5V1 1 y5v2 2 y5v6 9-10 y5v7 11-12 y5v39 45 y5v40 46-47 y5v45 55
y5v50 59 hhid 69-72 respid 73-74
/2 y5v57 1 y5v65 10
/4 y5v181 35 y5v182 36 y5v183 37 y5v184 38
/9 y5v435 4 y5v437 8 y5v439 12 y5v441 16 y5443 20 y5v445 24
y5v447 28 y5v463 60
* Variable Labels ....
* Value Labels ....
* Do a frequency command to test your results.
FREQUENCIES VARIABLES=y5v1 y5v463
* Save your results to an SPSS system file.
SAVE OUTFILE='/tmp/nys5.sys'
Here's the STATA dictionary
infix dictionary using "/DL/soc/youth/nys/wave5/da9112.card" {
15 lines
1:
y5V1 1 y5v2 2 y5v6 9-10 y5v7 11-12 y5v39 45 y5v40 46-47 y5v45 55
y5v50 59 hhid 69-72 respid 73-74
2:
y5v57 1 y5v65 10
3: y5v181 35 y5v182 36 y5v183 37 y5v184 38
9: y5v435 4 y5v437 8 y5v439 12 y5v441 16 y5443 20 y5v445 24 y5v447 28
y5v463 60
}
and here's the STATA DO File
* ========================================================================= * SN9112.do * Example STATA syntax for extracting data from an ASCII file in which * there are multiple lines of data for each observation * Sample data are the National Youth Survey, Wave 5 * ========================================================================= * set memory 20m * * infix using sn9112
| [ SAS EXAMPLE] [ SPSS EXAMPLE] |
Extracting data from a hierarchical file involves identifying all distinct data levels within the file and the variable that differentiates each level. The general idea is to extract the required data from each level of the hierarchy to generate a rectangular file. This is done by forcing the application to process each record type differently. Visually,
H1
P1 H1P1
P2 H1P2
P3 =====> H1P3
H2 H2P1
P1 H2P2
P2
where H1 and H2 are records for separate households, and P records indicate records for persons within each respective household.
The examples below use the 1990 Public Use Microdata for the state of Wyoming. Note: STATA cannot process hierarchical data files, thus there is no STATA example.
/*====================================================
* PUMS90.SAS
* Example SAS code for Intro to Data Handling
* *** Extraction on a hierarchical dataset.
* *** Example selection of cases
* ===================================================
*/ ;
***************** SAS Library Set Up Statements ****************** ;
LIBNAME pumout '.' ;
FILENAME pumsin '/DL/cens/pums/1990/5pct/wy56' ;
***************** Data Step ************************************** ;
* 1) RETAIN Statment. This tells SAS to Hold on to the values of
* specified variables across "input statements". In this case,
* the household variables.
* 2) First INPUT statement extracts the variable identifying the
* record type and the "@" is used to hold that record in the
* input buffer for further processing.
* 3) IF--THEN sequences. The IF statement is used to process cases
* by the value of the rectype. The DO-- END structure executes
* multiple commands.
* 4) DELETE tells SAS NOT to write the Household record to the
* output file.
;
DATA pumout.wy56 ;
INFILE pumsin MISSOVER LRECL=231 ;
RETAIN houseid region state puma areatype persons rfaminc rhhinc rwrkr89
rhhfamtp ;
INPUT rectype $ 1 @;
IF rectype='H' THEN
DO;
INPUT houseid 2-8 region 10 state 11-12 puma 13-17 areatype 18-19
persons 33-34 rfaminc 134-140 rhhinc 141-147 rwrkr89 148
rhhfamtp 151-152 ;
END;
IF rectype='P' THEN INPUT
persid 2-8 relat1 10-11 sex 11 race 12-14 age 15-16 pwgt1 18-21 rpob 29-30
hispanic 38-40 poverty 41-43 citizen 47 immigr 48-49 yearsch 51-52
mobility 59 migpums 62-66 yearwrk 114 industry 115-117 occup 118
work89 122 rearning 127-132 rpincome 133-138 ;
********** Selection of cases in which Age > 18 and Age < 65 ;
IF (17 < age < 65) ;
********** End of Data Step **************************************;
PROC FREQ DATA=pumout.wy56(KEEP=state sex race age) ;
* =================================================== * PUMS90.SPS * Example SPSS syntax for extracting data from a hierarchical file * Also, example selection of cases * =================================================== ************* File Handle Setup Statment ************************** FILE HANDLE pumsin /NAME='/DL/cens/pums/1990/1ptc/wy56' LRECL=231 ************* File Type Command *********************************** **** Requires that the record identifier be at the beginning of the record! **** Hierarcical files are type nested **** Record subcommand tells SPSS where to find the record id variable FILE TYPE NESTED FILE=pumsin RECORD=rectype 1 (a) RECORD TYPE 'H' DATA LIST /houseid 2-8 region 10 state 11-12 puma 13-17 areatype 18-19 persons 33-34 rfaminc 134-140 rhhinc 141-147 rwrkr89 148 rhhfamtp 151-152 RECORD TYPE 'P' DATA LIST /persid 2-8 sex 11 race 12-14 age 15-16 pwgt1 18-21 rpob 29-30 hispanic 38-40 poverty 41-43 citizen 47 immigr 48-49 yearsch 51-52 mobility 59 migpums 62-66 yearwrk 114 industry 115-117 occup 118 work89 122 rearning 127-132 rpincome 133-138 END FILE TYPE ************* Example Selection Statement ****************** SELECT IF (age > 17 AND age <= 65) FREQUENCIES VARIABLES=sex race age ************* Save out a SPSS system file ******************* SAVE OUTFILE='wy56.sys'
|
[ Concatenating Files ] [
Same Observations (1-to-1) ] [ Different Levels of Analysis ] |
As noted earlier, relational files are those that can be linked or merged on the basis of a relationship or value(s) of a specified variable(s). The key to working with relational files is determining the relationship between the file AND making sure that you have the appropriate variables to use as a basis for combining the files. At the outset, one can think of two basic types of operations--the first is to add cases from two files to create a much larger data set. The second operation would involve merging data from two data sets. For example, you may have two files containing the same individuals but at different points in time (for example, responendents in a panel study) or files that are related on the basis of a set of variables such as students and schools.
| [ SAS EXAMPLE] [ SPSS EXAMPLE] |
In most cases this is a fairly straightforward operation and comes up when you are working with raw files that are divided on the basis of geography or some other factor. For example, the Public Use Microdata sample from the 1990 Census is subdivided into 51 files. After you've run an extraction on two or more states, you might want to combine them into a single data set.
/* ===================================================== PUMSADD.SAS Example file of appending cases from two files with same variables to create a new file ================================================= */; LIBNAME add2 '.'; DATA add2.pums2; SET add2.wy56 add2.hi15; PROC CONTENTS;
*===================================================== * PUMSADD.SPS * Example SPSS syntax for concatenating two SPSS files with same variables * Example data are WY56.SYS and HI15.SYS *==================================================== ADD FILES FILE='wy56.sys' /FILE='hi15.sys' SAVE OUTFILE='pums.sys' /COMPRESSED
USE /home/spcbooke/data101/stata/il17 APPEND USING /home/spcbooke/data101/stata/in18
| [ SAS EXAMPLE] [ SPSS EXAMPLE] [A HREF=#exmerb3" > STATA EXAMPLE ] |
Whether you're working with Panel data, you've forgotten to extact a variable of vital importance to your research, or you need to add newly constructed variables from another file, merging data sets is an essential tool to learn. Most merges require that you have a unique or common variable between the two datasets--such as a unique caseid-- that allow you to match them together.
Two points are worth noting here. First, in order to merge files, they must be sorted on the basis of the unique identifying variable(s). Second, the order in which you specify the files to be merged makes a difference. The first listed file is the "master file". The most important implication of this structure is that if both files to be merged have variables with the same name, ONLY the variable in the first file will be saved. In other words, if you have recoded or in any other way modified a variable in the second data set, you should either rename it or drop the variable from the first dataset. (See "Useful" hints below).
/*====================================================
* PUMSMATCH.SAS
* Example of case-to-case match.
*
*=====================================================
*/;
LIBNAME match1 '.';
FILENAME pumin '/DL/cens/pums/1990/5pct/wy56';
* First step is to read in a new variable from the data file;
* Note that I don't save this as a permanent data set!;
DATA a1;
INFILE pumin MISSOVER LRECL=231;
INPUT rectype $ 1 @
IF rectype="P" THEN INPUT personid 2-8 relat1 9 asex 185;
IF rectype="P";
******* Sort both data sets ****************************;
SORT DATA=work.a1;
BY personid relat1;
SORT DATA=match1.wy56;
BY personid relat1;
******* Merge the files ********************************;
DATA match1.wy56;
MERGE match1.wy56 work.a1;
BY personid relat1;
*===================================================== * PUMSMATCH.SPS * Merges variables from two data sets by Personid * Example SPSS Syntax for Intro to Data Handling * * ==================================================== FILE HANDLE pumin /NAME='/DL/cens/pums/1990/5pct/wy56' LRECL=231 ***** Extract sample data. FILE TYPE NESTED FILE=pumsin RECORD rectype 1 (A) RECORD TYPE 'P' DATA LIST /persid 2-8 relat1 9-10 asex 185 END FILE TYPE ****** Sort cases by identifying variables: persid and relat1. ***** If you don't sort, SPSS will crash!! ****** Note that you need to save the sorted data! SORT CASES BY persid relat1 SAVE OUTFILE='/tmp/wy56a' GET FILE='wy56.sys' SORT CASES BY persid relat1 SAVE OUTFILE='/tmp/wy56sort.sys' **** Matching the files together. **** The order in which you specify files makes a difference~! **** **** In this first example, wy56sort.sys has fewer cases than w56a.sort MATCH FILES FILE='/tmp/wy56sort.sys' /FILE='/tmp/wy56a.sys' /BY persid relat1 **** In this first example, wy56sort.sys has more cases than w56a.sort MATCH FILES FILE='/tmp/wy56a.sys' /FILE='/tmp/wy56sort.sys' /BY persid relat1
use /home/spcbooke/data101/file2 sort caseid save /home/spcbooke/data101/file2, replace use /home/spcbooke/data101/file1 sort caseid merge caseid using /home/spcbooke/data101/file2, * Note, stata produces a nifty little variable named _merge * Coded values of _merge are: * 1 - Observation found only in the “master file” - file1 above * 2 - Observation found only in the “using” file - file 2 above * 3 - Observation found in both files
| [ SAS EXAMPLE] [ SPSS EXAMPLE] [STATA EXAMPLE ] |
This section describes a broad array of instances in which your research design calls for merging data from different sources or different levels of analysis. For example, you might want to add census tract level variables to a survey data collection or school characteristics with students.
The command files below demonstrate how to go about merging such files. As in all cases of merging, you MUST a common variable or variables between the two files and each file MUST be sorted on the basis of the common variable(s).
/*=================================================
* PUMSMERGE.SAS
* Example of a one-to-many merge.
* In this case, we're merging a household variable to person records.
*====================================================
*/;
LIBNAME merge1 '.';
FILENAME pumsin '/ec1/spcdata/cens/pums/1990/5pct/wy56';
DATA m2;
INFILE pumsin MISSOVER LRECL=231;
INPUT rectype $ 1;
IF rectype="H" THEN INPUT housid 2-8 state 9-10 rgrapi 121-122;
IF rectype="H";
SORT DATA=work.m2;
BY housid;
SORT DATA=merge1.wy56;
BY houseid;
DATA merge1.newwy56;
MERGE merge1.wy56 work.m2;
BY housid;
*==================================================== * PUMSMERGE.SPS * Example SPSS syntax for merging one to many * In this case, merging a household variable to person records from * the 1990 Public Use Microdata * ===================================================== FILE HANDLE pumin /name='/ec1/spcdata/cens/pums/1990/5pct/wy56' LRECL=231 ***** Extract sample data. FILE TYPE NESTED FILE=pumsin record rectype 1 (A) RECORD TYPE 'H' DATA LIST /housid 2-8 state 9-10 rgrapi 185-186 END FILE TYPE ****** Sort cases by identifying variables: housid. ***** If you don't sort, SPSS will crash!! ****** Note that you need to save the sorted data! SORT CASES BY housid SAVE OUTFILE='/tmp/wy56h' GET FILE='wy56.sys' SORT CASES BY housid SAVE OUTFILE='/tmp/wy56hsort.sys' **** Matching the files together. **** Note the TABLE subcommand. **** MATCH FILES FILE='/tmp/wy56hsort.sys' /TABLE='/tmp/wy56h.sys' /BY housid
|
[ SAS/UNIX <--> SAS/PC ] [
SPSS/Unix <--> SPSS/WIN ] [ SAS <--> SPSS ] [ SAS --> STATA ] |
During the course of a research project, it is not uncommon to discover that another application does an analysis better than the one that you're currently using. For example, SAS might be able to do something that SPSS cannot. You might also want to do some analyses on a UNIX system and others on a desktop. In all of these cases, you will need to "massage" your data to make it readable by another application. This section discusses some of the common cases. If your particular case is not discussed here, feel free to contact a site assistant or a data library staff member to discuss the best strategy AND/OR read the user manual for the application that you want to move to--it often has a lot of information on this subject!
This section describes how to transfer a SAS data set from UNIX to the PC. The same steps work the other way as well!
Step 1: SAS/Unix Command file
* =================================================
* SASTRAN.SAS
* Command file used to create a SAS Transport File
* =================================================
OPTIONS REPLACE; /* This is needed, otherwise SAS won't create the
transport file */
LIBNAME in1 '/u7/spcbooke';
LIBNAME out1 XPORT '/u7/spcbooke/sastran.xpt'; /* THIS IS A FILE!!!!
NOT DIRECTORY */
PROC COPY IN=in1 OUT=out1;
SELECT FILE; /* This is the name of the file you want included in
the transport file */
RUN;
Step 2: Transfer the Transport file to the PC.
From a PC, ftp the file from UNIX to the local hard drive.
ftp johnjohn login: your-login-id password: your-passwd ftp> binary ftp> get sastran.xpt ftp> exit
Step 3: Read the file into SAS/WIN. Here are the commands.
*=============================================
* SASIN.SAS
* Command file used to read transport file into SAS
* =============================================
OPTIONS REPLACE;
LIBNAME in1 XPORT 'C:\USERS\sastran.xpt'; /* This is a file */
LIBNAME out1 'C:\USERS'; /* This is where you want the sas data set
to be stored */
PROC COPY IN=in1 OUT=out1;
This section describes the creation of an SPSS portable file from an SPSS System file that can be read by any version of SPSS on any other platform.
Step 1: Create the SPSS Portable File
* ========================================= * SPSSPOR.CMD * Commands used to create a Portable File * ========================================= GET FILE='/u7/spcbooke/file.sys'. EXPORT OUTFILE='/u7/spcbooke/file.por'.
Step 2: FTP the file to the PC
ftp johnjohn login: your-login-id password: your-passwd ftp> get file.por ftp> exit
Step 3: Read the File into SPSS/Windows Method 1: Using the menus.
Since SPSS/Win is a graphical interface, you can read the file by making the appropriate menu selections. Basically, select File from the main menu bar, OPEN, DATA. You should now be in the "OPEN DATA FILE" window. The trick is to select the correct File Type (SPSS Portable) [ Lower Left corner of the window ]. Select your file, and SPSS will magically read it in. Remember to SAVE it when you're done!
Method 2: Writing Commands.
The SPSS Menus basically write the commands for you. Here's what was done:
IMPORT FILE='c:\users\file.por'. SAVE OUTFILE='c:\users\file.sav'.
As of SPSS version 5.0 on UNIX and version 6.0 on Windows and SAS version 6.09 on all platforms (UNIX and PC), it has become fairly easy to move data sets between SAS and SPSS. Both applications require that the files be in a transport/portable format as described in VIII.A.Step_1 and VIII.B.Step_1 above.
SAS --> SPSS
The command to read a SAS Transport file into SPSS is:
GETSAS DATA='/u29/spcbooke/file.xpt'
NOTE: if you're using SPSS/WIN, you'll have to type this command into a syntax window! This is not currently available as a menu option.
SPSS --> SAS
Getting SAS to read a SPSS file is a little more involved.
LIBNAME here '/u29/spcbooke'; /* This is the directory */
LIBNAME porfile SPSS '/u29/spcbooke/file.por'; /* This is the SPSS
portable file */
DATA here.file;
SET porfile.dummy; /* the .dummy part is because SAS assumes a file
name when referenced with a library */
It is now possible to transfer SAS data sets to STATA under UNIX, with the following caveats: the SAS dataset must not contain more than 2000 variables or 4000 observations!!!
% sas2stata file
This will convert SAS data set, file.ss01, into a STATA dataset, file.dta.
|
[Working with Compressed ASCII Data ] [Selecting a subset of variables ] [Random Sampling a subset of observations] |
This section discusses other useful things not discussed elsewhere. This section is still under construction!
One way to efficiently work with LARGE data collections (those over 30 MB uncompressed) is to trick your application into reading compressed data. This only works with SAS although rumor has it that SPSS will begin supporting "named pipes" soon......
The basic idea of a "named pipe" is to create a conduit file that will store small chunks of data from the compressed file and then pass it on to SAS when SAS asks for data. Here's how it works:
% mknod mypipe p &
% gzcat -c file.gz > mypipe &
mknod is used to create a "named pipe file"--mypipe in this example. This requires that you have /usr/etc in your UNIX path. The gzcat line is used to uncompress the data to the pipe. NOTE: both of these commands should be run in the background (denoted by "&"). NOTE: The gzcat won't begin executing until you start your sasjob.
In your file containing the SAS data step, change your FILENAME line as follows:
FILENAME raw 'mypipe';
One way to work "smartly" is to select only the variables you want to work with. This makes your working dataset much smaller and easier for the system to process. If you construct new variables, you can merge these back with the original dataset using the "update" commands.
One of the most commonly overlooked features of all statistical applications is that you can choose to use only those variables that are germane to the statistical tasks at hand. This can be done by either keeping the variables of interest or dropping those that are not. Example syntax for SPSS and SAS are presented below:
SPSS:
GET FILE='mydata.sys' /KEEP=v3 v4.
COMPUTE v5=v3+v4.
UPDATE FILE='mydata.sys' /FILE=*.
SAVE OUTFILE='newdata.sys'.
SAS:
DATA a;
SET mydata.junk (KEEP=v3 v4);
v5=v3+v4;
DATA mydata.junk1;
UPDATE mydata.junk a;
Another strategy for working with data, particularly at the early stages of a process is to select a random sample of observations to analyze. This has the advantage of allowing you to debug your progams without consuming vast resources. It can also point out potential flaws with your analysis prior to working with the full data collection.
SPSS Example:
GET FILE='mydata.sys'.
SAMPLE .10. {This samples 10% of the observations}
SAS Examples:
If your data are already in a "random" format (aka haven't been sorted), you can specify the number of observations to work with using the observations option:
OPTIONS OBS=100;
If you need to select a random sample on a file that has been sorted, you can do it as follows:
DATA;
SET mydata.junk;
IF ranuni(2) < .10; /* This selects 10% of the observations */
Return to the Data Library home page