Reading Delimited Text Files Into SAS®

Much has changed since your college days when you used SAS 6 to read CSV files. Back then you needed to preprocess the files to fix the consecutive commas so that SAS could understand that there were missing values between them. There was also the whole problem of quotes around the text strings. Things are much better now. Read on for a sample of how much easier things have gotten.

Using the INFILE and INPUT statements there are several options that make life much easier. Here's the short list:

Delimiter Sensitive Data (DSD)

Several things happen by default. First, SAS is expecting a comma delimiter to separate data values. Second, if the INPUT statement finds a delimiter inside a quoted string we understand that the delimiter is part of the data and do not try to split the string into multiple variables. Third, we strip off the quotes in the resulting variable. Fourth, if we find consecutive delimiters, SAS understands that there is an implied missing value between them.

You can change the default delimiter for the DSD option with the DELIMITER= option.

DELIMITER= (DLM=)/*设置分隔符*/

This option enables you to tell SAS what single character is used as a delimiter in the file you are reading. Some common delimiters are comma (,), vertical pipe (|), semi-colon (;), and tab. At this time the TAB needs to be by its hexadecimal value. For the ASCII systems (UNIX, Windows, and Linux), the value is 09. For EBCDIC systems (z/OS and MVS), the value is 05. The syntax looks like this for the pipe: DLM='|'. It looks like this for a TAB on ASCII systems: DLM='09'x.

TRUNCOVER

By default, if the INPUT statement reads past the end of a record without finding enough data to populate the variables listed, it continues to read data from the next record. This action is called flowover. The TRUNCOVER option tells SAS to use whatever data it finds to populate as many of the variables as it can.

LRECL=/*设置读入数据的maximum record length*/

This option should be used when the length of the records in a file are greater than 256 bytes (on ASCII platforms). The input buffer is 256 bytes by default, and records that are greater than this length will be truncated when they are read. Setting the LRECL= option to a greater length assures that the input buffer is long enough to read the whole record. In SAS 9.2 you can set a system option to make the default LRECL larger.

FIRSTOBS=/*设置从第几行开始读数*/

This option indicates that you want to start reading the input file at the record number specified, rather than beginning with the first record. This option is helpful when reading files that contain a header record, because a user can specify FIRSTOBS=2 and skip the header record entirely.

The following is a typical INFILE statement:

INFILE "C:\sasfiles\testfile.csv" DLM='09'x DSD LRECL=1024 TRUNCOVER FIRSTOBS=2;

 

The INPUT statement is used to list the variables you want to read from the file, for example:

DATA A; INFILE "C:\ sasfiles\testfile.csv"; INPUT VAR1 VAR2 VAR3; RUN;

 

Inputting Data

If your file contains numeric variables or character variables with a length of eight characters or less, then you don't have to worry about using informats to read in the data. Here's an example:

DATA A; INFILE 'C:\sas\example1.csv' DLM='|' DSD TRUNCOVER; INPUT FNAME $ LNAME $ AGE; RUN;

 

In the example above, the variables FNAME and LNAME are character, so they need a $ after each variable name. The variable AGE is numeric, so nothing else is needed. All three variables will have a length of 8 bytes, because no informats were used.

If you have anything else in your data, you will almost certainly want to use informats. For instance if you have a Date or a Time value or names or addresses longer than eight characters, you can add an INFORMAT statement or you can add the informat to the INPUT statement. If you choose the latter, add a colon(:) in front of the informat like this:

DATA A; INFILE 'C:\sas\example2.csv' DLM='09'X DSD TRUNCOVER; INPUT FNAME :$20. LNAME :$30. ADDRESS1 :$50. ADDRESS2 :$50. CITY :$40. STATE :$2. ZIP PHONE :$12.; RUN;

 

Troubleshooting

If you are having trouble reading a delimited file and aren't sure which of the options above you might need, the SAS log might hold the clues to solving the problem.

Problem 1

If a record was longer than the size of the input buffer and was truncated, there are two notes in the log that tell the user that this action occurred. The first note deals with the maximum record length for the file, and the second note tells the user that one or more records were truncated as follows:

The maximum record length was 256. One or more lines were truncated.

 

By default, the size of the input buffer on UNIX and PCs is 256 bytes, and records that are greater than this length will be truncated when they are read into SAS. If you see these notes in the log, increasing the input buffer with the LRECL option will correct the problem.

Problem 2

If a record does not contain enough bytes to satisfy the INPUT statement, and SAS went to a new record, there are two notes in the log that tell a user this action occurred:

NOTE: LOST CARD. NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

 

If you see either one or both of these notes in the log, adding the TRUNCOVER option to the INFILE statement will correct the problem. While the second note will always be written to the log in the above scenario, the first note will only be written to the log when SAS has read the last record in the file and tries to read another record to satisfy the INPUT statement.

Hopefully these tips will make reading delimited text files a lot clearer!