Reading Overlapping Text Segments

Tuesday, 19 May, 2009

⨳ 2 minute read ⨳ 263 words ⨳ quatchsasweb scraping

if you’re in the same boat as me, you’ve got sas 9.1.3 sp4 installed under windows. the thing about sas 9.1.3 sp4 running under windows is the lrecl is limited to 32,767 characters. (even though the sas kb says 1mb, that doesn’t appear to be true. at least not for me.)

if you’re trying to parse a long string of text (approaching 140,000 characters before a carriage return is encountered), things break.

here’s a nifty little trick for the input statement:

data _null_;
  infile datalines;
  format seg $8.;
  input seg $8. @@ +(-2);
  put seg;
datalines;
abcdefghijklmnopqrstuvwxyz
run;

so i read in 8 characters, back up 2 characters, read in 8 characters, back up 2 characters, …. the result is nice overlapping character segments ripe for parsing.

1093  data _null_;
1094    infile datalines;
1095    format seg $8.;
1096    input seg $8. @@ +(-2);
1097    put seg;
1098  datalines;

abcdefgh
ghijklmn
mnopqrst
stuvwxyz
yz
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      Memory                            151k


1100  run;

the log shows something about sas reading over to the next line, but, whatever.

i may be missing something very obvious here. if you notice, would you please let me know in the comments.

this post is related to the recent change to the sas training course list which broke my training parser.


Reading Overlapping Text Segments - May 19, 2009 - Richard Koopmann