《SAS编程与数据挖掘商业案例》学习笔记之十九
繼續《SAS編程與數據挖掘商業案例》學習筆記,本文側重數據處理實踐,包括:HASH對象、自定義format、以及功能強大的正則表達式
一:HASH對象
Hash對象又稱散列表,是根據關鍵碼值而直接進行訪問的數據結構,是根據關鍵碼值而直接進行訪問的數據結構,
sas提供了兩個類來處理哈希表,用于存儲數據的hash和用于遍歷的hiter,hash類提供了查找、添加、修改、刪除等方法,hiter提供了用于定位和遍歷的first、next等方法。
優點:鍵值的查找是在內存中進行的,有利于提高性能;
??????????????hash表可以在數據步運行時,動態的添加更新或刪除觀測;
??????????????hash表中可以很快的定位數據,減少查找次數;
常用方法:
definekey:定義鍵
Definedata:定義值
definedone:定義完成,可以載入數據
add:添加鍵值,如在hash表中已存在,則忽略;
replace:如果健在hash表中存在,則替換,如果不存在則添加鍵值
remove:清除鍵值對
find:查找健值,如果存在則將值寫入對應變量
check:查找鍵值,如果存在則返回rc=0,不修改當前變量的值;
output:將hash表輸出到數據集
clear:清空hash表,但并不刪除對象
equal:判斷兩個hash類是否相等
?
find方法的示例:
libname chapt12 'f:\data_model\book_data\chapt12';
data results;
?if _n_=0 then set chapt12.participants;??????????????????
???if _n_ = 1 then do;
????declare hash h(dataset:'chapt12.participants');????
????h.definekey('name');
????h.definedata('gender', 'treatment');
????h.definedone();
??end;
???set chapt12.weight;
??if h.find() = 0 then
????output;
run;
?
hiter對象的引例:
data patients;
??length patient_id $ 16 discharge 8;
??input patient_id discharge:date9.;
datalines;
smith-4123 15mar2004
hagen-2834 23apr2004
smith-2437 15jan2004
flinn-2940 12feb2004
;
data _null_;
??if _n_=0 then set patients;
??declare hash ht(dataset:"patients",ordered:"ascending");
??ht.definekey("patient_id");
??ht.definedata("patient_id", "discharge");
??ht.definedone();
??declare hiter iter("ht");
??rc = iter.first();
??do while (rc=0);
????put patient_id discharge:date9.;
????rc = iter.next();
??end;
run;
用declare hiter iter("ht");給hash表ht定義了一個遍歷器iter,之后調用first方法將遍歷器定位到hash表的第一條觀測,然后使用next方法遍歷hash表中的所有記錄并輸出。
?
商業實戰-兩個數據集的合并:
????data both1(drop=rc);????
??????declare hash plan ();???
???rc = plan.definekey ('plan_id');?
???rc = plan.definedata ('plan_desc');?
???rc = plan.definedone ();??
???do until (eof1) ;?????
?????set chapt12.plans end = eof1;
?????rc = plan.add ();????
??end;
??do until (eof2) ;?
?????set chapt12.members end = eof2;
?????call missing(plan_desc);
?????rc = plan.find ();?
?????output;???
??end;
??stop;
run;
上述程序可以簡化為:
data both2;
???length plan_id $3 plan_desc $20;
???if _n_ = 1 then do;
?????????declare hash h(dataset:'chapt12.plans');
?????????h.definekey('plan_id');
?????????h.definedata('plan_desc');
?????????h.definedone();
?????????call missing(plan_desc);
??????end;
???set chapt12.members;
???rc=h.find();
run;
二:format
自定義format:
Proc Format;
????Value $ Sex_Fmt
????'F'='女'
????'M'='男'
????Other = '未知';
????Value Age_Dur
????Low-10="10歲以下"????????????
????11-13="11-13歲"
????14-<15="14-15"
????15-High="15歲以上";
Run;
應用:
Data??test;
Set??sashelp.class(keep=sex age);
x=put(sex,$sex_fmt);y=put(age,age_dur.);
Run;
三:正則表達式:
/.../??一個正則表達式的起止;
|??數項之間的選擇,“或”運算;
()???匹配組,標記一個子表達式的開始和結束位置;
.????除換行符以外的任意字符;
\w??任一單詞字符,數字大小寫字母以及下劃線
\W??任一非單詞字符
\s???任一空白字符,包括空格、制表符、換行符、回車符、中文全角空格等;
\S???任一非空白字符,
\d???0-9任一數字
\D??任一非數字字符
[...]
[^...]
[a-z]??從a到z
[^a-z]??不在從a到z范圍內的任意字符
^??匹配輸入字符串的開始位置
$??匹配輸入字符串的結尾位置
\b??描述單詞的前或后邊界
\B??表示非單詞邊界
*??匹配0次或多次
+?匹配一次或多次
???匹配零次或?一次
{n}??匹配n次
{n,}??匹配n次以上
{n,m}??匹配n到m次
?
常用函數:
Prxparse?????定義一個正則表達式
Prxmatch??返回匹配模式的首次匹配位置
Call prxsubstr???返回匹配模式在目標字符串的開始位置和長度
Prxposn????返回正則表達式子表達式對應的匹配模式值
Call??prxposn????返回正則表達式子表達式對應的匹配模式和長度
Cal l??prxnext??返回匹配模式在目標字符串中的多個匹配位置和長度
Prxchange????替代匹配模式的值
Call prxchange???替代匹配模式的值
?
eg1:
data _null_;
???if _n_ = 1 then pattern_num = rxparse("/cat/");
??
???retain pattern_num;
???input string $30.;
???position = rxmatch(pattern_num,string);
???file print;
???put pattern_num= string= position=;
datalines;
there is a cat in this line.
does not match cat
cat in the beginning
at the end, a cat
cat
;
run;
eg2:數據驗證
data match_phone;
???set chapt12.phone_numbers;
???if _n_ = 1 then pattern = prxparse("/\(\d\d\d\) ?\d\d\d-\d{4}/");
???retain pattern;
???if prxmatch(pattern,phone) gt 0 then output;
run;
找出不匹配的手機號碼
data unmatch_phone;
???set chapt12.phone_numbers;
???where not prxmatch("/\(\d\d\d\) ?\d\d\d-\d{4}/",phone);
run;
Eg3:提取匹配某種模式的字符串
data extract;
???if _n_ = 1 then do;
??????pattern = prxparse("/\(\d\d\d\) ?\d\d\d-\d{4}/");
??????if missing(pattern) then do;
?????????put "error in compiling regular expression";
?????????stop;
??????end;
???end;
???retain pattern;
???length number $ 15;
???input string $char80.;
???call prxsubstr(pattern,string,start,length);
??????if start gt 0 then do;
??????number = substr (string,start,length);?
??????number = compress(number," ");
??????output;
???end;
???keep number;
datalines;
this line does not have any phone numbers on it
this line does: (123)345-4567 la di la di la
also valid (123) 999-9999
two numbers here (333)444-5555 and (800)123-4567
;
run;
eg4:提取名字
data ReversedNames;
???input name & $32.;
???datalines;
Jones, Fred
Kavich, Kate
Turley, Ron
Dulix, Yolanda
;
data FirstLastNames;
???length first last $ 16;
???keep first last;
???retain re;
???if _N_ = 1 then
??????re = prxparse('/(\w+), (\w+)/');
???set ReversedNames;
???if prxmatch(re, name) then
??????do;
?????????last = prxposn(re, 1, name);
?????????first = prxposn(re, 2, name);
??????end;
run;
注:1,2分別代表正則表達式中的兩個組
eg5:提取符合規定的名字
data old;
???input name $60.;
???datalines;
Judith S Reaveley
Ralph F. Morgan
Jess Ennis
Carol Echols
Kelly Hansen Huff
Judith
Nick
Jones
;
data new;
???length first middle last $ 40;
???re1 = prxparse('/(\S+)\s+([^\s]+\s+)?(\S+)/o');
???re2 = prxparse('/(\S+)(\s+)([^\s]+\s+)(?)(\S+)/o');
???set old;
???id1=prxmatch(re1, name);
???id2=prxmatch(re2, name);
???if id1 then
??????do;
?????????first = prxposn(re1, 1, name);
?????????middle = prxposn(re1, 2, name);
?????????last = prxposn(re1, 3, name);
??????end;
???if id2 then test=prxposn(re1, 4, name);
???put test=;
run;
Eg6:返回匹配模式的多個位置
data _null_;
???expressionid = prxparse('/[crb]at/');
???text = 'the woods have a bat, cat, and a rat!';
???start = 1;
???stop = length(text);
???call prxnext(expressionid, start, stop, text, position, length);
??????do while (position > 0);
?????????found = substr(text, position, length);
?????????put found= position= length=;
?????????call prxnext(expressionid, start, stop, text, position, length);
??????end;
run;
注:首次執行call prxnext返回一個position,然后進入循環,在抽取滿足條件的子串中,再次執行all prxnext,此時會返回下一個匹配的position;
Eg7:替換文本
data cat_and_mouse;
???input text $char40.;
???length new_text $ 80;
???if _n_ = 1 then match = prxparse("s/[Cc]at/mouse/");
???retain match;
???call prxchange(match,-1,text,new_text,len,trunc,num);???
???if trunc then put "note: new_text was truncated";
datalines;
the Cat in the hat
there are two cat cats in this line
here is no replacement
;
run;
?
?
總結
以上是生活随笔為你收集整理的《SAS编程与数据挖掘商业案例》学习笔记之十九的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 数据挖掘在呼叫中心的六大应用点
- 下一篇: 学习总结之数据挖掘三大类六分项