pig的基本语法以及高级语法
生活随笔
收集整理的這篇文章主要介紹了
pig的基本语法以及高级语法
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
pig簡介
pig是hadoop上層的衍生架構,與hive類似。對比hive(hive類似sql,是一種聲明式的語言),pig是一種過程語言,類似于存儲過程一步一步得進行數據轉化。
pig簡單操作
?? ? ?1.從文件導入數據
? ?? ?? ? 1)Mysql (Mysql需要先創建表).
? ?? ?? ?? ? CREATE TABLE TMP_TABLE(USER VARCHAR(32),AGE INT,IS_MALE BOOLEAN);
? ?? ?? ?? ? CREATE TABLE TMP_TABLE_2(AGE INT,OPTIONS VARCHAR(50));? ?-- 用于Join
? ?? ?? ?? ? LOAD DATA LOCAL INFILE '/tmp/data_file_1'??INTO TABLE TMP_TABLE ;
? ?? ?? ?? ? LOAD DATA LOCAL INFILE '/tmp/data_file_2'??INTO TABLE TMP_TABLE_2;
? ?? ?? ? 2)Pig
? ?? ?? ?? ???tmp_table = LOAD '/tmp/data_file_1' USING PigStorage('\t') AS (user:chararray, age:int,is_male:int);
? ?? ?? ?? ???tmp_table_2= LOAD '/tmp/data_file_2' USING PigStorage('\t') AS (age:int,options:chararray);
? ? ? ?2.查詢整張表
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE;
? ?? ?? ?2)Pig
? ?? ?? ?? ?DUMP tmp_table;
? ?? ??3. 查詢前50行
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE LIMIT 50;
? ?? ?? ?2)Pig
? ?? ?? ?? ? tmp_table_limit = LIMIT tmp_table 50;
? ?? ?? ?? ? DUMP tmp_table_limit;?
? ? 4.查詢某些列
? ?? ???1)Mysql
? ?? ?? ?? ?SELECT USER FROM TMP_TABLE;
? ?? ???2)Pig
? ?? ?? ?? ?tmp_table_user = FOREACH tmp_table GENERATE user;
? ?? ?? ?? ?DUMP tmp_table_user;
? ?? 5. 給列取別名
? ?? ???1)Mysql
? ?? ?? ???SELECT USER AS USER_NAME,AGE AS USER_AGE FROM TMP_TABLE;
? ?? ???2)Pig
? ?? ?? ???tmp_table_column_alias = FOREACH tmp_table GENERATE user AS user_name,age AS user_age;
? ?? ?? ???DUMP tmp_table_column_alias;?
? ?? 6.排序
? ?? ???1)Mysql
? ?? ?? ???SELECT * FROM TMP_TABLE ORDER BY AGE;
? ?? ???2)Pig
? ?? ?? ?? ?tmp_table_order = ORDER tmp_table BY age ASC;
? ?? ?? ?? ?DUMP tmp_table_order;
?? ? ?7.條件查詢
? ?? ???1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE WHERE AGE>20;
? ?? ???2) Pig
? ?? ?? ?? ?tmp_table_where = FILTER tmp_table by age > 20;
? ?? ?? ?? ?DUMP tmp_table_where;
? ? ??8.內連接Inner Join
? ?? ???1)Mysql
? ?? ?? ???SELECT * FROM TMP_TABLE A JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;
? ?? ???2)Pig
? ?? ?? ?? ?tmp_table_inner_join = JOIN tmp_table BY age,tmp_table_2 BY age;
? ?? ?? ?? ?DUMP tmp_table_inner_join;
? ? 9.左連接Left??Join
? ?? ? 1)Mysql
? ?? ?? ???SELECT * FROM TMP_TABLE A LEFT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;
? ?? ? 2)Pig
? ?? ?? ? tmp_table_left_join = JOIN tmp_table BY age LEFT OUTER,tmp_table_2 BY age;
? ?? ?? ? DUMP tmp_table_left_join;
? ? ?10.右連接Right Join
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE A RIGHT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;
? ?? ?? ?2)Pig
? ?? ?? ?? ?tmp_table_right_join = JOIN tmp_table BY age RIGHT OUTER,tmp_table_2 BY age;
? ?? ?? ?? ?DUMP tmp_table_right_join;
?? ??11.全連接Full Join
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE A??JOIN TMP_TABLE_2 B ON A.AGE=B.AGE
? ?? ?? ?? ?? ? UNION SELECT * FROM TMP_TABLE A LEFT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE
? ?? ?? ?? ?? ? UNION SELECT * FROM TMP_TABLE A RIGHT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;
? ?? ?? ?2)Pig
? ?? ?? ?? ?tmp_table_full_join = JOIN tmp_table BY age FULL OUTER,tmp_table_2 BY age;
? ?? ?? ?? ?DUMP tmp_table_full_join;
? ? ??12.同時對多張表交叉查詢
? ?? ?? ? 1)Mysql
? ?? ?? ?? ? SELECT * FROM TMP_TABLE,TMP_TABLE_2;
? ?? ?? ? 2)Pig
? ?? ?? ?? ? tmp_table_cross = CROSS tmp_table,tmp_table_2;
? ?? ?? ?? ? DUMP tmp_table_cross;
? ? ??13.分組GROUP BY
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE GROUP BY IS_MALE;
? ?? ?? ?2)Pig
? ?? ?? ?? ?tmp_table_group = GROUP tmp_table BY is_male;
? ?? ?? ?? ?DUMP tmp_table_group;
? ??? ?14.分組并統計
? ?? ?? ???1)Mysql
? ?? ?? ?? ? SELECT IS_MALE,COUNT(*) FROM TMP_TABLE GROUP BY IS_MALE;
? ?? ?? ???2)Pig
? ?? ?? ?? ???tmp_table_group_count = GROUP tmp_table BY is_male;
? ?? ?? ?? ???tmp_table_group_count = FOREACH tmp_table_group_count GENERATE group,COUNT($1);
? ?? ?? ?? ???DUMP tmp_table_group_count;
? ??? ?15.查詢去重DISTINCT
? ?? ?? ???1)MYSQL
? ?? ?? ?? ???SELECT DISTINCT IS_MALE FROM TMP_TABLE;
? ?? ?? ???2)Pig
? ?? ?? ?? ???tmp_table_distinct = FOREACH tmp_table GENERATE is_male;
? ?? ?? ?? ???tmp_table_distinct = DISTINCT tmp_table_distinct;
? ?? ?? ?? ???DUMP??tmp_table_distinct;
---------------------------------------------------------------------------------------------------------------------------------------------------
上面簡單操作,下面需要進一步了解pig支持的內容:
pig支持數據類型
double > float > long > int > bytearray
tuple|bag|map|chararray > bytearray
double float long int chararray bytearray都相當于pig的基本類型
tuple相當于數組 ,但是可以類型不一,舉例('dirkzhang','dallas',41)
Bag相當于tuple的一個集合,舉例{('dirk',41),('kedde',2),('terre',31)},在group的時候會生成bag
Map相當于哈希表,key為chararray,value為任意類型,例如['name'#dirk,'age'#36,'num'#41
nulls 表示的不只是數據不存在,他更表示數據是unkown
pig latin語法
1:load
LOAD 'data' [USING function] [AS schema];
? ?? ? 例如:
? ?? ?load = LOAD 'sql://{SELECT MONTH_ID,DAY_ID,PROV_ID FROM zb_d_bidwmb05009_010}'? ? USING com.xxxx.dataplatform.bbdp.geniuspig.VerticaLoader('oracle','192.168.6.5','dev','1522','vbap','vbap','1') AS (MONTH_ID:chararray,DAY_ID:chararray,PROV_ID:chararray);
Table = load ‘url’ as (id,name…..);? ? //table和load之間除了等號外 還必須有個空格 不然會出錯,url一定要帶引號,且只能是單引號。
2:filter
? ?? ? alias = FILTER alias BY expression;
? ?? ? Table = filter Table1 by + A; //A可以是 id > 10;not name matches ‘’,is not null 等,可以用and??和or連接各條件
? ?? ? 例如:
? ?? ? filter = filter load20 by ( MONTH_ID == '1210' and??DAY_ID == '18' and??PROV_ID == '010' );
3:group
alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];
? ?? ?? ? pig的分組,不僅是數據上的分組,在數據的schema形式上也進行分組為groupcolumn:bag
? ?? ?? ?Table3 = group Table2 by id;也可以Table3 = group Table2 by (id,name);括號必須加
? ?? ?? ?可以使用ALL實現對所有字段的分組
4:foreach
alias = FOREACH alias GENERATE expression [AS schema] [expression [AS schema]….];
alias = FOREACH nested_alias {
alias = {nested_op | nested_exp}; [{alias = {nested_op | nested_exp}; …]
GENERATE expression [AS schema] [expression [AS schema]….]
};
一般跟generate一塊使用
? ?? ?? ?Table = foreach Table generate (id,name);括號可加可不加。
avg = foreach Table generate group, AVG(age);??MAX ,MIN..
在進行數據過濾時,建議盡早使用foreach generate將多余的數據過濾掉,減少數據交換
5:join
Inner??join Syntax
alias = JOIN alias BY {expression|'('expression [, expression …]')'} (, alias BY {expression|'('expression [, expression …]')'} …) [USING 'replicated' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];
Outer join Syntax
alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column [USING 'replicated' | 'skewed' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];
? ???join/left join / right join
daily = load 'A' as (id,name, sex);
divs??= load 'B' as (id,name, sex);
join
jnd? ?= join daily by (id, name), divs by (id, name);? ?? ??
left join
jnd? ?= join daily by (id, name) left outer, divs by (id, name);
也可以同時多個變量,但只用于inner join
A = load 'input1' as (x, y);
B = load 'input2' as (u, v);
C = load 'input3' as (e, f);
alpha = join A by x, B by u, C by e;
6: union
alias = UNION [ONSCHEMA] alias, alias [, alias …];
union 相當與sql中的union,但與sql不通的是pig中的union可以針對兩個不同模式的變量:如果兩個變量模式相同,那么union后的變量模式與 變量的模式一樣;如果一個變量的模式可以由另一各變量的模式強制類型轉換,那么union后的變量模式與轉換后的變量模式相同;否則,union后的變量 沒有模式。
A = load 'input1' as (x:int, y:float);
B = load 'input2' as (x:int, y:float);
C = union A, B;
describe C;
C: {x: int,y: float}
A = load 'input1' as (x:double, y:float);
B = load 'input2' as (x:int, y:double);
C = union A, B;
describe C;
C: {x: double,y: double}
A = load 'input1' as (x:int, y:float);
B = load 'input2' as (x:int, y:chararray);
C = union A, B;
describe C;
Schema for C unknown.
注意:在pig 1.0中 執行不了最后一種union。
如果需要對兩個具有不通列名的變量union的話,可以使用onschema關鍵字
A = load 'input1' as (w: chararray, x:int, y:float);
B = load 'input2' as (x:int, y:double, z:chararray);
C = union onschema A, B;
describe C;
C: {w: chararray,x: int,y: double,z: chararray}
join和union之后alias的別名會變
7:Dump
? ???dump alias
用于在屏幕上顯示數據。
8:Order by
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];
? ?? ?? ?A = order Table by id desc;
9:distinct
? ?? ?? ?A = distinct alias;
10:limit
? ?? ?? ?A = limit alias 10;
11:sample
SAMPLE alias size;
隨機抽取指定比例(0到1)的數據。
some = sample divs 0.1;
13:cross
alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];
將多個數據集中的數據按照字段名進行同值組合,形成笛卡爾積。
--cross.pig
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,date:chararray, open:float, high:float, low:float,
close:float, volume:int, adj_close:float);
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,date:chararray, dividends:float);
tonsodata = cross daily, divs parallel 10;
15:split
Syntax
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression …] [, alias OTHERWISE];
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
DUMP Z;
(1,2,3)
(7,8,9)
16:store
? ?? ?? ?Store??… into … Using…
pig在別名維護上:
1、join
如e = join d by name,b by name;
? ? g = foreach e generate $0 as one:chararray, $1 as two:int, $2 as? ?? ?three:chararray,$3 asfour:int;
? ? 他生成的schemal:
? ?? ???e: {d::name: chararray,d::position: int,b::name: chararray,b::age: int}
g: {one: chararray,two: int,three: chararray,four: int}
2、group
? ?B = GROUP A BY age;
----------------------------------------------------------------------
| B? ???| group: int | A: bag({name: chararray,age: int,gpa: float}) |
----------------------------------------------------------------------
|? ?? ? | 18? ?? ?? ?| {(John, 18, 4.0), (Joe, 18, 3.8)}? ?? ?? ?? ? |
|? ?? ? | 20? ?? ?? ?| {(Bill, 20, 3.9)}? ?? ?? ?? ?? ?? ?? ?? ?? ???|
----------------------------------------------------------------------
(18,{(John,18,4.0F),(Joe,18,3.8F)})
pig udf自定義
pig支持嵌入user defined function,一個簡單的udf 繼承于evalFunc,通常用在filter,foreach中
----------------------------------------------------------------------------------------------------------------------------------------------------
如果我們知道nosql和傳統數據庫是有差別的,那么他們的都支持什么運算符和函數
?
?
4.6 Retional operators
首先編寫兩個數據文件A:
0,1,2
1,3,4
數據文件B:
0,5,2
1,7,8
運行pig:
xuqiang@ubuntu:~/hadoop/src/pig/pig-0.8.1/tutorial/pigtmp$ pig -x local
2011-06-05 18:46:54,039 [main] INFO??org.apache.pig.Main - Logging error messages to: /home/xuqiang/hadoop/src/pig/pig-0.8.1/tutorial/pigtmp/pig_1307324814030.log
2011-06-05 18:46:54,324 [main] INFO??org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
grunt>?
加載數據A:
grunt> a = load 'A' using PigStorage(',') as (a1:int, a2:int, a3:int);
加載數據B:
grunt> b = load 'B' using PigStorage(',') as (b1:int, b2:int, b3:int);
求a,b的并集:
grunt> c = union a, b;
grunt> dump c;? ??
(0,5,2)
(1,7,8)
(0,1,2)
(1,3,4)
將c分割為d和e,其中d的第一列數據值為0,e的第一列的數據為1($0表示數據集的第一列):
grunt> split c into d if $0 == 0, e if $0 == 1;
查看d:
grunt> dump d;? ?
(0,1,2)
(0,5,2)
查看e:
(1,3,4)
(1,7,8)
選擇c中的一部分數據:
grunt> f = filter c by $1 > 3;
查看數據f:
grunt> dump f;
(0,5,2)
(1,7,8)
對數據進行分組:
grunt> g = group c by $2;
查看g:
grunt> dump g;
(2,{(0,1,2),(0,5,2)})
(4,{(1,3,4)})
(8,{(1,7,8)})
當然也能夠將所有的元素集合到一起:
grunt> h = group c all;
grunt> dump h;??
(all,{(0,1,2),(1,3,4),(0,5,2),(1,7,8)})
查看h中元素個數:
grunt> i = foreach h generate COUNT($1);
查看元素個數:
grunt> dump i;
這里可能出現Could not resolve counter using imported: [, org.apache.pig.built in., org.apache.pig.impl.builtin. ]的情況,這是需要使用register命令來注冊pig對應的jar版本。
接下倆試一下jon操作:
grunt> j = join a by $2, b by $2;
該操作類似于sql中的連表查詢,這是的條件是$2 == $2。
取出c的第二列$1和$1 * $2,將這兩列保存在k中:
grunt> k = foreach c generate $1, $1 * $2;
查看k的內容:
grunt> dump k;?
(5,10)
(7,56)
(1,2)
(3,12)
pig是hadoop上層的衍生架構,與hive類似。對比hive(hive類似sql,是一種聲明式的語言),pig是一種過程語言,類似于存儲過程一步一步得進行數據轉化。
pig簡單操作
?? ? ?1.從文件導入數據
? ?? ?? ? 1)Mysql (Mysql需要先創建表).
? ?? ?? ?? ? CREATE TABLE TMP_TABLE(USER VARCHAR(32),AGE INT,IS_MALE BOOLEAN);
? ?? ?? ?? ? CREATE TABLE TMP_TABLE_2(AGE INT,OPTIONS VARCHAR(50));? ?-- 用于Join
? ?? ?? ?? ? LOAD DATA LOCAL INFILE '/tmp/data_file_1'??INTO TABLE TMP_TABLE ;
? ?? ?? ?? ? LOAD DATA LOCAL INFILE '/tmp/data_file_2'??INTO TABLE TMP_TABLE_2;
? ?? ?? ? 2)Pig
? ?? ?? ?? ???tmp_table = LOAD '/tmp/data_file_1' USING PigStorage('\t') AS (user:chararray, age:int,is_male:int);
? ?? ?? ?? ???tmp_table_2= LOAD '/tmp/data_file_2' USING PigStorage('\t') AS (age:int,options:chararray);
? ? ? ?2.查詢整張表
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE;
? ?? ?? ?2)Pig
? ?? ?? ?? ?DUMP tmp_table;
? ?? ??3. 查詢前50行
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE LIMIT 50;
? ?? ?? ?2)Pig
? ?? ?? ?? ? tmp_table_limit = LIMIT tmp_table 50;
? ?? ?? ?? ? DUMP tmp_table_limit;?
? ? 4.查詢某些列
? ?? ???1)Mysql
? ?? ?? ?? ?SELECT USER FROM TMP_TABLE;
? ?? ???2)Pig
? ?? ?? ?? ?tmp_table_user = FOREACH tmp_table GENERATE user;
? ?? ?? ?? ?DUMP tmp_table_user;
? ?? 5. 給列取別名
? ?? ???1)Mysql
? ?? ?? ???SELECT USER AS USER_NAME,AGE AS USER_AGE FROM TMP_TABLE;
? ?? ???2)Pig
? ?? ?? ???tmp_table_column_alias = FOREACH tmp_table GENERATE user AS user_name,age AS user_age;
? ?? ?? ???DUMP tmp_table_column_alias;?
? ?? 6.排序
? ?? ???1)Mysql
? ?? ?? ???SELECT * FROM TMP_TABLE ORDER BY AGE;
? ?? ???2)Pig
? ?? ?? ?? ?tmp_table_order = ORDER tmp_table BY age ASC;
? ?? ?? ?? ?DUMP tmp_table_order;
?? ? ?7.條件查詢
? ?? ???1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE WHERE AGE>20;
? ?? ???2) Pig
? ?? ?? ?? ?tmp_table_where = FILTER tmp_table by age > 20;
? ?? ?? ?? ?DUMP tmp_table_where;
? ? ??8.內連接Inner Join
? ?? ???1)Mysql
? ?? ?? ???SELECT * FROM TMP_TABLE A JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;
? ?? ???2)Pig
? ?? ?? ?? ?tmp_table_inner_join = JOIN tmp_table BY age,tmp_table_2 BY age;
? ?? ?? ?? ?DUMP tmp_table_inner_join;
? ? 9.左連接Left??Join
? ?? ? 1)Mysql
? ?? ?? ???SELECT * FROM TMP_TABLE A LEFT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;
? ?? ? 2)Pig
? ?? ?? ? tmp_table_left_join = JOIN tmp_table BY age LEFT OUTER,tmp_table_2 BY age;
? ?? ?? ? DUMP tmp_table_left_join;
? ? ?10.右連接Right Join
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE A RIGHT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;
? ?? ?? ?2)Pig
? ?? ?? ?? ?tmp_table_right_join = JOIN tmp_table BY age RIGHT OUTER,tmp_table_2 BY age;
? ?? ?? ?? ?DUMP tmp_table_right_join;
?? ??11.全連接Full Join
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE A??JOIN TMP_TABLE_2 B ON A.AGE=B.AGE
? ?? ?? ?? ?? ? UNION SELECT * FROM TMP_TABLE A LEFT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE
? ?? ?? ?? ?? ? UNION SELECT * FROM TMP_TABLE A RIGHT JOIN TMP_TABLE_2 B ON A.AGE=B.AGE;
? ?? ?? ?2)Pig
? ?? ?? ?? ?tmp_table_full_join = JOIN tmp_table BY age FULL OUTER,tmp_table_2 BY age;
? ?? ?? ?? ?DUMP tmp_table_full_join;
? ? ??12.同時對多張表交叉查詢
? ?? ?? ? 1)Mysql
? ?? ?? ?? ? SELECT * FROM TMP_TABLE,TMP_TABLE_2;
? ?? ?? ? 2)Pig
? ?? ?? ?? ? tmp_table_cross = CROSS tmp_table,tmp_table_2;
? ?? ?? ?? ? DUMP tmp_table_cross;
? ? ??13.分組GROUP BY
? ?? ?? ?1)Mysql
? ?? ?? ?? ?SELECT * FROM TMP_TABLE GROUP BY IS_MALE;
? ?? ?? ?2)Pig
? ?? ?? ?? ?tmp_table_group = GROUP tmp_table BY is_male;
? ?? ?? ?? ?DUMP tmp_table_group;
? ??? ?14.分組并統計
? ?? ?? ???1)Mysql
? ?? ?? ?? ? SELECT IS_MALE,COUNT(*) FROM TMP_TABLE GROUP BY IS_MALE;
? ?? ?? ???2)Pig
? ?? ?? ?? ???tmp_table_group_count = GROUP tmp_table BY is_male;
? ?? ?? ?? ???tmp_table_group_count = FOREACH tmp_table_group_count GENERATE group,COUNT($1);
? ?? ?? ?? ???DUMP tmp_table_group_count;
? ??? ?15.查詢去重DISTINCT
? ?? ?? ???1)MYSQL
? ?? ?? ?? ???SELECT DISTINCT IS_MALE FROM TMP_TABLE;
? ?? ?? ???2)Pig
? ?? ?? ?? ???tmp_table_distinct = FOREACH tmp_table GENERATE is_male;
? ?? ?? ?? ???tmp_table_distinct = DISTINCT tmp_table_distinct;
? ?? ?? ?? ???DUMP??tmp_table_distinct;
---------------------------------------------------------------------------------------------------------------------------------------------------
上面簡單操作,下面需要進一步了解pig支持的內容:
pig支持數據類型
double > float > long > int > bytearray
tuple|bag|map|chararray > bytearray
double float long int chararray bytearray都相當于pig的基本類型
tuple相當于數組 ,但是可以類型不一,舉例('dirkzhang','dallas',41)
Bag相當于tuple的一個集合,舉例{('dirk',41),('kedde',2),('terre',31)},在group的時候會生成bag
Map相當于哈希表,key為chararray,value為任意類型,例如['name'#dirk,'age'#36,'num'#41
nulls 表示的不只是數據不存在,他更表示數據是unkown
pig latin語法
1:load
LOAD 'data' [USING function] [AS schema];
? ?? ? 例如:
? ?? ?load = LOAD 'sql://{SELECT MONTH_ID,DAY_ID,PROV_ID FROM zb_d_bidwmb05009_010}'? ? USING com.xxxx.dataplatform.bbdp.geniuspig.VerticaLoader('oracle','192.168.6.5','dev','1522','vbap','vbap','1') AS (MONTH_ID:chararray,DAY_ID:chararray,PROV_ID:chararray);
Table = load ‘url’ as (id,name…..);? ? //table和load之間除了等號外 還必須有個空格 不然會出錯,url一定要帶引號,且只能是單引號。
2:filter
? ?? ? alias = FILTER alias BY expression;
? ?? ? Table = filter Table1 by + A; //A可以是 id > 10;not name matches ‘’,is not null 等,可以用and??和or連接各條件
? ?? ? 例如:
? ?? ? filter = filter load20 by ( MONTH_ID == '1210' and??DAY_ID == '18' and??PROV_ID == '010' );
3:group
alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];
? ?? ?? ? pig的分組,不僅是數據上的分組,在數據的schema形式上也進行分組為groupcolumn:bag
? ?? ?? ?Table3 = group Table2 by id;也可以Table3 = group Table2 by (id,name);括號必須加
? ?? ?? ?可以使用ALL實現對所有字段的分組
4:foreach
alias = FOREACH alias GENERATE expression [AS schema] [expression [AS schema]….];
alias = FOREACH nested_alias {
alias = {nested_op | nested_exp}; [{alias = {nested_op | nested_exp}; …]
GENERATE expression [AS schema] [expression [AS schema]….]
};
一般跟generate一塊使用
? ?? ?? ?Table = foreach Table generate (id,name);括號可加可不加。
avg = foreach Table generate group, AVG(age);??MAX ,MIN..
在進行數據過濾時,建議盡早使用foreach generate將多余的數據過濾掉,減少數據交換
5:join
Inner??join Syntax
alias = JOIN alias BY {expression|'('expression [, expression …]')'} (, alias BY {expression|'('expression [, expression …]')'} …) [USING 'replicated' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];
Outer join Syntax
alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column [USING 'replicated' | 'skewed' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];
? ???join/left join / right join
daily = load 'A' as (id,name, sex);
divs??= load 'B' as (id,name, sex);
join
jnd? ?= join daily by (id, name), divs by (id, name);? ?? ??
left join
jnd? ?= join daily by (id, name) left outer, divs by (id, name);
也可以同時多個變量,但只用于inner join
A = load 'input1' as (x, y);
B = load 'input2' as (u, v);
C = load 'input3' as (e, f);
alpha = join A by x, B by u, C by e;
6: union
alias = UNION [ONSCHEMA] alias, alias [, alias …];
union 相當與sql中的union,但與sql不通的是pig中的union可以針對兩個不同模式的變量:如果兩個變量模式相同,那么union后的變量模式與 變量的模式一樣;如果一個變量的模式可以由另一各變量的模式強制類型轉換,那么union后的變量模式與轉換后的變量模式相同;否則,union后的變量 沒有模式。
A = load 'input1' as (x:int, y:float);
B = load 'input2' as (x:int, y:float);
C = union A, B;
describe C;
C: {x: int,y: float}
A = load 'input1' as (x:double, y:float);
B = load 'input2' as (x:int, y:double);
C = union A, B;
describe C;
C: {x: double,y: double}
A = load 'input1' as (x:int, y:float);
B = load 'input2' as (x:int, y:chararray);
C = union A, B;
describe C;
Schema for C unknown.
注意:在pig 1.0中 執行不了最后一種union。
如果需要對兩個具有不通列名的變量union的話,可以使用onschema關鍵字
A = load 'input1' as (w: chararray, x:int, y:float);
B = load 'input2' as (x:int, y:double, z:chararray);
C = union onschema A, B;
describe C;
C: {w: chararray,x: int,y: double,z: chararray}
join和union之后alias的別名會變
7:Dump
? ???dump alias
用于在屏幕上顯示數據。
8:Order by
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];
? ?? ?? ?A = order Table by id desc;
9:distinct
? ?? ?? ?A = distinct alias;
10:limit
? ?? ?? ?A = limit alias 10;
11:sample
SAMPLE alias size;
隨機抽取指定比例(0到1)的數據。
some = sample divs 0.1;
13:cross
alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];
將多個數據集中的數據按照字段名進行同值組合,形成笛卡爾積。
--cross.pig
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,date:chararray, open:float, high:float, low:float,
close:float, volume:int, adj_close:float);
divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,date:chararray, dividends:float);
tonsodata = cross daily, divs parallel 10;
15:split
Syntax
SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression …] [, alias OTHERWISE];
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
DUMP Z;
(1,2,3)
(7,8,9)
16:store
? ?? ?? ?Store??… into … Using…
pig在別名維護上:
1、join
如e = join d by name,b by name;
? ? g = foreach e generate $0 as one:chararray, $1 as two:int, $2 as? ?? ?three:chararray,$3 asfour:int;
? ? 他生成的schemal:
? ?? ???e: {d::name: chararray,d::position: int,b::name: chararray,b::age: int}
g: {one: chararray,two: int,three: chararray,four: int}
2、group
? ?B = GROUP A BY age;
----------------------------------------------------------------------
| B? ???| group: int | A: bag({name: chararray,age: int,gpa: float}) |
----------------------------------------------------------------------
|? ?? ? | 18? ?? ?? ?| {(John, 18, 4.0), (Joe, 18, 3.8)}? ?? ?? ?? ? |
|? ?? ? | 20? ?? ?? ?| {(Bill, 20, 3.9)}? ?? ?? ?? ?? ?? ?? ?? ?? ???|
----------------------------------------------------------------------
(18,{(John,18,4.0F),(Joe,18,3.8F)})
pig udf自定義
pig支持嵌入user defined function,一個簡單的udf 繼承于evalFunc,通常用在filter,foreach中
----------------------------------------------------------------------------------------------------------------------------------------------------
如果我們知道nosql和傳統數據庫是有差別的,那么他們的都支持什么運算符和函數
?
?
4.6 Retional operators
首先編寫兩個數據文件A:
0,1,2
1,3,4
數據文件B:
0,5,2
1,7,8
運行pig:
xuqiang@ubuntu:~/hadoop/src/pig/pig-0.8.1/tutorial/pigtmp$ pig -x local
2011-06-05 18:46:54,039 [main] INFO??org.apache.pig.Main - Logging error messages to: /home/xuqiang/hadoop/src/pig/pig-0.8.1/tutorial/pigtmp/pig_1307324814030.log
2011-06-05 18:46:54,324 [main] INFO??org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
grunt>?
加載數據A:
grunt> a = load 'A' using PigStorage(',') as (a1:int, a2:int, a3:int);
加載數據B:
grunt> b = load 'B' using PigStorage(',') as (b1:int, b2:int, b3:int);
求a,b的并集:
grunt> c = union a, b;
grunt> dump c;? ??
(0,5,2)
(1,7,8)
(0,1,2)
(1,3,4)
將c分割為d和e,其中d的第一列數據值為0,e的第一列的數據為1($0表示數據集的第一列):
grunt> split c into d if $0 == 0, e if $0 == 1;
查看d:
grunt> dump d;? ?
(0,1,2)
(0,5,2)
查看e:
(1,3,4)
(1,7,8)
選擇c中的一部分數據:
grunt> f = filter c by $1 > 3;
查看數據f:
grunt> dump f;
(0,5,2)
(1,7,8)
對數據進行分組:
grunt> g = group c by $2;
查看g:
grunt> dump g;
(2,{(0,1,2),(0,5,2)})
(4,{(1,3,4)})
(8,{(1,7,8)})
當然也能夠將所有的元素集合到一起:
grunt> h = group c all;
grunt> dump h;??
(all,{(0,1,2),(1,3,4),(0,5,2),(1,7,8)})
查看h中元素個數:
grunt> i = foreach h generate COUNT($1);
查看元素個數:
grunt> dump i;
這里可能出現Could not resolve counter using imported: [, org.apache.pig.built in., org.apache.pig.impl.builtin. ]的情況,這是需要使用register命令來注冊pig對應的jar版本。
接下倆試一下jon操作:
grunt> j = join a by $2, b by $2;
該操作類似于sql中的連表查詢,這是的條件是$2 == $2。
取出c的第二列$1和$1 * $2,將這兩列保存在k中:
grunt> k = foreach c generate $1, $1 * $2;
查看k的內容:
grunt> dump k;?
(5,10)
(7,56)
(1,2)
(3,12)
總結
以上是生活随笔為你收集整理的pig的基本语法以及高级语法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 《高级无线网络—4G技术》——2.2 码
- 下一篇: python开发视频播放器_Python