Pig : Data types and Operators
Data types: simple data types: --------------------- int --> 32 bit integer. long ---> 64 bit " float --> 32 bit float [ not available in latest version ] double --> 64 bit boolean --> true/false ------------------------- complex data types: ------------------- tuple. bag: name : chararray [ older versions 0.7 before, 2gb is length later 4 gb is length ] ----> variable length. age : int sal : double. sex : chararry wife: tuple: (rani,24,hyd) children : bag : {(sony,4,m),(tony,2,f)} sample tuples of a outerbag. profiles ------------------------------- (Ravi, 26, M, (rani,24,hyd),{(sony,4,m), (tony,2,f)}) : : ------------------------------------------ pig latin statement: ----------------------- structure = these expressions will be changing from one operator to another. 1) load 2) describe 3) dump 4) store 5) foreach 6) filter 7) limit 8) sample 10) group 11) cogroup 12) union 13) join 14) left outer join 15) right outer join 16) full outer join 17) cross 18) pig 19) exec 20) run 21) illustrate load: ------ to load data from file to pig relation. [ logical load ] A = load 'file1' using PigStorage(',') as (a:int, b:int, c:int); here A is alias of relation. standard: use Capital letters for relation name. the input file can be , hdfs / local file. that depends on start-up mode of pig . ------------------ describe : --> to get the schema of a relation. grunt> describe emp ; dump:-- to execute data flow . A = load 'file1' ...... B = foreach A generate ... C = foreach B generate ... D = group C by .... E = foreach D generate .... grunt> dump E ---> the flow will be executed from root relation. and writes output into console.. store:-- to execute data flow, and writes output into file. the file can be local/hdfs depends on start up mode. grunt> store E into '/user/cloudera/myresults'; ---> myresults will be output directory. file s are prefixed with 'part-m-' --> output written by mapper or 'part-r-' --> output written by reducer. limit:- ------------ ---> to get first n number of tuples. grunt> X = limit A 3; --> to get last n number of tuples. two solutions: i) udf ii) joins ;. ----------------------------------- Sample: ------- to get random samples. two types of smpling techniques. i) sampling with out replacement. ---> different sample sets, dont have common elements (tuples) solution: Hive Bucketing. ii) sampling with replacement: ---> different sample sets, can have common tuples. soluting: pig Sampling. grunt> s1 = sample products 0.05; grunt> s2 = sample products 0.05; grunt> s3 = sample products 0.05; filter: to filter tuples based on given criteria. males = filter emp by (sex='m'); 3 ways to create subsets: i)limit ii) sample iii)filter foreach: -------- to process a tuple. i) to filter fieds. ii) to copy data from one relation to another. iii) to change field orders iv) to rename the fields. v) changing field data types. vi) performing transformations. with given expressiosn. vii) conditional transformations. ETL extract transform loading. extracting from databases. performing transformations, loading into target systems. ---> above is bad approach for bigdata. ELT is recommended for big data. E -->? extracting from rdbms, using sqoop L --> load into hdfs. T --> transform using Pig/hive/mr/spark.
simple data types:
---------------------
int --> 32 bit integer.
long ---> 64 bit "
float --> 32 bit float [ not available in
latest version ]
double --> 64 bit
boolean --> true/false
-------------------------
complex data types:
-------------------
tuple.
bag:
name : chararray [ older versions 0.7 before,
2gb is length
later 4 gb is length ]
----> variable length.
age : int
sal : double.
sex : chararry
wife: tuple: (rani,24,hyd)
children : bag : {(sony,4,m),(tony,2,f)}
sample tuples of a outerbag.
profiles
-------------------------------
(Ravi, 26, M, (rani,24,hyd),{(sony,4,m), (tony,2,f)})
:
:
------------------------------------------
pig latin statement:
-----------------------
structure
these expressions will be changing from one operator to another.
1) load
2) describe
3) dump
4) store
5) foreach
6) filter
7) limit
8) sample
10) group
11) cogroup
12) union
13) join
14) left outer join
15) right outer join
16) full outer join
17) cross
18) pig
19) exec
20) run
21) illustrate
load:
------
to load data from
file to pig relation.
[ logical load ]
A = load 'file1' using PigStorage(',')
as (a:int, b:int, c:int);
here A is alias of relation.
standard: use Capital letters for relation name.
the input file can be ,
hdfs / local file.
that depends on start-up mode of pig .
------------------
describe :
--> to get the schema of a relation.
grunt> describe emp ;
dump:-- to execute data flow
.
A = load 'file1' ......
B = foreach A generate ...
C = foreach B generate ...
D = group C by ....
E = foreach D generate ....
grunt> dump E
---> the flow will be executed from root relation. and writes output into console..
store:-- to execute data flow,
and writes output into file.
the file can be local/hdfs depends on start up mode.
grunt> store E into '/user/cloudera/myresults';
---> myresults will be output directory.
file s are prefixed with
'part-m-
or 'part-r-
limit:-
------------
---> to get first n number of tuples.
grunt> X = limit A 3;
--> to get last n number of tuples.
two solutions:
i) udf
ii) joins ;.
-----------------------------------
Sample:
-------
to get random samples.
two types of smpling techniques.
i) sampling with out replacement.
---> different sample sets,
dont have common elements (tuples)
solution: Hive Bucketing.
ii) sampling with replacement:
---> different sample sets,
can have common tuples.
soluting: pig Sampling.
grunt> s1 = sample products 0.05;
grunt> s2 = sample products 0.05;
grunt> s3 = sample products 0.05;
filter:
to filter tuples based on given criteria.
males = filter emp by (sex='m');
3 ways to create subsets:
i)limit ii) sample iii)filter
foreach:
--------
to process a tuple.
i) to filter fieds.
ii) to copy data from one relation to another.
iii) to change field orders
iv) to rename the fields.
v) changing field data types.
vi) performing transformations.
with given expressiosn.
vii) conditional transformations.
ETL
extract transform loading.
extracting from databases.
performing transformations,
loading into target systems.
---> above is bad approach for bigdata.
ELT is recommended for big data.
E -->? extracting from rdbms, using sqoop
L --> load into hdfs.
T --> transform using Pig/hive/mr/spark.