Load Operator:-------------- to load data from file to relation. [cloudera@quickstart ~]$ cat > samp1100 200 300400 500 900100 120 23123 900 800[cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp1 piglab[cloudera@quickstart ~]$grunt> s1 = load 'piglab/samp1' using PigStorage('\t')>> as (a:int, b:int, c:int);grunt> s2 = load 'piglab/samp1' using PigStorage()>> as (a:int, b:int, c:int);grunt> s3 = load 'piglab/samp1'>> as (a:int, b:int, c:int);grunt> dump s3(100,200,300)(400,500,900)(100,120,23)(123,900,800)grunt> dump s2(100,200,300)(400,500,900)(100,120,23)(123,900,800)grunt> dump s1(100,200,300)(400,500,900)(100,120,23)(123,900,800)outputs of s1, s2 , s3 are same. in s2, PigStorage() is with \t delimiter. in s3, among PigStorage() and BinStorage() PigStorage() is applied by default with \t delimiter. the meaning of s1, s2 ,s3 is same.s4 = load 'piglab/samp1' as (a:int, b:int, c:int, d:int) first 3 fields of file are mapped with a,b,c fields, there is not 4th field in the file, so d will become null.grunt> dump s4(100,200,300,)(400,500,900,)(100,120,23,)(123,900,800,)-- following is to skip last fields.grunt> s5 = load 'piglab/samp1'>> as (a:int, b:int)>> ;grunt> illustrate s5--------------------------------| s5 | a:int | b:int |--------------------------------| | 100 | 120 |---------------------------------- but to skip middled fields, take help of foreach operator. [later]loading non tab delimited files into PigRelation[cloudera@quickstart ~]$ cat > samp2100,10,12,200,203,30,300[cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp2 piglabgrunt> ss1 = load 'piglab/samp2' as (a:int, b:int, c:int);(,,)(,,)(,,) here load is expecting \t delimiter, but file has 0 tabs. so entire line is one field which is string. this has to be mapped with first field of relation , which is a but as int. so a became null. file does not have 2 nd, 3 rd fields , thats why b, c fields bacame null.grunt> ss2 = load 'piglab/samp2' as (a:chararray, b:int, c:int);(100,10,1,,)(2,200,20,,)(3,30,300,,)grunt> ss3 = load 'piglab/samp2' using PigStorage(',') as (a:int, b:int, c:int);grunt> dump ss3(100,10,1)(2,200,20)(3,30,300)grunt> cat piglab/emp101,aaaa,40000,m,11102,bbbbbb,50000,f,12103,cccc,50000,m,12104,dd,90000,f,13105,ee,10000,m,12106,dkd,40000,m,12107,sdkfj,80000,f,13108,iiii,50000,m,11grunt> emp = load 'piglab/emp'>> using PigStorage(',')>> as (id:int, name:chararray, sal:int, sex:chararray, dno:int);grunt> illustrate emp-------------| emp | id:int | name:chararray | sal:int | sex:chararray | dno:int |----------------------------------------------------------------------------------------| | 104 | dd | 90000 | f | 13 |----------------------------------------------------------------------------------------
Load Operator: -------------- to load data from file to relation. [cloudera@quickstart ~]$ cat > samp1 100200300 400500900 10012023 123900800 [cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp1 piglab [cloudera@quickstart ~]$
grunt> s1 = load 'piglab/samp1' using PigStorage('\t') >> as (a:int, b:int, c:int); grunt> s2 = load 'piglab/samp1' using PigStorage() >> as (a:int, b:int, c:int); grunt> s3 = load 'piglab/samp1' >> as (a:int, b:int, c:int); grunt> dump s3 (100,200,300) (400,500,900) (100,120,23) (123,900,800) grunt> dump s2 (100,200,300) (400,500,900) (100,120,23) (123,900,800) grunt> dump s1 (100,200,300) (400,500,900) (100,120,23) (123,900,800) outputs of s1, s2 , s3 are same. in s2, PigStorage() is with \t delimiter. in s3, among PigStorage() and BinStorage() PigStorage() is applied by default with \t delimiter.
the meaning of s1, s2 ,s3 is same. s4 = load 'piglab/samp1' as (a:int, b:int, c:int, d:int) first 3 fields of file are mapped with a,b,c fields, there is not 4th field in the file, so d will become null. grunt> dump s4 (100,200,300,) (400,500,900,) (100,120,23,) (123,900,800,) -- following is to skip last fields. grunt> s5 = load 'piglab/samp1' >> as (a:int, b:int) >> ; grunt> illustrate s5 -------------------------------- | s5 | a:int | b:int | -------------------------------- | | 100 | 120 | -------------------------------- -- but to skip middled fields, take help of foreach operator. [later] loading non tab delimited files into PigRelation
[cloudera@quickstart ~]$ cat > samp2 100,10,1 2,200,20 3,30,300 [cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp2 piglab grunt> ss1 = load 'piglab/samp2' as (a:int, b:int, c:int); (,,) (,,) (,,) here load is expecting \t delimiter, but file has 0 tabs. so entire line is one field which is string. this has to be mapped with first field of relation , which is a but as int. so a became null. file does not have 2 nd, 3 rd fields , thats why b, c fields bacame null. grunt> ss2 = load 'piglab/samp2' as (a:chararray, b:int, c:int); (100,10,1,,) (2,200,20,,) (3,30,300,,) grunt> ss3 = load 'piglab/samp2' using PigStorage(',') as (a:int, b:int, c:int); grunt> dump ss3 (100,10,1) (2,200,20) (3,30,300) grunt> cat piglab/emp 101,aaaa,40000,m,11 102,bbbbbb,50000,f,12 103,cccc,50000,m,12 104,dd,90000,f,13 105,ee,10000,m,12 106,dkd,40000,m,12 107,sdkfj,80000,f,13 108,iiii,50000,m,11 grunt> emp = load 'piglab/emp' >> using PigStorage(',') >> as (id:int, name:chararray, sal:int, sex:chararray, dno:int); grunt> illustrate emp ------------- | emp | id:int | name:chararray | sal:int | sex:chararray | dno:int | ---------------------------------------------------------------------------------------- | | 104 | dd | 90000 | f | 13 | ----------------------------------------------------------------------------------------