Pig : load Operator

Load Operator:-------------- to load data from file to relation. [cloudera@quickstart ~]$ cat > samp1100 200 300400 500 900100 120 23123 900 800[cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp1  piglab[cloudera@quickstart ~]$grunt> s1 = load 'piglab/samp1' using PigStorage('\t')>>          as (a:int, b:int, c:int);grunt> s2 = load 'piglab/samp1' using PigStorage()>>          as (a:int, b:int, c:int);grunt> s3 = load 'piglab/samp1'>>          as (a:int, b:int, c:int);grunt> dump s3(100,200,300)(400,500,900)(100,120,23)(123,900,800)grunt> dump s2(100,200,300)(400,500,900)(100,120,23)(123,900,800)grunt> dump s1(100,200,300)(400,500,900)(100,120,23)(123,900,800)outputs of s1, s2 , s3 are same. in s2, PigStorage() is with \t delimiter. in s3, among PigStorage() and BinStorage()   PigStorage() is applied by default with \t delimiter. the meaning of s1, s2 ,s3  is same.s4 = load 'piglab/samp1'      as (a:int, b:int, c:int, d:int) first 3 fields of file are mapped with a,b,c fields,   there is not 4th field in the file,  so d will become null.grunt> dump s4(100,200,300,)(400,500,900,)(100,120,23,)(123,900,800,)-- following is to skip last fields.grunt> s5 = load 'piglab/samp1'>>    as (a:int, b:int)>> ;grunt> illustrate s5--------------------------------| s5     | a:int    | b:int    |--------------------------------|        | 100      | 120      |---------------------------------- but to skip middled fields, take help of foreach operator. [later]loading  non tab delimited files into PigRelation[cloudera@quickstart ~]$ cat > samp2100,10,12,200,203,30,300[cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp2 piglabgrunt> ss1 = load 'piglab/samp2' as (a:int, b:int, c:int);(,,)(,,)(,,) here load is expecting \t delimiter,  but file has 0 tabs.   so entire line  is  one field which is string.  this has to be mapped with first field of relation , which is a but as int.  so a became null. file does not have 2 nd, 3 rd fields , thats why b, c fields bacame null.grunt> ss2 = load 'piglab/samp2'         as    (a:chararray, b:int, c:int);(100,10,1,,)(2,200,20,,)(3,30,300,,)grunt> ss3 = load 'piglab/samp2'        using PigStorage(',')       as (a:int, b:int, c:int);grunt> dump ss3(100,10,1)(2,200,20)(3,30,300)grunt> cat piglab/emp101,aaaa,40000,m,11102,bbbbbb,50000,f,12103,cccc,50000,m,12104,dd,90000,f,13105,ee,10000,m,12106,dkd,40000,m,12107,sdkfj,80000,f,13108,iiii,50000,m,11grunt> emp = load 'piglab/emp'>>     using PigStorage(',')>>    as (id:int, name:chararray, sal:int, sex:chararray, dno:int);grunt> illustrate emp-------------| emp     | id:int    | name:chararray    | sal:int    | sex:chararray    | dno:int    |----------------------------------------------------------------------------------------|         | 104       | dd                | 90000      | f                | 13         |----------------------------------------------------------------------------------------


Load Operator:
--------------
 to load data from file to relation.
 [cloudera@quickstart ~]$ cat > samp1
100 200 300
400 500 900
100 120 23
123 900 800
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp1  piglab
[cloudera@quickstart ~]$

grunt> s1 = load 'piglab/samp1' using PigStorage('\t')
>>          as (a:int, b:int, c:int);
grunt> s2 = load 'piglab/samp1' using PigStorage()
>>          as (a:int, b:int, c:int);
grunt> s3 = load 'piglab/samp1'
>>          as (a:int, b:int, c:int);
grunt> dump s3
(100,200,300)
(400,500,900)
(100,120,23)
(123,900,800)
grunt> dump s2
(100,200,300)
(400,500,900)
(100,120,23)
(123,900,800)
grunt> dump s1
(100,200,300)
(400,500,900)
(100,120,23)
(123,900,800)
outputs of s1, s2 , s3 are same.
 in s2, PigStorage() is with \t delimiter.
 in s3, among PigStorage() and BinStorage()
   PigStorage() is applied by default with \t delimiter.

 the meaning of s1, s2 ,s3  is same.
s4 = load 'piglab/samp1'
      as (a:int, b:int, c:int, d:int)
 first 3 fields of file are mapped with a,b,c fields,
   there is not 4th field in the file,
  so d will become null.
grunt> dump s4
(100,200,300,)
(400,500,900,)
(100,120,23,)
(123,900,800,)
-- following is to skip last fields.
grunt> s5 = load 'piglab/samp1'
>>    as (a:int, b:int)
>> ;
grunt> illustrate s5
--------------------------------
| s5     | a:int    | b:int    |
--------------------------------
|        | 100      | 120      |
--------------------------------
-- but to skip middled fields, take help of foreach operator. [later]
loading  non tab delimited files into PigRelation

[cloudera@quickstart ~]$ cat > samp2
100,10,1
2,200,20
3,30,300
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp2 piglab
grunt> ss1 = load 'piglab/samp2' as (a:int, b:int, c:int);
(,,)
(,,)
(,,)
 here load is expecting \t delimiter,
  but file has 0 tabs.
   so entire line  is  one field which is string.
  this has to be mapped with first field of relation , which is a but as int.
  so a became null. file does not have 2 nd, 3 rd fields , thats why b, c fields bacame null.
grunt> ss2 = load 'piglab/samp2'
         as    (a:chararray, b:int, c:int);
(100,10,1,,)
(2,200,20,,)
(3,30,300,,)
grunt> ss3 = load 'piglab/samp2'
        using PigStorage(',')
       as (a:int, b:int, c:int);
grunt> dump ss3
(100,10,1)
(2,200,20)
(3,30,300)
grunt> cat piglab/emp
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
grunt> emp = load 'piglab/emp'
>>     using PigStorage(',')
>>    as (id:int, name:chararray, sal:int, sex:chararray, dno:int);
grunt> illustrate emp
-------------
| emp     | id:int    | name:chararray    | sal:int    | sex:chararray    | dno:int    |
----------------------------------------------------------------------------------------
|         | 104       | dd                | 90000      | f                | 13         |
----------------------------------------------------------------------------------------