I am having difficulty loading data using apache pig script
cat data15.txt
1,(2,3)
2,(3,4)
grunt>a = load 'nikhil/data15.txt' using PigStorage(',') as (x:int, y:tuple(y1:int,y2:int));
grunt>dump a;
(1,)
(2,)
I know its too late to answer this
The problem is that the tuple and the other field have the same delimiter as ','. Pig fails to do the schema conversion.
you can try something like thisyou need to change the delimiter
1:(5,7,7)
3:(7,9,4)
5:(5,9,7)
and run the pig script as
A = load 'file.txt' using PigStorage(':') as (t1:int,t2:tuple(x:int,y:int,z:int));
dump A;
the output is
(1,(5,7,7))
(3,(7,9,4))
(5,(5,9,7))
you can change the delimiter using sed command in the input file and then load the file.
Related
How do I solve the issue of extra comma or entries from the tuple
ab = load "/path/file1.txt" USING PigStorage(',') AS (id1:chararray, id2:chararray, dt:chararray, qty:int);
Current output:-
(F1,S9,12/09/2011,2,,,)
Expected Output:-
(F1,S9,12/09/2011,2)
Should I make changes in the text which is there in my file.txt or something else?
Write path between single quote('') in LOAD statement.
Example:
ab = load '/path/file1.txt' USING PigStorage(',') AS (id1:chararray, id2:chararray, dt:chararray, qty:int);
In PIG, When we load a CSV file using LOAD statement without mentioning schema & with default PIGSTORAGE (\t), what happens? Will the Load work fine and can we dump the data? Else will it throw error since the file has ',' and the pigstorage is '/t'? Please advice
When you load a csv file without defining a schema using PigStorage('\t'), since there are no tabs in each line of the input file, the whole line will be treated as one tuple. You will not be able to access the individual words in the line.
Example:
Input file:
john,smith,nyu,NY
jim,young,osu,OH
robert,cernera,mu,NJ
a = LOAD 'input' USING PigStorage('\t');
dump a;
OUTPUT:
(john,smith,nyu,NY)
(jim,young,osu,OH)
(robert,cernera,mu,NJ)
b = foreach a generate $0, $1, $2;
dump b;
(john,smith,nyu,NY,,)
(jim,young,osu,OH,,)
(robert,cernera,mu,NJ,,)
Ideally, b should have been:
(john,smith,nyu)
(jim,young,osu)
(robert,cernera,mu)
if the delimiter was a comma. But since the delimiter was a tab and a tab does not exist in the input records, the whole line was treated as one field. Pig doe snot complain if a field is null- It just outputs nothing when there is a null. Hence you see only the commas when you dump b.
Hope that was useful.
I have the data like this:
$ cat samp.txt
Ramesh,[city#Bangalore],123
Arun,[city#Anantapur],345
Pranith,[city#US],456
I have written the following pig query:
A = load 'samp.txt' using PigStorage(',')
as(name:chararray,addr:map[chararray,chararray],empno:int);
When I execute the above code in pig I am getting the following error:
error: mismatched input ',' expecting RIGHT_BRACKET Details at logfile: /home/training/pig_1471586597209.log
Can any one help me to resolve this error?
A= load 'pdemo/samp' using PigStorage(',') as (name:chararray,add:map[],empno:int);
Now it will work..
Hi everyone I have a problem about loading data using apache pig, the file format is like:
"1","2","xx,yy","a,sd","3"
So I want to load it by using the multiple delimiter "," 2double quotes and one comma like:
A = LOAD 'file.csv' USING PigStorage('","') AS (f1,f2,f3,f4,f5);
but the PigStorage doesn't accept the multiple delimiter ",".How I can do it? Thank you very much!
PigStorage takes single character as delimiter.You will have use builtin functions from PiggyBank. Download piggybank.jar and save in the same folder as your pigscript.Register the jar in your pigscript.
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD 'test1.txt' USING CSVLoader(',') AS (f1:int,f2:int,f3:chararray,f4:chararray,f5:int);
B = FOREACH A GENERATE f1,f2,f3,f4,f5;
DUMP B;
Alternate option is to load the data into a line and then use STRSPLIT
A = LOAD 'test1.txt' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '","'));
DUMP B;
Can i create a variable in PIG and concatenate them where on if the variable is dynamic- like current time?
I need a file name to be created based on the current time.
%declare FILE_PREFIX file;
%declare FILE_POSTFIX date +%Y-%m-%d-%s;
Can i do something like:
file_name = '$FILE_PREFIX$FILE_POSTFIX';
As of my Experience,I worked like below..
Passed parameter from command line to pig script filename and date..
pig -f myscript.pig --param file="india_" --param nw=$(date +"%Y-%m-%d-%s")
In the pig script.
%declare FILE_PREFIX '$file$nw ';
A = load '/user/root/$FILE_PREFIX' USING PigStorage(',') as (id1, name1);
dump A;