How to use apache pig filter to find '.PDF'

How to use apache pig filter to find '.PDF' - apache-pig

I have a file /pigmix.txt in HDFS which have a list of files with different format like .PDF,.DOC,.PPT etc. I want to filter only .PDF. How can I use apache pig filter function for it?

Can you try the below filter command?
input:
file1.txt
file2.PDF
file3.doc
file4.ppt
file5.pdf
PigScript:
A = LOAD 'input' USING PigStorage() AS (filename:chararray);
B = FILTER A BY filename matches '.*\\.(pdf|PDF)$';
DUMP B;
Output:
(file2.PDF)
(file5.pdf)

Related

How to remove extra entries from the tuple in apache pig

How do I solve the issue of extra comma or entries from the tuple
ab = load "/path/file1.txt" USING PigStorage(',') AS (id1:chararray, id2:chararray, dt:chararray, qty:int);
Current output:-
(F1,S9,12/09/2011,2,,,)
Expected Output:-
(F1,S9,12/09/2011,2)
Should I make changes in the text which is there in my file.txt or something else?

Write path between single quote('') in LOAD statement.
Example:
ab = load '/path/file1.txt' USING PigStorage(',') AS (id1:chararray, id2:chararray, dt:chararray, qty:int);

Unable to load data using Apache Pig

I have a csv data in the following format:
id,name,price,information
12,Pants,50.00,{Clothes & Shoes: 5}
And here is my pig script:
grunt> sample = LOAD 'data.csv' USING PigStorage (',') AS (id:int, name:chararray, price:double, information:chararray);
The problem is, when I load information as chararray, because I can't access the category or the quantity itself. I tried to do something like:
information:tuple(category:chararray, quantity:int)
But it didn't work..
What should I do?
What is the best way to load information so I can have access to both category and quantity..
Thanks

What you have is a Bag and not a Tuple.See here for Bag,Tuple.
( ) A tuple is enclosed in parentheses ( ).
{ } An inner bag is enclosed in curly brackets { }.
You can load it like this
sample = LOAD 'data.csv' USING PigStorage (',') AS (id:int, name:chararray, price:double, information:bag{});

apache pig load data with multiple delimiters

Hi everyone I have a problem about loading data using apache pig, the file format is like:
"1","2","xx,yy","a,sd","3"
So I want to load it by using the multiple delimiter "," 2double quotes and one comma like:
A = LOAD 'file.csv' USING PigStorage('","') AS (f1,f2,f3,f4,f5);
but the PigStorage doesn't accept the multiple delimiter ",".How I can do it? Thank you very much!

PigStorage takes single character as delimiter.You will have use builtin functions from PiggyBank. Download piggybank.jar and save in the same folder as your pigscript.Register the jar in your pigscript.
REGISTER piggybank.jar;
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD 'test1.txt' USING CSVLoader(',') AS (f1:int,f2:int,f3:chararray,f4:chararray,f5:int);
B = FOREACH A GENERATE f1,f2,f3,f4,f5;
DUMP B;
Alternate option is to load the data into a line and then use STRSPLIT
A = LOAD 'test1.txt' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '","'));
DUMP B;

Loading selected files from a directory in Pig script

I would like to know how to load some files from a directory in Pig Script .
Let's say there are 4 files in a directory for JAN month and those 4 file names are as below
2016-01-01.txt
2016-01-02.txt
2016-01-03.txt
2016-01-04.txt
Now my requirement is to read files from 2016-01-01 to 2016-01-03, that means taking first 3 files of JAN 2016 ..
My Pig script :
This below line works:
rec = LOAD '/home/dir/{2016-01-01*,2016-01-02*,2016-01-03*}' USING PigStorage(',');
This below line does not work :
rec = LOAD '/home/dir/{2016-01-{01*-03*}}' USING PigStorage(',');
I am getting the below error. I am using Pig 0.14 in MAPR Cluster
N/A file_records MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input Pattern maprfs:///home/dir/{2016-01-{01*-03*}} matches 0 files. Paths with components .*, _* were skipped.
0 additional path filters were applied
Could some body explain me what happened and how do I resolve this ?

Possible duplicate Load mutilple files over a date range in PIG
rec = LOAD '/home/dir/{2016-01-0{1,2,3}*}' USING PigStorage(',');
or
rec = LOAD '/home/dir/{2016-01-{01,02,03}*}' USING PigStorage(',');
or
rec = LOAD '/home/dir/{2016-01-0[1-3]*}' USING PigStorage(',');

Error loading pig script

I am having difficulty loading data using apache pig script
cat data15.txt
1,(2,3)
2,(3,4)
grunt>a = load 'nikhil/data15.txt' using PigStorage(',') as (x:int, y:tuple(y1:int,y2:int));
grunt>dump a;
(1,)
(2,)

I know its too late to answer this
The problem is that the tuple and the other field have the same delimiter as ','. Pig fails to do the schema conversion.
you can try something like thisyou need to change the delimiter
1:(5,7,7)
3:(7,9,4)
5:(5,9,7)
and run the pig script as
A = load 'file.txt' using PigStorage(':') as (t1:int,t2:tuple(x:int,y:int,z:int));
dump A;
the output is
(1,(5,7,7))
(3,(7,9,4))
(5,(5,9,7))
you can change the delimiter using sed command in the input file and then load the file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to use apache pig filter to find '.PDF' - apache-pig

I have a file /pigmix.txt in HDFS which have a list of files with different format like .PDF,.DOC,.PPT etc. I want to filter only .PDF. How can I use apache pig filter function for it?

Can you try the below filter command? input: file1.txt file2.PDF file3.doc file4.ppt file5.pdf PigScript: A = LOAD 'input' USING PigStorage() AS (filename:chararray); B = FILTER A BY filename matches '.*\\.(pdf|PDF)$'; DUMP B; Output: (file2.PDF) (file5.pdf)

Related

How to remove extra entries from the tuple in apache pig

Unable to load data using Apache Pig

apache pig load data with multiple delimiters

Loading selected files from a directory in Pig script

Error loading pig script

Categories

Resources