Use regular expression in Pig load command - apache-pig

I am wondering if it's possible to use regular expression in pig load command to load all files that ends with a suffix. I have see some example for the prefix one. For example if I want to load all the files that start with "Prefix". I can do something like:
LOAD '/path/to/dir/Prefix*' USING AvroStorage();
For loading all files end with a certain suffix, I am trying to do something like:
LOAD '/path/to/dir/*suffix' USING AvroStorage();
But it's not working. Can someone please suggest what's wrong? Thanks a lot for your help!

Related

AzureSynapse Lookup UserErrorFileNotFound with Wildcard path

I am facing an odd issue where my lookup is returning a filenotfound error when I use a wildcard path. If I specify and exact file path, the lookup runs without error. However, if I replace the filename with a *, I get a filenotfound error.
The file is Data_643.json, located in my Azure Data Lake Storage Gen2, under the labournavigatorfile system. The exact file path is:
labournavigatorfile/raw_data/Scraped/HeadHunter/Saudi_Arabia/Data_643.json.
If I put this exact path into the Integration dataset configuration, the pipeline runs without issue. However, as soon as I replace the 'Data_643.json' with a '*', the pipeline crashes with a filenotfound error.
What am I doing wrong? Many Thanks for any support. This must be something very simple that I am missing.
Exact path works:
Wildcrad path throws error:
I have 3 files in my container as file1.json, file2.json, file3.json as shown below:
The following is how I configured my dataset to read using wildcard with configuration same as in the image provided in the question.
When I used this in lookup I got the same error:
To overcome this, go to your lookup activity. When you want to use wildcards to read a file/files, check the wildcard file path option. Then specify the folder structure and use wildcard where required. The following is an image for reference.
The following is the debug output when I run the pipeline (Each of my files had 10 rows):

How to prevent Apache pig from outputting empty files?

I have a pig script that reads data from a directory on HDFS. The data are stored as avro files. The file structure looks like:
DIR--
--Subdir1
--Subdir2
--Subdir3
--Subdir4
In the pig script I am simply doing a load, filter and store. It looks like:
items = LOAD path USING AvroStorage()
items = FILTER items BY some property
STORE items into outputDirectory using AvroStorage()
The problem right now is that pig is outputting many empty files in the output directory. I am wondering if there's a way to remove those files? Thanks!
For pig version 0.13 and later, you can set pig.output.lazy=true to avoid creating empty files. (https://issues.apache.org/jira/browse/PIG-3299)

I want to know how my data should be in text file in relative to following script?

I want to know how my data should be text file in relative to following script?
How pig differentiate delimiter for following script?
Please give me sample one row of input?
A = LOAD 'mydata.txt' AS (P:int, T1:tuple(f1:int, f2:int), B:{T2:(t1:int,t2:int)}, M:[] );
At first, there is a document:
Load/Store Functions
And, see this:
Apache Pig - Not able to read the bag
Sample data:
30|(1,2)|{(3,4)}|[]
Sample code:
A = LOAD 'mydata.txt' USING PigStorage('|') AS (P:int, T1:tuple(f1:int, f2:int), B:{T2:(t1:int,t2:int)}, M:[] );
DUMP A;
It seems PigStorage cannot determine commas in bag. I guess it's bug.

Storing from wildcard input path

I’m having issues using wildcard input paths in Pig.
If I run the following commands:
A = load ‘/something/*.csv’ using PigStorage(‘,’)
dump A;
I see the output from all csv files in the something folder printed to my console after the job is run.
If, however, I run a store instead:
A = load ‘/something/*.csv’ using PigStorage(‘,’)
store A into ‘somedestination’;
The job fails with the following error message:
Input(s):
Failed to read data from “/something/*.csv”
It looks like the store is attempting to load from the literal path instead of globbing using the wildcard, but if that’s the case then why does it work during the dump? Is there another way to accomplish this?
You may not have the permission to write to that folder.
The dump essentially writes to the tmp folder (or another folder if the configuration is different) and then prints that to the screen.
Do a dump. Look at the log. It should say something like:
Input(s):
Successfully read 0 records from: "‘/something/*.csv’"
Output(s):
Successfully stored 0 records in: "file:/tmp/temp1865628879/tmp-1573237939"
Then next time try and store to the folder that you saw when you did the dump. If that works fine, then you have a permissions problem.

WebHCat & Pig - how to pass a parameter file to the job?

I am using HCatalog's WebHCat API to run Pig jobs, such as documented here:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig
I have no problem running a simple job but I would like to attach a parameters file to the job, such as one can do using pig command line's parameter: --param_file .
I assume this is possible through arg request's parameter, so I tried multiple things, such as passing:
'arg': '-param_file /path/to/param.file'
or:
'arg': {'param_file': '/path/to/param.file'}
None seems to work, and error stacks don't say much.
I would love to know if this is possible, and if so, how to correctly achieve this.
Many thanks
Correct usage:
'arg': ['-param_file', '/path/to/param.file']
Explanation:
By passing the value in arg,
'arg': {'-param_file': '/path/to/param.file'}
webhcat generates "-param_file" for the command prompt.
Pig throws the following error
ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Can not create a Path from a null string
Using a comma instead of the colon operator passes the path to file as a second argument.
webhcat will generate "-param_file" "/path/to/param.file"
P.S: I am using Requests library on python to make the REST calls