Storing from wildcard input path - apache-pig

I’m having issues using wildcard input paths in Pig.
If I run the following commands:
A = load ‘/something/*.csv’ using PigStorage(‘,’)
dump A;
I see the output from all csv files in the something folder printed to my console after the job is run.
If, however, I run a store instead:
A = load ‘/something/*.csv’ using PigStorage(‘,’)
store A into ‘somedestination’;
The job fails with the following error message:
Input(s):
Failed to read data from “/something/*.csv”
It looks like the store is attempting to load from the literal path instead of globbing using the wildcard, but if that’s the case then why does it work during the dump? Is there another way to accomplish this?

You may not have the permission to write to that folder.
The dump essentially writes to the tmp folder (or another folder if the configuration is different) and then prints that to the screen.
Do a dump. Look at the log. It should say something like:
Input(s):
Successfully read 0 records from: "‘/something/*.csv’"
Output(s):
Successfully stored 0 records in: "file:/tmp/temp1865628879/tmp-1573237939"
Then next time try and store to the folder that you saw when you did the dump. If that works fine, then you have a permissions problem.

Related

Archiving files using Pentaho PDI

I need to archive the txt file using Pentaho PDI by giving it a dynamic timestamp and append the variable to the output filename. I used get system info which automatically assigns variable as well as value. So my job was Start__ get system info___zip file. In the zip file component, I tried called the variable while giving the output filename along with ${Variable} but the output filename is not coming properly. It should be off filename__timestamp__variable. Can someone please help me with this?

AzureSynapse Lookup UserErrorFileNotFound with Wildcard path

I am facing an odd issue where my lookup is returning a filenotfound error when I use a wildcard path. If I specify and exact file path, the lookup runs without error. However, if I replace the filename with a *, I get a filenotfound error.
The file is Data_643.json, located in my Azure Data Lake Storage Gen2, under the labournavigatorfile system. The exact file path is:
labournavigatorfile/raw_data/Scraped/HeadHunter/Saudi_Arabia/Data_643.json.
If I put this exact path into the Integration dataset configuration, the pipeline runs without issue. However, as soon as I replace the 'Data_643.json' with a '*', the pipeline crashes with a filenotfound error.
What am I doing wrong? Many Thanks for any support. This must be something very simple that I am missing.
Exact path works:
Wildcrad path throws error:
I have 3 files in my container as file1.json, file2.json, file3.json as shown below:
The following is how I configured my dataset to read using wildcard with configuration same as in the image provided in the question.
When I used this in lookup I got the same error:
To overcome this, go to your lookup activity. When you want to use wildcards to read a file/files, check the wildcard file path option. Then specify the folder structure and use wildcard where required. The following is an image for reference.
The following is the debug output when I run the pipeline (Each of my files had 10 rows):

Internal error while loading to Bigquery table

I ran this command to load 11 files to a Bigquery table:
bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
I got this error:
Waiting on bqjob_r46f38146351d545_00000147ef890755_1 ... (11s) Current status: DONE
BigQuery error in load operation: Error processing job 'ardent-course-601:bqjob_r46f38146351d545_00000147ef890755_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 5: Unexpected. Please try again.
I tried many times after that and still got the same error.
To debug what went wrong, I instead load each file one by one to the Bigquery table. For example:
/usr/local/bin/bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part-m-00011.gz /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
There are 11 files total and each ran fine.
Could someone please help? Is this a bug on Bigquery side?
Thank you.
There was an error reading one of the files: gs://...part-m-00005.gz
Looking at the import logs, it appears that the gzip reader encountered an error decompressing the file.
It looks like that file may not actually be compressed. BigQuery samples the header of the first file in the list to determine whether it is dealing with compressed or uncompressed files and to determine the compression type. When you import all of the files at once, it only samples the first file.
When you run the files individually, bigquery reads the header of the file and determines that it isn't actually compressed (despite having the suffix '.gz') so imports it as a normal flat file.
If you run a load that doesn't mix compressed and uncompressed files, it should work successfully.
Please let me know if you think this is not the case and I'll dig in some more.

IS it possible to manage NO FILE error in Pig?

I'm trying to load simple file:
log = load 'file_1.gz' using TextLoader AS (line:chararray);
dump log
And I get an error:
2014-04-08 11:46:19,471 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1054)
Is is possible to manage such situation before error appears?
Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
The error is the input file doesn't exist in the given hdfs path.
log = load 'file_1.gz' using TextLoader AS (line:chararray);
as you haven’t mentioned the absolute path of file_1.gz , it will taken the home hdfs dir of the user with which you are running your pig-script
Unfortunately in the current version of Pig (0.15.0) it is impossible to manage these errors without using UDF's.
I suggest creating a Java or Python script using try and catch to take care of this.
Here's a good website that might be of some use to you: https://wiki.apache.org/pig/PigErrorHandlingInScripts
Good luck learning Pig!
I'm facing this issue as well. My load command is:
DATA = LOAD '${qurwf_folder_input}/data/*/' AS (...);
I want to load all files from the data subfolders, but the data folder is empty and I got the same error as you. What I did, in my particular case, was to create an empty folder in the data directory. So the LOAD returns an empty dataset and the script did not fail.
By the way, I'm using Oozie workflow to run the scripts, and in the prepare, I create the empty folders.

How to force STORE (overwrite) to HDFS in Pig?

When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:
2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists
So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.
In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use
fs -rmr foo/bar
(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use
fs -test -e foo/bar
which is a test and shouldn't break or so I thought. However, Pig again interpretes test's return code on a non-existing directory as a failure code and breaks.
There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.
At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.
Suppose you want to store your output using the statement
STORE Relation INTO 'foo/bar';
Then, in order to delete the directory, you can call at the start of the script
rmf foo/bar
No ";" or quotations required since it is a shell command.
I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.
Example:
SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';
Once you use the fs command, there a lot of ways to do this. For an individual file, I wound up adding this to the beginning of my scripts:
-- Delete file (won't work for output, which will be a directory
-- but will work for a file that gets copied or moved during the
-- the script.)
fs -touchz top_100
rm top_100
For a directory
-- Delete dir
fs -rm -r out