pig error: Job in state DEFINE instead of RUNNING - Generic solution - apache-pig

A typical Pig error that occurs without much usefull information is the following:
Job in state DEFINE instead of RUNNING
Often found in a line like this:
Caused by: java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
I have seen some examples of this error, but would like to have the generic solution for this problem.

So far, at each occasion where I have encountered this error, it is because Pig fails to load files. The error in the question is printed to stderr log, and you will not find anything usefull there.
However, if you were to look in the stdout log, you would expect to find the following:
Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input Pattern hdfs://x.x.x.x:x/locationOnHDFS/* matches 0 files
Typically followed by:
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://x.x.x.x:x/locationOnHDFS/* matches 0 files
At this point the most likely suspects are:
There are no files in the specified folder (though the folder exists)
The user that is running the script does not have the rights to access the relevant files
All files are empty (not sure about this one)
Note that it is a commonly known difficulty that pig will error out if you try to read an empty directory (rather than just processing the alias with 0 lines).

Related

AzureSynapse Lookup UserErrorFileNotFound with Wildcard path

I am facing an odd issue where my lookup is returning a filenotfound error when I use a wildcard path. If I specify and exact file path, the lookup runs without error. However, if I replace the filename with a *, I get a filenotfound error.
The file is Data_643.json, located in my Azure Data Lake Storage Gen2, under the labournavigatorfile system. The exact file path is:
labournavigatorfile/raw_data/Scraped/HeadHunter/Saudi_Arabia/Data_643.json.
If I put this exact path into the Integration dataset configuration, the pipeline runs without issue. However, as soon as I replace the 'Data_643.json' with a '*', the pipeline crashes with a filenotfound error.
What am I doing wrong? Many Thanks for any support. This must be something very simple that I am missing.
Exact path works:
Wildcrad path throws error:
I have 3 files in my container as file1.json, file2.json, file3.json as shown below:
The following is how I configured my dataset to read using wildcard with configuration same as in the image provided in the question.
When I used this in lookup I got the same error:
To overcome this, go to your lookup activity. When you want to use wildcards to read a file/files, check the wildcard file path option. Then specify the folder structure and use wildcard where required. The following is an image for reference.
The following is the debug output when I run the pipeline (Each of my files had 10 rows):

Run a binary CGI file on the command line with GET params

How can I run a binary cgi file on the command line and provide GET parameters to it?
I understand this task may be straightforward for perl or php files, but I've got a binary cgi file and no documentation for it. I'd like to run it without a web server so that I can evaluate certain problems on some co-workers' machines.
I've tried the following, but to no avail:
QUERY_STRING="foo=bar" ./myfile.cgi
foo=bar ./myfile.cgi
./myfile.cgi foo=bar
./myfile.cgi <<< "foo=bar"
In each case, the script outputs Error in form found<br>Missing foo<br><b></b><br>. (When executed through apache on our server, it returns no error message, only the intended results.)
Specifying the environment variable REQUEST_METHOD=GET in addition to QUERY_STRING=... makes the difference.
Among the tokens output by strings myfile.cgi, there were a number of cgi-related variables which the web server might be expected to set, such as REMOTE_ADDR, SERVER_SOFTWARE, and SERVER_PROTOCOL, but the two aforementioned variables proved enough to get this executable to produce a non-error output.
(For a POST, I've read that the body/params are read from stdin, but I haven't substantiated that personally.)

DATASET_CANT_CLOSE error number 32 "Broken Pipe"

I experienced an error in SAP ABAP which says DATASET_CANT_CLOSE with error number 32 (Broken Pipe). Question is: what procedure triggered this kind of error?
As far as I know, this error was triggered by:
CLOSE DATASET dset
But I can't reproduce the error since I don't know what procedure does trigger this kind of error.
This is the code I use:
method GENERATE_TXT_FILE.
DATA :
lwa_data TYPE t_line,
lv_param TYPE sxpgcolist-parameters.
"Upload File to Server
*Open Dataset
OPEN DATASET im_file_name FILTER 'dos2ux'
FOR OUTPUT IN TEXT MODE ENCODING DEFAULT.
CLEAR lwa_data.
LOOP AT it_data INTO lwa_data.
CATCH SYSTEM-EXCEPTIONS file_access_errors = 4
OTHERS = 8.
TRANSFER lwa_data-lines TO im_file_name.
ENDCATCH.
IF sy-subrc <> 0.
CLEAR lwa_data.
EXIT.
ENDIF.
CLEAR lwa_data.
ENDLOOP.
*Close Dataset
CLOSE DATASET im_file_name.
As I have investigated through the background job log, it seems that the current server which run the background job haven't got mapped yet to the text file folder. Solution is to re-map the server to text file folder.
You are using the FILTER extension to OPEN DATASET - which can be a HUGE security issue as well as raise loads of portability issues unless you know what you're doing, but that's not what the question is about. From the documentation:
When the statement OPEN DATASET is executed, a process is started in
the operating system for the specified statement. When the file is
opened for reading, a channel (pipe) is linked with STDOUT of the
process, from which the data is read during file reading. The file
itself is linked with STDIN of the process. When the file is opened
for writing, a channel (pipe) is linked to STDIN of the process, to
which data is passed when writing. The output of the process is
diverted to this file.
In your case, the filter command probably decided to bail out - see this answer among many. Why is hard to investigate - you may have to go through various system logs to find out. If the problem really is some unmapped network folder, you could try switching to UNC paths.

IS it possible to manage NO FILE error in Pig?

I'm trying to load simple file:
log = load 'file_1.gz' using TextLoader AS (line:chararray);
dump log
And I get an error:
2014-04-08 11:46:19,471 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1054)
Is is possible to manage such situation before error appears?
Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
The error is the input file doesn't exist in the given hdfs path.
log = load 'file_1.gz' using TextLoader AS (line:chararray);
as you haven’t mentioned the absolute path of file_1.gz , it will taken the home hdfs dir of the user with which you are running your pig-script
Unfortunately in the current version of Pig (0.15.0) it is impossible to manage these errors without using UDF's.
I suggest creating a Java or Python script using try and catch to take care of this.
Here's a good website that might be of some use to you: https://wiki.apache.org/pig/PigErrorHandlingInScripts
Good luck learning Pig!
I'm facing this issue as well. My load command is:
DATA = LOAD '${qurwf_folder_input}/data/*/' AS (...);
I want to load all files from the data subfolders, but the data folder is empty and I got the same error as you. What I did, in my particular case, was to create an empty folder in the data directory. So the LOAD returns an empty dataset and the script did not fail.
By the way, I'm using Oozie workflow to run the scripts, and in the prepare, I create the empty folders.

How to force STORE (overwrite) to HDFS in Pig?

When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:
2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists
So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.
In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use
fs -rmr foo/bar
(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use
fs -test -e foo/bar
which is a test and shouldn't break or so I thought. However, Pig again interpretes test's return code on a non-existing directory as a failure code and breaks.
There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.
At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.
Suppose you want to store your output using the statement
STORE Relation INTO 'foo/bar';
Then, in order to delete the directory, you can call at the start of the script
rmf foo/bar
No ";" or quotations required since it is a shell command.
I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.
Example:
SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';
Once you use the fs command, there a lot of ways to do this. For an individual file, I wound up adding this to the beginning of my scripts:
-- Delete file (won't work for output, which will be a directory
-- but will work for a file that gets copied or moved during the
-- the script.)
fs -touchz top_100
rm top_100
For a directory
-- Delete dir
fs -rm -r out