Pig-Attempt to access non existing field - apache-pig

Problem:
Dumping filtered output throws an error and prints incorrect output with warnings:
Error-attempt to access non-existing field in input
Steps:
Loaded a tab-delimited file into relation a:
a = LOAD '/user/a6000518-a/AdobeHourlySampleHit/hit_data.tsv' USING PigStorage('\t');
This file contains 952 columns.
I want to list the values in the 374th column. I did a null check and generated the 374th column values.
b = FILTER a BY $373 is not null;
c = FOREACH b GENERATE $373;
DUMP c
Dumping the results produces the expected output but also prints this warning message:
2015-08-20 16:50:53,179 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject(ACCESSING_NON_EXISTENT_FIELD): Attempt to access field which was not found in the input
Could you please let me know where I could've made a mistake?
Thanks!

Related

Python - Compare two csv file - based on Column

I am trying to compare two CSV files, most of the time it will have same data but order of data will not be the same. Eg
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
CSV File2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
so I want to use third column as Primary key to compare other values. Report the difference. Is this possible to do it in Robotframework or Panda?
If you are making use of robotframework you need to do the following,
install robotframework-csvlib
Use Built-in Collections
Input from your question
csv file1
AAA,111,A1A1
BBB,222,B2B2
CCC,333,C3C3
csv file2
CCC,333,C3C3
BBB,212,B2B2
AAA,111,A1A1
My Solution
In the below approach, we are first reading csv into list of lists for both csv files and then comparing all the list of list items by making use of Collections KW List Should Contain Sub List, here, notice that we are passing an argument "values=True" which compares the value as well.
Code that compares 2 csv files
*** Settings ***
Library CSVLib
Library Collections
*** Test Cases ***
Test CSV
${list1}= read csv as list csv1.csv
log to console ${list1}
${list2}= read csv as list csv2.csv
log to console ${list2}
List Should Contain Sub List ${list1} ${list2} values=True
OUTPUT
(rf1) C:\Users\kgurupra>robot s1.robot
==============================================================================
S1
==============================================================================
Test CSV .[['C1,C2,C3'], ['AAA,111,A1A1'], ['BBB,222,B2B2'], ['CCC,333,C3C3']]
..[['C1,C2,C3'], ['CCC,333,C3C3'], ['BBB,212,B2B2'], ['AAA,111,A1A1']]
Test CSV | FAIL |
Following values were not found from first list: ['BBB,212,B2B2']
------------------------------------------------------------------------------
S1 | FAIL |
1 critical test, 0 passed, 1 failed
1 test total, 0 passed, 1 failed
==============================================================================
Output: C:\Users\kgurupra\output.xml
Log: C:\Users\kgurupra\log.html
Report: C:\Users\kgurupra\report.html
Assuming you've imported your CSV files as pandas DataFrames you can do the following to merge the two while retaining fundamental differences:
df = csv1.merge(csv2, on='<insert name primary key column here>',how='outer')
Adding the suffixes option allows you to more clearly differentiate between identically named columns from each file:
df = csv1.merge(csv2, on='<insert name>',how='outer',suffixes=['_csv1','_csv2'])
After that it depends on what kind of differences you are looking to spot but perhaps a starting point is:
df['difference_1'] = df['column1_csv1'] == df['column1_csv2']
this will create a boolean column which indicates True if observations are the same and False otherwise.
But there are nearly endless options for comparison.

"Error while reading data" error received when uploading CSV file into BigQuery via console UI

I need to upload a CSV file to BigQuery via the UI, after I select the file from my local drive I specify BigQuery to automatically detect the Schema and run the job. It fails with the following message:
"Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2; errors: 1. Please look into the
errors[] collection for more details."
I have tried removing the comma in the last column, and tried changing options in the advanced section but it always results in the same error.
The error log is not helping me understand where the problem is, this is example of the error log entry:
2
019-04-03 23:03:50.261 CLST Bigquery jobcompleted
bquxjob_6b9eae1_169e6166db0 frank#xxxxxxxxx.nn INVALID_ARGUMENT
and:
"Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2; errors: 1. Please look into the
errors[] collection for more details."
and:
"Error while reading data, error message: Error detected while parsing
row starting at position: 46. Error: Data between close double quote
(") and field separator."
The strange thing is that the sample CSV data has NO double quote field separator!?
2019-01-02 00:00:00,326,1,,292,0,,294,0,,-28,0,,262,0,,109,0,,372,0,,453,0,,536,0,,136,0,,2609,0,,1450,0,,352,0,,-123,0,,17852,0,,8528,0
2019-01-02 00:02:29,289,1,,402,0,,165,0,,-218,0,,150,0,,90,0,,263,0,,327,0,,275,0,,67,0,,4863,0,,2808,0,,124,0,,454,0,,21880,0,,6410,0
2019-01-02 00:07:29,622,1,,135,0,,228,0,,-147,0,,130,0,,51,0,,381,0,,428,0,,276,0,,67,0,,2672,0,,1623,0,,346,0,,-140,0,,23962,0,,10759,0
2019-01-02 00:12:29,206,1,,118,0,,431,0,,106,0,,133,0,,50,0,,380,0,,426,0,,272,0,,63,0,,1224,0,,740,0,,371,0,,-127,0,,27758,0,,12187,0
2019-01-02 00:17:29,174,1,,119,0,,363,0,,59,0,,157,0,,67,0,,381,0,,426,0,,344,0,,161,0,,923,0,,595,0,,372,0,,-128,0,,22249,0,,9278,0
2019-01-02 00:22:29,175,1,,119,0,,301,0,,7,0,,124,0,,46,0,,382,0,,425,0,,431,0,,339,0,,1622,0,,1344,0,,379,0,,-126,0,,23888,0,,8963,0
I shared an example of a few lines of CSV data. I expect BigQuery to be able to detect the schema and load the data into a new table.
Using BigQuery new WebUI and your input data I did the following:
Select a dataset
Clicked on create a table
Filled the create table form as follow:
The table was created and I was able to SELECT 6 rows as expected
SELECT * FROM projectId.datasetId.SO LIMIT 1000

Adding date to new columns returns error

I'm trying to add a new column to my file. I want to add the date to each row of my file.
Filename is: 2016-06-15.txt
The schema my file is:
A B C
7 8 13
I want to obtain:
Date A B C
2016-06-15 7 8 13
For that I'm using Pig with following script:
A = LOAD 'user/cloudera/Analytics/source/file.txt' using PigStorage(' ','-tagPath');
DUMP A ; ****--> ERROR****
STORE A INTO 'user/cloudera/Analytics/source/file.txt' USING PigStorage(' '); ****--> ERROR****
But I'm getting an error and I don't have any log available :( Anyone can help? Many thanks!
You will have to use the -tagFile option to get the file name as the first field.
Before that check to make sure the file path is correct.Looks like a forward slash is missing at the beginning of your file path.Ensure you are using the correct delimiter in PigStorage.Seems like the columns are separated by a tab or multiple spaces.Lastly choose a different folder to Store the new file or else you will get a file exists error.
A = LOAD '/user/cloudera/Analytics/source/2016-06-15.txt' using PigStorage(' ','-tagFile');
STORE A INTO '/user/cloudera/Analytics/NEW_source/2016-06-15.txt' USING PigStorage(' ');

unable to specify db2 import parameters on bluemix?

I subscribed a free sqldb service from bluemix and tried to import data in CSV file to this database instance.
For certain columns I have pure "space" as data, and some columns to be filled by default value. I can import this data with the following command on my local DB2:
db2 'import from MY_DATA.csv of del modified by usedefaults keepblanks timestampformat="MM/DD/YYYY HH:MM:SS" skipcount 1 insert into MY_TABLE'
On bluemix, I can only assign date / time / timestamp format and skip 1st row. How can I add the "modified by usedefaults keepblanks" part on bluemix to complete the import?
Also, when the import fails, I only receive the following message:
BaseException message: [Routine "SYSPROC.ADMIN_CMD" execution has completed, but at least one error, "_0911", was encountered during the execution. More information is available.. _CODE=20397, _STATE=01H52, DRIVER=3.66.46]
Where can I get the detail error log that I can see on my local DB such as:
SQL3125W The character data in row "2" and column "32" was truncated because
the data is longer than the target database column.
SQL3148W A row from the input file was not inserted into the table. SQLCODE
"-181" was returned.
SQL0181N The string representation of a datetime value is out of range.
SQLSTATE=22007
SQL3185W The previous error occurred while processing data from row "2" of
the input file.
SQL3110N The utility has completed processing. "2" rows were read from the
input file.
SQL3221W ...Begin COMMIT WORK. Input Record Count = "2".
SQL3222W ...COMMIT of any database changes was successful.
SQL3149N "2" rows were processed from the input file. "0" rows were
successfully inserted into the table. "1" rows were rejected.
Number of rows read = 2
Number of rows skipped = 1
Number of rows inserted = 0
Number of rows updated = 0
Number of rows rejected = 1
Number of rows committed = 2
In the same quick load page (load complete in step 4), there should be a link to view the logs for this load. Hopefully it'll reveal more details about the error message.
Also note that keepblanks is applicable to DEL file formats (Delimited ASCII) only. It is not applicable to ASCII file formats (ASC/DEL) or ASC file formats (Non-delimited ASCII).
http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0023577.html?cp=SSEPGG_10.5.0%2F3-6-1-3-0-0-12&lang=en

Split JSON file using apache PIG

I have a JSON input file that needs to be split into multiple files based on a keyword and the output should also retain the same JSON format.
Example:
The keyword here is the value of the object EVT.NAME. Depeneding on the value it should route it to the output.
Input has three different values (KEYPRESS,TUNE,TRICK), so 3 different output files should be created.
Input:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672866844,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TUNE","ETS":1402672867117,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402672868600,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672868888,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402673179313,"VALUE":{"KEY":"FAST_FORWARD"}},"HOST":"XXX"}
Output1:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672866844,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"KEYPRESS","ETS":1402672868888,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
Output 2:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TUNE","ETS":1402672867117,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
Output 3:
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402672868600,"VALUE":{"KEY":"PLAY"}},"HOST":"XXX"}
{"PV":"1.0","DEV":{"DEV_ID":"P0100011103"},"EVT":{"NAME":"TRICK","ETS":1402673179313,"VALUE":{"KEY":"FAST_FORWARD"}},"HOST":"XXX"}
You can use JsonLoader and JsonStorage. See this article - http://joshualande.com/read-write-json-apache-pig.
table = LOAD 'file.json'
USING JsonLoader('KEYPRESS:chararray, TUNE:chararray, TRICK:chararray');