Apache PIG, validate input

Apache PIG, validate input - error-handling

How to handle bad records in a Apache PIG scripts. In my case I'm processing a comma seperated file wich usually has 14 fields on every row.
But sometimes the row contains a \n and the record is splitted in two lines and my PIG script failes to insert this record and all records after into HBase.
The problem is that the length of the map within the UDF is always 3. Probably because of the schema defined within the PIG script. How to determine if a records has the number of fields equal to the schema...
PIG
REGISTER 'files.py' using jython as myfuncs
A = LOAD '/etl/incoming/test.txt' USING PigStorage(',') AS (name:chararray, age:int, gpa:float);
B = FOREACH A {
GENERATE
myfuncs.checkFormat(TOTUPLE(*)) as fields;
}
DUMP B;
UDF
import org.apache.pig.data.DataType as DataType
import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil
#outputSchema("record:map[]")
def checkFormat(record):
print(type(record))
print(record)
record = list(record)
print("length: %d" % len(record)) #always return 3
return record

You can write validations as Pig UDFs in a variety of languages
I usually return the same schema with an additional field that signify validity and then Filter the results (once for logging into an error log and once for the continued operation)

Related

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.

I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

How to ignore "," in data fields

I am trying to generate following ...
Input
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124440112951296,"00:00 #MAW",WesleyBitton
A = LOAD '/user/root/data/tweets.csv' USING PigStorage(',') as (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';
output truncated
(396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift)
Output excepting
(396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse")
I do not want to read row as line.

You can use CSVLoader for loading data
however if you do not wish to do that here is the work around in Apache Pig itself for that :
--Load your Data
A = LOAD 'your/path/users.csv' USING TextLoader() AS (unparsed:chararray);
--Replace your " string with | so as to separate your tweets
B = FOREACH A GENERATE REPLACE(unparsed, '\\"', '|') AS parsed:chararray;
--store your temporary parsed data into your location
STORE B INTO 'your/path/parsed_users.csv' USING PigStorage('|');
--load your parsed data
C = LOAD 'your/path/parsed_users.csv' USING PigStorage('|') AS (users:chararray, tweets:chararray);
--Dump your data , how ever this will still contain one extra comma(,) but you can replace it by using the replace function you get the point.
DUMP C;

Thats fit in the csv standardization, so you need just to use CSVLoader which
supports double-quoted fields that contain commas and other
double-quotes escaped with backslashes.
This is how to use it :
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
A = LOAD '/user/root/data/tweets.csv' USING CSVLoader AS (users:chararray, tweets:chararray);
B = FILTER A by users == '396124436476092416';

REGEX_EXTRACT error in PIG

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.

In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);

The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).

How to filter 'NaN' in Pig

I have data that has some rows that look like this:
(1655,var0,var1,NaN)
The first column is an ID, the second and third come from the correlation. The fourth column is the correlation value (from using the COR function).
I would like to filter these rows.
From the Apache Pig documentation, I was under the impression that NaN is equivalent to a null. Therefore I added this to my code:
filter_corr = filter correlation by (corr IS NOT NULL);
This obviously did not work since apparently Pig does not treat null and NaN in the same way.
I would like to know what is the correct way to filter NaN since it is not clear from the Pig documentation.
Thanks!

Eventually you could specify your column as chararray in you schema and Filter with a not matches 'NaN'
Or evenly if you want to replace your NaNs by something else, you put the chararray in your schema as previously and then :
Data = FOREACH Data GENERATE ..., (correlation matches 'NaN' ? 0 : (double) correlation), ...
I hope this could help, good luck ;)

You could read in the data as one chararray line and the use a udf to parse the rows. I made a dataset it looks like this
1665,var0,var1,NaN
1453,var2,var3,5.432
3452,var4,var5,7.654
8765,var6,var7,NaN
Create UDF
#!/usr/bin/env python
# -*- coding: utf-8 -*-
### name of file: udf.py ###
#outputSchema("id:int, col2:chararray, col3:chararray, corr:float")
def format_input(line):
parsed = line.split(',')
if parsed[len(parsed) - 1] == 'NaN'
parsed.pop()
parsed.append(None)
return tuple(parsed)
Then in the pig shell
$ pig -x local
grunt>
/* register udf */
register 'udf.py' using jython as udf;
data = load 'file' as (line:chararray);
A = foreach data generate FLATTEN(udf.format_input(line));
filtered = filter A by corr is not null;
dump filtered
output
(1453,var2,var3,5.432)
(3452,var4,var5,7.654)

I've gone with this solution:
filter_corr = filter data by (corr != 'NaN');
data1 = foreach filter_corr generate ID, (double)corr as double_corr;
I renamed the column and reassigned the data type from chararray to double.
I appreciate the responses but I cannot use UDFs during prototyping due to a limitation in the UI that I am using (Cloudera)

Apache Pig: Merging list of attributes into a single tuple

I receive data in the form
id1|attribute1a,attribute1b|attribute2a|attribute3a,attribute3b,attribute3c....
id2||attribute2b,attribute2c|..
I'm trying to merge it all into a form where I just have a bag of tuples of an id field followed by a tuple containing a list of all my other fields merged together.
(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c...))
(id2,(attribute2b,attribute2c...))
Currently I fetch it like
my_data = load '$input' USING PigStorage(|) as
(id:chararray, attribute1:chararray, attribute2:chararray)...
then I've tried all combinations of FLATTEN, TOKENIZE, GENERATE, TOTUPLE, BagConcat, etc. to massage it into the form I want, but I'm new to pig and just can't figure it out. Can anyone help? Any open source UDF libraries are fair game.

Load each line as an entire string, and then use the features of the built-in STRPLIT UDF to achieve the desired result. This relies on there being no tabs in your list of attributes, and assumes that | and , are not to be treated any differently in separating out the different attributes. Also, I modified your input a little bit to show more edge cases.
input.txt:
id1|attribute1a,attribute1b|attribute2a|,|attribute3a,attribute3b,attribute3c
id2||attribute2b,attribute2c,|attribute4a|,attribute5a
test.pig:
my_data = LOAD '$input' AS (str:chararray);
split1 = FOREACH my_data GENERATE FLATTEN(STRSPLIT(str, '\\|', 2)) AS (id:chararray, attr:chararray);
split2 = FOREACH split1 GENERATE id, STRSPLIT(attr, '[,|]') AS attributes;
DUMP split2;
Output of pig -x local -p input=input.txt test.pig:
(id1,(attribute1a,attribute1b,attribute2a,,,attribute3a,attribute3b,attribute3c))
(id2,(,attribute2b,attribute2c,,attribute4a,,attribute5a))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Apache PIG, validate input - error-handling

You can write validations as Pig UDFs in a variety of languages I usually return the same schema with an additional field that signify validity and then Filter the results (once for logging into an error log and once for the continued operation)

Related

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

How to ignore "," in data fields

REGEX_EXTRACT error in PIG

How to filter 'NaN' in Pig

Apache Pig: Merging list of attributes into a single tuple

Categories

Resources