Passing a parameter with a white space - apache-pig

When I run my script with the command showed below, with police_force parameter set as "Surrey Police", it gives me an error
"ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. File not found: Police""
If I pass in the value as "Surrey_Police" it runs fine but doesn't return anything
-- knownvalues: dataset presented
-- date1: one date of comparison
-- date2: 2nd date of comparison
-- police_force: falls within
-- crime_type
-- Usage: exec -param knownvalues='/user/cw/input/all.txt' -param date1='2017-05' -param date2='2017-06' -param police_force="Surrey Police" /home/xiaorui/CW/compare_crime.pig
knownvalues = LOAD '$knownvalues' USING PigStorage(',') AS (crimeid:chararray,month:chararray,reportedby:chararray,fallswithin:chararray,longitude:float,latitude:float,location:chararray,lsoacode:chararray,lsoaname:chararray,crimetype:chararray,lastoutcome:chararray,context:chararray);
knownvalues = SAMPLE knownvalues 0.00001;
location = FILTER knownvalues BY (fallswithin MATCHES $police_force);
first_date = FILTER location BY (month MATCHES '$date1');
second_date = FILTER location BY (month MATCHES '$date2');
DUMP first_date;
If i use the line below, code works as intended
location = FILTER knownvalues BY (fallswithin MATCHES 'Surrey Police');

I got it achieved with below steps.
a.)Firstly the police_force from filter command should be enclosed in single quotes like below.
location = FILTER knownvalues BY (fallswithin MATCHES '$police_force');
b.) Secondly we need to include escaping character() as well along with either single or double quotes in execution command.
pig -x local -param knownvalues='/home/ec2-user/data' -param police_force="Surrey\ Police" /home/ec2-user/test.pig
or
pig -x local -param knownvalues='/home/ec2-user/data' -param police_force='Surrey\ Police' /home/ec2-user/test.pig
Below is my testing code and commands.
Pig Input data file: cat data
mary,19
john,18
joe,18
Surrey Police,20
Pig Sample Code: cat test.pig
knownvalues = LOAD '$knownvalues' USING PigStorage(',') AS (name:chararray,age:int);
dump knownvalues;
describe knownvalues;
location = FILTER knownvalues BY (name MATCHES '$police_force');
dump location;
describe location;
Output:
After load:
(mary,19)
(john,18)
(joe,18)
(Surrey Police,20)
knownvalues: {name: chararray,age: int}
After filter:
(Surrey Police,20)
location: {name: chararray,age: int}

Related

Pig Load with Schema giving error

I have a file called data_tuple_bag.txt on hdfs with the following content:
10,{(1,2),(2,3)}
11,{(4,5),(6,7)}
I am creating a relation as below :
D = LOAD '/user/pig_demo/data_tuple_bag.txt' AS (f1:int,B:{T:(t1:int,t2:int)});
When I DUMP it it is giving me ACCESSING_NON_EXISTENT_FIELD 2 time(s) as well as FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and an empty output.
I changed the relation to :
D = LOAD '/user/pig_demo/data_tuple_bag.txt' USING PigStorage(',') AS (f1:int,B:{T:(t1:int,t2:int)});
Now it's only giving FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and output as :
(10,)
(11,)
I have another file data_only_bag.txt with following in it:
{(1,2),(2,3)}
{(4,5),(6,7)}
The relation is defined as :
A = LOAD '/user/pig_demo/data_only_bag.txt' AS (B:{T:(t1:int,t2:int)});
And it works.
Now I am updating the data_only_bag.txt as below:
10,{(1,2),(2,3)}
11,{(4,5),(6,7)}
And the relation is :
A = LOAD '/user/pig_demo/data_only_bag.txt' AS (f1:int,B:{T:(t1:int,t2:int)});
I am getting :
(,)
(,)
When I DUMP it it is giving me ACCESSING_NON_EXISTENT_FIELD 2 time(s) as well as FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and an empty output.
Now I am updating the relation to :
A = LOAD '/user/pig_demo/data_only_bag.txt' USING PigStorage(',') AS (f1:int,B:{T:(t1:int,t2:int)});
Now it's only giving FIELD_DISCARDED_TYPE_CONVERSION_FAILED 2 time(s) and output as :
(10,)
(11,)
Same as before.
Can anybody tell me what wrong am I doing here?
Thanks in Advance.
It failed to parse the input with the provided schema,
Try with this :
D = LOAD '/user/pig_demo/data_tuple_bag.txt' USING PigStorage(',')
AS (f1:int, B: {T1: (t1:int, t2:int),T2: (t1:int, t2:int)});

Apache Pig reading name value pairs in data file

i have a sample pig script with data that will read a csv file and dump it ot screen; however, my data has name value pairs. how can i read in a line of name value pairs and split the pairs using the name for the field and the value for the value?
data:
1,Smith,Bob,Business Development
2,Doe,John,Developer
3,Jane,Sally,Tester
script:
data = LOAD 'example-data.txt' USING PigStorage(',')
AS (id:chararray, last_name:chararray,
first_name:chararray, role:chararray);
DESCRIBE data;
DUMP data;
output:
data: {id: chararray,last_name: chararray,first_name: chararray,role: chararray}
(1,Smith,Bob,Business Development)
(2,Doe,John,Developer)
(3,Jane,Sally,Tester)
however, given the following input (as name value pairs); how could i process the data to get the same "data object"?
id=1,last_name=Smith,first_name=Bob,role=Business Development
id=2,last_name=Doe,first_name=John,role=Developer
id=3,last_name=Jane,first_name=Sally,role=Tester
Refer to STRSPLIT
A = LOAD 'example-data.txt' USING PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray, f4:chararray);
B = FOREACH A GENERATE
FLATTEN(STRSPLIT(f1,'=',2)) as (n1:chararray,v1:chararray),
FLATTEN(STRSPLIT(f2,'=',2)) as (n2:chararray,v2:chararray),
FLATTEN(STRSPLIT(f3,'=',2)) as (n3:chararray,v3:chararray),
FLATTEN(STRSPLIT(f4,'=',2)) as (n4:chararray,v4:chararray);
C = FOREACH B GENERATE v1,v2,v3,v4;
DUMP C;

I have a seemingly simple Pig generate and then filter issue

I am trying to run a simple Pig script on a simple csv file and I can not get FILTER to do what I want. I have a test.csv file that looks like this:
john,12,44,,0
bob,14,56,5,7
dave,13,40,5,5
jill,8,,,6
Here is my script that does not work:
people = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filtered = FILTER data BY first == 13;
DUMP filtered;
When I dump data, everything looks good. I get the name and the first and last integer as expected. When I describe the data, everything looks good:
data: {name: bytearray,first: int,second: int}
When I try and filter out data by the first value being 13, I get nothing. DUMP filtered simply returns nothing. Oddly enough, if I change it to first > 13, then all "rows" will print out.
However, this script works:
peopletwo = LOAD 'hdfs:/whatever/test.csv' using PigStorage(',') AS (f1:chararray,f2:int,f3:int,f4:int,f5:int);
datatwo = FOREACH peopletwo GENERATE $0 AS name:chararray, $1 AS first:int, $4 AS second:int;
filteredtwo = FILTER datatwo BY first == 13;
DUMP filteredtwo;
What is the difference between filteredtwo and filtered (or data and datatwo for that matter)? I want to know why the new relation obtained using GENERATE (i.e. data) won't filter in the first script as one would expect.
Specify the datatype in the load itself.See below
people = LOAD 'test5.csv' USING PigStorage(',') as (f1:chararray,f2:int,f3:int,f4:int,f5:int);
filtered = FILTER people BY f2 == 13;
DUMP filtered;
Output
Changing the filter to use > gives
filtered = FILTER people BY f2 > 13;
Output
EDIT
When converting from bytearray you will have to explicitly cast the value of the fields in the FOREACH.This works.
people = LOAD 'test5.csv' USING PigStorage(',');
data = FOREACH people GENERATE $0 AS name:chararray,(int)$1 AS f1,(int)$4 AS f2;
filtered = FILTER data BY f1 == 13;
DUMP filtered;

In Pig latin, am not able to load data as multiple tuples, please advice

I am not able load the data as multiple tuples, am not sure what mistake am doing, please advise.
data.txt
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
I want to load them as 2 touples.
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:tuple(name:bytearray, no:int), T2:tuple(result:chararray, school:chararray));
OR
A = LOAD 'data.txt' USING PigStorage('\t') AS (T1:(name:bytearray, no:int), T2:(result:chararray, school:chararray));
dump A;
the below data is displayed in the form of new line, i dont know why am not able to read actual data from data.txt.
(,)
(,)
(,)
As the input data is not stored as tuple we wont be able to read it directly in to a tuple.
One feasible approach is to read the data and then form a tuple with required fields.
Pig Script :
A = LOAD 'a.csv' USING PigStorage('\t') AS (name:chararray,no:int,result:chararray,school:chararray);
B = FOREACH A GENERATE (name,no) AS T1:tuple(name:chararray, no:int), (result,school) AS T2:tuple(result:chararray, school:chararray);
DUMP B;
Input : a.csv
vineet 1 pass Govt
hisham 2 pass Prvt
raj 3 fail Prvt
Output : DUMP B:
((vineet,1),(pass,Govt))
((hisham,2),(pass,Prvt))
((raj,3),(fail,Prvt))
Output : DESCRIBE B :
B: {T1: (name: chararray,no: int),T2: (result: chararray,school: chararray)}

how can i ignore " (double quotes) while loading file in PIG?

I have following data in file
"a","b","1","2"
"a","b","4","3"
"a","b","3","1"
I am reading this file using below command
File1 = LOAD '/path' using PigStorage (',') as (f1:chararray,f2:chararray,f3:int,f4:int)
But here it is ignoring the data of field 3 and 4
How to read this file correctly or any way to make PIG skip '"'
Additional information i am using Apache Pig version 0.10.0
You may use the REPLACE function (it won't be in one pass though) :
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0 as (f1:chararray), $1 as (f2:chararray), REPLACE($2, '\\"', '') as (f3:int), REPLACE($3, '\\"', '') as (f4:int);
You may also use regexes with REGEX_EXTRACT :
file1 = load 'your.csv' using PigStorage(',');
data = foreach file1 generate $0, $1, REGEX_EXTRACT($2, '([0-9]+)', 1), REGEX_EXTRACT($3, '([0-9]+)', 1);
Of course, you could erase " for f1 and f2 the same way.
Try below (No need to escape or replace double quotes) :
using org.apache.pig.piggybank.storage.CSVExcelStorage()
If you have Jython installed you could deploy a simple UDF to accomplish the job.
python UDF
#!/usr/bin/env python
'''
udf.py
'''
#outputSchema("out:chararray")
def formatter(item):
chars = 'abcdefghijklmnopqrstuvwxyz'
nums = '1234567890'
new_item = item.split('"')[1]
if new_item in chars:
output = str(new_item)
elif new_item in nums:
output = int(new_item)
return output
pig script
REGISTER 'udf.py' USING jython as udf;
data = load 'file' USING PigStorage(',') AS (col1:chararray, col2:chararray,
col3:chararray, col4:chararray);
out = foreach data generate udf.formatter(col1) as a, udf.formatter(col3) as b;
dump out
(a,1)
(a,4)
(a,3)
How about use REPLACE? if case is this simple?
data = LOAD 'YOUR_DATA' Using PigStorage(',') AS (a:chararray, b:chararray, c:chararray, d:chararray) ;
new_data = foreach data generate
REPLACE(a, '"', '') AS a,
REPLACE(b, '"', '') AS b,
(int)REPLACE(c, '"', '') AS c:int,
(int)REPLACE(d, '"', '') AS d:int;
One more tips: If you are loading a csv file, set a correct number format in an Excel like tools might also help.
You can use CSVExcelStorage loader from Pig.
The double quotes in data are handled by this loader.
You have to register Piggy-bank jar for using this loader.
Register ${jar_location}/piggybank-0.15.0.jar;
load_data = load '${data_location}' using
org.apache.pig.piggybank.storage.CSVExcelStorage(',');
Hope this helps.