SUM function in PIG - apache-pig

Starting to learn Pig latin scripting and stuck on below issue. I have gone through similar questions on the same topic without any luck! Want to find SUM of all the age fields.
DUMP X;
(22)(19)
grunt> DESCRIBE X;
X: {age: int}
I tried several options such as :
Y = FOREACH ( group X all ) GENERATE SUM(X.age);
But, getting below exception.
Invalid field projection. Projected field [age] does not exist in schema: group:chararray,X:bag{:tuple(age:int)}.
Thanks for your time and help.

I think the Y projection should work as you wrote it. Here's mi little example code for the same and that's just work fine for me.
X = LOAD 'SO/sum_age.txt' USING PigStorage('\t') AS (age:int);
DESCRIBE X;
Y = FOREACH ( group X all ) GENERATE
SUM(X.age);
DESCRIBE Y;
DUMP Y;
So you your problem looks strange. I used the following input data:
-bash-4.1$ cat sum_age.txt
22
19
Can you make a try on the same data with script I inserted here?

Related

Extracting Values from Array in Redshift SQL

I have some arrays stored in Redshift table "transactions" in the following format:
id, total, breakdown
1, 100, [50,50]
2, 200, [150,50]
3, 125, [15, 110]
...
n, 10000, [100,900]
Since this format is useless to me, I need to do some processing on this to get the values out. I've tried using regex to extract it.
SELECT regexp_substr(breakdown, '\[([0-9]+),([0-9]+)\]')
FROM transactions
but I get an error returned that says
Unmatched ( or \(
Detail:
-----------------------------------------------
error: Unmatched ( or \(
code: 8002
context: T_regexp_init
query: 8946413
location: funcs_expr.cpp:130
process: query3_40 [pid=17533]
--------------------------------------------
Ideally I would like to get x and y as their own columns so I can do the appropriate math. I know I can do this fairly easy in python or PHP or the like, but I'm interested in a pure SQL solution - partially because I'm using an online SQL editor (Mode Analytics) to plot it easily as a dashboard.
Thanks for your help!
If breakdown really is an array you can do this:
select id, total, breakdown[1] as x, breakdown[2] as y
from transactions;
If breakdown is not an array but e.g. a varchar column, you can cast it into an array if you replace the square brackets with curly braces:
select id, total,
(translate(breakdown, '[]', '{}')::integer[])[1] as x,
(translate(breakdown, '[]', '{}')::integer[])[2] as y
from transactions;
You can try this :
SELECT REPLACE(SPLIT_PART(breakdown,',',1),'[','') as x,REPLACE(SPLIT_PART(breakdown,',',2),']','') as y FROM transactions;
I tried this with redshift db and this worked for me.
Detailed Explanation:
SPLIT_PART(breakdown,',',1) will give you [50.
SPLIT_PART(breakdown,',',2) will give you 50].
REPLACE(SPLIT_PART(breakdown,',',1),'[','') will replace the [ and will give just 50.
REPLACE(SPLIT_PART(breakdown,',',2),']','') will replace the ] and will give just 50.
Know its an old post.But if someone needs a much easier way
select json_extract_array_element_text('[100,101,102]', 2);
output : 102

Pig: Cast error while grouping data

This is the code that I am trying to run. Steps:
Take an input (there is a .pig_schema file in the input folder)
Take only two fields (chararray) from it and remove duplicates
Group on one of those fields
The code is as follows:
x = LOAD '$input' USING PigStorage('\t'); --The input is tab separated
x = LIMIT x 25;
DESCRIBE x;
-- Output of DESCRIBE x:
-- x: {id: chararray,keywords: chararray,score: chararray,time: long}
distinctCounts = FOREACH x GENERATE keywords, id; -- generate two fields
distinctCounts = DISTINCT distinctCounts; -- remove duplicates
DESCRIBE distinctCounts;
-- Output of DESCRIBE distinctCounts;
-- distinctCounts: {keywords: chararray,id: chararray}
grouped = GROUP distinctCounts BY keywords; --group by keywords
DESCRIBE grouped; --THIS IS WHERE IT GIVES AN ERROR
DUMP grouped;
When I do the grouped, it gives the following error:
ERROR org.apache.pig.tools.pigstats.SimplePigStats -
ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String
keywords is a chararray and Pig should be able to group on a chararray. Any ideas?
EDIT:
Input file:
0000010000014743 call for midwife 23 1425761139
0000010000062069 naruto 1 56 1425780386
0000010000079919 the following 98 1425788874
0000010000081650 planes 2 76 1425721945
0000010000118785 law and order 21 1425763899
0000010000136965 family guy 12 1425766338
0000010000136100 american dad 19 1425766702
.pig_schema file
{"fields":[{"name":"id","type":55},{"name":"keywords","type":55},{"name":"score","type":55},{"name":"time","type":15}]}
Pig is not able to identify the value of keywords as chararray.Its better to go for field naming during initial load, in this way we are explicitly stating the field types.
x = LOAD '$input' USING PigStorage('\t') AS (id:chararray,keywords:chararray,score: chararray,time: long);
UPDATE :
Tried the below snippet with updated .pig_schema to introduce score, used '\t' as separator and tried the below steps for the input shared.
x = LOAD 'a.csv' USING PigStorage('\t');
distinctCounts = FOREACH x GENERATE keywords, id;
distinctCounts = DISTINCT distinctCounts;
grouped = GROUP distinctCounts BY keywords;
DUMP grouped;
Would suggest to use unique alias names for better readability and maintainability.
Output :
(naruto 1,{(naruto 1,0000010000062069)})
(planes 2,{(planes 2,0000010000081650)})
(family guy,{(family guy,0000010000136965)})
(american dad,{(american dad,0000010000136100)})
(law and order,{(law and order,0000010000118785)})
(the following,{(the following,0000010000079919)})
(call for midwife,{(call for midwife,0000010000014743)})

Casting the output from Flatten and Strsplit in Pig

I am trying to parse a log extract with multiple delimiters with sample data as below using pig
CEF:0|NetScreen|Firewall/VPN||traffic:1|Permit|Low| eventId=5
msg=start_time\="2015-05-20 09:41:38" duration\=0 policy_id\=64
My code is as below:
A = LOAD '/user/cef.csv' USING PigStorage(' ') as
(a:chararray,b:chararray,c:chararray,d:chararray,e:chararray,f:chararray,g:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2),STRSPLIT(d,'=',2),STRSP LIT(e,'=',2),STRSPLIT(f,'=',2),STRSPLIT(g,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2),FLATTEN($3),FLATTEN($4),FLATTEN($5);
D = FOREACH C GENERATE $2,flatten(STRSPLIT($4,'"',2)),flatten(STRSPLIT($5,'"',2)),$7,$9;
E = FOREACH D GENERATE (int)$0,(chararray)$2,(chararray)$3,(int)$5,(int)$6 as (a:int,b:chararray,c:chararray,D:int,E:int);
Now when i dump E,i get the error
grunt> 2015-05-25 04:06:48,092 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1031: Incompatable schema: left is
"a:int,b:chararray,c:chararray,D:int,E:int", right is ":int"
I am trying to cast the output of my flatten and strsplit operations into chararray and int.
Please let me know whether this can be done
Thank you for the help!
Your problem is how you use the as clause. Since you place the as after the sixth parameter, it assumes you are trying to specify that schema only for that sixth parameter. Therefore, you are assigning a schema of six fields to only one, hence the error.
Do it like this:
E = FOREACH D GENERATE (int)$0 as a:int,(chararray)$2 as b,(chararray)$3 as c,(int)$5 as d,(int)$6 as e;
However, you are casting 09:41:38" to an int, so it will give you another error once you change it. You need to check again how you are splitting the data.
In my humble opinion, you should try to split the files by their delimiter before processing them in Pig, and then load them with their delimiter and perform an union. If your data is too large, then forget this idea... But your code is going to get too messy if you have several delimiters in the same file.

How to find two strings in a CLOB column?

Ive tried many queries to find... just one word and I can´t even make that.
Its a DB2 database Im using com.ibm.db2.jcc.DB2Driver
This brings me info:
select *
from JL_ENR
where id_ws = '002'
and dc_dy_bsn = '2014-08-25'
and ai_trn = 2331
the JL_TPE column is the CLOB column where I want to find two strings in that search result ( and dc_dy_bsn = '2014-08-25'
and ai_trn = 2331 ).
So first I tried with one:
select
dbms_lob.substr(clob_column,dbms_lob_instr(JL_TPE,'CEMENTO'),1)
from
JL_ENR
where
dbms_lob.instr(JL_TPE,'CEMENTO')>0;
didnt work
SELECT * FROM JL_ENR WHERE dbms_lob.instr(JL_TPE,'CEMENTO')>0
and ai_trn = 2331
and dc_dy_bsn = '2014-08-25'
didnt work
Select *
From JL_ENR
Where NOT
DBMS_LOB.INSTR(JL_TPE, 'CEMENTO', 1, 1) = 0;
didn´t work
Could someone explain me how to find two strings please?
Or a tutorial link where it is explained how to make it work...
Thanks.
Can you provide some sample data and the version you are using? Your example should work (tested on v10.5.0.1):
db2 "create table test ( x int, y clob(1M) )"
db2 "insert into test (x,y) values (1,cast('The string to find is CEMENTO, how do we do that?')"
db2 "insert into test (x,y) values (2,cast('The string to find is CEMENT, how do we do that?' as clob))"
db2 "select x, DBMS_LOB.INSTR(y, 'CEMENTO', 1) from test where DBMS_LOB.INSTR(y, 'CEMENTO', 1) > 0"
X 2
----------- -----------
1 23
1 record(s) selected.
I had to search for a specific value in the where clause. I used TEXTBLOB LIKE '%Search value%' and it worked! This was for db2 in a CLOB(536870912) column.

Error 1045 on sum function in pig latin with an int

The following pig latin script:
data = load 'access_log_Jul95' using PigStorage(' ') as (ip:chararray, dash1:chararray, dash2:chararray, date:chararray, date1:chararray, getRequset:chararray, location:chararray, http:chararray, code:int, size:int);
splitDate = foreach data generate size as size:int , ip as ip, FLATTEN(STRSPLIT(date, ':')) as h;
groupedIp = group splitDate by h.$1;
a = foreach groupedIp{
added = foreach splitDate generate SUM(size); --
generate added;
};
describe a;
gives me the error:
ERROR 1045:
<file 3.pig, line 10, column 39> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
This error makes me think I need to cast size as an int, but if i describe my groupedIp field, I get the following schema.
groupedIp: {group: bytearray,splitDate: {(size: int,ip: chararray,h: bytearray)}} which indicates that size is an int, and should be able to be used by the sum function.
Am I calling the sum function incorrectly? Let me know if you would like to see any thing else, such as the input file.
SUM operates on a bag as input, but you pass it the field 'size'.
Try to eliminate the nested foreach and use:
a = foreach groupedIp generate SUM(splitDate.size);
Do some dumps of your data. I'll bet some of the stuff in the size column is non-integer, and Pig runs into that and dies. You could also code up your own isInteger udf to check this before the rest of your processing, and throw out any that aren't integers.
SUM, AVG and COUNT are functions that always work on a bag, therefore group the data and then join with the original set like below:
A = load 'nyse_data.txt' as (exchange:chararray, symbol:chararray,date:chararray, pen:float,high:float, low:float, close:float,volume:int, adj_close:float);
G = group A by symbol;
C = foreach G generate group, SUM(A.open);