Group by expression in pig - apache-pig

Consider I have a dataset with tuples (f1, f2). I want to get my data in two bags: one where fi is null and the other where f1 values are not null. I try:
raw = LOAD 'somedata' USING PigStorage() AS (f1:chararray, f2:chararray);
raw_group = GROUP raw BY f1 is null;
raw_count = FOREACH raw_group GENERATE group, COUNT_STAR(raw);
I expect to get two groups with keys true and false. When I run it in grunt I get the following:
2013-12-26 14:56:10,958 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: <line 1046, column 25> Syntax error, unexpected symbol at or near 'f1'
I can do a workaround:
raw_group = GROUP raw BY (f1 is null)?0:1;
, but I really like to understand what's going on here, as I just started to learn Pig. According to Pig documentation I can use expressions as a grouping key. Do I miss something here or nulls are treated differently in Pig?

The boolean datatype was introduced in Pig 0.10. The expression f1 is null is a boolean, so it can't appear as a field in a relation, which it would do if it were the value of group. Prior to Pig 0.10, booleans could only be used in FILTER statements or in the ternary operator, as you showed in your workaround.
While I haven't tried this out, presumably if you were to attempt the same thing in Pig 0.10 or later, your original attempt would succeed.

Related

How to resolve this sql error of schema_of_json

I need to find out the schema of a given JSON file, I see sql has schema_of_json function
and something like this works flawlessly
> SELECT schema_of_json('[{"col":0}]');
ARRAY<STRUCT<`col`: BIGINT>>
But if I query for my table name, it gives me the following error
>SELECT schema_of_json(Transaction) as json_data from table_name;
Error in SQL statement: AnalysisException: cannot resolve 'schemaofjson(`Transaction`)' due to data type mismatch: The input json should be a string literal and not null; however, got `Transaction`.; line 1 pos 7;
The Transaction is one of the columns in my table and after checking it manually I can attest that it is of String type(json).
The SQL statement has it to give me the schema of the JSON, how to do it?
after looking further into the documentation that it is clear that the word foldable means that of the static one, and a column from a table JSON won't work
for minimal reroducible example here you go:
SELECT schema_of_json(CAST('{ "a": "b" }' AS STRING))
As soon as the cast is introduced in the above statement, the schema_of_json will fail......... It needs a static JSON as it's input

Regex comparison in Oracle between 2 varchar columns (from different tables)

I am trying to find a way to capture relevant errors from oracle alertlog. I have one table (ORA_BLACKLIST) with column values as below (these are the values which I want to ignore from
V$DIAG_ALERT_EXT)
Below are sample data in ORA_BLACKLIST table. This table can grow based on additional error to ignore from alertlog.
ORA-07445%[kkqctdrvJPPD
ORA-07445%[kxsPurgeCursor
ORA-01013%
ORA-27037%
ORA-01110
ORA-2154
V$DIAG_ALERT_EXT contains a MESSAGE_TEXT column which contains sample text like below.
ORA-01013: user requested cancel of current operation
ORA-07445: exception encountered: core dump [kxtogboh()+22] [SIGSEGV] [ADDR:0x87] [PC:0x12292A56]
ORA-07445: exception encountered: core dump [java_util_HashMap__get()] [SIGSEGV]
ORA-00600: internal error code arguments: [qercoRopRowsets:anumrows]
I want to write a query something like below to ignore the black listed errors and only capture relevant info like below.
select
dae.instance_id,
dae.container_name,
err_count,
dae.message_level
from
ORA_BLACKLIST ob,
V$DIAG_ALERT_EXT dae
where
group by .....;
Can someone suggest a way or sample code to achieve it?
I should have provided the exact contents of blacklist table. It currently contains some regex (perl) and I want to convert it to oracle like regex and compare with v$diag_alert_ext message_text column. Below are sample perl regex in my blacklist table.
ORA-0(,|$| )
ORA-48913
ORA-00060
ORA-609(,|$| )
ORA-65011
ORA-65020 ORA-31(,|$| )
ORA-7452 ORA-959(,|$| )
ORA-3136(,|)|$| )
ORA-07445.[kkqctdrvJPPD
ORA-07445.[kxsPurgeCursor –
Your blacklist table looks like like patterns, not regular expressions.
You can write a query like this:
select dae.* -- or whatever columns you want
from V$DIAG_ALERT_EXT dae
where not exists (select 1
from ORA_BLACKLIST ob
where dae.message_text like ob.<column name>
);
This will not have particularly good performance if the tables are large.

Passing Optional List argument from Django to filter with in Raw SQL

When using primitive types such as Integer, I can without any problems do a query like this:
with connection.cursor() as cursor:
cursor.execute(sql='''SELECT count(*) FROM account
WHERE %(pk)s ISNULL OR id %(pk)s''', params={'pk': 1})
Which would either return row with id = 1 or it would return all rows if pk parameter was equal to None.
However, when trying to use similar approach to pass a list/tuple of IDs, I always produce a SQL syntax error when passing empty/None tuple, e.g. trying:
with connection.cursor() as cursor:
cursor.execute(sql='''SELECT count(*) FROM account
WHERE %(ids)s ISNULL OR id IN %(ids)s''', params={'ids': (1,2,3)})
works, but passing () produces SQL syntax error:
psycopg2.ProgrammingError: syntax error at or near ")"
LINE 1: SELECT count(*) FROM account WHERE () ISNULL OR id IN ()
Or if I pass None I get:
django.db.utils.ProgrammingError: syntax error at or near "NULL"
LINE 1: ...LECT count(*) FROM account WHERE NULL ISNULL OR id IN NULL
I tried putting the argument in SQL in () - (%(ids)s) - but that always breaks one or the other condition. I also tried playing around with pg_typeof or casting the argument, but with no results.
Notes:
the actual SQL is much more complex, this one here is a simplification for illustrative purposes
as a last resort - I could alter the SQL in Python based on the argument, but I really wanted to avoid that.)
At first I had an idea of using just 1 argument, but replacing it with a dummy value [-1] and then using it like
cursor.execute(sql='''SELECT ... WHERE -1 = any(%(ids)s) OR id = ANY(%(ids)s)''', params={'ids': ids if ids else [-1]})
but this did a Full table scan for non empty lists, which was unfortunate, so a no go.
Then I thought I could do a little preprocessing in python and send 2 arguments instead of just the single list- the actual list and an empty list boolean indicator. That is
cursor.execute(sql='''SELECT ... WHERE %(empty_ids)s = TRUE OR id = ANY(%(ids)s)''', params={'empty_ids': not ids, 'ids': ids})
Not the most elegant solution, but it performs quite well (Index scan for non empty list, Full table scan for empty list - but that returns the whole table anyway, so it's ok)
And finally I came up with the simplest solution and quite elegant:
cursor.execute(sql='''SELECT ... WHERE '{}' = %(ids)s OR id = ANY(%(ids)s)''', params={'ids': ids})
This one also performs Index scan for non empty lists, so it's quite fast.
From the psycopg2 docs:
Note You can use a Python list as the argument of the IN operator using the PostgreSQL ANY operator.
ids = [10, 20, 30]
cur.execute("SELECT * FROM data WHERE id = ANY(%s);", (ids,))
Furthermore ANY can also work with empty lists, whereas IN () is a SQL syntax error.

PIg scalar is bigger than 0

I have the following code
Data1 = LOAD '/user/cloudera/Class Ex 2/Data 1' USING PigStorage(',') as (Name:chararray,ID:chararray,text_1:chararray,Grade_1:int,Grade_2:int,Grade_3:int,Grade_4:int);
Data2 = LOAD '/user/cloudera/Class Ex 2/Data 2' USING PigStorage(',') as (Name:chararray,ID:chararray,text_2:chararray,Grade_5:int,Grade_6:int,Grade_7:int,Grade_8:int);
Data_3 = JOIN Data1 BY Data1.ID,Data2 BY Data2.ID;
Data_4 = FOREACH Data_3 GENERATE $0,$1,$2,$3,$4,$5,$6,$9,$10,$11,$12,$13;
Data_5 = FOREACH Data_4 GENERATE
Name,
ID,
text_1,
SIZE(text_1),
REPLACE(text_1,'or',''),
SIZE(REPLACE(text_1,'or','')),
SIZE(text_1)-SIZE(REPLACE(text_1,'or','')),
text_2,
SIZE(text_2),
REPLACE(text_2,'or',''),
SIZE(REPLACE(text_2,'or','')),
SIZE(text_2)-SIZE(REPLACE(text_2,'or','')),
($3+$4+$5+$6+$8+$9+$10+$11)/8;
DESCRIBE Data_5;
STORE Data_5 Into '/user/cloudera/Class Ex 2/Data_output' USING PigStorage(',');
Essentially I have to load 2 sets of data, and then make some basic text statistics and manipulation.
Everything works fine until the last statement, STORE.
When I add it I receive the scalar error.
What am I doing wrong here?
Thanks guys!
First of all, Pig only evaluates the alias' which finally lead to a STORE or a DUMP (this is called lazy evaluation). Hence, your error was always there; It just got caught once you added the STORE statement. Since you have not pasted the full trace, I would think that your error is in the third statement where you try to access the field ID using the dot (.) operator. You need to change it to one of the following:
1) Refer to the field ID directly since only one field called ID in both Data1 and Data2:
Data_3 = JOIN Data1 BY ID, Data2 BY ID;
2) Use :: instead of . if you do need to disambiguate:
Data_3 = JOIN Data1 BY Data1::ID, Data2 BY Data2::ID;
If you want to know why the dot (.) operator caused an error, might help to look at the following question: Getting exception while trying to execute a Pig Latin Script

Casting the output from Flatten and Strsplit in Pig

I am trying to parse a log extract with multiple delimiters with sample data as below using pig
CEF:0|NetScreen|Firewall/VPN||traffic:1|Permit|Low| eventId=5
msg=start_time\="2015-05-20 09:41:38" duration\=0 policy_id\=64
My code is as below:
A = LOAD '/user/cef.csv' USING PigStorage(' ') as
(a:chararray,b:chararray,c:chararray,d:chararray,e:chararray,f:chararray,g:chararray);
B = FOREACH A GENERATE STRSPLIT(SUBSTRING(a, LAST_INDEX_OF(a,'|')+1, (int)SIZE(a)),'=',2),STRSPLIT(b,'=',2),STRSPLIT(c,'=',2),STRSPLIT(d,'=',2),STRSP LIT(e,'=',2),STRSPLIT(f,'=',2),STRSPLIT(g,'=',2);
C = FOREACH B GENERATE FLATTEN($0), FLATTEN($1), FLATTEN($2),FLATTEN($3),FLATTEN($4),FLATTEN($5);
D = FOREACH C GENERATE $2,flatten(STRSPLIT($4,'"',2)),flatten(STRSPLIT($5,'"',2)),$7,$9;
E = FOREACH D GENERATE (int)$0,(chararray)$2,(chararray)$3,(int)$5,(int)$6 as (a:int,b:chararray,c:chararray,D:int,E:int);
Now when i dump E,i get the error
grunt> 2015-05-25 04:06:48,092 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1031: Incompatable schema: left is
"a:int,b:chararray,c:chararray,D:int,E:int", right is ":int"
I am trying to cast the output of my flatten and strsplit operations into chararray and int.
Please let me know whether this can be done
Thank you for the help!
Your problem is how you use the as clause. Since you place the as after the sixth parameter, it assumes you are trying to specify that schema only for that sixth parameter. Therefore, you are assigning a schema of six fields to only one, hence the error.
Do it like this:
E = FOREACH D GENERATE (int)$0 as a:int,(chararray)$2 as b,(chararray)$3 as c,(int)$5 as d,(int)$6 as e;
However, you are casting 09:41:38" to an int, so it will give you another error once you change it. You need to check again how you are splitting the data.
In my humble opinion, you should try to split the files by their delimiter before processing them in Pig, and then load them with their delimiter and perform an union. If your data is too large, then forget this idea... But your code is going to get too messy if you have several delimiters in the same file.