IN clause in pig - apache-pig

user1,action,aa
user2,comedy,cc
user3,drama,dd
user4,action,aa
user5,action,aa
user6,comedy,cc
user7,action,aa
user8,comedy,cc
user9,drama,dd
user10,action,aa
user11,action,aa
user12,comedy,cc
i want to replace all 'aa' with 'bb' if corresponding role are from (action,comedy) . Couldnt find case statement support or other approaches for the same!

Assuming your data is loaded to the relation A, then use case statement to check the value of $1 i.e. second field in the relation A and when it is 'action' OR 'comedy' replace the value 'aa' with 'bb' or else use the default value $2
B = FOREACH A GENERATE $0,$1,
(CASE $1
WHEN 'action' OR 'comedy' THEN REPLACE($2,'aa','bb')
ELSE $2 END);

Related

Case statement handling logic differently than expected

I'm trying to assign a status based on the number of IDs using a metric. This is the query I've written (and it works):
select
x.yyyy_mm_dd,
x.prov_id,
x.app,
x.metric,
x.is_100,
case
when ((x.is_100 = 'true') or size(collect_set(x.list)) >10) then 'implemented'
when ((x.is_100 = 'false') and size(collect_set(x.list)) between 1 and 10) then 'first contact'
else 'no contact'
end as impl_status,
size(collect_set(x.list)) as array_size,
collect_set(x.list) as list
from(
select
yyyy_mm_dd,
prov_id,
app,
metric,
is_100,
list
from
my_table
lateral view explode(ids) e as list
) x
group by
1,2,3,4,5
However, the impl_status is incorrect for the second condition in the case statement. In the result set, I can see rows with is_100 = false, array_size between 1 and 10, however the impl_status ends up being 'no contact' instead of 'first contact'. I was thinking maybe between isn't inclusive but it seems to be according to the docs.
I am curious if this works:
(case when x.is_100 or count(distinct x.list) > 10
then 'implemented'
when (not x.is_100) and count(x.list) > 0
then 'first contact'
else 'no contact'
end) as impl_status,
This should be the same logic without the string comparisons -- here is an interesting viewpoint on booleans in Hive. I also think that COUNT() is clearer than the array functionality.
Be sure you have not some hidden space in the string
when (( trim(x.is_100) = 'false') and size(collect_set(x.list)) between 1 and 10) then 'first contact'

how to prevent converting the text into boolean by the use of when statement in postgresql?

select fti.pa_serial_,fti.homeownerm_name,fti.ward_,fti.villagetole,fti.status,
ftrq.date_reporting, ftrq.name_of_recorder_reporting,
case
when fti.status='terminate' then ftrq.is_the_site_cleared ='1' end as is_the_site_cleared from fti join ftrq on ftrq.fulcrum_parent_id = fti.fulcrum_id
Here, is_the_site_cleared is text type of column which is converted into boolean by the when statement written and hence does not print as 1 and takes as true. I explicitly used print '1'. But this also did not work. My aim is to display '1' in the column 'is_the_site_cleared' when the value of fti.status='terminate'. Please help!!!
How about using integers rather than booleans?
select fti.pa_serial_, fti.homeownerm_name, fti.ward_,
fti.villagetole, fti.status, ftrq.date_reporting,
ftrq.name_of_recorder_reporting,
(case when fti.status = 'terminate' -- and ftrq.is_the_site_cleared = '1'
then 1 else 0
end) as is_the_site_cleared
from fti join
ftrq
on ftrq.fulcrum_parent_id = fti.fulcrum_id ;
From the description, I cannot tell if you want to include the condition ftrq.is_the_site_cleared = '1' in the when condition. But the idea is to have the then and else return numbers if that is what you want to see.

Filter a null column out of tuple in pig

I am working on a usecase where we have to eliminate the nulls out of a tuple
A =
(7,Ron,ron#abc.com)
(8,,rina#xyz.com )
(9,Don,)
(9,Don,dmes#xyz.com)
(10,Maya,maya#cnn.com)
B = FILTER A BY col2 != '';
Output :-
(7,Ron,ron#abc.com)
(9,Don,dmes#xyz.com)
(10,Maya,maya#cnn.com)
Here the filter operator filters the second row. But we have to filter the column.
The expected output should be something like:
(7,Ron,ron#abc.com)
(8,rina#xyz.com)
(9,Don,dmes#xyz.com)
(9,Don)
(10,Maya,maya#cnn.com)
We can split our relation into sub relations, project desired columns, then union the result together :
So if the first column is not nullable, the second and the third are nullable but always we have atleast one of them, then :
SPLIT A INTO col1null IF $1 is null, col2null IF $2 is null, allnotnull IF ($1 is not null AND $2 is not null);
col1reject = FOREACH col1null GENERATE $0,$2; --remove column $1
col2reject = FOREACH col2null GENERATE $0,$1; --remove column $2
OUT = UNION allnotnull ,col1reject , col2reject ;

Select with filter on function parameters, if specified

I have a function to try to match partial data to a database row.
I want it to find a match if the parameter is non null; if it's null it should ignore that parameter.
If one of the parameters has a value but finds no match, the query returns no rows.
In pseudocode, that's pretty much how it'd go:
get all rows where:
param_a matches col_a when param_b is not null else don't check this column
AND param_b matches col_b when param_b is not null else don't check this column
AND param_c matches col_c when param_c is not null else don't check this column
AND param_d matches col_d when param_d is not null else don't check this column
AND param_e matches col_e when param_e is not null else don't check this column
What I do right now:
SELECT * FROM table
WHERE nvl(param_a, col_a) = col_a
AND nvl(param_b, col_b) = col_b
AND nvl(param_c, col_c) = col_c
AND nvl(param_d, col_d) = col_d;
Etc... It works, but I'm not sure it's the best option. A colleague suggested that I use
SELECT * FROM table
WHERE (param_a = col_a or param_a is null)
AND (param_b = col_b or param_b is null)
AND (param_c = col_c or param_c is null)
AND (param_d = col_d or param_d is null);
As this is used for an auto-complete feature in a web application, the query is executed a lot, as the user types. Being fast is essential. The strictest columns are filtered first to reduce the number of rows to process.
Is either of these options preferable? If not, how would you do it?
EDIT:
I wrote the question to be generic, but to put it in context: in this case it's for addresses. param_a is actually postal code, param_b street name etc... The function gets the string the user writes (ex: 999 Random St, Fakestate, Countryland, 131ABD) and calls a procedure on it that tries to split it and returns a table containing address, city, country, etc... that is used by the select statement (which is the subject of the question).
I believe the second solution is better. It allows Oracle to skip evaluating colA/B/C/D is the corresponding parameter is null.
However, it would be cleaner and faster if you dynamically build the query.
For example either in sql or in a programming language you can do something like this:
whereClause = 'WHERE 1 = 1'
IF paramA is not null
then whereClause += ' AND param_a = col_a'
else if paramB is not null
then whereClause += ' AND param_b = col_b'
etc...
For indexes I would only index the commonly used column combinations. There are too many combinations to cover them all. Pick the ones that give you the most bang for your buck.
If you are trying to go at typing speed, then I would suggest the following approach. Create a separate query for each combination of parameters. This is a total of 24 queries, with the where clauses such as:
WHERE param_a = col_a
WHERE param_b = col_b
. . .
WHERE param_a = col_a and param_b = col_b
. . .
WHERE param_a = col_a and param_b = col_b and param_c = col_c and param_d = col_d
Then, precompile these twenty-four queries.
Then choose the appropriate query based on the current state of the parameters.
I would also add indexes, at a minimum:
table(param_a, param_b, param_c, param_d)
table(param_b, param_c, param_d, param_a)
table(param_c, param_d, param_a, param_b)
table(param_d, param_a, param_b, param_c)
This will at least cover all cases with one parameter. You might want to include other indexes for other parameters.

where condition based on bind variables

I want a different set where condition to be executed on a query based on a bind variable in Oracle sql .
Here is what I have tried
Table a contains
a_id primary key
a_role varchar2(10)
SELECT *
FROM a
WHERE a_role IN ('approved','rejected' , 'needInfo') AND
:bind = 'new' OR
:bind != 'new AND
a_role IN ('complete') AND
:bind = 'approved' OR
:bind != 'approved'
In short i am trying to select roles based on current role which I will pass in bind variable. I want something like
if(:bind = 'new')
{select 'approved' , 'rejected' , 'needInfo' }
else if (:bind = 'approved')
{ select 'complete' }
Thanks ,
Puneet
You have both AND and OR conditions mixed in your statement, and they may not be interpreted as you wanted. Additionally, it looks like the two comparisons to a_role should be OR'd together rathen than AND'd. Try the following:
SELECT *
FROM a
WHERE (:bind = 'new' AND
(a_role IN ('approved','rejected' , 'needInfo')) OR
(:bind = 'approved' AND
a_role IN ('complete'))
Share and enjoy.
I can't be certain because I'm not clear what your original query is trying to do - but it should be something like this, assuming your running in sqlPlus:
variable bind varchar2
bind := 'new'
SELECT *
FROM a
WHERE (:bind = 'new' AND a_role IN ('approved','rejected' , 'needInfo'))
OR (:bind = 'approved' AND a_role IN ('complete'))
This article explains thins fairly neatly