detect sequence in hive column with lead function - hive

I'm trying to detect a sequence in a column of my hive table. I have 3 columns
(id, label, index). Each id has a sequence of labels and index is the ordering of the labels, like
id label index
a x 1
a y 2
a x 3
a y 4
b x 1
b y 2
b y 3
b y 4
b x 5
b y 6
I want to identify if the label sequence of x,y,x,y occurs.
I was thinking of trying a lead function to accomplish this like:
select id, index, label,
lead( label, 1) over (partition by id order by index) as l1_fac,
lead( label, 2) over (partition by id order by index) as l2_fac,
lead( label, 3) over (partition by id order by index) as l3_fac
from mytable
yields:
id index label l1_fac l2_fac l3_fac
a 1 x y x y
a 2 y x y NULL
a 3 x y NULL NULL
a 4 y NULL NULL NULL
b 1 x y y y
b 2 y y y x
b 3 y y x y
b 4 y x y NULL
b 5 x y NULL NULL
where l1(2,3) are the next label values. Then I could check for a pattern with
where label = l2_fac and l1_fac = l3_fac
This will work for id = a, but not id = b where the label sequence is: x, y, y, y, y, x. I don't care that it was 3 y's in a row I am just interested that it went from x to y to x to y.
I'm not sure if this is possible, I was trying a combination of group by and partition, but not successful.

I answered this question where the OP wanted to collect items to a list and remove any repeating items. I think this is essentially what you want to do. This would extract actual xyxy sequences and also would account for your second example where xyxy occurs, but is clouded by 2 extra ys. You need to collect the label column to an array using this UDAF -- this will preserve the order -- then use the UDF I referenced, then you can use concat_ws to make the contents of this array a string, and lastly, check that string for the occurrence of your desired sequence. The function instr will spit out the location of the first occurrence and zero if it never finds the string.
Query:
add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/other/jar/duplicates.jar;
create temporary function remove_seq_dups as 'com.something.RemoveSequentialDuplicates';
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
select id, label_string, instr('xyxy', label_string) str_flg
from (
select id, concat_ws('', no_dups) label_string
from (
select id, remove_seq_dups(label_array) no_dups
from (
select id, collect(label) label_array
from db.table
group by id ) x
) y
) z
Output:
id label_string str_flg
============================
a xyxy 1
b xyxy 1
A better alternative might be to simply collect label with the UDF, make it a string, and then regex out the sequence xyxy but I'm pretty terrible at regex so possibly someone else can comment intelligently on this.

Related

SQL: Remove the entire value x from a column if that value x is also in the same row as column y where column y's value is a

Is there any way to remove rows from column X when Y="a"? Eg. Given:
X
Y
100
a
101
a
101
b
200
c
The end result would be only row where [200, c] exist.
[101, b] is also removed as 101 is also in a. Thank you!
I don't know your RDBMS but I think this is ANSI standard SQL. To retrieve the rows you want, you can use a simple subquery:
SELECT X, Y FROM TBL WHERE X NOT IN
(SELECT X FROM tbl WHERE Y = 'a')
If you want to actually remove (DELETE) rows, you can simply DELETE with the same subquery:
DELETE FROM tbl WHERE X IN
(SELECT X FROM tbl WHERE Y = 'a')
Since you did not include the table name, I just made one up (tbl).

LIKE operator for array value of column

I have simple table with columns x and y.
These columns contains SQL-like patterns for matching.
Column x is array of varchar (VARCHAR[])
Column y is simple string (VARCHAR)
As example:
first row:
x y
{'asd','sdf%','%er%'} %er%
I have this query:
SELECT x, 'ters' LIKE ANY("x"), y, 'ters' LIKE "y" FROM s
So the result of this query is:
"{'asd','sdf%','%er%'}";f;"%er%";t
My trouble is:
Why does the LIKE operator work for Y but not for X.
How can I match by VARCHAR[]?
You have a wrong input format for the array:
select x, 'ters' like any (x), x[3]
from (values
('{asd,sdf%,%er%}'::varchar[]),
($${'asd','sdf%','%er%'}$$)
) s(x);
x | ?column? | x
-----------------------+----------+--------
{asd,sdf%,%er%} | t | %er%
{'asd','sdf%','%er%'} | f | '%er%'
https://www.postgresql.org/docs/current/static/arrays.html#ARRAYS-INPUT

Pandas GroupBy Filtering

I am looking to understand how to filter a groupby object.
I am generating this through:
groupby = df.groupby(['Order #', 'ProductLine', 'ProductType']).size()
And the result is:
Order # ProductLine ProductType QTY
1 A Z 1
Y 1
B X 2
2 A Z 1
Y 1
3 A Y 1
B X 1
I need to filter out two conditions:
Orders where only Product A is contained
Orders where Product A is contained, but there is no ProductType Z
In the example above, only order 1 is legitimate. Order 2 and 3 would be filtered out.
filter takes a callable that returns a boolean. That callable will take the entire groups dataframe. If the boolean is True, the dataframe comes back. If False then nothing comes back.
Only A
def f(df):
v = df.ProductLine.values
return (v == 'A').all()
df.groupby(['Order #', 'ProductLine', 'ProductType']).filter(f)
A and not Z
def f(df):
v = df.ProductLine.values
return ('A' in v) and ('Z' not in v)
df.groupby(['Order #', 'ProductLine', 'ProductType']).filter(f)

SQL Query to Replace Multiple Matching Rows With a New Row

For simplicity, suppose I have a SQL table with one column, containing numeric data. Sample data is
11
13
21
22
23
3
31
32
33
41
42
131
132
133
141
142
143
If the table contains ALL the numbers of the form x1, x2, x3, but NOT x (where x is all of the digits of a number but the last digit. So for 123456, x would be 12345), then I want to replace these three rows with a new row, x.
The desired output of the above data would be:
11
13
2
3
31
32
33
41
42
131
132
133
14
How would I accomplish this with SQL? I should mention that I do not want to permanently alter the data - just for the query results.
Thanks.
I assume
the presence of to functions: lastDigit and head producing the last digit and the rest of the input value respectively
the data is unique
only the digits 1,2,3 are used for constructing the table values
the table is named t and has a single column x
you don't want this to work recursively
create a view n like this: select head(x) h, lastDigit(x) last from t You can use inline views instead
create a view c like this
select h
from n
group by h
having count(*) = 3
Then this should give the desired result:
select distinct x
from (
select x from t where head(x) not in (select h from c)
union
select h from c
)
You need two SQL commands, one to remove the existing rows and the second to add the new row. They would look like this (where :X is a parameter containing your base number, 14 in your example):
DELETE FROM YourTable WHERE YourColumn BETWEEN (:X*10) AND ((:X*10) + 9)
INSERT INTO YourTable (YourColumn) VALUES (:X)
Note: I assume you want all the numbers from x0 to x9 to be removed, so that's what I wrote above. If you really only want x1, x2, x3 removed, then you would use this DELETE statement instead:
DELETE FROM YourTable WHERE YourColumn BETWEEN ((:X*10) + 1) AND ((:X*10) + 3)

SQL: Exchange column values

Want to clear some concepts about SQL internals.
Suppose I have a table as:
---------tblXY-----------
X int
Y int
Now it has records as:
X Y
---
1 4
2 3
3 2
4 1
And I want the resulting table to be:
X Y
---
4 1
3 2
2 3
1 4
So I wrote the query as:
UPDATE tblXY
SET [X] = Y
,[Y] = X
and got the required result.
But how did it happened? I mean I'm setting X's value as Y's current value and at the very moment I'm setting Y's value as X's.
It's because the operations are a single atomic action - the current values of X and Y are read first before any of the assignments are done.
So it's not so much:
for every row:
set x = y
set y = x
but more like:
for every row:
set tmpx = x
set tmpy = y
set x = tmpy
set y = tmpx
Keep in mind that's just the conceptual view. It's likely to be much more efficient under the covers.
Without that, you'd have to store the temporary yourself for every row, or just rename the columns :-)