Hive Query to select lines that meet multiple criteria

Hive Query to select lines that meet multiple criteria - hive

I have a table that looks something like this (Column 1 is a URL, column 2 is an action ID, and column 3 is a user ID):
1 2 3
===========
d x a
d q a
e y a
f z a
f z b
d i b
e x b
d i c
g q c
o q c
f q c
I'm trying to check and see if there are any rows where col1 = 'f'.
If col1 = 'f', I need to get the userID from col3 then check all rows where col3 = userID to see if there are any rows where col2 = 'x'.
If there are any userIDs that have a row where col1 = 'f' and a row where col2 = 'x', return all rows that have userID in col3
I'm a hive/sql noob, but here is some python code that i think would accomplish what I'm trying to do...
df = pd.DataFrame(table)
df2 = df[df['1'].str.contains('f')]
df2['check'] = df2['2'].str.contains('x')
ids = df2[df2['check']]
df = df[df['3'].isin(ids)]
The result of my desired query would return
1 2 3
===========
d x a
d q a
e y a
f z a
f z b
d i b
e x b
So far the closest I've gotten is this:
SELECT * FROM log AS a
WHERE a.3 in
(
SELECT DISTINCT 3
FROM log
WHERE ((to_date(log_date)) >= (date_sub(current_date, 1)))
AND 1 = 'f'
)
This gets me half way there, but it's not filtering on col2 and takes an extraordinarily long time to run, which can cause it to fail in my environment.
Is there a way to accomplish this using only Hive / Spark? I really don't want to have to download this file and run a python script on it, as it is several GB and my office wifi is slow :(

Get all userids where url = 'f'.This will give you (a,b)
Use that to check userid for actionid='x'.This will give you (a,b)
Finally get all rows with userid from the above.
select * from log where userid in
(
select distinct userid from log
where
actionid ='x' and
userid in (select distinct userid from log where URL='f')
)

Related

Break out nested data within SQL, criteria across multiple rows (similar to dcast in R)

I'm trying to write a simple query to take a data set that looks like this:
ID | Col2
X B
X C
Y B
Y D
and return this:
ID | Col2 | Col3
X B C
Y B D
Essentially, I have an ID column that can have either B, C, or D in Col2. I am trying to identify which IDs only have B and D. I have a query to find both, but not only that combination. Query:
select ID, Col2
from Table1
where ID in (
select ID from Table1
group by ID
having count(distinct Col2) = 2)
order by ID
Alternatively, I could use help in finding a way to filter that query on B and D and leave off B and C. I have seen perhaps a self join, but am not sure how to implement that.
Thanks!
EDIT: Most of the data set has, for a given ID, all three of B, C, and D. The goal here is to isolate the IDs that are missing one, namely missing C.

I am trying to identify which IDs only have B and D. I have a query to find both
If this is what you want, you don't need multiple columns:
select id
from table1
where col2 in ('B', 'D')
group by id
having count(distinct col2) = 2;
If you want only 'B' and 'D' and no others, then:
select id
from table1
group by id
having sum(case when col2 = 'B' then 1 else 0 end) > 0 AND
sum(case when col2 = 'C' then 1 else 0 end) > 0 AND
sum(case when col2 not in ('B', 'D') then 1 else 0 end) = 0;
If there are only two columns, you can also easily pivot the values using aggregation:
select id, min(col2), nullif(max(col2), min(col2))
from table1
group by id;

Trying to replicate a '=countifs' function from excel to SQL

I have an excel file with a table that looks like this:
A B C
Registry ID Parent Reg ID Focus Account (Y/N)
100000033 100000036778 Y
100000343 1000 Y
1000343223 100000036778 N
And the formula is on the column D (Focus Parent): =IF(COUNTIFS(C:C,"Y",B:B,B)>=1,"Y","N")
So on the column D the formula returns 'Y' for each row.
I've tried to replicate this in SQL with the following code:
SELECT
REGISTRY_ID,
PARENT_REG_ID,
FOCUS_ACCOUNT,
SCORE_DETAILS,
(CASE
WHEN FOCUS_ACCOUNT = 'Y' THEN
(CASE
WHEN COUNT(PARENT_REG_ID) >= 1 THEN 'Y'
ELSE 'N'
END)
ELSE 'N'
END) AS Focus_Parent
FROM MA_ACCOUNTS
But this query returns this error:
ORA-00937: not a single-group group function
Can you please advise?
Later edit:
Let me clarify this: I have a list with unique Registry_IDs that contain a Parent_Registry_ID. A Parent_Registry_ID can have multiple Registry_ID but if a Registry_ID is marked as ‘Y’ in the Column Focus_Account then that Parent_Registry_ID should have ‘Y’ in the column Focus_Parent.
Registry ID Parent Reg ID Focus Account (Y/N)
1 A N
2 B N
3 A Y
4 C Y
5 A N
6 B Y
7 A N
8 D Y
9 E N
10 E N
Expected outcome:
Registry ID Parent Reg ID Focus Account (Y/N) Focus Parent (Y/N)
1 A N Y
2 B N Y
3 A Y Y
4 C Y Y
5 A N Y
6 B Y Y
7 A N Y
8 D Y Y
9 E N N
10 E N N

You are using an aggregated count() so Oracle is expecting a GROUP BY clause. However, that would not fit the shape of your result set. Seems like an analytic function would be better?
You have posted a clarification which I think defines this rule:
if any registry_id has focus_account='Y' then set focus_parent = 'Y' for all instances of its parent_reg_id.
If my interpretation is correct you can implement it quite simply with an analytic max():
select
registry_id,
parent_reg_id,
focus_account,
max( focus_account ) over (partition by parent_reg_id) as focus_parent
from ma_accounts
This works because focus_account is a Y/N flag. Certainly the above query produces your revised result set from the posted input data.

Your using an aggregate method in the select section, but your not grouping at the end for the other selected variables.
Try:
SELECT
REGISTRY_ID,
PARENT_REG_ID,
FOCUS_ACCOUNT,
SCORE_DETAILS,
CASE WHEN COUNT(PARENT_REG_ID) >= 1 AND FOCUS_ACCOUNT = 'Y' THEN 'Y'
ELSE 'N' END AS Focus_Parent
FROM MA_ACCOUNTS
GROUP BY REGISTRY_ID,
PARENT_REG_ID,
FOCUS_ACCOUNT,
SCORE_DETAILS

sql: select rows where group of elements occurs several times in the table

I am searching for an implementation of the following pseodo-code:
SELECT A, B, C
FROM X
HAVING COUNT(A,B) > 1
Here is an example of what the code should do:
Assume table X looks as follows:
A B C D
--------------
1 1 0 2
1 1 1 1
2 1 1 0
The first and the second row have the same entries in columns A and B, the third column is identical in column B but different in column A. The desired output is columns A,B, and C of rows 1 and 2:
1 1 0
1 1 1
How could this be implemented? The problem with my pseodo-code is, that COUNT accepts either a single column or all columns (*), but it can't take two out of 4 columns. GROUP BY has the same property.

You can do this with an exists clause. This should work in all databases:
select a, b, c
from x
where exists (select 1
from x x2
where x.a = x2.a and x.b = x2.b and x.c <> x2.c
);
This assumes that the rows have difference c values.
This will perform best with an index on x(a, b).

For RDMS that supports analytic functions, you can do
SELECT a,b,c
FROM
(
SELECT a, b, c, count(1) OVER(PARTITION BY a,b) cnt
FROM X
)t1
WHERE t1.cnt >1
If analytic/windows function are not available , join should do the job
SELECT t1.a, t1.b, t1.c
FROM X t1
INNER JOIN
(
SELECT a,b
FROM X
GROUP BY a,b
HAVING COUNT(1) >1
)t2 ON (t2.a=t1.a AND t2.b=t1.b)

Select rows until condition met

I would like to write an Oracle query which returns a specific set of information. Using the table below, if given an id, it will return the id and value of B. Also, if B=T, it will return the next row as well. If that next row has a B=T, it will return that, and so on until a F is encountered.
So, given 3 it would just return one row: (3,F). Given 4 it would return 3 rows: ((4,T),(5,T),(6,F))
id B
1 F
2 F
3 F
4 T
5 T
6 F
7 T
8 F
Thank you in advance!

Use a sub-query to find out at what point you should stop, then return all row from your starting point to the calculated stop point.
SELECT
*
FROM
yourTable
WHERE
id >= 4
AND id <= (SELECT MIN(id) FROM yourTable WHERE b = 'F' AND id >= 4)
Note, this assumes that the last record is always an 'F'. You can deal with the last record being a 'T' using a COALESCE.
SELECT
*
FROM
yourTable
WHERE
id >= 4
AND id <= COALESCE(
(SELECT MIN(id) FROM yourTable WHERE b = 'F' AND id >= 4),
(SELECT MAX(id) FROM yourTable )
)

SQL Counting the number of occurence based on a subject

I find it hard to word what I am trying to achieve. I have a table that looks like this:
user char
---------
a | x
a | y
a | z
b | x
b | x
b | y
c | y
c | y
c | z
How do I write a query that would return me the following result?
user x y z
-------
a |1|1|1|
b |2|1|0|
c |0|2|1|
the numbers represent the no of occurences of chars in the original table
EDIT:
The chars values are unknown hence the solution cannot be restricted to these values. Sorry for not mentioning it sooner. I am using Oracle DB but planning to use JPQL to construct the query.

select user,
sum(case when char='x' then 1 else 0 end) as x,
sum(case when char='y' then 1 else 0 end) as y,
sum(case when char='z' then 1 else 0 end) as z
from thetable
group by user
Or, if you don't mind stacking vertically, this solution will give you a solution that works even with unknown sets of characters:
select user, char, count(*) as count
from thetable
group by user, char
This will give you:
user char count
a x 1
a y 1
a z 1
b x 2
If you want to string an unknown set of values out horizontally (as in your demo output), you're going to need to get into dynamic queries... the SQL standard is not designed to generate output with an unknown number of columns... Hope this is helpful!

Another option, using T-SQL PIVOT (SQL SERVER 2005+)
select *
from userchar as t
pivot
(
count([char]) for [char] in ([x],[y],[z])
) as p
Result:
user x y z
----------- ----------- ----------- -----------
a 1 1 1
b 2 1 0
c 0 2 1
(3 row(s) affected)
Edit ORACLE:
You can build a similar PIVOT table using ORACLE.
The tricky part is that you need the right column names in the IN ([x],[y],[z],...) statement. It shouldn't be too hard to construct the SQL query in code, getting a (SELECT DISTINCT [char] from table) and appending it to your base query.
Pivoting rows into columns dynamically in Oracle

If you don't know the exact values on which to PIVOT, you'll either need to do something procedural or mess with dynamic sql (inside an anonymous block), or use XML (in 11g).
If you want the XML approach, it would be something like:
with x as (
select 'a' as usr, 'x' as val from dual
union all
select 'a' as usr, 'y' as val from dual
union all
select 'b' as usr, 'x' as val from dual
union all
select 'b' as usr, 'x' as val from dual
union all
select 'c' as usr, 'z' as val from dual
)
select * from x
pivot XML (count(val) as val_cnt for val in (ANY))
;
Output:
USR VAL_XML
a <PivotSet><item><column name = "VAL">x</column><column name = "VAL_CNT">1</column></item><item><column name = "VAL">y</column><column name = "VAL_CNT">1</column></item></PivotSet>
b <PivotSet><item><column name = "VAL">x</column><column name = "VAL_CNT">2</column></item></PivotSet>
c <PivotSet><item><column name = "VAL">z</column><column name = "VAL_CNT">1</column></item></PivotSet>
Hope that helps

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive Query to select lines that meet multiple criteria - hive

Related

Break out nested data within SQL, criteria across multiple rows (similar to dcast in R)

Trying to replicate a '=countifs' function from excel to SQL

sql: select rows where group of elements occurs several times in the table

Select rows until condition met

SQL Counting the number of occurence based on a subject

Categories

Resources