SQL (Hive) group-by using nulls as wildcard

SQL (Hive) group-by using nulls as wildcard - sql

I have a table like this:
group val1 val2 val3
group1 5 . .
group1 . 2 1
group1 . . 3
group2 1 4 .
group2 . . 8
group2 2 . 7
I need to count the occurrences of all possible combinations for each group in Hive, using null values (.) as a wildcard. This would give me results like this:
group val1 val2 val3 cnt
group1 5 2 1 2
group1 5 2 3 2
group2 1 4 8 2
group2 2 4 8 1
group2 2 4 7 1
I know I can do this by selecting all distinct group-val1 pairs, full joining this with all distinct group-val2 pairs, and full joining this with all distinct group-val3 pairs. This gives me all possible combinations for each group, which I can then inner join with my table, counting cases where a row of my original data is a subset of a combination.
Something like this:
create table my_results as
with combos as (
select *
from (select distinct group, val1 from data) A
full join (select distinct group, val2 from data) B
on A.group = B.group
full join (select distinct group, val3 from data) C
on A.group = C.group
)
select A.group, A.val1, A.val2, A.val3, count(*)
from combos A
inner join data B
on A.group = B.group
and (A.val1 = B.val1 OR B.val1 is null)
and (A.val2 = B.val2 OR B.val2 is null)
and (A.val3 = B.val3 OR B.val3 is null)
group by A.group, A.val1, A.val2, A.val3
But! My dataset is very large (100s of millions of rows), and the number of all possible combinations I can expect is also very large (10s of thousands). Such a join is just too big.
Is there another way? I wondered if I could use regular expressions, but I don't know where to start.

In your sample data, only the third column has multiple values. So, you could just fill in the one value for the two other columns:
select group,
max(max(col1)) over (partition by group) as col1,
max(max(col2)) over (partition by group) as col2,
col3,
count(*)
from data
group by group;

Related

SQL- how to pick consecutive rows?

I want to pick the IDs when the two conditions are met
Col2 of every ID should have 1 and
the consecutive row of same ID should be 2
it does not matter where 1 is but 2 must be the consecutive row after 1.
IDs
Col2
97654
1
97854
2
97854
3
97854
4
97854
5
76543
1
76543
3
76543
2
12345
2
12345
3
12345
4
34567
3
34567
1
34567
2
output
IDs
97854
34567
I tried to use this code but here it outputs all the IDs
SELECT *
from (SELECT IDs, Col2 FROM table_name where Col2 = 1)a
left outer join (SELECT IDs, Col2 FROM table_name where Col2 = 2)b
on a.IDs=b.IDs
Help is appreciated.

To do what you are asking for you should use window functions. Make it row_number or lead ones.
One option is:
SELECT t0.IDs
from (select *, ROW_NUMBER OVER (PARTITION BY IDs) as rn from table_name) t0
left join (select *, ROW_NUMBER OVER (PARTITION BY IDs) as rn from table_name) t1 on (t0.IDs = t1.IDs AND t0.rn + 1 = t1.rn)
where t0.Col2=1 and t1.Col2=2

You can join the table with itself. Each instance picks row #1 and row #2 respectively, and the join ensures both are present.
For example:
select t.ids
from t
join t b on b.ids = t.ids
where t.col2 = 1 and b.col2 = 2

How to get distinct count over multiple columns in SQL?

I have a table that looks like this. And I want to get the distinct count across the three columns.
ID
Column1
Column 2
Column 3
1
A
B
C
2
A
A
B
3
A
A
The desired output I'm looking for is:
ID
Column1
Column 2
Column 3
unique_count
1
A
B
C
3
2
A
A
B
2
3
A
A
1

You want to use cross apply for this one.
select *
from t cross apply
(select count(distinct cnt) as unique_count
from (values(Column1),(Column2),(Column3)) t(cnt)) t2
ID
Column1
Column2
Column3
unique_count
1
A
B
C
3
2
A
A
B
2
3
A
A
1
Fiddle

In standard SQL, you have to UNPIVOT first, do the count(distinct) group by on the PIVOTed result and then PIVOT again.
In recent ORACLE version, you could write your own Polymorphic Table Function to do it, passing the table and the list of columns to count for DISTINCT values.

try if this works
SELECT Count(*)
FROM (SELECT DISTINCT Column1 FROM TABLENAME);

Pivot duplicate column names and get all values for columns

EXPLAINATION
Imagine that I have 2 tables. FormFields where are stored column names as values, which should be pivoted and second table FilledValues with user's filled values with FormFieldId provided.
PROBLEM
As you see (below in SAMPLE section) in FormFields table I have duplicate names, but different ID's. I need to make that after joining tables, all values from FilledValues table will be assiged to column names, not to Id's.
What I need better you will see in OUTPUT section below.
SAMPLE DATA
FormFields
ID Name GroupId
1 col1 1
2 col2 1
3 col3 1
4 col1 2
5 col2 2
6 col3 2
FilledValues
ID Name FormFieldId GroupID
1 a 2 1
2 b 3 1
3 c 1 1
4 d 4 2
5 e 6 2
6 f 5 2
OUTPUT FOR NOW
col1 col2 col3
c a b -- As you see It returning only values for FormFieldId 1 2 3
-- d, e, f are lost that because It have duplicate col names, but different id's
DESIRED OUTPUT
col1 col2 col3
c a b
e f d
QUERY
SELECT * FROM
(
SELECT FF.Name AS NamePiv,
FV.Name AS Val1
FROM FormFields FF
JOIN FilledValues FV ON FF.Id = FV.FormFieldId
) x
PIVOT
(
MIN(Val1)
FOR NamePiv IN ([col1],[col2],[col3])
) piv
SQL FIDDLE
How can I produce the OUTPUT with the multiple rows?

Since you are using PIVOT the data is being aggregated so you only return one value for each column being grouped. You don't have any columns in your subquery that are unique and being used in the grouping aspect of PIVOT to return multiple rows. In order to do this you need some value. If you have a column with a unique value for each "group" then you would use that or you can use a windowing function like row_number().
row_number() will create a sequenced number for each FF.Name meaning if you have 2 col1 you will generate a 1 for a row and a 2 for another row. Once this is included in your subquery, you now have a unique value that is used when aggregating your data and you will return multiple rows:
SELECT [col1],[col2],[col3]
FROM
(
SELECT
FF.Name AS NamePiv,
FV.Name AS Val1,
rn = row_number() over(partition by ff.Name order by fv.Id)
FROM FormFields FF
JOIN FilledValues FV ON FF.Id = FV.FormFieldId
) x
PIVOT
(
MIN(Val1)
FOR NamePiv IN ([col1],[col2],[col3])
) piv;
See SQL Fiddle with Demo. The output is:
| col1 | col2 | col3 |
|------|------|------|
| c | a | b |
| e | f | d |

Just adding GroupId in Pivot source query will fix your problem
SELECT * FROM (
SELECT FF.Name AS NamePiv,
FV.Name AS Val1,
ff.groupid
FROM FormFields FF
JOIN FilledValues FV ON FF.Id = FV.FormFieldId
) x
PIVOT
(
MIN(Val1)
FOR NamePiv IN ([col1],[col2],[col3])
) piv
SQLFIDDLE DEMO

I would be inclined to do this with conditional aggregation:
select max(case when formfieldid % 3 = 1 then name end) as col1,
max(case when formfieldid % 3 = 2 then name end) as col2,
max(case when formfieldid % 3 = 0 then name end) as col3
from FilledValues
group by GroupID;
It is unclear what the rule is for assigning a value to a column. This uses the remainder, which works for your input data.

How SQL join work?

i am using MYSQL..
I have two tables:
TABLE 1 (TABLE NAME T1)
SL NAME
1 a
2 b
3 c
4 c
table 2 (table name T2)
SL NAME
1 a
2 c
3 c
4 c
Q1: how i count the total number of 'c' in both table?
Q2: which name is max occurrences in both table?
sl is primary key...
my query is:>
select count(*) from t1,t2
where t1.name=t2.name where t1.name='c';
but it showing 6

To count c in both tables you should use UNION, not JOIN.
Syntax:
SELECT ...
UNION [ALL | DISTINCT] SELECT ...
[UNION [ALL | DISTINCT] SELECT ...]
Doc:
http://dev.mysql.com/doc/refman/5.0/en/union.html
Edit:
I'll explain the query that you provided.
select count(*) from t1,t2 where t1.name=t2.name where t1.name='c';
First of all, you use WHERE clause twice which is a syntax error. Should be:
select count(*) from t1,t2 where t1.name=t2.name AND t1.name='c';
And this is the same that:
SELECT count(*) from t1
JOIN t2 ON t1.name=t2.name
WHERE t1.name='c';
You choose only rows with c value so these are the rows, that we will take under consideration:
TABLE 1 (TABLE NAME T1)
SL NAME
3 c
4 c
table 2 (table name T2)
SL NAME
2 c
3 c
4 c
Now, simple JOIN joins every row from table 1 to every row from table 2 (where condition is true of course)
So the result before counting is:
t1.SL t1.NAME t2.SL t2.NAME
3 c 2 c
4 c 3 c
3 c 4 c
4 c 2 c
3 c 3 c
4 c 4 c
This is 6 rows.
Answers for both of your questions.
SELECT name, count(*) as cnt
FROM(select t1.name from t1
union all
select name from t2) as tem
group by name
order by cnt DESC
This query will give you ranking of names ordered by occurrences.
To retrieve only c count, just add WHERE clause. To retrieve only the most occurring name set LIMIT clause to 1.

INSERT INTO #test
SELECT NAME FROM m_t1 WHERE NAME ='c'
UNION all
SELECT NAME FROM m_t2 WHERE NAME ='c'
SELECT count(*) FROM #test

Get ID pairs between 2 tables with matching child records

I have 2 tables with the same structure.
FIELD 1 INT
FIELD 2 VARCHAR(32) -- is a MD5 Hash
The query has to get matching FIELD 1 pairs from for records that have the exact combination of values for FIELD 2 in both TABLE 1 and TABLE 2.
These tables are pretty large ( 1 million records between the two ) but are deduced down to an ID and a Hash.
Example data:
TABLE 1
1 A
1 B
2 A
2 D
2 E
3 G
3 H
4 E
4 D
4 C
5 E
5 D
TABLE 2
8 A
8 B
9 E
9 D
9 C
10 F
11 G
11 H
12 B
12 D
13 A
13 B
14 E
14 A
The results of the query should be
8 1
9 4
11 3
13 1
I have tried creating a concatenated string of FIELD 2 using a correlated sub-query and FOR XML PATH string trick I read on here but that is very slow.

You can try following query also -
SELECT t_2.Field_1, t_1.Field_1 --1
FROM table_1 t_1, table_2 t_2 --2
WHERE t_1.Field_2 = t_2.Field_2 --3
GROUP BY t_1.Field_1, t_2.Field_1 --4
HAVING COUNT(*) = (SELECT COUNT(*) --5
FROM Table_1 t_1_1 --6
WHERE t_1_1.Field_1 = t_1.Field_1) --7
AND COUNT(*) = (SELECT COUNT(*) --8
FROM Table_2 t_2_1 --9
WHERE t_2_1.Field_1 =t_2.Field_1) --10
Edit
First the requested set of result is the combination of Field1 from both the tables where respective Field2 is exactly same.
so for that you can use one method which I have posted above.
Here
query will take the data from both the table based on field2 values (from line 1 to line 3)
then it will group the data based on field1 from table1 and field1 from table2 (line 4)
till this step you will get the result having field1 from table1 and field2 from table2 where it exists (at least one) matching based on field2 from tables for respective field1 values.
after this you just need to filter the result for correct (exactly same values for field2 values for respective field1 column value). so that you can make condition on row count.
here my assumption is that you don't have multiple values for field1 and field2 combination in either tables
means following rows will not be present -
1 b
1 b
In any of the tables.
if so, the rows count got for table1 and table2 for same field2 values should be match with the rows present in table1 for field1 and same rows only should present in tables2 for field2 value.
for this condition query has condition on count(*) in having clause (from line 5 to line 10).

Let me try to explain this version of the query:
select t1.field1 as t1field1, t2.field1 as t2field1
from (select t1.*,
count(*) over (partition by field1) as NumField2
from table1 t1
) t1 full outer join
(select t2.*,
count(*) over (partition by field1) as NumField2
from table2 t2
) t2
on t1.field2 = t2.field2
where t1.NumField2 = t2.NumField2
group by t1.Field1, t2.Field1
having count(t1.field2) = max(t1.NumField2) and
count(t2.field2) = max(t2.NumField2)
(which is here at SQLFiddle).
The idea is to compare the following counts for each pair of field1 values.
The number of field2 values on each.
The number of field2 values that they share.
All of these have to be equal.
Each subquery counts the number of values of field2 on each field1 value. For the first rows of your data, this produces:
1 A 2
1 B 2
2 A 3
2 D 3
2 E 3
. . .
And for the second table
8 A 2
8 B 2
9 E 3
9 D 3
9 C 3
Next, the full outer join is applied, requiring a match on both the count and the field2 value. This multiplies the data, producing rows such as:
1 A 2 8 A 2
1 B 2 8 B 2
2 A 3 NULL NULL NULL
2 D 3 9 D 3
2 E 3 9 E 3
NULL NULL NULL 9 C 3
And so on for all the possible combinations. Note that the NULLs appear due to the full outer join.
Note that when you have a pair, such as 1 and 8 that match, there are no rows with NULL values. When you have a pair with the same counts but they don't match, then you have NULL values. When you have a pair with different counts, they are filtered out by the where clause.
The filtering aggregation step applies these rules to get pairs that meet the first condition but not the second.
The having essentially removes any pair that has NULL values. When you count() a column, NULL values are not included. In that case, the count() on the column is fewer than the number of values expected (NumField2).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL (Hive) group-by using nulls as wildcard - sql

In your sample data, only the third column has multiple values. So, you could just fill in the one value for the two other columns: select group, max(max(col1)) over (partition by group) as col1, max(max(col2)) over (partition by group) as col2, col3, count(*) from data group by group;

Related

SQL- how to pick consecutive rows?

How to get distinct count over multiple columns in SQL?

Pivot duplicate column names and get all values for columns

How SQL join work?

Get ID pairs between 2 tables with matching child records

Categories

Resources