How to get distinct count over multiple columns in Hive SQL?

How to get distinct count over multiple columns in Hive SQL? - sql

I have a table that looks like this. And I want to get the distinct count horizontally across the three columns ignoring nulls.
ID
Column1
Column 2
Column 3
1
A
B
C
2
A
A
B
3
A
A
The desired output I'm looking for is:
ID
Column1
Column 2
Column 3
unique_count
1
A
B
C
3
2
A
A
B
2
3
A
A
1

One possible option would be
WITH sample AS (
SELECT 'A' Column1, 'B' Column2, 'C' Column3 UNION ALL
SELECT 'A', 'A', 'B' UNION ALL
SELECT 'A', 'A', NULL UNION ALL
SELECT '', 'A', NULL
)
SELECT Column1, Column2, Column3, COUNT(DISTINCT NULLIF(TRIM(c), '')) unique_count
FROM (SELECT *, ROW_NUMBER() OVER () rn FROM sample) t LATERAL VIEW EXPLODE(ARRAY(Column1, Column2, Column3)) tf AS c
GROUP BY Column1, Column2, Column3, rn;
output
+---------+---------+---------+--------------+
| column1 | column2 | column3 | unique_count |
+---------+---------+---------+--------------+
| | A | NULL | 1 |
| A | A | NULL | 1 |
| A | A | B | 2 |
| A | B | C | 3 |
+---------+---------+---------+--------------+

case when C1 not in (C2, C3) then 1 else 0 end +
case when C2 not in (C3) then 1 else 0 end + 1
This will not work if you intend to count nulls. The pattern would extend to more columns by successively comparing each one to all columns to its right. The order doesn't strictly matter. There's just no point in repeating the same test over and over.
If the values were alphabetically ordered then you could test only adjacent pairs to look for differences. While that applies to your limited sample it would not be the most general case.
Using a column pivot with a distinct count aggregate is likely to be a lot less efficient, less portable, and a lot less adaptable to a broad range of queries.

Related

sql - Only want rows with NULL in column if it isn't defined somewhere else as well

I have a table with possible NULL values in a column. I need to return the NULL values, but only if it isn't also defined somewhere else. Below, I want row F, but I do not want row B. We have some automation that attempts something but also has a fail over. We need to identify when both tries fail.
Column 1 | Column 2
A | 1
B | 1
B | null
C | 2
C | 1
D | 1
E | 2
F | null
F | null
G | 2

Simply do aggregation :
select col1, null as col2
from table t
group by col1
having max(col2) is null;

You can use not exists:
select t.*
from mytable t
where not exists (
select 1
from mytable t1
where t1.column1 = t.column1 and t1.column2 is not null
)
Or you can use window functions:
select column1, column2
from (
select t.*, max(column2) over(partition by column1) max_column2
from mytable t
) t
where max_column2 is null

SQL code to get next variable in table with different value

I need to find a way in SQL Server 2014 Management Studios to find the next unique value in a column that shares the value of a different column.
So for example below I would want my results to be
Column 1 - A
Column 2 - 1
Column 3 - 4
As that is the first time that A has unique values in column 2 and 3
Column1 | Column2 | Column3
---------+---------+---------
| A | X | 1 |
| A | X | 2 |
| B | Y | 3 |
| A | Z | 4 |
Query:
SELECT
Column1,
LEAD(Column3) OVER (PARTITION BY Column2 ORDER BY Column3) AS FindValue
FROM
Table

If I understand it correctly I would try something like this:
-- first we find minimum values for column1, column2 variations
WITH min_values AS (
SELECT
column1,
column2,
min(column3) AS min_value
FROM
table
GROUP BY 1,2
)
-- then we find bottom 2 values for column1
,bottom_2 AS (
SELECT
column1,
min_value,
row_number() OVER (PARTITION BY column1 ORDER BY min_value ASC) AS rn
FROM
min_values
)
-- THEN we JOIN results INTO single record
SELECT
b1.column1, b2.min_value, b1.min_value
FROM
bottom_2 b1
JOIN
bottom_2 b2 ON b1.column1 = b2.column1 AND b2.rn < b1.rn
WHERE b1.rn <= 2
I just checked comments above and would like to add some notes.
If you want to find next value ordered by column2 then you have to change order by from min_value to column2 in row_number() line. Otherwise, if you are looking for next inserted value then you need a timestamp or some kind of id.

Move All null columns (Columns with zero non-null values) to the extreme right of the table

I have a table with 100's of columns. For few of the columns, the value is always null. I want to move all such columns (columns with zero non-null values) to the extreme right so that when my users see the table, they will get to see the usable information first.
Eg : test_table
**column1 | column2 | column3 | column4**
a | null | null | 1
b | null | null | 2
c | null | null | 3
After insertion, I want to make the test_table as like the below table
**column1 | column4 | column2 | column3**
a | 1 | null | null
b | 2 | null | null
c | 3 | null | null

The order the columns show up is defined when you query the table.
If you want these columns to show at the right end of the result set, you need to re-order them within the SELECT portion of your query.
So to stick to your example, instead of having:
Select column1, column2, column3, column4 FROM test_table
You would have:
Select column1, column4, column2, column3 FROM test_table
I do not believe there's a way to do this dynamically, but since you know these columns will always have a null value you should feel pretty confident with this.

use isnull and case statment
select distinct isnull(id,0) from Employee
select distinct case when id is null then 0 else id end as id from Employee

SQL query to give distinct values from one column, and a count of distinct values from a second column

Say I have a table like this:
column1 | column2
---------------------
1 | a
1 | b
1 | c
2 | a
2 | b
I need an SQL query to show the distinct values from column 1, and a count of the related distinct values from column 2. The output would look like:
column1 | count
-------------------
1 | 3
2 | 2

You could do something like this:
SELECT column1, count(column2)
FROM table
GROUP BY column1

You should do a COUNT(DISTINCT ...) with a GROUP BY:
Select Column1,
Count(Distinct Column2) As Count
From Table
Group By Column1

How to pivot or 'merge' rows with column names?

I have the following table:
crit_id | criterium | val1 | val2
----------+------------+-------+--------
1 | T01 | 9 | 9
2 | T02 | 3 | 5
3 | T03 | 4 | 9
4 | T01 | 2 | 3
5 | T02 | 5 | 1
6 | T03 | 6 | 1
I need to convert the values in 'criterium' into columns as 'cross product' with val1 and val2. So the result has to lool like:
T01_val1 |T01_val2 |T02_val1 |T02_val2 | T03_val1 | T03_val2
---------+---------+---------+---------+----------+---------
9 | 9 | 3 | 5 | 4 | 9
2 | 3 | 5 | 1 | 6 | 1
Or to say differently: I need every value for all criteria to be in one row.
This is my current approach:
select
case when criterium = 'T01' then val1 else null end as T01_val1,
case when criterium = 'T01' then val2 else null end as T01_val2,
case when criterium = 'T02' then val1 else null end as T02_val1,
case when criterium = 'T02' then val2 else null end as T02_val2,
case when criterium = 'T03' then val1 else null end as T03_val1,
case when criterium = 'T03' then val2 else null end as T04_val2,
from crit_table;
But the result looks not how I want it to look like:
T01_val1 |T01_val2 |T02_val1 |T02_val2 | T03_val1 | T03_val2
---------+---------+---------+---------+----------+---------
9 | 9 | null | null | null | null
null | null | 3 | 5 | null | null
null | null | null | null | 4 | 9
What's the fastest way to achieve my goal?
Bonus question:
I have 77 criteria and seven different kinds of values for every criterium. So I have to write 539 case statements. Whats the best way to create them dynamically?
I'm working with PostgreSql 9.4

Prepare for crosstab
In order to use crosstab() function, the data must be reorganized. You need a dataset with three columns (row number, criterium, value). To have all values in one column you must unpivot two last columns, changing at the same time the names of criteria. As a row number you can use rank() function over partitions by new criteria.
select rank() over (partition by criterium order by crit_id), criterium, val
from (
select crit_id, criterium || '_v1' criterium, val1 val
from crit
union
select crit_id, criterium || '_v2' criterium, val2 val
from crit
) sub
order by 1, 2
rank | criterium | val
------+-----------+-----
1 | T01_v1 | 9
1 | T01_v2 | 9
1 | T02_v1 | 3
1 | T02_v2 | 5
1 | T03_v1 | 4
1 | T03_v2 | 9
2 | T01_v1 | 2
2 | T01_v2 | 3
2 | T02_v1 | 5
2 | T02_v2 | 1
2 | T03_v1 | 6
2 | T03_v2 | 1
(12 rows)
This dataset can be used in crosstab():
create extension if not exists tablefunc;
select * from crosstab($ct$
select rank() over (partition by criterium order by crit_id), criterium, val
from (
select crit_id, criterium || '_v1' criterium, val1 val
from crit
union
select crit_id, criterium || '_v2' criterium, val2 val
from crit
) sub
order by 1, 2
$ct$)
as ct (rank bigint, "T01_v1" int, "T01_v2" int,
"T02_v1" int, "T02_v2" int,
"T03_v1" int, "T03_v2" int);
rank | T01_v1 | T01_v2 | T02_v1 | T02_v2 | T03_v1 | T03_v2
------+--------+--------+--------+--------+--------+--------
1 | 9 | 9 | 3 | 5 | 4 | 9
2 | 2 | 3 | 5 | 1 | 6 | 1
(2 rows)
Alternative solution
For 77 criteria * 7 parameters the above query may be troublesome. If you can accept a bit different way of presenting the data, the issue becomes much easier.
select * from crosstab($ct$
select
rank() over (partition by criterium order by crit_id),
criterium,
concat_ws(' | ', val1, val2) vals
from crit
order by 1, 2
$ct$)
as ct (rank bigint, "T01" text, "T02" text, "T03" text);
rank | T01 | T02 | T03
------+-------+-------+-------
1 | 9 | 9 | 3 | 5 | 4 | 9
2 | 2 | 3 | 5 | 1 | 6 | 1
(2 rows)

DECLARE #Table1 TABLE
(crit_id int, criterium varchar(3), val1 int, val2 int)
;
INSERT INTO #Table1
(crit_id, criterium, val1, val2)
VALUES
(1, 'T01', 9, 9),
(2, 'T02', 3, 5),
(3, 'T03', 4, 9),
(4, 'T01', 2, 3),
(5, 'T02', 5, 1),
(6, 'T03', 6, 1)
;
select [T01] As [T01_val1 ],[T01-1] As [T01_val2 ],[T02] As [T02_val1 ],[T02-1] As [T02_val2 ],[T03] As [T03_val1 ],[T03-1] As [T03_val3 ] from (
select T.criterium,T.val1,ROW_NUMBER()OVER(PARTITION BY T.criterium ORDER BY (SELECT NULL)) RN from (
select criterium, val1 from #Table1
UNION ALL
select criterium+'-'+'1', val2 from #Table1)T)PP
PIVOT (MAX(val1) FOR criterium IN([T01],[T02],[T03],[T01-1],[T02-1],[T03-1]))P

I agree with Michael's comment that this requirement looks a bit weird, but if you really need it that way, you were on the right track with your solution. It just needs a little bit of additional code (and small corrections wherever val_1 and val_2 where mixed up):
select
sum(case when criterium = 'T01' then val_1 else null end) as T01_val1,
sum(case when criterium = 'T01' then val_2 else null end) as T01_val2,
sum(case when criterium = 'T02' then val_1 else null end) as T02_val1,
sum(case when criterium = 'T02' then val_2 else null end) as T02_val2,
sum(case when criterium = 'T03' then val_1 else null end) as T03_val1,
sum(case when criterium = 'T03' then val_2 else null end) as T03_val2
from
crit_table
group by
trunc((crit_id-1)/3.0)
order by
trunc((crit_id-1)/3.0);
This works as follows. To aggregate the result you posted into the result you would like to have, the first helpful observation is that the desired result has less rows than your preliminary one. So there's some kind of grouping necessary, and the key question is: "What's the grouping criterion?" In this case, it's rather non-obvious: It's criterion ID (minus 1, to start counting with 0) divided by 3, and truncated. The three comes from the number of different criteria. After that puzzle is solved, it is easy to see that for among the input rows that are aggregated into the same result row, there is only one non-null value per column. That means that the choice of aggregate function is not so important, as it is only needed to return the only non-null value. I used the sum in my code snippet, but you could as well use min or max.
As for the bonus question: Use a code generator query that generates the query you need. The code looks like this (with only three types of values to keep it brief):
with value_table as /* possible kinds of values, add the remaining ones here */
(select 'val_1' value_type union
select 'val_2' value_type union
select 'val_3' value_type )
select contents from (
select 0 order_id, 'select' contents
union
select row_number() over () order_id,
'max(case when criterium = '''||criterium||''' then '||value_type||' else null end) '||criterium||'_'||value_type||',' contents
from crit_table
cross join value_table
union select 9999999 order_id,
' from crit_table group by trunc((crit_id-1)/3.0) order by trunc((crit_id-1)/3.0);' contents
) v
order by order_id;
This basically only uses a string template of your query and then inserts the appropriate combinations of values for the criteria and the val-columns. You could even get rid of the with-clause by reading column names from information_schema.columns, but I think the basic idea is clearer in the version above. Note that the code generated contains one comma too much directly after the last column (before the from clause). It's easier to delete that by hand afterwards than correcting it in the generator.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get distinct count over multiple columns in Hive SQL? - sql

I have a table that looks like this. And I want to get the distinct count horizontally across the three columns ignoring nulls. ID Column1 Column 2 Column 3 1 A B C 2 A A B 3 A A The desired output I'm looking for is: ID Column1 Column 2 Column 3 unique_count 1 A B C 3 2 A A B 2 3 A A 1

Related

sql - Only want rows with NULL in column if it isn't defined somewhere else as well

SQL code to get next variable in table with different value

Move All null columns (Columns with zero non-null values) to the extreme right of the table

SQL query to give distinct values from one column, and a count of distinct values from a second column

How to pivot or 'merge' rows with column names?

Categories

Resources