SQL to get combine rows in single row based on a column - sql

I have a table as follow:
+---+---+---+
|obj|col|Val|
+---+---+---+
|1 |c1 | v1|
+---+---+---+
|1 |c2 | v2|
+---+---+---+
|2 |c1 | v3|
+---+---+---+
|2 |c2 | v4|
+---+---+---+
And I am looking for SQL that will give the result in the following format
+---+---+---+
|obj|c1 |c2 |
+---+---+---+
|1 |v1 | v2|
+---+---+---+
|2 |v3 | v4|
+---+---+---+

In this SQL, I am checking for col = 'c?' and printing out the corresponding Val. But the reason for group by is to avoid all NULL values in case the condition doesn't match. By grouping on obj all the NULL values will be avoided and produce the desired result.
SELECT obj,
MAX( CASE WHEN col = 'c1' THEN Val END ) AS c1,
MAX( CASE WHEN col = 'c2' THEN Val END ) AS c2
FROM Table
GROUP BY obj;

First you need to select all the unique id from your table
select distinct id
from a_table_you_did_not_name
how you can use that to left join to your columns
select base.id, one.val as c1, two.val as c2
from (
select distinct id
from a_table_you_did_not_name
) base
left join a_table_you_did_not_name one on one.id = base.id and one.col = 'c1'
left join a_table_you_did_not_name two on two.id = base.id and two.col = 'c2'
note: your case is a relatively simple case of this kind of join -- I coded it like this because using my method can be extended to the more complicated cases and still work. There are some other ways for this particular requirement that might be simpler.
specifically the most common one is joining to multiple tables, not all in the same table. My method will still work in those cases.

Related

Update/merge single value in group by where all columns have identical values in each group

I have a data quality task where I need to allow one row in each group to remain unchanged, and then update the 'duplicate' column in the Delta table for the rest of the rows in the group to 'true'. Initially, when the data is loaded into the table, all values in the 'duplicate' column are 'false'.
The data is stored in Delta format and I am using Spark in Databricks.
An example would be that below. I then want to run a query which updates the 'duplicate' column for all rows in a group to 'true' and leaves only a single value as 'false'. This is so a downstream pipeline can still pick up one of the values where we have duplicates for downstream processing.
The table starts out like this:
ID
Value1
Value2
Duplicate
23
a
b
false
23
a
b
false
24
c
d
false
25
e
f
false
26
g
h
false
26
g
h
false
and I need to end up with this:
ID
Value1
Value2
Duplicate
23
a
b
false
23
a
b
true
24
c
d
false
25
e
f
false
26
g
h
false
26
g
h
true
I could of course simply say spark.table("myTable").dropDuplicates() but I would like to assess how big the issue is with duplicates as an external supplier provides data and we need to understand if there are excessive retries on the system which push up costs.
I am struggling to find a way to change all but one entry where I do not have any unique identifier to use. I have tried many different ways, but have failed. I have included what I would like to achieve below (this will clearly fail).
WITH CTE AS (
SELECT
ID,
Duplicate,
row_number() over (partition by ID) AS rn,
FROM myTable
)
UPDATE myTable SET Duplicate = 'true' WHERE (SELECT rn FROM CTE) > 1
I am unsure if this is even possible without having some unique identifier. If possible I would like to try and avoid using any non-deterministic hashes just to accomplish this as the dataset is very large.
I can use any of Scala, PySpark or SQL within Spark so language isn't vital.
Any pointers would be greatly appreciated.
For SQL, something like this should do the trick.
WITH duplicates
AS (
SELECT
ID,
Duplicate,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) AS rn
FROM
myTable
)
UPDATE duplicates
SET Duplicate = 'false'
WHERE rn > 1
In pyspark you can try with row_number as below
df1.withColumn("rank",row_number().over(Window.partitionBy("ID").orderBy("ID"))).withColumn("duplicate",when(col("rank")>1,True).otherwise(False)).drop("rank").show()
#output
+---+------+------+---------+
| Id|value1|value2|duplicate|
+---+------+------+---------+
| 23| a| b| false|
| 23| a| b| true|
| 24| c| d| false|
| 25| e| f| false|
| 26| g| h| false|
| 26| g| h| true|
+---+------+------+---------+

A multiple group by and multiple display count on spark sql?

I am new at Spark and I just want to ask you please this question related to Spark SQL. Let's consider this EMPLOYEE table :
Employee Sub_department Department
A 105182 10
A 105182 10 (data can be redundant !)
A 114256 11
A 127855 12
A 125182 12
B 136234 13
B 133468 13
Department is defined as substring(sub_department, 0, 2) to extract only the first 2 digits of the sub_department.
What I want to display is to divide 3 types of employees :
Set 1 : Employees having at least 3 different departments (regardless of their sub_departments)
Set 1 : Employees having at least 5 different sub_departments AND 2 different departments
Set 3 : Employees having at least 10 different sub_departments with the same department
Concretely I have no idea how to do this even in classical SQL. But at least, I think the final output can be something like this :
Employee Sub_department total_sub_dept Department total_dept
A 105182 4 10 3
A 114256 4 11 3
A 127855 4 12 3
A 125182 4 12 3
And "eventually" a column named "Set" to show in which set an employee can belong to, but it's optional and I'm scared it will be too heavy to compute such a value...
It's important to display the different values AND the count for each of the 2 columns (the sub_department and the department).
I have a very big table (with many columns and many data that can be redundant) so I thought to do this by using a first partition on the sub_department and store it on a first table. Then a second partition on the department (regardless the "sub_department" value) and store it on a second table. And finally, do an inner joint between the two table based on the employee name.
But I got some wrong results and I don't know if there is a better way to do this ? or at least to have an optimisation since the department column depends on the sub_department (to do one group by rather than 2).
So, how can I fix this? I tried but it seems impossible to combine count(column) with the same column for each of the 2 columns.
I'll help you out with the requirement in set 1 just to encourage you. Please try to understand below query, once done, it is very simple to do set 2 and set 3.
SELECT
employee
total_dept
FROM
(
SELECT
employee
COUNT(Department) AS total_dept
FROM
(
select
employee,
Sub_department,
SUBSTRING(Sub_department,0,2) AS Department,
ROW_NUMBER() OVER (partition by employee,SUBSTRING(Sub_department,0,2)) AS redundancy
FROM
table
)
WHERE redundancy = 1
GROUP BY employee
) WHERE total_dept >= 3
EDIT1:
SELECT
full_data.employee,
full_data.sub_department,
total_sub_dept_count.total_sub_dept
full_data.SUBSTRING(Sub_department,0,2) AS Department
total_dept_count.total_dept
FROM
(
SELECT
employee
COUNT(Department) AS total_dept
FROM
(
select
employee,
Sub_department,
SUBSTRING(Sub_department,0,2) AS Department,
ROW_NUMBER() OVER (partition by employee,SUBSTRING(Sub_department,0,2)) AS redundancy
FROM
employee_table
)
WHERE redundancy = 1
GROUP BY employee
) total_dept_count
JOIN
(
SELECT
employee
COUNT(department) AS total_sub_dept
FROM
(
select
employee,
department,
ROW_NUMBER() OVER (partition by employee,department) AS redundancy
FROM
employee_table
)
WHERE redundancy = 1
GROUP BY employee
) total_sub_dept_count
ON(total_dept_count.employee = total_sub_dept_count.employee)
JOIN
employee_table full_data
ON(total_sub_dept_count.employee = full_data.employee)
You can use the window functions collect_set() and get the results. Check this out
scala> val df = Seq(("A","105182","10"), ("A","105182","10" ), ("A","114256","11"), ("A","127855","12"), ("A","125182","12"), ("B","136234","13"), ("B","133468","13")).toDF("emp","subdept","dept")
df: org.apache.spark.sql.DataFrame = [emp: string, subdept: string ... 1 more field]
scala> df.printSchema
root
|-- emp: string (nullable = true)
|-- subdept: string (nullable = true)
|-- dept: string (nullable = true)
scala> df.show
+---+-------+----+
|emp|subdept|dept|
+---+-------+----+
| A| 105182| 10|
| A| 105182| 10|
| A| 114256| 11|
| A| 127855| 12|
| A| 125182| 12|
| B| 136234| 13|
| B| 133468| 13|
+---+-------+----+
scala> val df2 = df.withColumn("dept2",substring('subdept,3,7))
df2: org.apache.spark.sql.DataFrame = [emp: string, subdept: string ... 2 more fields]
scala> df2.createOrReplaceTempView("salaman")
scala> spark.sql(""" select *, size(collect_set(subdept) over(partition by emp)) sub_dep_count, size(collect_set(dept) over(partition by emp)) dep_count from salaman """).show(false)
+---+-------+----+-----+-------------+---------+
|emp|subdept|dept|dept2|sub_dep_count|dep_count|
+---+-------+----+-----+-------------+---------+
|B |136234 |13 |6234 |2 |1 |
|B |133468 |13 |3468 |2 |1 |
|A |105182 |10 |5182 |4 |3 |
|A |105182 |10 |5182 |4 |3 |
|A |125182 |12 |5182 |4 |3 |
|A |114256 |11 |4256 |4 |3 |
|A |127855 |12 |7855 |4 |3 |
+---+-------+----+-----+-------------+---------+
scala>

sql query sort but display specific values on top

For example I have a table with names as column1 and date as column2. I want to query where specific names will be on top, and dates will be sorted in descending order.
|names|dates|
|a |2016 |
|b |2013 |
|c |2017 |
|d |2011 |
I want to display a table where b and c will be on top, but dates columns will be sorted in desc order and the rest of the names will be displayed as the dates columns is sorted. its like having two groups where values b and c has their dates sorted and the rest by another sorted dates. for example
|names|dates|
|b |2017 |
|c |2013 |
|a |2016 |
|d |2011 |
What sql query should I use?
If you want all b and c on top sorted by dates then rest, try this:
ORDER BY (CASE WHEN names = 'b' or names = 'c' THEN '1' else '2' END) ASC, dates desc
DEMO
http://sqlfiddle.com/#!9/8c6ba6/1
If you want all b on top, then all c then rest, try this:
ORDER BY (CASE WHEN names = 'b' THEN '1'
WHEN names = 'c' THEN '2' else '3' END) ASC, dates desc
DEMO
http://sqlfiddle.com/#!9/8c6ba6/2

Aggregate multiple select statements without replicating data

How do I aggregate 2 select clauses without replicating data.
For instance, suppose I have tab_a that contains the data from 1 to 10:
|id|
|1 |
|2 |
|3 |
|. |
|. |
|10|
And then, I want to generate the combination of tab_b and tab_c making sure that result has 10 lines and add the column of tab_a to the result tuple
Script:
SELECT tab_b.id, tab_c.id, tab_a.id
from tab_b, tab_c, tab_a;
However this is replicating data from tab_a for each combination of tab_b and tab_c, I only want to add and would that for each combination of tab_b x tab_c I add a row of tab_a.
Example of data from tab_b
|id|
|1 |
|2 |
Example of data from tab_c
|id|
|1 |
|2 |
|3 |
|4 |
|5 |
I would like to get this output:
|tab_b.id|tab_c.id|tab_a.id|
|1 |1 |1 |
|2 |1 |2 |
|1 |2 |3 |
|... |... |... |
|2 |5 |10 |
Your question includes an unstated, invalid assumption: that the position of the values in the table (the row number) is meaningful in SQL. It's not. In SQL, rows have no order. All joins -- everything, in fact -- are based on values. To join tables, you have to supply the values the DBMS should use to determine which rows go together.
You got a hint of that with your attempted join: from tab_b, tab_c, tab_a. You didn't supply any basis for joining the rows, which in SQL means there's no restriction: all rows are "the same" for the purpose of this join. They all match, and voila, you get them all!
To do what you want, redesign your tables with at least one more column: the key that serves to identify the value. It could be a number; for example, your source data might be an array. More commonly each value has a name of some kind.
Once you have tables with keys, I think you'll find the join easier to write and understand.
Perhaps you're new to SQL, but this is generally not the way things are done with RDBMSs. Anyway, if this is what you need, PostgreSQL can deal with it nicely, using different strategies:
Window Functions:
with
tab_a (id) as (select generate_series(1,10)),
tab_b (id) as (select generate_series(1,2)),
tab_c (id) as (select generate_series(1,5))
select tab_b_id, tab_c_id, tab_a.id
from (select *, row_number() over () from tab_a) as tab_a
left join (
select tab_b.id as tab_b_id, tab_c.id as tab_c_id, row_number() over ()
from tab_b, tab_c
order by 2, 1
) tabs_b_c ON (tabs_b_c.row_number = tab_a.row_number)
order by tab_a.id;
Arrays:
with
tab_a (id) as (select generate_series(1,10)),
tab_b (id) as (select generate_series(1,2)),
tab_c (id) as (select generate_series(1,5))
select bc[s][1], bc[s][2], a[s]
from (
select array(
select id
from tab_a
order by 1
) a,
array(
select array[tab_b.id, tab_c.id]
from tab_b, tab_c
order by tab_c.id, tab_b.id
) bc
) arr
join lateral generate_subscripts(arr.a, 1) s on true
If i understand your question correctly maybe this is what you are looking for ..
SELECT bctable.b_id, bctable.c_id, atable.a_id
FROM (SELECT a_id, ROW_NUMBER () OVER () AS arnum FROM a) atable
JOIN (SELECT p.b_id, p.c_id, ROW_NUMBER () OVER () AS bcrnum
FROM ( SELECT b.b_id, c.c_id
FROM b CROSS JOIN c
ORDER BY c.c_id, b.b_id) p) bctable
ON atable.arnum = bctable.bcrnum
Please check the SQLFiddle .

Aggregate by aggregate (ARRAY_AGG)?

Let's say I have a simple table agg_test with 3 columns - id, column_1 and column_2. Dataset, for example:
id|column_1|column_2
--------------------
1| 1| 1
2| 1| 2
3| 1| 3
4| 1| 4
5| 2| 1
6| 3| 2
7| 4| 3
8| 4| 4
9| 5| 3
10| 5| 4
A query like this (with self join):
SELECT
a1.column_1,
a2.column_1,
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2)
FROM agg_test a1
JOIN agg_test a2 ON a1.column_2 = a2.column_2 AND a1.column_1 <> a2.column_1
WHERE a1.column_1 = 1
GROUP BY a1.column_1, a2.column_1
Will produce a result like this:
column_1|column_1|array_agg
---------------------------
1| 2| {1}
1| 3| {2}
1| 4| {3,4}
1| 5| {3,4}
We can see that for values 4 and 5 from the joined table we have the same result in the last column. So, is it possible to somehow group the results by it, e.g:
column_1|column_1|array_agg
---------------------------
1| {2}| {1}
1| {3}| {2}
1| {4,5}| {3,4}
Thanks for any answers. If anything isn't clear or can be presented in a better way - tell me in the comments and I'll try to make this question as readable as I can.
I'm not sure if you can aggregate by an array. If you can here is one approach:
select col1, array_agg(col2), ar
from (SELECT a1.column_1 as col1, a2.column_1 as col2,
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2) as ar
FROM agg_test a1 JOIN
agg_test a2
ON a1.column_2 = a2.column_2 AND a1.column_1 <> a2.column_1
WHERE a1.column_1 = 1
GROUP BY a1.column_1, a2.column_1
) t
group by col1, ar
The alternative is to use array_dims to convert the array values into a string.
You could also try something like this:
SELECT DISTINCT
a1.column_1,
ARRAY_AGG(a2.column_1) OVER (
PARTITION BY
a1.column_1,
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2)
) AS "a2.column_1 agg",
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2)
FROM agg_test a1
JOIN agg_test a2 ON a1.column_2 = a2.column_2 AND a1.column_1 a2.column_1
WHERE a1.column_1 = 1
GROUP BY a1.column_1, a2.column_1
;
(Highlighted are the parts that are different from the query you've posted in your question.)
The above uses a window ARRAY_AGG to combine the values of a2.column_1 alongside the other other ARRAY_AGG, using the latter's result as one of the partitioning criteria. Without the DISTINCT, it would produce two {4,5} rows for your example. So, DISTINCT is needed to eliminate the duplicates.
Here's a SQL Fiddle demo: http://sqlfiddle.com/#!1/df5c3/4
Note, though, that the window ARRAY_AGG cannot have an ORDER BY like it's "normal" counterpart. That means the order of a2.column_1 values in the list would be indeterminate, although in the linked demo it does happen to match the one in your expected output.