PySpark or SQL: consuming coalesce - sql

I'm trying to coalesce multiple input columns into multiple output columns in either a pyspark dataframe or sql table.
Each output column would contain the "first available" input value, and then "consume" it so the input value is unavailable for following output columns.
+----+-----+-----+-----+-----+-----+---+------+------+------+
| ID | in1 | in2 | in3 | in4 | in5 | / | out1 | out2 | out3 |
+----+-----+-----+-----+-----+-----+---+------+------+------+
| 1 | | | C | | | / | C | | |
| 2 | A | | C | | E | / | A | C | E |
| 3 | A | B | C | | | / | A | B | C |
| 4 | A | B | C | D | E | / | A | B | C |
| 5 | | | | | | / | | | |
| 6 | | B | | | E | / | B | E | |
| 7 | | B | | D | E | / | B | D | E |
+----+-----+-----+-----+-----+-----+---+------+------+------+
What's the best way to do this?
edit: clarification - in1, in2, in3, etc.. can be any value

Here is the way.
import pyspark.sql.functions as f
df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
cols = df.columns
cols.remove('ID')
df2 = df.withColumn('ins', f.array_except(f.array(*cols), f.array(f.lit(None))))
for i in range(0, 3):
df2 = df2.withColumn('out' + str(i+1), f.col('ins')[i])
df2.show(10, False)
+---+----+----+----+----+----+---------------+----+----+----+
|ID |in1 |in2 |in3 |in4 |in5 |ins |out1|out2|out3|
+---+----+----+----+----+----+---------------+----+----+----+
|1 |null|null|C |null|null|[C] |C |null|null|
|2 |A |null|C |null|E |[A, C, E] |A |C |E |
|3 |A |B |C |null|null|[A, B, C] |A |B |C |
|4 |A |B |C |D |E |[A, B, C, D, E]|A |B |C |
|5 |null|null|null|null|null|[] |null|null|null|
|6 |null|B |null|null|E |[B, E] |B |E |null|
|7 |null|B |null|D |E |[B, D, E] |B |D |E |
+---+----+----+----+----+----+---------------+----+----+----+

Related

how to add columns on pandas pivot table( multi-column)

Please tell me how to add columns?
This DataFrame to pivot.
|date |country|type|qty|
|----------|-------|----|---|
|2021/03/01|jp |A |10 |
|2021/03/01|en |C |20 |
|2021/03/01|jp |C |15 |
|2021/03/02|jp |A |10 |
|2021/03/02|en |A |20 |
|2021/03/02|en |C |15 |
(to pivot)
| |2021/03/01|2021/03/02|
|-----|----------|----------|
| |jp |en |jp |en |
|-----|----------|----------|
| A |10 | 0 |50 |30 |
| C |15 | 15 |0 |75 |
I would like to add "rate column"
| |2021/03/01 |2021/03/02 |
|-----|---------------------|----------------------|
| | jp | en | jp | en |
|-----|---------------------|----------------------|
| |cnt | rate|cnt |rate|cnt | rate|cnt |rate |
|-----|----------|----------|----------|-----------|
| A |10 | 0.4 | 0 | 0 |50 | 1 | 30 | 0.26|
| C |15 | 0.6 | 15 | 1 |0 | 0 | 85 | 0.74|
You can use concat with keys parameter with divide values by sums, then add DataFrame.reorder_levels and sort MultiIndex:
#change to your function if necessary
df1 = df.pivot_table(index='type', columns=['date','country'], values='qty', fill_value=0)
print (df1)
date 2021/03/01 2021/03/02
country en jp en jp
type
A 0 10 20 10
C 20 15 15 0
df = (pd.concat([df1, df1.div(df1.sum())], axis=1, keys=('cnt','rate'))
.reorder_levels([1,2,0], axis=1)
.sort_index(axis=1))
print (df)
date 2021/03/01 2021/03/02
country en jp en jp
cnt rate cnt rate cnt rate cnt rate
type
A 0 0.0 10 0.4 20 0.571429 10 1.0
C 20 1.0 15 0.6 15 0.428571 0 0.0

How to give an alias to an unnested array of tuples?

I'm trying to set up a set of data using this query :
WITH CTE AS (
SELECT *
FROM UNNEST([("a",NULL),("b",NULL),("c",1),("d",1),("e",2),("f",3),("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)])
)
SELECT *
FROM CTE
The yielded result is :
|Row|f0_|f1_ |
|---|---|----|
| 1 | a |null|
| 2 | b |null|
| 3 | c |1 |
| 4 | d |1 |
| 5 | e |2 |
| 6 | f |3 |
| 7 | g |4 |
| 8 | h |4 |
| 9 | i |4 |
| 10| j |4 |
| 11| k |5 |
| 12| l |5 |
| 13| m |6 |
| 14| n |7 |
| 15| o |7 |
What I want is :
|Row| x | y |
|---|---|----|
| 1 | a |null|
| 2 | b |null|
| 3 | c |1 |
| 4 | d |1 |
| 5 | e |2 |
| 6 | f |3 |
| 7 | g |4 |
| 8 | h |4 |
| 9 | i |4 |
| 10| j |4 |
| 11| k |5 |
| 12| l |5 |
| 13| m |6 |
| 14| n |7 |
| 15| o |7 |
Use STRUCT:
WITH CTE AS (
SELECT *
FROM UNNEST([STRUCT("a" as x,NULL as y),("b",NULL),("c",1),("d",1),("e",2),("f",3),("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)])
)
SELECT *
FROM CTE
You can use this trick with almost any SQL dialect:
WITH CTE AS (
SELECT NULL AS A, NULL AS B
FROM (SELECT 1) T
WHERE FALSE
UNION ALL
SELECT *
FROM UNNEST([
("a",NULL),("b",NULL),("c",1),("d",1),("e",2),("f",3),
("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)
])
)
SELECT *
FROM CTE;

SQL Server 2008:: Efficient way to do the following query

I have the following data:
Input:
----------------------------
| Id | Value|
----------------------------
| 1 |A |
| 1 |B |
| 2 |C |
| 2 |D |
| 2 |E |
| 3 |F |
----------------------------
I need to convert the results to the following:
Output (Count is based on Id)
----------------------------
| Id | Value| Count|
----------------------------
| 1 |A | 2 |
| 1 |B | 2 |
| 2 |C | 3 |
| 2 |D | 3 |
| 2 |E | 3 |
| 3 |F | 1 |
----------------------------
I am using SQL server 2008. Is it possible to write a query to do this?
If yes could anyone help me provide a SQL to obtain the above output from the input data I gave.
You are looking for COUNT OVER:
select id, value, count(*) over (partition by id)
from mytable
order by id, value;

SQL Server Query grouping Optimization

My I/P table is T1:
+----+------+----+------+
| Id | v1 |v2 |v3 |
+----+------+----+------+
| 1 | a |b |c |
| 2 | null |b |null |
| 3 | d |null|null |
| 4 | null |e |null |
| 5 | e |f |null |
+----+------+----+------+
My Requirement : I have to compare one row with the others on the basis of id's.If they have all the values same or null/empty then I have to club the values of id separated by commas.
Required output:
+----+---------------------+
| Id |v1 |v2 |v3 |
+----+---------------------+
| 1,2| a |b |c |
| 3,4| d |e |null |
| 5 | e |f |null |
+----+---------------------+
Please assist.I am trying to use while loop but it is taking me very long.
I want optimize solution as I have run the statement on large record set.

How to insert multiple rows into one table for each id of another table

I have 2 Tables (20000 +rows) :
Table1:
C1_ID |C1_Name | L1_ID | C2_ID
------------------------------------
a | Alan |123 | k
b | Ben |345 | l
a | Alan |123 | m
a | Alan |453 | n
c | John |111 | i
f | Sasha |987 | e
c | John |111 | s
c | John |756 |null
z | Peter |145 |null
Table2:
C2_ID |L2_ID|Category
---------------------
k |888 |1
k |789 |2
k |888 |1
l |456 |0
l |147 |1
m |333 |1
n |999 |2
n |369 |4
n |258 |3
i |159 |2
i |357 |1
e |684 |1
s |153 |2
Desired output:
C1_ID |C1_Name | L1_ID | C2_ID| L2_ID|Category
----------------------------------------------
a | Alan |123 | k |888 |1
a | Alan |123 | k |789 |2
a | Alan |123 | m |333 |1
a | Alan |453 | n |999 |2
a | Alan |453 | n |369 |4
a | Alan |453 | n |258 |3
b | Ben |345 | l |456 |0
b | Ben |345 | l |147 |1
c | John |111 | i |159 |2
c | John |111 | i |357 |1
c | John |111 | s |153 |2
c | John |756 |null |null |null
f | Sasha |987 | e |684 |1
z | Peter |145 |null |null |null
I need to update table 1 and add rows for each time C2_ID(with distinct L2_ID) is found in table 2.
The row should be updated with the relevant L2_Id and Category.
This is a simple join.
SELECT *
FROM TABLE1 T1
LEFT JOIN
(SELECT DISTNCT C2_ID, L2_ID, Category
FROM TABLE2) AS T2 ON T1.C2_ID = T2.C2_ID