How to give an alias to an unnested array of tuples? - sql

I'm trying to set up a set of data using this query :
WITH CTE AS (
SELECT *
FROM UNNEST([("a",NULL),("b",NULL),("c",1),("d",1),("e",2),("f",3),("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)])
)
SELECT *
FROM CTE
The yielded result is :
|Row|f0_|f1_ |
|---|---|----|
| 1 | a |null|
| 2 | b |null|
| 3 | c |1 |
| 4 | d |1 |
| 5 | e |2 |
| 6 | f |3 |
| 7 | g |4 |
| 8 | h |4 |
| 9 | i |4 |
| 10| j |4 |
| 11| k |5 |
| 12| l |5 |
| 13| m |6 |
| 14| n |7 |
| 15| o |7 |
What I want is :
|Row| x | y |
|---|---|----|
| 1 | a |null|
| 2 | b |null|
| 3 | c |1 |
| 4 | d |1 |
| 5 | e |2 |
| 6 | f |3 |
| 7 | g |4 |
| 8 | h |4 |
| 9 | i |4 |
| 10| j |4 |
| 11| k |5 |
| 12| l |5 |
| 13| m |6 |
| 14| n |7 |
| 15| o |7 |

Use STRUCT:
WITH CTE AS (
SELECT *
FROM UNNEST([STRUCT("a" as x,NULL as y),("b",NULL),("c",1),("d",1),("e",2),("f",3),("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)])
)
SELECT *
FROM CTE

You can use this trick with almost any SQL dialect:
WITH CTE AS (
SELECT NULL AS A, NULL AS B
FROM (SELECT 1) T
WHERE FALSE
UNION ALL
SELECT *
FROM UNNEST([
("a",NULL),("b",NULL),("c",1),("d",1),("e",2),("f",3),
("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)
])
)
SELECT *
FROM CTE;

Related

PySpark or SQL: consuming coalesce

I'm trying to coalesce multiple input columns into multiple output columns in either a pyspark dataframe or sql table.
Each output column would contain the "first available" input value, and then "consume" it so the input value is unavailable for following output columns.
+----+-----+-----+-----+-----+-----+---+------+------+------+
| ID | in1 | in2 | in3 | in4 | in5 | / | out1 | out2 | out3 |
+----+-----+-----+-----+-----+-----+---+------+------+------+
| 1 | | | C | | | / | C | | |
| 2 | A | | C | | E | / | A | C | E |
| 3 | A | B | C | | | / | A | B | C |
| 4 | A | B | C | D | E | / | A | B | C |
| 5 | | | | | | / | | | |
| 6 | | B | | | E | / | B | E | |
| 7 | | B | | D | E | / | B | D | E |
+----+-----+-----+-----+-----+-----+---+------+------+------+
What's the best way to do this?
edit: clarification - in1, in2, in3, etc.. can be any value
Here is the way.
import pyspark.sql.functions as f
df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
cols = df.columns
cols.remove('ID')
df2 = df.withColumn('ins', f.array_except(f.array(*cols), f.array(f.lit(None))))
for i in range(0, 3):
df2 = df2.withColumn('out' + str(i+1), f.col('ins')[i])
df2.show(10, False)
+---+----+----+----+----+----+---------------+----+----+----+
|ID |in1 |in2 |in3 |in4 |in5 |ins |out1|out2|out3|
+---+----+----+----+----+----+---------------+----+----+----+
|1 |null|null|C |null|null|[C] |C |null|null|
|2 |A |null|C |null|E |[A, C, E] |A |C |E |
|3 |A |B |C |null|null|[A, B, C] |A |B |C |
|4 |A |B |C |D |E |[A, B, C, D, E]|A |B |C |
|5 |null|null|null|null|null|[] |null|null|null|
|6 |null|B |null|null|E |[B, E] |B |E |null|
|7 |null|B |null|D |E |[B, D, E] |B |D |E |
+---+----+----+----+----+----+---------------+----+----+----+

Group rows based on column values in SQL / BigQuery

Is it possible to "group" rows within BigQuery/SQL depending on column values? Let's say I want to assign a string/id for all rows between stream_start_init and stream_start and then do the same for the rows between stream_resume and the last stream_ad.
The amount of stream_ad event can differ hence I can't use a RANK() or ROW() to group them be based on those values.
|id, timestamp, event|
|1 | 1231231 | first_visit|
|2 | 1231232 | login|
|3 | 1231233 | page_view|
|4 | 1231234 | page_view|
|5 | 1231235 | stream_start_init|
|6 | 1231236 | stream_ad|
|7 | 1231237 | stream_ad|
|8 | 1231238 | stream_ad|
|9 | 1231239 | stream_start|
|6 | 1231216 | stream_resume|
|6 | 1231236 | stream_ad|
|7 | 1231217 | stream_ad|
|8 | 1231258 | stream_ad|
|10| 1231240 | page_view|
How I wish the table to be
|id, timestamp, event, group_id|
|1 | 1231231 | first_visit, null|
|2 | 1231232 | login, null|
|3 | 1231233 | page_view, null|
|4 | 1231234 | page_view, null|
|5 | 1231235 | stream_start_init, group_1|
|6 | 1231236 | stream_ad, group_1|
|7 | 1231237 | stream_ad, group_1|
|8 | 1231238 | stream_ad, group_1|
|9 | 1231239 | stream_start, group_1|
|6 | 1231216 | stream_resume, group_2|
|6 | 1231236 | stream_ad, group_2|
|7 | 1231217 | stream_ad, group_2|
|8 | 1231258 | stream_ad, group_2|
|10| 1231240 | page_view, null|
I wouldn't assign a string. I would assign a number. This appears to be a cumulative sum. I think a sum of the number of "stream_start_init" and "stream_resume" does what you want:
select t.*,
countif(event in ('stream_start_init', 'stream_resume')) over (order by timestamp) as group_id
from t;
Note that this produces 0 for the first group -- which seems like a good thing. You can convert that to a NULL using NULLIF().
If you really want strings, you can use CONCAT().
Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
IF(event IN ('stream_start_init', 'stream_start', 'stream_resume', 'stream_ad'),
COUNTIF(event IN ('stream_start_init', 'stream_resume')) OVER(ORDER BY timestamp),
NULL
) AS group_id
FROM `project.dataset.table`

SQL Server Query grouping Optimization

My I/P table is T1:
+----+------+----+------+
| Id | v1 |v2 |v3 |
+----+------+----+------+
| 1 | a |b |c |
| 2 | null |b |null |
| 3 | d |null|null |
| 4 | null |e |null |
| 5 | e |f |null |
+----+------+----+------+
My Requirement : I have to compare one row with the others on the basis of id's.If they have all the values same or null/empty then I have to club the values of id separated by commas.
Required output:
+----+---------------------+
| Id |v1 |v2 |v3 |
+----+---------------------+
| 1,2| a |b |c |
| 3,4| d |e |null |
| 5 | e |f |null |
+----+---------------------+
Please assist.I am trying to use while loop but it is taking me very long.
I want optimize solution as I have run the statement on large record set.

Derive and Update Column Value based on Row Value SQL Server

So I have a Request History table that I would like to flag its versions (version is based on end of cycle); I was able to mark the end of the cycle, but somehow I couldn't update the values of each associated with each cycle. Here is an example:
|history_id | Req_id | StatID | Time |EndCycleDate |
|-------------|---------|-------|---------- |-------------|
|1 | 1 |18 | 3/26/2017 | NULL |
|2 | 1 | 19 | 3/26/2017 | NULL |
|3 | 1 |20 | 3/30/2017 | NULL |
|4 |1 | 23 |3/30/2017 | NULL |
|5 | 1 |35 |3/30/2017 | 3/30/2017 |
|6 | 1 |33 |4/4/2017 | NULL |
|7 | 1 |34 |4/4/2017 | NULL |
|8 | 1 |39 |4/4/2017 | NULL |
|9 | 1 |35 |4/4/2017 | 4/4/2017 |
|10 | 1 |33 |4/5/2017 | NULL |
|11 | 1 |34 |4/6/2017 | NULL |
|12 | 1 |39 |4/6/2017 | NULL |
|13 | 1 |35 |4/7/2017 | 4/7/2017 |
|14 | 1 |33 |4/8/2017 | NULL |
|15 | 1 | 34 |4/8/2017 | NULL |
|16 | 2 |18 |3/28/2017 | NULL |
|17 | 2 |26 |3/28/2017 | NULL |
|18 | 2 |20 |3/30/2017 | NULL |
|19 | 2 |23 |3/30/2017 | NULL |
|20 | 2 |35 |3/30/2017 | 3/30/2017 |
|21 | 2 |33 |4/12/2017 | NULL |
|22 | 2 |34 |4/12/2017 | NULL |
|23 | 2 |38 |4/13/2017 | NULL |
Now what I would like to achieve is to derive a new column, namely VER, and update its value like the following:
|history_id | Req_id | StatID | Time |EndCycleDate | VER |
|-------------|---------|-------|---------- |-------------|------|
|1 | 1 |18 | 3/26/2017 | NULL | 1 |
|2 | 1 | 19 | 3/26/2017 | NULL | 1 |
|3 | 1 |20 | 3/30/2017 | NULL | 1 |
|4 |1 | 23 |3/30/2017 | NULL | 1 |
|5 | 1 |35 |3/30/2017 | 3/30/2017 | 1 |
|6 | 1 |33 |4/4/2017 | NULL | 2 |
|7 | 1 |34 |4/4/2017 | NULL | 2 |
|8 | 1 |39 |4/4/2017 | NULL | 2 |
|9 | 1 |35 |4/4/2017 | 4/4/2017 | 2 |
|10 | 1 |33 |4/5/2017 | NULL | 3 |
|11 | 1 |34 |4/6/2017 | NULL | 3 |
|12 | 1 |39 |4/6/2017 | NULL | 3 |
|13 | 1 |35 |4/7/2017 | 4/7/2017 | 3 |
|14 | 1 |33 |4/8/2017 | NULL | 4 |
|15 | 1 | 34 |4/8/2017 | NULL | 4 |
|16 | 2 |18 |3/28/2017 | NULL | 1 |
|17 | 2 |26 |3/28/2017 | NULL | 1 |
|18 | 2 |20 |3/30/2017 | NULL | 1 |
|19 | 2 |23 |3/30/2017 | NULL | 1 |
|20 | 2 |35 |3/30/2017 | 3/30/2017 | 1 |
|21 | 2 |33 |4/12/2017 | NULL | 2 |
|22 | 2 |34 |4/12/2017 | NULL | 2 |
|23 | 2 |38 |4/13/2017 | NULL | 2 |
One method that comes really close is a cumulative count:
select t.*,
count(endCycleDate) over (partition by req_id order by history_id) as ver
from t;
However, this doesn't get the value when the endCycle date is defined exactly right. And the value starts at 0. Most of these problems are fixed with a windowing clause:
select t.*,
(count(endCycleDate) over (partition by req_id
order by history_id
rows between unbounded preceding and 1 preceding) + 1
) as ver
from t;
But that misses the value on the first row first one. So, here is a method that actually works. It enumerates the values backward and then subtracts from the total to get the versions in ascending order:
select t.*,
(1 + count(*) over (partition by req_id) -
(count(endCycleDate) over (partition by req_id
order by history_id desc)
) as ver
from t;

How to insert multiple rows into one table for each id of another table

I have 2 Tables (20000 +rows) :
Table1:
C1_ID |C1_Name | L1_ID | C2_ID
------------------------------------
a | Alan |123 | k
b | Ben |345 | l
a | Alan |123 | m
a | Alan |453 | n
c | John |111 | i
f | Sasha |987 | e
c | John |111 | s
c | John |756 |null
z | Peter |145 |null
Table2:
C2_ID |L2_ID|Category
---------------------
k |888 |1
k |789 |2
k |888 |1
l |456 |0
l |147 |1
m |333 |1
n |999 |2
n |369 |4
n |258 |3
i |159 |2
i |357 |1
e |684 |1
s |153 |2
Desired output:
C1_ID |C1_Name | L1_ID | C2_ID| L2_ID|Category
----------------------------------------------
a | Alan |123 | k |888 |1
a | Alan |123 | k |789 |2
a | Alan |123 | m |333 |1
a | Alan |453 | n |999 |2
a | Alan |453 | n |369 |4
a | Alan |453 | n |258 |3
b | Ben |345 | l |456 |0
b | Ben |345 | l |147 |1
c | John |111 | i |159 |2
c | John |111 | i |357 |1
c | John |111 | s |153 |2
c | John |756 |null |null |null
f | Sasha |987 | e |684 |1
z | Peter |145 |null |null |null
I need to update table 1 and add rows for each time C2_ID(with distinct L2_ID) is found in table 2.
The row should be updated with the relevant L2_Id and Category.
This is a simple join.
SELECT *
FROM TABLE1 T1
LEFT JOIN
(SELECT DISTNCT C2_ID, L2_ID, Category
FROM TABLE2) AS T2 ON T1.C2_ID = T2.C2_ID