How to give an alias to an unnested array of tuples?

How to give an alias to an unnested array of tuples? - sql

I'm trying to set up a set of data using this query :
WITH CTE AS (
SELECT *
FROM UNNEST([("a",NULL),("b",NULL),("c",1),("d",1),("e",2),("f",3),("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)])
)
SELECT *
FROM CTE
The yielded result is :
|Row|f0_|f1_ |
|---|---|----|
| 1 | a |null|
| 2 | b |null|
| 3 | c |1 |
| 4 | d |1 |
| 5 | e |2 |
| 6 | f |3 |
| 7 | g |4 |
| 8 | h |4 |
| 9 | i |4 |
| 10| j |4 |
| 11| k |5 |
| 12| l |5 |
| 13| m |6 |
| 14| n |7 |
| 15| o |7 |
What I want is :
|Row| x | y |
|---|---|----|
| 1 | a |null|
| 2 | b |null|
| 3 | c |1 |
| 4 | d |1 |
| 5 | e |2 |
| 6 | f |3 |
| 7 | g |4 |
| 8 | h |4 |
| 9 | i |4 |
| 10| j |4 |
| 11| k |5 |
| 12| l |5 |
| 13| m |6 |
| 14| n |7 |
| 15| o |7 |

Use STRUCT:
WITH CTE AS (
SELECT *
FROM UNNEST([STRUCT("a" as x,NULL as y),("b",NULL),("c",1),("d",1),("e",2),("f",3),("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)])
)
SELECT *
FROM CTE

You can use this trick with almost any SQL dialect:
WITH CTE AS (
SELECT NULL AS A, NULL AS B
FROM (SELECT 1) T
WHERE FALSE
UNION ALL
SELECT *
FROM UNNEST([
("a",NULL),("b",NULL),("c",1),("d",1),("e",2),("f",3),
("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)
])
)
SELECT *
FROM CTE;

Related

PySpark or SQL: consuming coalesce

I'm trying to coalesce multiple input columns into multiple output columns in either a pyspark dataframe or sql table.
Each output column would contain the "first available" input value, and then "consume" it so the input value is unavailable for following output columns.
+----+-----+-----+-----+-----+-----+---+------+------+------+
| ID | in1 | in2 | in3 | in4 | in5 | / | out1 | out2 | out3 |
+----+-----+-----+-----+-----+-----+---+------+------+------+
| 1 | | | C | | | / | C | | |
| 2 | A | | C | | E | / | A | C | E |
| 3 | A | B | C | | | / | A | B | C |
| 4 | A | B | C | D | E | / | A | B | C |
| 5 | | | | | | / | | | |
| 6 | | B | | | E | / | B | E | |
| 7 | | B | | D | E | / | B | D | E |
+----+-----+-----+-----+-----+-----+---+------+------+------+
What's the best way to do this?
edit: clarification - in1, in2, in3, etc.. can be any value

Here is the way.
import pyspark.sql.functions as f
df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
cols = df.columns
cols.remove('ID')
df2 = df.withColumn('ins', f.array_except(f.array(*cols), f.array(f.lit(None))))
for i in range(0, 3):
df2 = df2.withColumn('out' + str(i+1), f.col('ins')[i])
df2.show(10, False)
+---+----+----+----+----+----+---------------+----+----+----+
|ID |in1 |in2 |in3 |in4 |in5 |ins |out1|out2|out3|
+---+----+----+----+----+----+---------------+----+----+----+
|1 |null|null|C |null|null|[C] |C |null|null|
|2 |A |null|C |null|E |[A, C, E] |A |C |E |
|3 |A |B |C |null|null|[A, B, C] |A |B |C |
|4 |A |B |C |D |E |[A, B, C, D, E]|A |B |C |
|5 |null|null|null|null|null|[] |null|null|null|
|6 |null|B |null|null|E |[B, E] |B |E |null|
|7 |null|B |null|D |E |[B, D, E] |B |D |E |
+---+----+----+----+----+----+---------------+----+----+----+

Group rows based on column values in SQL / BigQuery

Is it possible to "group" rows within BigQuery/SQL depending on column values? Let's say I want to assign a string/id for all rows between stream_start_init and stream_start and then do the same for the rows between stream_resume and the last stream_ad.
The amount of stream_ad event can differ hence I can't use a RANK() or ROW() to group them be based on those values.
|id, timestamp, event|
|1 | 1231231 | first_visit|
|2 | 1231232 | login|
|3 | 1231233 | page_view|
|4 | 1231234 | page_view|
|5 | 1231235 | stream_start_init|
|6 | 1231236 | stream_ad|
|7 | 1231237 | stream_ad|
|8 | 1231238 | stream_ad|
|9 | 1231239 | stream_start|
|6 | 1231216 | stream_resume|
|6 | 1231236 | stream_ad|
|7 | 1231217 | stream_ad|
|8 | 1231258 | stream_ad|
|10| 1231240 | page_view|
How I wish the table to be
|id, timestamp, event, group_id|
|1 | 1231231 | first_visit, null|
|2 | 1231232 | login, null|
|3 | 1231233 | page_view, null|
|4 | 1231234 | page_view, null|
|5 | 1231235 | stream_start_init, group_1|
|6 | 1231236 | stream_ad, group_1|
|7 | 1231237 | stream_ad, group_1|
|8 | 1231238 | stream_ad, group_1|
|9 | 1231239 | stream_start, group_1|
|6 | 1231216 | stream_resume, group_2|
|6 | 1231236 | stream_ad, group_2|
|7 | 1231217 | stream_ad, group_2|
|8 | 1231258 | stream_ad, group_2|
|10| 1231240 | page_view, null|

I wouldn't assign a string. I would assign a number. This appears to be a cumulative sum. I think a sum of the number of "stream_start_init" and "stream_resume" does what you want:
select t.*,
countif(event in ('stream_start_init', 'stream_resume')) over (order by timestamp) as group_id
from t;
Note that this produces 0 for the first group -- which seems like a good thing. You can convert that to a NULL using NULLIF().
If you really want strings, you can use CONCAT().

Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
IF(event IN ('stream_start_init', 'stream_start', 'stream_resume', 'stream_ad'),
COUNTIF(event IN ('stream_start_init', 'stream_resume')) OVER(ORDER BY timestamp),
NULL
) AS group_id
FROM `project.dataset.table`

SQL Server Query grouping Optimization

My I/P table is T1:
+----+------+----+------+
| Id | v1 |v2 |v3 |
+----+------+----+------+
| 1 | a |b |c |
| 2 | null |b |null |
| 3 | d |null|null |
| 4 | null |e |null |
| 5 | e |f |null |
+----+------+----+------+
My Requirement : I have to compare one row with the others on the basis of id's.If they have all the values same or null/empty then I have to club the values of id separated by commas.
Required output:
+----+---------------------+
| Id |v1 |v2 |v3 |
+----+---------------------+
| 1,2| a |b |c |
| 3,4| d |e |null |
| 5 | e |f |null |
+----+---------------------+
Please assist.I am trying to use while loop but it is taking me very long.
I want optimize solution as I have run the statement on large record set.

Derive and Update Column Value based on Row Value SQL Server

So I have a Request History table that I would like to flag its versions (version is based on end of cycle); I was able to mark the end of the cycle, but somehow I couldn't update the values of each associated with each cycle. Here is an example:
|history_id | Req_id | StatID | Time |EndCycleDate |
|-------------|---------|-------|---------- |-------------|
|1 | 1 |18 | 3/26/2017 | NULL |
|2 | 1 | 19 | 3/26/2017 | NULL |
|3 | 1 |20 | 3/30/2017 | NULL |
|4 |1 | 23 |3/30/2017 | NULL |
|5 | 1 |35 |3/30/2017 | 3/30/2017 |
|6 | 1 |33 |4/4/2017 | NULL |
|7 | 1 |34 |4/4/2017 | NULL |
|8 | 1 |39 |4/4/2017 | NULL |
|9 | 1 |35 |4/4/2017 | 4/4/2017 |
|10 | 1 |33 |4/5/2017 | NULL |
|11 | 1 |34 |4/6/2017 | NULL |
|12 | 1 |39 |4/6/2017 | NULL |
|13 | 1 |35 |4/7/2017 | 4/7/2017 |
|14 | 1 |33 |4/8/2017 | NULL |
|15 | 1 | 34 |4/8/2017 | NULL |
|16 | 2 |18 |3/28/2017 | NULL |
|17 | 2 |26 |3/28/2017 | NULL |
|18 | 2 |20 |3/30/2017 | NULL |
|19 | 2 |23 |3/30/2017 | NULL |
|20 | 2 |35 |3/30/2017 | 3/30/2017 |
|21 | 2 |33 |4/12/2017 | NULL |
|22 | 2 |34 |4/12/2017 | NULL |
|23 | 2 |38 |4/13/2017 | NULL |
Now what I would like to achieve is to derive a new column, namely VER, and update its value like the following:
|history_id | Req_id | StatID | Time |EndCycleDate | VER |
|-------------|---------|-------|---------- |-------------|------|
|1 | 1 |18 | 3/26/2017 | NULL | 1 |
|2 | 1 | 19 | 3/26/2017 | NULL | 1 |
|3 | 1 |20 | 3/30/2017 | NULL | 1 |
|4 |1 | 23 |3/30/2017 | NULL | 1 |
|5 | 1 |35 |3/30/2017 | 3/30/2017 | 1 |
|6 | 1 |33 |4/4/2017 | NULL | 2 |
|7 | 1 |34 |4/4/2017 | NULL | 2 |
|8 | 1 |39 |4/4/2017 | NULL | 2 |
|9 | 1 |35 |4/4/2017 | 4/4/2017 | 2 |
|10 | 1 |33 |4/5/2017 | NULL | 3 |
|11 | 1 |34 |4/6/2017 | NULL | 3 |
|12 | 1 |39 |4/6/2017 | NULL | 3 |
|13 | 1 |35 |4/7/2017 | 4/7/2017 | 3 |
|14 | 1 |33 |4/8/2017 | NULL | 4 |
|15 | 1 | 34 |4/8/2017 | NULL | 4 |
|16 | 2 |18 |3/28/2017 | NULL | 1 |
|17 | 2 |26 |3/28/2017 | NULL | 1 |
|18 | 2 |20 |3/30/2017 | NULL | 1 |
|19 | 2 |23 |3/30/2017 | NULL | 1 |
|20 | 2 |35 |3/30/2017 | 3/30/2017 | 1 |
|21 | 2 |33 |4/12/2017 | NULL | 2 |
|22 | 2 |34 |4/12/2017 | NULL | 2 |
|23 | 2 |38 |4/13/2017 | NULL | 2 |

One method that comes really close is a cumulative count:
select t.*,
count(endCycleDate) over (partition by req_id order by history_id) as ver
from t;
However, this doesn't get the value when the endCycle date is defined exactly right. And the value starts at 0. Most of these problems are fixed with a windowing clause:
select t.*,
(count(endCycleDate) over (partition by req_id
order by history_id
rows between unbounded preceding and 1 preceding) + 1
) as ver
from t;
But that misses the value on the first row first one. So, here is a method that actually works. It enumerates the values backward and then subtracts from the total to get the versions in ascending order:
select t.*,
(1 + count(*) over (partition by req_id) -
(count(endCycleDate) over (partition by req_id
order by history_id desc)
) as ver
from t;

How to insert multiple rows into one table for each id of another table

I have 2 Tables (20000 +rows) :
Table1:
C1_ID |C1_Name | L1_ID | C2_ID
------------------------------------
a | Alan |123 | k
b | Ben |345 | l
a | Alan |123 | m
a | Alan |453 | n
c | John |111 | i
f | Sasha |987 | e
c | John |111 | s
c | John |756 |null
z | Peter |145 |null
Table2:
C2_ID |L2_ID|Category
---------------------
k |888 |1
k |789 |2
k |888 |1
l |456 |0
l |147 |1
m |333 |1
n |999 |2
n |369 |4
n |258 |3
i |159 |2
i |357 |1
e |684 |1
s |153 |2
Desired output:
C1_ID |C1_Name | L1_ID | C2_ID| L2_ID|Category
----------------------------------------------
a | Alan |123 | k |888 |1
a | Alan |123 | k |789 |2
a | Alan |123 | m |333 |1
a | Alan |453 | n |999 |2
a | Alan |453 | n |369 |4
a | Alan |453 | n |258 |3
b | Ben |345 | l |456 |0
b | Ben |345 | l |147 |1
c | John |111 | i |159 |2
c | John |111 | i |357 |1
c | John |111 | s |153 |2
c | John |756 |null |null |null
f | Sasha |987 | e |684 |1
z | Peter |145 |null |null |null
I need to update table 1 and add rows for each time C2_ID(with distinct L2_ID) is found in table 2.
The row should be updated with the relevant L2_Id and Category.

This is a simple join.
SELECT *
FROM TABLE1 T1
LEFT JOIN
(SELECT DISTNCT C2_ID, L2_ID, Category
FROM TABLE2) AS T2 ON T1.C2_ID = T2.C2_ID

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to give an alias to an unnested array of tuples? - sql

Use STRUCT: WITH CTE AS ( SELECT * FROM UNNEST([STRUCT("a" as x,NULL as y),("b",NULL),("c",1),("d",1),("e",2),("f",3),("g",4),("h",4),("i",4),("j",4),("k",5),("l",5),("m",6),("n",7),("o",7)]) ) SELECT * FROM CTE

Related

PySpark or SQL: consuming coalesce

Group rows based on column values in SQL / BigQuery

SQL Server Query grouping Optimization

Derive and Update Column Value based on Row Value SQL Server

How to insert multiple rows into one table for each id of another table

Categories

Resources