Transposing in Spark SQL

Transposing in Spark SQL - sql

For the following table structure:
+------------------------------------------------------+
| timestamp | value1 | value2 | ... value100 |
+------------------------------------------------------+
|1/1/1 00:00:00 | 1 | 2 | 100 |
+------------------------------------------------------+
How could I transpose it into a structure like this using Spark SQL syntax?
+---------------------------------------+
| timestamp | id | value |
+---------------------------------------+
|1/1/1 00:00:00 | value1 | 1 |
|1/1/1 00:00:00 | value2 | 2 |
|1/1/1 00:00:00 | ... value100 | 100 |
+---------------------------------------+
In Python or R this would be relatively straightforward, and UNPIVOT doesn't seem to be applicable here.

A more concise approach would be to use STACK
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",10,97,23,214),
("20211011T00:00:00+0530",23,8218,9192,827),
("20200212T00:00:00+0300",51,981,18,10),
("20211021T00:00:00+0530",10,2197,871,108),
("20211021T00:00:00+0900",128,9812,98,192),
("20211021T00:00:00-0500",218,487,21,51)
]
,['timestamp','value1','value2','value3','value4'])
# sparkDF.show(truncate=False)
sparkDF.createOrReplaceTempView("sparkDF")
sql.sql("""
SELECT
timestamp
,STACK(4,'value1',value1
,'value2',value2
,'value3',value3
,'value4',value4
) as (id,value)
FROM sparkDF
""").show()
+--------------------+------+-----+
| timestamp| id|value|
+--------------------+------+-----+
|20201021T00:00:00...|value1| 10|
|20201021T00:00:00...|value2| 97|
|20201021T00:00:00...|value3| 23|
|20201021T00:00:00...|value4| 214|
|20211011T00:00:00...|value1| 23|
|20211011T00:00:00...|value2| 8218|
|20211011T00:00:00...|value3| 9192|
|20211011T00:00:00...|value4| 827|
|20200212T00:00:00...|value1| 51|
|20200212T00:00:00...|value2| 981|
|20200212T00:00:00...|value3| 18|
|20200212T00:00:00...|value4| 10|
|20211021T00:00:00...|value1| 10|
|20211021T00:00:00...|value2| 2197|
|20211021T00:00:00...|value3| 871|
|20211021T00:00:00...|value4| 108|
|20211021T00:00:00...|value1| 128|
|20211021T00:00:00...|value2| 9812|
|20211021T00:00:00...|value3| 98|
|20211021T00:00:00...|value4| 192|
+--------------------+------+-----+
Stack String
You can further create the stack_str , depending on the columns you want
col_len = 4
stack_str = ''
for i in range(col_len):
if i == 0:
stack_str += f'\'value{i+1}\',value{i+1}'
else:
stack_str += f',\'value{i+1}\',value{i+1}'
stack_str = f"STACK({col_len},{stack_str}) as (id,value)"
stack_str
"STACK(4,'value1',value1,'value2',value2,'value3',value3,'value4',value4) as (id,value)"
sql.sql(f"""
SELECT
timestamp
,{stack_str}
FROM sparkDF
""").show()
+--------------------+------+-----+
| timestamp| id|value|
+--------------------+------+-----+
|20201021T00:00:00...|value1| 10|
|20201021T00:00:00...|value2| 97|
|20201021T00:00:00...|value3| 23|
|20201021T00:00:00...|value4| 214|
|20211011T00:00:00...|value1| 23|
|20211011T00:00:00...|value2| 8218|
|20211011T00:00:00...|value3| 9192|
|20211011T00:00:00...|value4| 827|
|20200212T00:00:00...|value1| 51|
|20200212T00:00:00...|value2| 981|
|20200212T00:00:00...|value3| 18|
|20200212T00:00:00...|value4| 10|
|20211021T00:00:00...|value1| 10|
|20211021T00:00:00...|value2| 2197|
|20211021T00:00:00...|value3| 871|
|20211021T00:00:00...|value4| 108|
|20211021T00:00:00...|value1| 128|
|20211021T00:00:00...|value2| 9812|
|20211021T00:00:00...|value3| 98|
|20211021T00:00:00...|value4| 192|
+--------------------+------+-----+

You could do the same using regular SQL as follows
select timestamp
,'value1' as id
,value1 as value
from table
union all
select timestamp
,'value2' as id
,value2 as value
from table
union all
select timestamp
,'value3' as id
,value3 as value
from table

Related

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance

Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)

create multiple rows for each month from date range

the source has data like
Colum1| Colum2| Colum3| Colum4| Colum5| Start_date| End_date
A | B| A| B| A| 1/1/2021| 4/1/2021|
is it possible to get data as follows using query in netizza
Colum1| Colum2| Colum3| Colum4| Colum5| Month
A| B | A | B | A | 1-Jan-21
A| B | A | B | A | 1-Feb-21
A| B | A | B | A | 1-Mar-21
A| B | A | B | A | 1-Apr-21

Sure, you need a time dimension and do a ‘between-join’ against that:
Create temp table TimeDim as
Select ‘2010-01-01’::date+((datasliceid-1) || ‘ months’)::interval as FirstDayOfMonth
From _v_dual_dslice
;
Then the between-join:
Select * from YourTable join TimeDim
On FirstDayOfMonth between Start_date and End_Date
;
Can you follow ?
Lars

How to join 2 tables when it has different values in the column

I have 2 tables as follows.
Need to join these 2 tables to get below table
I am trying different joins but not getting expected results. Could you please help me to get the desired table.
Really appreciate your help.
Thanks...

Hope this solution can help you, (I am used SQL_Server syntax)
SELECT isnull(date1,date2) as Date3, ISNULL(RM, 0 ),ISNULL(KM, 0 )
FROM table1
FULL JOIN table2
ON table1.Date1 = table2.Date2
order by Date3;
[RESULT]:
[EDIT]:
Live demo
create table Table1 (DATE1 date, RM int);
INSERT INTO Table1 VALUES ('1/4/2020' , 1);
INSERT INTO Table1 VALUES ('2/1/2020' , 4);
INSERT INTO Table1 VALUES ('2/10/2020' , 4);
GO
3 rows affected
create table Table2 (DATE2 date, KM int);
INSERT INTO Table2 VALUES ('2/2/2020' , 1);
INSERT INTO Table2 VALUES ('2/10/2020' , 3);
INSERT INTO Table2 VALUES ('3/5/2020' , 2);
GO
3 rows affected
select * from Table1;
GO
DATE1 | RM
:--------- | -:
2020-01-04 | 1
2020-02-01 | 4
2020-02-10 | 4
select * from Table2;
GO
DATE2 | KM
:--------- | -:
2020-02-02 | 1
2020-02-10 | 3
2020-03-05 | 2
SELECT isnull(date1,date2) as Date3, ISNULL(RM, 0 ),ISNULL(KM, 0 )
FROM table1
FULL JOIN table2
ON table1.Date1 = table2.Date2
order by Date3;
GO
Date3 | (No column name) | (No column name)
:--------- | ---------------: | ---------------:
2020-01-04 | 1 | 0
2020-02-01 | 4 | 0
2020-02-02 | 0 | 1
2020-02-10 | 4 | 3
2020-03-05 | 0 | 2
db<>fiddle here

I don't know scala but in pyspark you can do the following:
df1.join(df2, 'DATE', 'full').fillna(0)
Essentially you do a full join and fill all the NULLs with 0.
For Hive SQL I guess it would be something like
SELECT Date,
CASE WHEN (table1.RM IS NOT NULL) THEN table1.RM ELSE 0 END AS RM,
CASE WHEN (table2.KM IS NOT NULL) THEN table2.KM ELSE 0 END AS KM
FROM table1
FULL JOIN table2
ON table1.Date = table2.Date

I have created two initial dataframe named as df_rm, df_km as a source for your data.
df_rm looks like this:
+---------+---+
| date| rm|
+---------+---+
| 1/4/2020| 1|
| 2/1/2020| 4|
|2/10/2020| 4|
+---------+---+
df_km:
+---------+---+
| date| km|
+---------+---+
| 2/2/2020| 1|
|2/10/2020| 3|
| 3/5/2020| 2|
+---------+---+
Now, first we can do outer join then replace the null values with some values, in this case 0.
df_km.join(right = df_rm, Seq("date"),joinType = "outer")
.withColumn("rm",when(col("rm").isNull,0).otherwise(col("rm")))
.withColumn("km",when(col("km").isNull,0).otherwise(col("km")))
.show()
Which outputs like this:
+---------+---+---+
| date| km| rm|
+---------+---+---+
| 3/5/2020| 2| 0|
| 2/2/2020| 1| 0|
| 2/1/2020| 0| 4|
| 1/4/2020| 0| 1|
|2/10/2020| 3| 4|
+---------+---+---+

Hive : Validate the quality of data filled in array by comparing with data definition record and find percentage of data filled, quality rank of data

I have two tables. Table t1 defines the metadata. ie, what are the attribute values an ideal transaction should contain. It also defines the order of importance of attributes by the order of records in the array. The first record is most important and it has weightage of 1. 2nd one has 0.9, 3rd - 0.8, 4th - 0.7 and so on.... Anything above 10 is of least important. I need to find the quality of data filled in the transaction table t2. Find the percentage of attributes filled and what is the quality rank of them.
t1
------------------------------------
| a_id | attribute_values |
------------------------------------
| 12345 | ["a1", "a2", "a3", "a5"] |
| 6789 | ["b1", "b4", "b7"] |
------------------------------------
t2
------------------------------------
| b_id | a_id | attribute_values|
------------------------------------
| B123 | 12345 | ["a2", "a5"] |
| B456 | 6789 | ["b1, "b7"] |
-------------------------------------
I am looking for way to calculate the quality rank for my t2 records as below
------------------------------------------
| b_id | percent_complete | quality_rank |
------------------------------------------
| B123 | 50 | 0.4. |
| B456 | 66.66 | 0.6. |
------------------------------------------
B123 - (2 out of 4) 50% complete. quality rank - (0.9+0.7)/4 = 0.4
B456 - (2 out of 3) 66.66% complete. quality rank - (1+0.8)/3 = 0.6

Solved it by exploding both the tables. Calculated the weight and rank for the first table and then joined with the other table. Not able to do it in single sql though.
scala> val t1 = Seq((12345, List("a1", "a2", "a3", "a5")), (6789, List("b1", "b5", "b7"))).toDF("a_id", "attribute_values")
scala> val t2 = Seq(("B123", 12345, List("a2", "a5")), ("B456", 6789, List("b1", "b7"))).toDF("b_id","a_id", "attribute_values")
scala> val t1_1 = t1.select($"a_id", posexplode($"attribute_values"))
scala> t1_1.show
+-----+---+---+
| a_id|pos|col|
+-----+---+---+
|12345| 0| a1|
|12345| 1| a2|
|12345| 2| a3|
|12345| 3| a5|
| 6789| 0| b1|
| 6789| 1| b5|
| 6789| 2| b7|
+-----+---+---+
scala> t1_1.createOrReplaceTempView("tab_t1_1")
scala> spark.sql("select *, 1 - (pos * 0.1) as calc_weight, count(col) over (partition by a_id) as rec_count from tab_t1_1").show
+-----+---+---+-----------+---------+
| a_id|pos|col|calc_weight|rec_count|
+-----+---+---+-----------+---------+
| 6789| 0| b1| 1.0| 3|
| 6789| 1| b5| 0.9| 3|
| 6789| 2| b7| 0.8| 3|
|12345| 0| a1| 1.0| 4|
|12345| 1| a2| 0.9| 4|
|12345| 2| a3| 0.8| 4|
|12345| 3| a5| 0.7| 4|
+-----+---+---+-----------+---------+
scala> val t1_2 = spark.sql("select *, 1 - (pos * 0.1) as calc_weight, count(col) over (partition by a_id) as rec_count from tab_t1_1")
scala> t1_2.createOrReplaceTempView("tab_t1_2")
scala> val t2_1 = t2.select($"b_id", $"a_id", explode($"attribute_values"))
scala> t2_1.show
+----+-----+---+
|b_id| a_id|col|
+----+-----+---+
|B123|12345| a2|
|B123|12345| a5|
|B456| 6789| b1|
|B456| 6789| b7|
+----+-----+---+
scala> t2_1.createOrReplaceTempView("tab_t2_1")
scala> spark.sql("Select b_id, t1.a_id, round(count(t2.col)*100/max(t1.rec_count),2) as percent_complete, round(sum(t1.calc_weight)/ max(t1.rec_count),2) as quality_rank from tab_t1_2 t1, tab_t2_1 t2 where t1.a_id = t2.a_id and t1.col = t2.col group by b_id, t1.a_id").show
+----+-----+----------------+------------+
|b_id| a_id|percent_complete|quality_rank|
+----+-----+----------------+------------+
|B123|12345| 50.0| 0.40|
|B456| 6789| 66.67| 0.60|
+----+-----+----------------+------------+

Select with Nested table omitting duplicate values

I have two tables.
T1
--------------------------
|IDT1|DESCR | VALUE |
--------------------------
| 1|TEST 1 | 100|
| 2|TEST 2 | 80|
--------------------------
T2
-----------
|IDT2|IDT1|
-----------
| 1| 1|
| 2| 1|
| 3| 2|
-----------
The field T2.IDT1 is foreign key of T1.IDT1.
I need to omit the duplicate values of T1 table (only), like the second row in the below result.
----------------------------
|IDT1|DESCR |IDT2| VALUE|
----------------------------
| 1|TEST 1 | 1| 100|
| | | 2| |
| 2|TEST 2 | 3| 80|
----------------------------
I am using firebird 2.5.

I'm not familiar with firebird, but if this was an Oracle DB, you could try this:
select
t1.idt1,
t1.descr,
t2.idt2,
t1.value
from (
select
t2.idt2 idt2,
case
when lag(t2.idt1) over (order by t2.idt1, t2.idt2) = t2.idt1 theN null
else t2.idt1
end idt1
from t2
) t2
left outer join t1
on t1.idt1 = t2.idt1
order by 3;
You can test that here: SQL Fiddle

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Transposing in Spark SQL - sql

You could do the same using regular SQL as follows select timestamp ,'value1' as id ,value1 as value from table union all select timestamp ,'value2' as id ,value2 as value from table union all select timestamp ,'value3' as id ,value3 as value from table

Related

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

create multiple rows for each month from date range

How to join 2 tables when it has different values in the column

Hive : Validate the quality of data filled in array by comparing with data definition record and find percentage of data filled, quality rank of data

Select with Nested table omitting duplicate values

Categories

Resources