I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance
Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)
For the following table structure:
+------------------------------------------------------+
| timestamp | value1 | value2 | ... value100 |
+------------------------------------------------------+
|1/1/1 00:00:00 | 1 | 2 | 100 |
+------------------------------------------------------+
How could I transpose it into a structure like this using Spark SQL syntax?
+---------------------------------------+
| timestamp | id | value |
+---------------------------------------+
|1/1/1 00:00:00 | value1 | 1 |
|1/1/1 00:00:00 | value2 | 2 |
|1/1/1 00:00:00 | ... value100 | 100 |
+---------------------------------------+
In Python or R this would be relatively straightforward, and UNPIVOT doesn't seem to be applicable here.
A more concise approach would be to use STACK
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",10,97,23,214),
("20211011T00:00:00+0530",23,8218,9192,827),
("20200212T00:00:00+0300",51,981,18,10),
("20211021T00:00:00+0530",10,2197,871,108),
("20211021T00:00:00+0900",128,9812,98,192),
("20211021T00:00:00-0500",218,487,21,51)
]
,['timestamp','value1','value2','value3','value4'])
# sparkDF.show(truncate=False)
sparkDF.createOrReplaceTempView("sparkDF")
sql.sql("""
SELECT
timestamp
,STACK(4,'value1',value1
,'value2',value2
,'value3',value3
,'value4',value4
) as (id,value)
FROM sparkDF
""").show()
+--------------------+------+-----+
| timestamp| id|value|
+--------------------+------+-----+
|20201021T00:00:00...|value1| 10|
|20201021T00:00:00...|value2| 97|
|20201021T00:00:00...|value3| 23|
|20201021T00:00:00...|value4| 214|
|20211011T00:00:00...|value1| 23|
|20211011T00:00:00...|value2| 8218|
|20211011T00:00:00...|value3| 9192|
|20211011T00:00:00...|value4| 827|
|20200212T00:00:00...|value1| 51|
|20200212T00:00:00...|value2| 981|
|20200212T00:00:00...|value3| 18|
|20200212T00:00:00...|value4| 10|
|20211021T00:00:00...|value1| 10|
|20211021T00:00:00...|value2| 2197|
|20211021T00:00:00...|value3| 871|
|20211021T00:00:00...|value4| 108|
|20211021T00:00:00...|value1| 128|
|20211021T00:00:00...|value2| 9812|
|20211021T00:00:00...|value3| 98|
|20211021T00:00:00...|value4| 192|
+--------------------+------+-----+
Stack String
You can further create the stack_str , depending on the columns you want
col_len = 4
stack_str = ''
for i in range(col_len):
if i == 0:
stack_str += f'\'value{i+1}\',value{i+1}'
else:
stack_str += f',\'value{i+1}\',value{i+1}'
stack_str = f"STACK({col_len},{stack_str}) as (id,value)"
stack_str
"STACK(4,'value1',value1,'value2',value2,'value3',value3,'value4',value4) as (id,value)"
sql.sql(f"""
SELECT
timestamp
,{stack_str}
FROM sparkDF
""").show()
+--------------------+------+-----+
| timestamp| id|value|
+--------------------+------+-----+
|20201021T00:00:00...|value1| 10|
|20201021T00:00:00...|value2| 97|
|20201021T00:00:00...|value3| 23|
|20201021T00:00:00...|value4| 214|
|20211011T00:00:00...|value1| 23|
|20211011T00:00:00...|value2| 8218|
|20211011T00:00:00...|value3| 9192|
|20211011T00:00:00...|value4| 827|
|20200212T00:00:00...|value1| 51|
|20200212T00:00:00...|value2| 981|
|20200212T00:00:00...|value3| 18|
|20200212T00:00:00...|value4| 10|
|20211021T00:00:00...|value1| 10|
|20211021T00:00:00...|value2| 2197|
|20211021T00:00:00...|value3| 871|
|20211021T00:00:00...|value4| 108|
|20211021T00:00:00...|value1| 128|
|20211021T00:00:00...|value2| 9812|
|20211021T00:00:00...|value3| 98|
|20211021T00:00:00...|value4| 192|
+--------------------+------+-----+
You could do the same using regular SQL as follows
select timestamp
,'value1' as id
,value1 as value
from table
union all
select timestamp
,'value2' as id
,value2 as value
from table
union all
select timestamp
,'value3' as id
,value3 as value
from table
I have a production table in hive which gets incremental(changed records/new records) data from external source on daily basis. For values in row are possibly spread across different dates, for example, this is how records in table looks on first day
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| |
| 3| | b3|
+---+----+----+
on second day, we get following -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 4| a4| |
| 2| | b2 |
| 3| a3| |
+---+----+----+
which has new record as well as changed records
The result I want to achieve is, merge of rows based on Primary key (id in this case) and produce and output which is -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2 |
| 3| a3| b3|
| 4| a4| b4|
+---+----+----+
Number of columns are pretty huge , typically in range of 100-150. Aim is to provide latest full view of all the data received so far.How can I do this within hive itself.
(ps:it doesnt have to be sorted)
This can archived using COALESCE and full outer join.
SELECT COALESCE(a.id ,b.id) as id ,
COALESCE(a.col1 ,b.col1) as col1,
COALESCE(a.col2 ,b.col2) as col2
FROM tbl1 a
FULL OUTER JOIN table2 b
on a.id =b.id
I have two tables.
T1
--------------------------
|IDT1|DESCR | VALUE |
--------------------------
| 1|TEST 1 | 100|
| 2|TEST 2 | 80|
--------------------------
T2
-----------
|IDT2|IDT1|
-----------
| 1| 1|
| 2| 1|
| 3| 2|
-----------
The field T2.IDT1 is foreign key of T1.IDT1.
I need to omit the duplicate values of T1 table (only), like the second row in the below result.
----------------------------
|IDT1|DESCR |IDT2| VALUE|
----------------------------
| 1|TEST 1 | 1| 100|
| | | 2| |
| 2|TEST 2 | 3| 80|
----------------------------
I am using firebird 2.5.
I'm not familiar with firebird, but if this was an Oracle DB, you could try this:
select
t1.idt1,
t1.descr,
t2.idt2,
t1.value
from (
select
t2.idt2 idt2,
case
when lag(t2.idt1) over (order by t2.idt1, t2.idt2) = t2.idt1 theN null
else t2.idt1
end idt1
from t2
) t2
left outer join t1
on t1.idt1 = t2.idt1
order by 3;
You can test that here: SQL Fiddle
I need to get an intersection of two tables that uses two many-to-many tables to relate each other. Example tables as follows:
**Discount** **DiscountRef** **ProductCat** **Product**
|DisId| Discount|Amount| |DisId|RefType|RefId|IsActive| |ProdId|CatId| |ProdId| ProdName|ProdPrice|
+-----+---------+------+ +-----+-------+-----+--------+ +------+-----+ +------+--------------+---------+
| 1| 2% Off| 0.02| | 1|Product| 9004| 0| | 9001| 3456| | 9001| 9" Nail| 0.50|
| 2| 10% Off| 0.10| | 2|Product| 9002| 0| | 9002| 3456| | 9002| 2"x4" Stud| 2.50|
| 3| 25% Off| 0.25| | 2| PCat| 3456| 1| | 9005| 3456| | 9003| Claw Hammer| 5.99|
| 4| 2 for 1| 0.50| | 3| PCat| 7346| 1| | 9001| 7346| | 9004| Wood Glue| 1.20|
| 5|Clearance| 0.75| | 3| PCat| 4455| 1| | 9003| 7346| | 9005|6'x4' Dry Wall| 10.39|
| 5|Product| 9004| 0| | 9003| 4455| | 9006| Screwdriver| 4.25|
| 9006| 4455|
With these tables I need to get the intersection of Product Categories if there under the same Discount Id. The below table is what I need to get:
|DisId|ProdId|DisPrice|
+-----+------+--------+
| 2| 9001| 0.45|
| 2| 9002| 2.25|
| 2| 9005| 9.36|
| 3| 9003| 4.50|
I have tried a few different ways but can't seem to get to that table. The below SQL returns me the discounts that have more then one category applied to it.
SELECT DR.DisId, PC.CatId
FROM DiscountRef DR
INNER JOIN (
SELECT DisId
FROM DiscountRef
GROUP BY DisId
HAVING COUNT(DisId) > 1
) SDR ON SDR.DisId = DR.DisId
INNER JOIN ProductCat PC ON PC.CatId = DR.RefId AND DR.RefType = 'PCat'
GROUP BY DR.DisId, PC.CateId
Table Returned:
|DisId|CatId|
+-----+-----+
| 3| 7346|
| 3| 4455|
Then using the Product Categories Id's with an intersect of Product tables I get the correct amount of product Ids.
SELECT P1.ProdId
FROM Product P1
INNER JOIN ProdCat PC1 ON PC1.ProdId = P1.ProdId AND PC1.CategoryId = 7346
INTERSECT
SELECT P2.ProdId
FROM Product P2
INNER JOIN ProdCat PC2 ON PC2.ProdId = P2.ProdId AND PC2.CategoryId = 4455
Also a discount can have more then two categories (Narrows down the number of products), and some times there's more then one discount active (discount data is omitted for this but a check will be done).
Any help on how I can get my desired table above?
EDIT: If there are multiple DisIds on the DiscountRef table and they happen to be the PCat type they are products that shared in all the categories. Like how Claw Hammer is the only item that appears in both CatId 7346 AND CatId 4455.