How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime - dataframe

I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance

Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)

Related

Transposing in Spark SQL

For the following table structure:
+------------------------------------------------------+
| timestamp | value1 | value2 | ... value100 |
+------------------------------------------------------+
|1/1/1 00:00:00 | 1 | 2 | 100 |
+------------------------------------------------------+
How could I transpose it into a structure like this using Spark SQL syntax?
+---------------------------------------+
| timestamp | id | value |
+---------------------------------------+
|1/1/1 00:00:00 | value1 | 1 |
|1/1/1 00:00:00 | value2 | 2 |
|1/1/1 00:00:00 | ... value100 | 100 |
+---------------------------------------+
In Python or R this would be relatively straightforward, and UNPIVOT doesn't seem to be applicable here.
A more concise approach would be to use STACK
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",10,97,23,214),
("20211011T00:00:00+0530",23,8218,9192,827),
("20200212T00:00:00+0300",51,981,18,10),
("20211021T00:00:00+0530",10,2197,871,108),
("20211021T00:00:00+0900",128,9812,98,192),
("20211021T00:00:00-0500",218,487,21,51)
]
,['timestamp','value1','value2','value3','value4'])
# sparkDF.show(truncate=False)
sparkDF.createOrReplaceTempView("sparkDF")
sql.sql("""
SELECT
timestamp
,STACK(4,'value1',value1
,'value2',value2
,'value3',value3
,'value4',value4
) as (id,value)
FROM sparkDF
""").show()
+--------------------+------+-----+
| timestamp| id|value|
+--------------------+------+-----+
|20201021T00:00:00...|value1| 10|
|20201021T00:00:00...|value2| 97|
|20201021T00:00:00...|value3| 23|
|20201021T00:00:00...|value4| 214|
|20211011T00:00:00...|value1| 23|
|20211011T00:00:00...|value2| 8218|
|20211011T00:00:00...|value3| 9192|
|20211011T00:00:00...|value4| 827|
|20200212T00:00:00...|value1| 51|
|20200212T00:00:00...|value2| 981|
|20200212T00:00:00...|value3| 18|
|20200212T00:00:00...|value4| 10|
|20211021T00:00:00...|value1| 10|
|20211021T00:00:00...|value2| 2197|
|20211021T00:00:00...|value3| 871|
|20211021T00:00:00...|value4| 108|
|20211021T00:00:00...|value1| 128|
|20211021T00:00:00...|value2| 9812|
|20211021T00:00:00...|value3| 98|
|20211021T00:00:00...|value4| 192|
+--------------------+------+-----+
Stack String
You can further create the stack_str , depending on the columns you want
col_len = 4
stack_str = ''
for i in range(col_len):
if i == 0:
stack_str += f'\'value{i+1}\',value{i+1}'
else:
stack_str += f',\'value{i+1}\',value{i+1}'
stack_str = f"STACK({col_len},{stack_str}) as (id,value)"
stack_str
"STACK(4,'value1',value1,'value2',value2,'value3',value3,'value4',value4) as (id,value)"
sql.sql(f"""
SELECT
timestamp
,{stack_str}
FROM sparkDF
""").show()
+--------------------+------+-----+
| timestamp| id|value|
+--------------------+------+-----+
|20201021T00:00:00...|value1| 10|
|20201021T00:00:00...|value2| 97|
|20201021T00:00:00...|value3| 23|
|20201021T00:00:00...|value4| 214|
|20211011T00:00:00...|value1| 23|
|20211011T00:00:00...|value2| 8218|
|20211011T00:00:00...|value3| 9192|
|20211011T00:00:00...|value4| 827|
|20200212T00:00:00...|value1| 51|
|20200212T00:00:00...|value2| 981|
|20200212T00:00:00...|value3| 18|
|20200212T00:00:00...|value4| 10|
|20211021T00:00:00...|value1| 10|
|20211021T00:00:00...|value2| 2197|
|20211021T00:00:00...|value3| 871|
|20211021T00:00:00...|value4| 108|
|20211021T00:00:00...|value1| 128|
|20211021T00:00:00...|value2| 9812|
|20211021T00:00:00...|value3| 98|
|20211021T00:00:00...|value4| 192|
+--------------------+------+-----+
You could do the same using regular SQL as follows
select timestamp
,'value1' as id
,value1 as value
from table
union all
select timestamp
,'value2' as id
,value2 as value
from table
union all
select timestamp
,'value3' as id
,value3 as value
from table

create multiple rows for each month from date range

the source has data like
Colum1| Colum2| Colum3| Colum4| Colum5| Start_date| End_date
A | B| A| B| A| 1/1/2021| 4/1/2021|
is it possible to get data as follows using query in netizza
Colum1| Colum2| Colum3| Colum4| Colum5| Month
A| B | A | B | A | 1-Jan-21
A| B | A | B | A | 1-Feb-21
A| B | A | B | A | 1-Mar-21
A| B | A | B | A | 1-Apr-21
Sure, you need a time dimension and do a ‘between-join’ against that:
Create temp table TimeDim as
Select ‘2010-01-01’::date+((datasliceid-1) || ‘ months’)::interval as FirstDayOfMonth
From _v_dual_dslice
;
Then the between-join:
Select * from YourTable join TimeDim
On FirstDayOfMonth between Start_date and End_Date
;
Can you follow ?
Lars

Using pyspark to create a segment array from a flat record

I have a sparsely populated table with values for various segments for unique user ids. I need to create an array with unique_id and relevant segment headers only
Please note that this is just an indicative dataset. I have several hundreds of segments like these.
------------------------------------------------
| user_id | seg1 | seg2 | seg3 | seg4 | seg5 |
------------------------------------------------
| 100 | M | null| 25 | null| 30 |
| 200 | null| null| 43 | null| 250 |
| 300 | F | 3000| null| 74 | null|
------------------------------------------------
I am expecting the output to be
-------------------------------
| user_id| segment_array |
-------------------------------
| 100 | [seg1, seg3, seg5] |
| 200 | [seg3, seg5] |
| 300 | [seg1, seg2, seg4] |
-------------------------------
Is there any function available in pyspark of pyspark-sql to accomplish this?
Thanks for your help!
I cannot find the direct way but you can do this.
cols= df.columns[1:]
r = df.withColumn('array', array(*[when(col(c).isNotNull(), lit(c)).otherwise('notmatch') for c in cols])) \
.withColumn('array', array_remove('array', 'notmatch'))
r.show()
+-------+----+----+----+----+----+------------------+
|user_id|seg1|seg2|seg3|seg4|seg5| array|
+-------+----+----+----+----+----+------------------+
| 100| M|null| 25|null| 30|[seg1, seg3, seg5]|
| 200|null|null| 43|null| 250| [seg3, seg5]|
| 300| F|3000|null| 74|null|[seg1, seg2, seg4]|
+-------+----+----+----+----+----+------------------+
Not sure this is the best way but I'd attack it this way:
There's the collect_set function which will always give you a unique value across a list of values you aggregate over.
do a union for each segment on:
df_seg_1 = df.select(
'user_id',
fn.when(
col('seg1').isNotNull(),
lit('seg1)
).alias('segment')
)
# repeat for all segments
df = df_seg_1.union(df_seg_2).union(...)
df.groupBy('user_id').agg(collect_list('segment'))

convert row to column in spark

I have a date like below :- I have to display year_month column column wise. How should I use this, I am new to spark.
scala> spark.sql("""select sum(actual_calls_count),year_month from ph_com_b_gbl_dice.dm_rep_customer_call group by year_month""")
res0: org.apache.spark.sql.DataFrame = [sum(actual_calls_count): bigint, year_month: string]
scala> res0.show
+-----------------------+----------+
|sum(actual_calls_count)|year_month|
+-----------------------+----------+
| 1| 2019-10|
| 3693| 2018-10|
| 7| 2019-11|
| 32| 2017-10|
| 94| 2019-03|
| 10527| 2018-06|
| 4774| 2017-05|
| 1279| 2017-11|
| 331982| 2018-03|
| 315767| 2018-02|
| 7097| 2017-03|
| 8| 2017-08|
| 3| 2019-07|
| 3136| 2017-06|
| 6088| 2017-02|
| 6344| 2017-04|
| 223426| 2018-05|
| 9819| 2018-08|
| 1| 2017-07|
| 68| 2019-05|
+-----------------------+----------+
only showing top 20 rows
My output should be like this :-
sum(actual_calls_count)|year_month1 | year_month2 | year_month3 and so on..
scala> df.groupBy(lit(1)).pivot(col("year_month")).agg(concat_ws("",collect_list(col("sum")))).drop("1").show(false)
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|2017-02|2017-03|2017-04|2017-05|2017-06|2017-07|2017-08|2017-10|2017-11|2018-02|2018-03|2018-05|2018-06|2018-08|2018-10|2019-03|2019-05|2019-07|2019-10|2019-11|
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|6088 |7097 |6344 |4774 |3136 |1 |8 |32 |1279 |315767 |331982 |223426 |10527 |9819 |3693 |94 |68 |3 |1 |7 |
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

How to filter dates column with one condtion from the other column in Pyspark?

Assume, I have the following data frame named table_df in Pyspark
sid | date | label
------------------
1033| 20170521 | 0
1033| 20170520 | 0
1033| 20170519 | 1
1033| 20170516 | 0
1033| 20170515 | 0
1033| 20170511 | 1
1033| 20170511 | 0
1033| 20170509 | 0
.....................
The data frame table_df contains different IDs in different rows, the above is simply one typical case of ID.
For each ID and for each date with label 1, I would like to find the date with label 0 that is the closest and before.
For the above table, with ID 1033, date=20170519, label 1, the date of label 0 that is closest and before is 20170516.
And with ID 1033, date=20170511, label 1, the date of label 0 that is closest and before is 20170509 .
So, finally using groupBy and some complicated operations, I will obtain the following table:
sid | filtered_date |
-------------------------
1033| 20170516 |
1033| 20170509 |
-------------
Any help is highly appreciated. I tried but could not find any smart ways.
Thanks
We can use window partition ordered by date and find difference with the next row,
df.show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170521| 0|
|1033|20170520| 0|
|1033|20170519| 1|
|1033|20170516| 0|
|1033|20170515| 0|
|1033|20170511| 1|
|1033|20170511| 0|
|1033|20170509| 0|
+----+--------+-----+
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('sid').orderBy('date')
df.withColumn('diff',F.lead('label').over(w) - df['label']).where(F.col('diff') == 1).drop('diff').show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170509| 0|
|1033|20170516| 0|
+----+--------+-----+