I want to create a new column in PySpark DataFrame with N repeating row numbers irrespective of other columns in the data frame.
Original data:
name year
A 2010
A 2011
A 2011
A 2013
A 2014
A 2015
A 2016
A 2018
B 2018
B 2019
I want to have a new column with N repeating row number, consider N=3.
Expected Output:
name year rownumber
A 2010 1
A 2011 1
A 2011 1
A 2013 2
A 2014 2
A 2015 2
A 2016 3
A 2018 3
B 2018 3
B 2019 4
You can try row number with division:
n=3
df.withColumn("rounum",
((F.row_number().over(Window.orderBy(F.lit(0)))-1)/n).cast("Integer")+1).show()
+----+----+------+
|name|year|rounum|
+----+----+------+
| A|2010| 1|
| A|2011| 1|
| A|2011| 1|
| A|2013| 2|
| A|2014| 2|
| A|2015| 2|
| A|2016| 3|
| A|2018| 3|
| B|2018| 3|
| B|2019| 4|
+----+----+------+
Related
say I have two tables. In below example only
two cols changing but not sure if pivot would work well for 10 cols.
Table 1:
--------------------------
|id |filtercol| inputid1|
--------------------------
|100| 10 | 4 |
|108| 10 | 5 |
|200| 9 | 4 |
|106| 9 | 6 |
|110| 11 | 7 |
|130| 9 | 7 |
--------------------------
Table 2:
---------------------------------
|a | b | c | d |
---------------------------------
|"hello"| 1 | 4 | 6 |
|"world"| 2 | 5 | 6 |
|"test" | 3 | 4 | 7 |
---------------------------------
I want the final table to be
----------------------------------
|a | b | 10 | 11|
----------------------------------
|"hello"| 1 | 100 | |
|"world"| 2 | 108 | |
|"test" | 3 | 100 |110|
---------------------------------
So c col will be changed to 10 and col d will be renamed to 11.
Then use 10 as the filter for table 1 in the filtercol column name and use the value in column c and d as lookup value for column inputid1. Whatever value is found we change table 2 values to the value of id in table 1.
Example for the first row the new table has 100 in col 10 because we used original value in this col which was 4 for this row as the lookup for column inputid1 and then used the new c column name 10 as the filter on column filtercol and got id 100 so now replace 4 with 100 in this column.
Reason why null is returned col 11 is when used 6 as filtercol no values returned in lookup after using 6 as the filtercol.
I was thinking of possible joining and filtering but does not seem to be good solution as lets say I have col e,f,g,hi,j to check too.
df2 = df.withColumnRenamed("c","10")
df2 = df.withColumnRenamed("d","11")
table3df = (
df1.join(df2,
df1.inputid1 == df2.10, how='left')
)
table3df = table3df.filter(col("filtercol") ==int(col("10"))
I was playing with your example a bit, and did not fully implement it yet. You did not mention what to do when multiple values for a match in column c. I resolved it with max which gave me a different answer then what you were expecting.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
spark = SparkSession.builder.getOrCreate()
table1_df = spark.sql("""
SELECT 100 as id, 10 as filtercol, 4 as inputid1
UNION ALL
SELECT 108, 10, 5
UNION ALL
SELECT 200, 9, 4
UNION ALL
SELECT 106, 9, 6
UNION ALL
SELECT 110, 11, 7
UNION ALL
SELECT 130, 9, 7
""").alias("table1")
table2_df = spark.sql("""
SELECT 'hello' as a, 1 as b, 4 as c, 6 as d
UNION ALL
SELECT 'world', 2, 5, 6
UNION ALL
SELECT 'test', 3, 4, 7
""").alias("table2")
j = table2_df.join(table1_df.alias("join_c"), col("table2.c") == col("join_c.inputid1")).join(table1_df.alias("join_d"), col("table2.d") == col("join_d.inputid1"))
j.show()
j.select(
"table2.a",
"table2.b",
when(col("join_c.filtercol") == "10", col("join_c.id")).alias("10"),
when(col("join_d.filtercol") == "11", col("join_c.id")).alias("11")
).groupby("a", "b").max().show()
+-----+---+------+-------+-------+
| a| b|max(b)|max(10)|max(11)|
+-----+---+------+-------+-------+
|hello| 1| 1| 100| null|
|world| 2| 2| 108| null|
| test| 3| 3| 100| 200|
+-----+---+------+-------+-------+
I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance
Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)
I'm trying to do a join between two PySpark dataframes, joining on a key, however the date of the first table should always come after the date of the second table. As an example. We have two tables that we're trying to join:
Table 1:
Date1 value1 key
13 Feb 2020 1 a
01 Mar 2020 2 a
31 Mar 2020 3 a
15 Apr 2020 4 a
Table 2:
Date2 value2 key
10 Feb 2020 11 a
15 Mar 2020 22 a
After the join, the result should be something like this:
Date1 value1 value2 key
13 Feb 2020 1 11 a
01 Mar 2020 2 null a
31 Mar 2020 3 22 a
15 Apr 2020 4 null a
Any ideas?
This is an interesting join. My approach is to join on the key first, select the earliest date, and do a self join after the earliest date is found.
from pyspark.sql import functions as F, Window
# Clean up date format first
df3 = df1.withColumn('Date1', F.to_date('Date1', 'dd MMM yyyy'))
df4 = df2.withColumn('Date2', F.to_date('Date2', 'dd MMM yyyy'))
result = (df3.join(df4, 'key')
.filter('Date1 > Date2')
.withColumn('rn', F.row_number().over(Window.partitionBy('Date2').orderBy('Date1')))
.filter('rn = 1')
.drop('key', 'rn', 'Date2')
.join(df3, ['Date1', 'value1'], 'right')
)
result.show()
+----------+------+------+---+
|Date1 |value1|value2|key|
+----------+------+------+---+
|2020-02-13|1 |11 |a |
|2020-03-01|2 |null |a |
|2020-03-31|3 |22 |a |
|2020-04-15|4 |null |a |
+----------+------+------+---+
You can try window lag function, it's scala but python version will be similar.
// change col names for union all and add extra col to indentify dataset
val df1A = df1.toDF("Date","value","key").withColumn("df",lit(1))
val df2A = df2.toDF("Date","value","key").withColumn("df",lit(2))
import org.apache.spark.sql.expressions.Window
df1A.unionAll(df2A)
.withColumn("value2",lag(array('value,'df),1) over Window.partitionBy('key).orderBy(to_date('Date,"dd MMM yyyy")))
.filter('df===1)
.withColumn("value2",when(element_at('value2,2)===2,element_at('value2,1)))
.drop("df")
.show
output:
+-----------+-----+---+------+
| Date|value|key|value2|
+-----------+-----+---+------+
|13 Feb 2020| 1| a| 11|
|01 Mar 2020| 2| a| null|
|31 Mar 2020| 3| a| 22|
|15 Apr 2020| 4| a| null|
+-----------+-----+---+------+
I'm working on a MS Access 2010 database and I'm struggling more than expected with this.
I have these tables:
tblBook:
IDBook (key)
Title
tblUser:
IDUser (key)
Username
tblOrder:
IDOrder (key)
IDUser (linked to tblUser)
Date
tblOrderBook:
IDOrderBook (key)
IDOrder (linked to tblOrder)
IDBook (linked to tblBook)
A user can pick up to 3 books per order. I made a query that displays them like this, by IDOrderBook:
IDOrderBook |IDOrder | Username | Date | Title
6 |3 | John | Aug 1| Harry Potter
5 |3 | John | Aug 1| Lord of the Rings
4 |2 | Susan | Jul 5| The Shining
3 |2 | Susan | Jul 5| Huck Finn
2 |2 | Susan | Jul 5| Peter Pan
1 |1 | Rita | Jul 4| Harry Potter
Now I want something to show them by IDOrder like this:
IDOrder | Username | Date | Title1 | Title2 | Title3
3 | John | Aug 1| Harry Potter | LoTR |
2 | Susan | Jul 5| The Shining | Huck Finn | Peter Pan
1 | Rita | Jul 4| Harry Potter | |
So with multiple titles in a single row. How do I build this query?
Thank you!
This is quite a difficult task. First, we create the column name. We do that by counting each record with an ID higher than the current one. Then, we pivot that column name, creating t
I'm going to base all these queries based on that query you've shared. We'll call that qry1.
The query creating the column names, qry2:
SELECT IDOrderBook, IDOrder, Username, Date, Title,
"Title" & (
SELECT Count(*)
FROM qry1 s
WHERE q.IDOrder = s.IDOrder AND q.IDOrderBook >= s.IDOrderBook
) As ColumnName
FROM qry1 q
Then, we use a pivot (crosstab) query to create your desired result:
TRANSFORM First(Title)
SELECT IDOrder, Username, Date
FROM qry2
GROUP BY IDOrder, Username, Date
Pivot ColumnName
I leave merging these queries using subqueries as an exercise to the reader, I've split them for clarity.
Assume, I have the following data frame named table_df in Pyspark
sid | date | label
------------------
1033| 20170521 | 0
1033| 20170520 | 0
1033| 20170519 | 1
1033| 20170516 | 0
1033| 20170515 | 0
1033| 20170511 | 1
1033| 20170511 | 0
1033| 20170509 | 0
.....................
The data frame table_df contains different IDs in different rows, the above is simply one typical case of ID.
For each ID and for each date with label 1, I would like to find the date with label 0 that is the closest and before.
For the above table, with ID 1033, date=20170519, label 1, the date of label 0 that is closest and before is 20170516.
And with ID 1033, date=20170511, label 1, the date of label 0 that is closest and before is 20170509 .
So, finally using groupBy and some complicated operations, I will obtain the following table:
sid | filtered_date |
-------------------------
1033| 20170516 |
1033| 20170509 |
-------------
Any help is highly appreciated. I tried but could not find any smart ways.
Thanks
We can use window partition ordered by date and find difference with the next row,
df.show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170521| 0|
|1033|20170520| 0|
|1033|20170519| 1|
|1033|20170516| 0|
|1033|20170515| 0|
|1033|20170511| 1|
|1033|20170511| 0|
|1033|20170509| 0|
+----+--------+-----+
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('sid').orderBy('date')
df.withColumn('diff',F.lead('label').over(w) - df['label']).where(F.col('diff') == 1).drop('diff').show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170509| 0|
|1033|20170516| 0|
+----+--------+-----+