Create empty sparse vectors in PySpark - apache-spark-sql

I have a dataframe DF1 that looks like this:
+-------+------+
|user_id|meta |
+-------+------+
| 1| null|
| 11| null|
| 15| null|
+-------+------+
Schema:
root
|-- user_id: string (nullable = true)
|-- meta: string (nullable = true)
and I have another dataframe DF2 that looks like this
+-------+------------------------------------+
|user_id| Vectorz |
+-------+------------------------------------+
| 10| (2,[1],[1.0])|
| 12| (2,[1],[1.0])|
| 13| (2,[0],[1.0])|
| 14| (2,[1],[1.0])|
---------------------------------------------
Schema is:
[user_id: string, Vectorz: vector]
I want to inject all the user_ids from DF1 into DF2, but create empty sparse vectors for them since their "meta" column is all NULLs.
So, I want DF2 to finally be:
+-------+------------------------------------+
|user_id| Vectorz |
+-------+------------------------------------+
| 1| (,[],[])|
| 10| (2,[1],[1.0])|
| 11| (,[],[])|
| 12| (2,[1],[1.0])|
| 13| (2,[0],[1.0])|
| 14| (2,[1],[1.0])|
| 15| (,[],[])|
---------------------------------------------
Can somebody please help?
I am new to PySpark. So, sorry if I don't sound informed enough.

You can go ahead and create empty vectors for all the user_ids when meta is null.
Anyways you need to decide when the meta column is not null.
Sample COde
DF1
val spark = sqlContext.sparkSession
val implicits = sqlContext.sparkSession.implicits
import implicits._
val df1 = sqlContext.range(1,4)
.withColumnRenamed("id", "user_id")
.withColumn("meta", lit(null).cast(DataTypes.StringType))
df1.show(false)
df1.printSchema()
+-------+----+
|user_id|meta|
+-------+----+
|1 |null|
|2 |null|
|3 |null|
+-------+----+
root
|-- user_id: long (nullable = false)
|-- meta: string (nullable = true)
DF2
import org.apache.spark.ml.linalg.Vectors
val staticVector = udf(() => Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), SQLDataTypes.VectorType)
val df2 = sqlContext.range(5,8)
.withColumnRenamed("id", "user_id")
.withColumn("Vectorz", staticVector())
df2.show(false)
df2.printSchema()
+-------+-------------------+
|user_id|Vectorz |
+-------+-------------------+
|5 |(5,[1,3],[1.0,7.0])|
|6 |(5,[1,3],[1.0,7.0])|
|7 |(5,[1,3],[1.0,7.0])|
+-------+-------------------+
root
|-- user_id: long (nullable = false)
|-- Vectorz: vector (nullable = true)
Processed DF
val emptyVector = udf(() => Vectors.sparse(0, Array.empty[Int], Array.empty[Double]), SQLDataTypes.VectorType)
val processedDF =
// meta column shouldn't have any value
// for the safer side adding filter as meta is null
// need to decide what if meta is not null
// I'm assigning empty vector to that also
df1.where(col("meta").isNull)
.withColumn("Vectorz", when(col("meta").isNull, emptyVector()).otherwise(emptyVector()))
.drop("meta")
.unionByName(df2)
processedDF.show(false)
processedDF.printSchema()
+-------+-------------------+
|user_id|Vectorz |
+-------+-------------------+
|1 |(0,[],[]) |
|2 |(0,[],[]) |
|3 |(0,[],[]) |
|5 |(5,[1,3],[1.0,7.0])|
|6 |(5,[1,3],[1.0,7.0])|
|7 |(5,[1,3],[1.0,7.0])|
+-------+-------------------+
root
|-- user_id: long (nullable = false)
|-- Vectorz: vector (nullable = true)

Related

List of Winners of Each World champions Trophy

Total Result of all rounds of Tournament for that player is considered as that player's Score/Result.
Schema:
|-- game_id: string (nullable = true)
|-- game_order: integer (nullable = true)
|-- event: string (nullable = true)
|-- site: string (nullable = true)
|-- date_played: string (nullable = true)
|-- round: double (nullable = true)
|-- white: string (nullable = true)
|-- black: string (nullable = true)
|-- result: string (nullable = true)
|-- white_elo: integer (nullable = true)
|-- black_elo: integer (nullable = true)
|-- white_title: string (nullable = true)
|-- black_title: string (nullable = true)
|-- winner: string (nullable = true)
|-- winner_elo: integer (nullable = true)
|-- loser: string (nullable = true)
|-- loser_elo: integer (nullable = true)
|-- winner_loser_elo_diff: integer (nullable = true)
|-- eco: string (nullable = true)
|-- date_created: string (nullable = true)
|-- tournament_name: string (nullable = true)
Sample DaraFrame:
+--------------------+----------+--------+----------+-----------+-----+----------------+----------------+-------+---------+---------+-----------+-----------+---------+----------+----------------+---------+---------------------+---+--------------------+---------------+
| game_id|game_order| event| site|date_played|round| white| black| result|white_elo|black_elo|white_title|black_title| winner|winner_elo| loser|loser_elo|winner_loser_elo_diff|eco| date_created|tournament_name|
+--------------------+----------+--------+----------+-----------+-----+----------------+----------------+-------+---------+---------+-----------+-----------+---------+----------+----------------+---------+---------------------+---+--------------------+---------------+
|86e0b7f5-7b94-4ae...| 1|WCh 2021| Dubai UAE| 2021.11.26| 1.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|dc4a10ab-54cf-49d...| 2|WCh 2021| Dubai UAE| 2021.11.27| 2.0| Carlsen,M|Nepomniachtchi,I|1/2-1/2| 2855| 2782| null| null| draw| null| draw| null| 0|E06|2022-07-22T22:33:...| WorldChamp2021|
|f042ca37-8899-488...| 3|WCh 2021| Dubai UAE| 2021.11.28| 3.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|f70e4bbc-21e3-46f...| 4|WCh 2021| Dubai UAE| 2021.11.30| 4.0| Carlsen,M|Nepomniachtchi,I|1/2-1/2| 2855| 2782| null| null| draw| null| draw| null| 0|C42|2022-07-22T22:33:...| WorldChamp2021|
|c941c323-308a-4c8...| 5|WCh 2021| Dubai UAE| 2021.12.01| 5.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|58e83255-93bb-4d5...| 6|WCh 2021| Dubai UAE| 2021.12.03| 6.0| Carlsen,M|Nepomniachtchi,I| 1-0| 2855| 2782| null| null|Carlsen,M| 2855|Nepomniachtchi,I| 2782| 73|D02|2022-07-22T22:33:...| WorldChamp2021|
|29181d93-73f4-4fb...| 7|WCh 2021| Dubai UAE| 2021.12.04| 7.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|8a4ccd8c-d437-429...| 8|WCh 2021| Dubai UAE| 2021.12.05| 8.0| Carlsen,M|Nepomniachtchi,I| 1-0| 2855| 2782| null| null|Carlsen,M| 2855|Nepomniachtchi,I| 2782| 73|C43|2022-07-22T22:33:...| WorldChamp2021|
|55a122db-27d1-495...| 9|WCh 2021| Dubai UAE| 2021.12.07| 9.0|Nepomniachtchi,I| Carlsen,M| 0-1| 2782| 2855| null| null|Carlsen,M| 2855|Nepomniachtchi,I| 2782| 73|A13|2022-07-22T22:33:...| WorldChamp2021|
|1f900d18-5ea3-4f4...| 10|WCh 2021| Dubai UAE| 2021.12.08| 10.0| Carlsen,M|Nepomniachtchi,I|1/2-1/2| 2855| 2782| null| null| draw| null| draw| null| 0|C42|2022-07-22T22:33:...| WorldChamp2021|
My code looks like this. I think it's messed up. Am I supposed to do sum somewhere?
winners = df_history_info.filter(df_history_info['winner'] != "draw").groupBy("tournament_name").agg({"winner":"max"}).show()
I'm getting this result but it is incorrect in many cases.
+---------------+--------------------+
|tournament_name| max(winner)|
+---------------+--------------------+
| WorldChamp2004| Leko,P|
| WorldChamp1894| Steinitz, William|
| WorldChamp2013| Carlsen, Magnus|
| FideChamp2000| Yermolinsky,A|
| WorldChamp2007| Svidler,P|
| FideChamp1993| Timman, Jan H|
|WorldChamp1910b| Lasker, Emanuel|
| WorldChamp1921|Capablanca, Jose ...|
| WorldChamp1958| Smyslov, Vassily|
| WorldChamp1981| Kortschnoj, Viktor|
| WorldChamp1961| Tal, Mihail|
| WorldChamp1978| Kortschnoj, Viktor|
| WorldChamp1960| Tal, Mihail|
| WorldChamp1948| Smyslov, Vassily|
| WorldChamp1929| Bogoljubow, Efim|
| WorldChamp1934| Bogoljubow, Efim|
| WorldChamp1986| Kasparov, Gary|
| PCAChamp1995| Kasparov, Gary|
| WorldChamp1886|Zukertort, Johann...|
| WorldChamp1907| Lasker, Emanuel|
+---------------+--------------------+
Since the winner column contains either the winning player's name or the word "draw" which you've filtered out, then this means the operation .agg({"winner":"max"}) will return the max of a string. This is why Zukertort, Johann... appears as the winner of WorldChamp1886 instead of Steinitz..., and Yermolinksky,A appears as in the winner in the 128 person field in the FideChamp2000.
Here is an example of something you could try with a spark dataframe that looks like the following:
df = spark.createDataFrame(
[
("WC1", "A"),
("WC1", "B"),
("WC1", "A"),
("WC1", "A"),
("WC1", "A"),
("WC1", "B"),
("WC1", "A"),
("WC1", "B"),
("WC2", "F"),
("WC2", "F"),
("WC2", "F"),
("WC2", "D"),
("WC2", "D"),
("WC2", "E"),
("WC2", "F"),
("WC2", "F"),
],
["tournament_name", "winner"] # add your column names here
)
And you have a situation like this where you want to determine who wins each tournament by the most number of times their name appears in the winner column.
+---------------+------+
|tournament_name|winner|
+---------------+------+
| WC1| A|
| WC1| B|
| WC1| A|
| WC1| A|
| WC1| A|
| WC1| B|
| WC1| A|
| WC1| B|
| WC2| F|
| WC2| F|
| WC2| F|
| WC2| D|
| WC2| D|
| WC2| E|
| WC2| F|
| WC2| F|
+---------------+------+
You can do a groupby count on tournament_name and winner:
d = df.groupby(["tournament_name","winner"]).count()
And that gives you this pyspark dataframe:
+---------------+------+-----+
|tournament_name|winner|count|
+---------------+------+-----+
| WC1| B| 3|
| WC1| A| 5|
| WC2| F| 5|
| WC2| D| 2|
| WC2| E| 1|
+---------------+------+-----+
Then following this example, you could create a WindowSpec object that partitions by tournament_name, and sorts by in descending order of the count column, and apply it to d:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
windowDept = Window.partitionBy("tournament_name").orderBy(col("count").desc())
d.withColumn("row",row_number().over(windowDept)) \
.filter(col("row") == 1).drop("row") \
.show()
Final result:
+---------------+------+-----+
|tournament_name|winner|count|
+---------------+------+-----+
| WC1| A| 5|
| WC2| F| 5|
+---------------+------+-----+

pyspark dataframe replace null in one column with another column by converting it from string to array

I would like to replace a null value of a pyspark dataframe column with another string column converted to array.
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
new_customers.printSchema()
new_customers.show(5, False)
new_customers = new_customers.withColumn("new_val", F.coalesce(F.col("val"), F.array(F.col("name"))))
new_customers.show(10, truncate=False)
But, it is
root
|-- name: string (nullable = true)
|-- val: array (nullable = true)
| |-- element: string (containsNull = true)
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[] |
|Cosimo|[d]|[d] |
+------+---+-------+
what I expect:
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+
Did I miss something ? thanks
Problem is that you've an array with null element in it. It will not test positive for isNull check.
First clean up single-null-element arrays:
import pyspark.sql.functions as F
import pyspark.sql.types as T
new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
+------+------+
|name |val |
+------+------+
|Karen |[a] |
|Penny |[b] |
|John |[null]|
|Cosimo|[d] |
+------+------+
new_customers = new_customers.withColumn("val", F.filter(F.col("val"), lambda x: x.isNotNull()))
+------+---+
|name |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John |[] |
|Cosimo|[d]|
+------+---+
Then, change your expression for array empty check instead of null check:
new_customers = new_customers.withColumn("new_val", F.when(F.size("val")>0, F.col("val")).otherwise(F.array(F.col("name"))))
+------+---+-------+
|name |val|new_val|
+------+---+-------+
|Karen |[a]|[a] |
|Penny |[b]|[b] |
|John |[] |[John] |
|Cosimo|[d]|[d] |
+------+---+-------+

how to flatten multiple structs and get the keys as one of the fields

I have this struct schema
|-- teams: struct (nullable = true)
| |-- blue: struct (nullable = true)
| | |-- has_won: boolean (nullable = true)
| | |-- rounds_lost: long (nullable = true)
| | |-- rounds_won: long (nullable = true)
| |-- red: struct (nullable = true)
| | |-- has_won: boolean (nullable = true)
| | |-- rounds_lost: long (nullable = true)
| | |-- rounds_won: long (nullable = true)
which I want to turn to this schema
+----+-------+-----------+----------+
|team|has_won|rounds_lost|rounds_win|
+----+-------+-----------+----------+
|blue| 1| 13| 10|
| red| 0| 10| 13|
+----+-------+-----------+----------+
I already tried selectExpr(inline(array('teams.*')))
inline array
but I don't have any idea to get the team to one of the fields? Thank you!
You can start by un-nesting the struct using * and then use stack to "un-pivot" the dataframe. Finally, un-nest the stats.
from pyspark.sql import Row
rows = [Row(teams=Row(blue=Row(has_won=1, rounds_lost=13, rounds_won=10),
red=Row(has_won=0, rounds_lost=10, rounds_won=13)))]
df = spark.createDataFrame(rows)
(df.select("teams.*")
.selectExpr("stack(2, 'blue', blue, 'red', red) as (team, stats)")
.selectExpr("team", "stats.*")
).show()
"""
+----+-------+-----------+----------+
|team|has_won|rounds_lost|rounds_won|
+----+-------+-----------+----------+
|blue| 1| 13| 10|
| red| 0| 10| 13|
+----+-------+-----------+----------+
"""

Spark: getting the first entry according to a date groupBy

Is it possible to get the first Datetime of each day from a certain dataframe?
Schema:
root
|-- Datetime: timestamp (nullable = true)
|-- Quantity: integer (nullable = true)
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-10 10:08:16| 100|
|2021-09-11 10:05:11| 100|
|2021-09-11 10:07:25| 100|
|2021-09-11 10:07:14| 3000|
|2021-09-12 09:24:11| 1000|
+-------------------+--------+
Desired output:
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-11 10:05:11| 100|
|2021-09-12 09:24:11| 1000|
+-------------------+--------+
You can use row_number for that. Simply define a Window partitioned by day and ordered by Datetime:
from pyspark.sql import functions as F, Window
w = Window.partitionBy(F.to_date("Datetime")).orderBy("Datetime")
df1 = df.withColumn("rn", F.row_number().over(w)).filter("rn = 1").drop("rn")
df1.show()
#+-------------------+--------+
#| Datetime|Quantity|
#+-------------------+--------+
#|2021-09-10 10:08:11| 200|
#|2021-09-11 10:05:11| 100|
#|2021-09-12 09:24:11| 1000|
#+-------------------+--------+

How do I group records that are within a specific time interval using Spark Scala or sql?

I would like to group records in scala only if they have the same ID and their time is within 1 min of each other.
I am thinking conceptually something like this? But I am not really sure
HAVING a.ID = b.ID AND a.time + 30 sec > b.time AND a.time - 30 sec < b.time
| ID | volume | Time |
|:-----------|------------:|:--------------------------:|
| 1 | 10 | 2019-02-17T12:00:34Z |
| 2 | 20 | 2019-02-17T11:10:46Z |
| 3 | 30 | 2019-02-17T13:23:34Z |
| 1 | 40 | 2019-02-17T12:01:02Z |
| 2 | 50 | 2019-02-17T11:10:30Z |
| 1 | 60 | 2019-02-17T12:01:57Z |
to this:
| ID | volume |
|:-----------|------------:|
| 1 | 50 | // (10+40)
| 2 | 70 | // (20+50)
| 3 | 30 |
df.groupBy($"ID", window($"Time", "1 minutes")).sum("volume")
the code above is 1 solution but it always rounds.
For example 2019-02-17T12:00:45Z will have a range of
2019-02-17T12:00:00Z TO 2019-02-17T12:01:00Z.
I am looking for this instead:
2019-02-17T11:45:00Z TO 2019-02-17T12:01:45Z.
Is there a way?
org.apache.spark.sql.functions provides overloaded window functions as below.
1. window(timeColumn: Column, windowDuration: String) : Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
The windows will look like:
{{{
09:00:00-09:01:00
09:01:00-09:02:00
09:02:00-09:03:00 ...
}}}
2. window((timeColumn: Column, windowDuration: String, slideDuration: String):
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
slideDuration Parameter specifying the sliding interval of the window, e.g. 1 minute.A new window will be generated every slideDuration. Must be less than or equal to the windowDuration.
The windows will look like:
{{{
09:00:00-09:01:00
09:00:10-09:01:10
09:00:20-09:01:20 ...
}}}
3. window((timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05).
The windows will look like:
{{{
09:00:05-09:01:05
09:00:15-09:01:15
09:00:25-09:01:25 ...
}}}
For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes. This is the perfect overloaded window function which suites your requirement.
Please find working code as below.
import org.apache.spark.sql.SparkSession
object SparkWindowTest extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("File_Streaming")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//Prepare Test Data
val df = Seq((1, 10, "2019-02-17 12:00:49"), (2, 20, "2019-02-17 11:10:46"),
(3, 30, "2019-02-17 13:23:34"),(2, 50, "2019-02-17 11:10:30"),
(1, 40, "2019-02-17 12:01:02"), (1, 60, "2019-02-17 12:01:57"))
.toDF("ID", "Volume", "TimeString")
df.show()
df.printSchema()
+---+------+-------------------+
| ID|Volume| TimeString|
+---+------+-------------------+
| 1| 10|2019-02-17 12:00:49|
| 2| 20|2019-02-17 11:10:46|
| 3| 30|2019-02-17 13:23:34|
| 2| 50|2019-02-17 11:10:30|
| 1| 40|2019-02-17 12:01:02|
| 1| 60|2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
//Converted String Timestamp into Timestamp
val modifiedDF = df.withColumn("Time", to_timestamp($"TimeString"))
//Dropped String Timestamp from DF
val modifiedDF1 = modifiedDF.drop("TimeString")
modifiedDF.show(false)
modifiedDF.printSchema()
+---+------+-------------------+-------------------+
|ID |Volume|TimeString |Time |
+---+------+-------------------+-------------------+
|1 |10 |2019-02-17 12:00:49|2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|2019-02-17 12:01:57|
+---+------+-------------------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
|-- Time: timestamp (nullable = true)
modifiedDF1.show(false)
modifiedDF1.printSchema()
+---+------+-------------------+
|ID |Volume|Time |
+---+------+-------------------+
|1 |10 |2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- Time: timestamp (nullable = true)
//Main logic
val modifiedDF2 = modifiedDF1.groupBy($"ID", window($"Time", "1 minutes","1 minutes","45 seconds")).sum("Volume")
//Renamed all columns of DF.
val newNames = Seq("ID", "WINDOW", "VOLUME")
val finalDF = modifiedDF2.toDF(newNames: _*)
finalDF.show(false)
+---+---------------------------------------------+------+
|ID |WINDOW |VOLUME|
+---+---------------------------------------------+------+
|2 |[2019-02-17 11:09:45.0,2019-02-17 11:10:45.0]|50 |
|1 |[2019-02-17 12:01:45.0,2019-02-17 12:02:45.0]|60 |
|1 |[2019-02-17 12:00:45.0,2019-02-17 12:01:45.0]|50 |
|3 |[2019-02-17 13:22:45.0,2019-02-17 13:23:45.0]|30 |
|2 |[2019-02-17 11:10:45.0,2019-02-17 11:11:45.0]|20 |
+---+---------------------------------------------+------+
}