Spark: getting the first entry according to a date groupBy - dataframe

Is it possible to get the first Datetime of each day from a certain dataframe?
Schema:
root
|-- Datetime: timestamp (nullable = true)
|-- Quantity: integer (nullable = true)
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-10 10:08:16| 100|
|2021-09-11 10:05:11| 100|
|2021-09-11 10:07:25| 100|
|2021-09-11 10:07:14| 3000|
|2021-09-12 09:24:11| 1000|
+-------------------+--------+
Desired output:
+-------------------+--------+
| Datetime|Quantity|
+-------------------+--------+
|2021-09-10 10:08:11| 200|
|2021-09-11 10:05:11| 100|
|2021-09-12 09:24:11| 1000|
+-------------------+--------+

You can use row_number for that. Simply define a Window partitioned by day and ordered by Datetime:
from pyspark.sql import functions as F, Window
w = Window.partitionBy(F.to_date("Datetime")).orderBy("Datetime")
df1 = df.withColumn("rn", F.row_number().over(w)).filter("rn = 1").drop("rn")
df1.show()
#+-------------------+--------+
#| Datetime|Quantity|
#+-------------------+--------+
#|2021-09-10 10:08:11| 200|
#|2021-09-11 10:05:11| 100|
#|2021-09-12 09:24:11| 1000|
#+-------------------+--------+

Related

List of Winners of Each World champions Trophy

Total Result of all rounds of Tournament for that player is considered as that player's Score/Result.
Schema:
|-- game_id: string (nullable = true)
|-- game_order: integer (nullable = true)
|-- event: string (nullable = true)
|-- site: string (nullable = true)
|-- date_played: string (nullable = true)
|-- round: double (nullable = true)
|-- white: string (nullable = true)
|-- black: string (nullable = true)
|-- result: string (nullable = true)
|-- white_elo: integer (nullable = true)
|-- black_elo: integer (nullable = true)
|-- white_title: string (nullable = true)
|-- black_title: string (nullable = true)
|-- winner: string (nullable = true)
|-- winner_elo: integer (nullable = true)
|-- loser: string (nullable = true)
|-- loser_elo: integer (nullable = true)
|-- winner_loser_elo_diff: integer (nullable = true)
|-- eco: string (nullable = true)
|-- date_created: string (nullable = true)
|-- tournament_name: string (nullable = true)
Sample DaraFrame:
+--------------------+----------+--------+----------+-----------+-----+----------------+----------------+-------+---------+---------+-----------+-----------+---------+----------+----------------+---------+---------------------+---+--------------------+---------------+
| game_id|game_order| event| site|date_played|round| white| black| result|white_elo|black_elo|white_title|black_title| winner|winner_elo| loser|loser_elo|winner_loser_elo_diff|eco| date_created|tournament_name|
+--------------------+----------+--------+----------+-----------+-----+----------------+----------------+-------+---------+---------+-----------+-----------+---------+----------+----------------+---------+---------------------+---+--------------------+---------------+
|86e0b7f5-7b94-4ae...| 1|WCh 2021| Dubai UAE| 2021.11.26| 1.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|dc4a10ab-54cf-49d...| 2|WCh 2021| Dubai UAE| 2021.11.27| 2.0| Carlsen,M|Nepomniachtchi,I|1/2-1/2| 2855| 2782| null| null| draw| null| draw| null| 0|E06|2022-07-22T22:33:...| WorldChamp2021|
|f042ca37-8899-488...| 3|WCh 2021| Dubai UAE| 2021.11.28| 3.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|f70e4bbc-21e3-46f...| 4|WCh 2021| Dubai UAE| 2021.11.30| 4.0| Carlsen,M|Nepomniachtchi,I|1/2-1/2| 2855| 2782| null| null| draw| null| draw| null| 0|C42|2022-07-22T22:33:...| WorldChamp2021|
|c941c323-308a-4c8...| 5|WCh 2021| Dubai UAE| 2021.12.01| 5.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|58e83255-93bb-4d5...| 6|WCh 2021| Dubai UAE| 2021.12.03| 6.0| Carlsen,M|Nepomniachtchi,I| 1-0| 2855| 2782| null| null|Carlsen,M| 2855|Nepomniachtchi,I| 2782| 73|D02|2022-07-22T22:33:...| WorldChamp2021|
|29181d93-73f4-4fb...| 7|WCh 2021| Dubai UAE| 2021.12.04| 7.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|8a4ccd8c-d437-429...| 8|WCh 2021| Dubai UAE| 2021.12.05| 8.0| Carlsen,M|Nepomniachtchi,I| 1-0| 2855| 2782| null| null|Carlsen,M| 2855|Nepomniachtchi,I| 2782| 73|C43|2022-07-22T22:33:...| WorldChamp2021|
|55a122db-27d1-495...| 9|WCh 2021| Dubai UAE| 2021.12.07| 9.0|Nepomniachtchi,I| Carlsen,M| 0-1| 2782| 2855| null| null|Carlsen,M| 2855|Nepomniachtchi,I| 2782| 73|A13|2022-07-22T22:33:...| WorldChamp2021|
|1f900d18-5ea3-4f4...| 10|WCh 2021| Dubai UAE| 2021.12.08| 10.0| Carlsen,M|Nepomniachtchi,I|1/2-1/2| 2855| 2782| null| null| draw| null| draw| null| 0|C42|2022-07-22T22:33:...| WorldChamp2021|
My code looks like this. I think it's messed up. Am I supposed to do sum somewhere?
winners = df_history_info.filter(df_history_info['winner'] != "draw").groupBy("tournament_name").agg({"winner":"max"}).show()
I'm getting this result but it is incorrect in many cases.
+---------------+--------------------+
|tournament_name| max(winner)|
+---------------+--------------------+
| WorldChamp2004| Leko,P|
| WorldChamp1894| Steinitz, William|
| WorldChamp2013| Carlsen, Magnus|
| FideChamp2000| Yermolinsky,A|
| WorldChamp2007| Svidler,P|
| FideChamp1993| Timman, Jan H|
|WorldChamp1910b| Lasker, Emanuel|
| WorldChamp1921|Capablanca, Jose ...|
| WorldChamp1958| Smyslov, Vassily|
| WorldChamp1981| Kortschnoj, Viktor|
| WorldChamp1961| Tal, Mihail|
| WorldChamp1978| Kortschnoj, Viktor|
| WorldChamp1960| Tal, Mihail|
| WorldChamp1948| Smyslov, Vassily|
| WorldChamp1929| Bogoljubow, Efim|
| WorldChamp1934| Bogoljubow, Efim|
| WorldChamp1986| Kasparov, Gary|
| PCAChamp1995| Kasparov, Gary|
| WorldChamp1886|Zukertort, Johann...|
| WorldChamp1907| Lasker, Emanuel|
+---------------+--------------------+
Since the winner column contains either the winning player's name or the word "draw" which you've filtered out, then this means the operation .agg({"winner":"max"}) will return the max of a string. This is why Zukertort, Johann... appears as the winner of WorldChamp1886 instead of Steinitz..., and Yermolinksky,A appears as in the winner in the 128 person field in the FideChamp2000.
Here is an example of something you could try with a spark dataframe that looks like the following:
df = spark.createDataFrame(
[
("WC1", "A"),
("WC1", "B"),
("WC1", "A"),
("WC1", "A"),
("WC1", "A"),
("WC1", "B"),
("WC1", "A"),
("WC1", "B"),
("WC2", "F"),
("WC2", "F"),
("WC2", "F"),
("WC2", "D"),
("WC2", "D"),
("WC2", "E"),
("WC2", "F"),
("WC2", "F"),
],
["tournament_name", "winner"] # add your column names here
)
And you have a situation like this where you want to determine who wins each tournament by the most number of times their name appears in the winner column.
+---------------+------+
|tournament_name|winner|
+---------------+------+
| WC1| A|
| WC1| B|
| WC1| A|
| WC1| A|
| WC1| A|
| WC1| B|
| WC1| A|
| WC1| B|
| WC2| F|
| WC2| F|
| WC2| F|
| WC2| D|
| WC2| D|
| WC2| E|
| WC2| F|
| WC2| F|
+---------------+------+
You can do a groupby count on tournament_name and winner:
d = df.groupby(["tournament_name","winner"]).count()
And that gives you this pyspark dataframe:
+---------------+------+-----+
|tournament_name|winner|count|
+---------------+------+-----+
| WC1| B| 3|
| WC1| A| 5|
| WC2| F| 5|
| WC2| D| 2|
| WC2| E| 1|
+---------------+------+-----+
Then following this example, you could create a WindowSpec object that partitions by tournament_name, and sorts by in descending order of the count column, and apply it to d:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
windowDept = Window.partitionBy("tournament_name").orderBy(col("count").desc())
d.withColumn("row",row_number().over(windowDept)) \
.filter(col("row") == 1).drop("row") \
.show()
Final result:
+---------------+------+-----+
|tournament_name|winner|count|
+---------------+------+-----+
| WC1| A| 5|
| WC2| F| 5|
+---------------+------+-----+

how to flatten multiple structs and get the keys as one of the fields

I have this struct schema
|-- teams: struct (nullable = true)
| |-- blue: struct (nullable = true)
| | |-- has_won: boolean (nullable = true)
| | |-- rounds_lost: long (nullable = true)
| | |-- rounds_won: long (nullable = true)
| |-- red: struct (nullable = true)
| | |-- has_won: boolean (nullable = true)
| | |-- rounds_lost: long (nullable = true)
| | |-- rounds_won: long (nullable = true)
which I want to turn to this schema
+----+-------+-----------+----------+
|team|has_won|rounds_lost|rounds_win|
+----+-------+-----------+----------+
|blue| 1| 13| 10|
| red| 0| 10| 13|
+----+-------+-----------+----------+
I already tried selectExpr(inline(array('teams.*')))
inline array
but I don't have any idea to get the team to one of the fields? Thank you!
You can start by un-nesting the struct using * and then use stack to "un-pivot" the dataframe. Finally, un-nest the stats.
from pyspark.sql import Row
rows = [Row(teams=Row(blue=Row(has_won=1, rounds_lost=13, rounds_won=10),
red=Row(has_won=0, rounds_lost=10, rounds_won=13)))]
df = spark.createDataFrame(rows)
(df.select("teams.*")
.selectExpr("stack(2, 'blue', blue, 'red', red) as (team, stats)")
.selectExpr("team", "stats.*")
).show()
"""
+----+-------+-----------+----------+
|team|has_won|rounds_lost|rounds_won|
+----+-------+-----------+----------+
|blue| 1| 13| 10|
| red| 0| 10| 13|
+----+-------+-----------+----------+
"""

always show error : 'DataFrame' object does not support item assignment in databricks

import numpy as np
import pandas as pd
df = df_final_bureau_balance
df.show()
df.printSchema()
df["*"] = df['STATUS']
I wanna create one column but there is always one error:'DataFrame' object that does not support item assignment
but from pandas user manual there is nothing wrong.
the object does support item assignment isn't dataframe?
+------------+------+-------------------+-------------------+-----+
|SK_ID_BUREAU|STATUS|max(MONTHS_BALANCE)|min(MONTHS_BALANCE)|count|
+------------+------+-------------------+-------------------+-----+
| 5001709| C| 0| -85| 86|
| 5001709| X| -86| -96| 11|
| 5001710| C| 0| -47| 48|
| 5001710| X| -49| -82| 30|
| 5001710| 0| -48| -53| 5|
| 5001711| X| 0| 0| 1|
| 5001711| 0| -1| -3| 3|
| 5001712| C| 0| -8| 9|
| 5001712| 0| -9| -18| 10|
| 5001713| X| 0| -21| 22|
| 5001714| X| 0| -14| 15|
| 5001715| X| 0| -59| 60|
| 5001716| 0| -39| -65| 27|
| 5001716| X| -66| -85| 20|
| 5001716| C| 0| -38| 39|
| 5001717| 0| -5| -21| 17|
| 5001717| C| 0| -4| 5|
| 5001718| C| 0| -2| 3|
| 5001718| X| -9| -38| 10|
| 5001718| 0| -3| -37| 24|
+------------+------+-------------------+-------------------+-----+
only showing top 20 rows
root
|-- SK_ID_BUREAU: integer (nullable = true)
|-- STATUS: string (nullable = true)
|-- max(MONTHS_BALANCE): integer (nullable = true)
|-- min(MONTHS_BALANCE): integer (nullable = true)
|-- count: long (nullable = true)
TypeError: 'DataFrame' object does not support item assignment
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-2083632421660035> in <module>
7 df.show()
8 df.printSchema()
----> 9 df["*"] = df['STATUS']
10
11
TypeError: 'DataFrame' object does not support item assignment
following syntaxs are not belong in panda dataframe. Those are related spark dataframe.
df.show()
df.printSchema()
Same functionality should be in panda dataframe,
print(df)
df.info(verbose=True)

Dataframe column tolist(): column object is not callable

I am trying to make a list of column values of an existing dataframe named datadf.
datadf = sqlContext.createDataFrame(data[0:], ('Name', 'Date', 'Lat', 'Lon', 'Number'))
print(type(datadf))
datadf.printSchema()
Returns:
<class 'pyspark.sql.dataframe.DataFrame'>
root
|-- Name: string (nullable = true)
|-- Date: string (nullable = true)
|-- Lat: double (nullable = true)
|-- Lon: double (nullable = true)
|-- Number: long (nullable = true)
datadf.show()
Returns:
+-----------------+----------+-----------+-----------+------+
| Name| Date| Lat| Lon|Number|
+-----------------+----------+-----------+-----------+------+
|Fallopia japonica|16/09/2016| 52.3792| 6.499| 10|
|Fallopia japonica|21/08/2015| 51.813| 5.784| 1|
|Fallopia japonica|25/08/2016| 50.9623| 5.723| 1|
|Fallopia japonica|27/06/2013| 50.844| 5.688| 1|
|Fallopia japonica|31/05/2015| 51.7267| 5.615| 1|
|Fallopia japonica|04/07/2015| 52.0883| 5.147| 1|
|Fallopia japonica|21/05/2016| 51.5757| 5.027| 1|
|Fallopia japonica|09/06/2015| 51.5734| 5.024| 1|
|Fallopia japonica|13/08/2015| 51.6| 4.981| 101|
|Fallopia japonica|16/07/2014| 51.5656| 4.752| 5001|
|Fallopia japonica|26/09/2016| 51.3021| 3.977| 1|
|Fallopia japonica|27/09/2015| 53.1802005| 7.19828113| 1|
|Fallopia japonica|10/07/2011|53.11105167| 7.19632833| 1|
|Fallopia japonica|11/06/2014|53.00800151|7.192501277| 1|
|Fallopia japonica|19/06/2016|53.00857768| 7.19225564| 51|
|Fallopia japonica|21/04/2015|53.16380117|7.186146926| 1|
|Fallopia japonica|21/04/2015|53.16380117|7.186146926| 1|
|Fallopia japonica|23/08/2003|53.09439231|7.178677324| 1|
|Fallopia japonica|02/09/2002| 53.0050096|7.145194014| 1|
|Fallopia japonica|04/08/2013| 52.962782| 7.144035| 1|
+-----------------+----------+-----------+-----------+------+
only showing top 20 rows
The dataframe has latitude and longitude values, basically I want to make a python list of each.
import pandas
latlist = datadf['Lat'].tolist()
latlist = datadf['Lat'].values.tolist()
Both return: 'Column' object is not callable
Now, I suspect something is wrong with the dataframe values, as I ran into this error before. I have a basemap of the Netherlands, and I want to simply add these coordinates as points on this map.

Create empty sparse vectors in PySpark

I have a dataframe DF1 that looks like this:
+-------+------+
|user_id|meta |
+-------+------+
| 1| null|
| 11| null|
| 15| null|
+-------+------+
Schema:
root
|-- user_id: string (nullable = true)
|-- meta: string (nullable = true)
and I have another dataframe DF2 that looks like this
+-------+------------------------------------+
|user_id| Vectorz |
+-------+------------------------------------+
| 10| (2,[1],[1.0])|
| 12| (2,[1],[1.0])|
| 13| (2,[0],[1.0])|
| 14| (2,[1],[1.0])|
---------------------------------------------
Schema is:
[user_id: string, Vectorz: vector]
I want to inject all the user_ids from DF1 into DF2, but create empty sparse vectors for them since their "meta" column is all NULLs.
So, I want DF2 to finally be:
+-------+------------------------------------+
|user_id| Vectorz |
+-------+------------------------------------+
| 1| (,[],[])|
| 10| (2,[1],[1.0])|
| 11| (,[],[])|
| 12| (2,[1],[1.0])|
| 13| (2,[0],[1.0])|
| 14| (2,[1],[1.0])|
| 15| (,[],[])|
---------------------------------------------
Can somebody please help?
I am new to PySpark. So, sorry if I don't sound informed enough.
You can go ahead and create empty vectors for all the user_ids when meta is null.
Anyways you need to decide when the meta column is not null.
Sample COde
DF1
val spark = sqlContext.sparkSession
val implicits = sqlContext.sparkSession.implicits
import implicits._
val df1 = sqlContext.range(1,4)
.withColumnRenamed("id", "user_id")
.withColumn("meta", lit(null).cast(DataTypes.StringType))
df1.show(false)
df1.printSchema()
+-------+----+
|user_id|meta|
+-------+----+
|1 |null|
|2 |null|
|3 |null|
+-------+----+
root
|-- user_id: long (nullable = false)
|-- meta: string (nullable = true)
DF2
import org.apache.spark.ml.linalg.Vectors
val staticVector = udf(() => Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), SQLDataTypes.VectorType)
val df2 = sqlContext.range(5,8)
.withColumnRenamed("id", "user_id")
.withColumn("Vectorz", staticVector())
df2.show(false)
df2.printSchema()
+-------+-------------------+
|user_id|Vectorz |
+-------+-------------------+
|5 |(5,[1,3],[1.0,7.0])|
|6 |(5,[1,3],[1.0,7.0])|
|7 |(5,[1,3],[1.0,7.0])|
+-------+-------------------+
root
|-- user_id: long (nullable = false)
|-- Vectorz: vector (nullable = true)
Processed DF
val emptyVector = udf(() => Vectors.sparse(0, Array.empty[Int], Array.empty[Double]), SQLDataTypes.VectorType)
val processedDF =
// meta column shouldn't have any value
// for the safer side adding filter as meta is null
// need to decide what if meta is not null
// I'm assigning empty vector to that also
df1.where(col("meta").isNull)
.withColumn("Vectorz", when(col("meta").isNull, emptyVector()).otherwise(emptyVector()))
.drop("meta")
.unionByName(df2)
processedDF.show(false)
processedDF.printSchema()
+-------+-------------------+
|user_id|Vectorz |
+-------+-------------------+
|1 |(0,[],[]) |
|2 |(0,[],[]) |
|3 |(0,[],[]) |
|5 |(5,[1,3],[1.0,7.0])|
|6 |(5,[1,3],[1.0,7.0])|
|7 |(5,[1,3],[1.0,7.0])|
+-------+-------------------+
root
|-- user_id: long (nullable = false)
|-- Vectorz: vector (nullable = true)