Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark - sql

I have the following time-series data in a DataFrame in pyspark:
(id, timestamp, type)
the id column can be any integer value and many rows of the same id
can exist in the table
the timestamp column is a timestamp represented by an integer (for simplification)
the type column is a string type variable where each distinct
string on the column represents one category. One special category
out of all is 'A'
My question is the following:
Is there any way to compute (with SQL or pyspark DataFrame operations):
the counts of every type
for all the time differences from the timestamp corresponding to all the rows
of type='A' within a time range (e.g. [-5,+5]), with granularity of 1 second
For example, for the following DataFrame:
ts_df = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(1,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
ts_df.show()
+---+----+-----+
| id|type| ts|
+---+----+-----+
| 1| A| 100|
| 2| A| 1000|
| 3| A|10000|
| 1| b| 99|
| 1| b| 99|
| 1| b| 99|
| 2| b| 999|
| 2| b| 999|
| 2| c| 999|
| 2| c| 999|
| 1| d| 999|
| 3| c| 9999|
| 3| c| 9999|
| 3| d| 9999|
| 1| b| 98|
| 1| b| 98|
| 2| b| 998|
| 2| c| 998|
| 3| c| 9998|
+---+----+-----+
for a time difference of -1 second the result should be:
# result for time difference = -1 sec
# b: 5
# c: 4
# d: 2
while for a time difference of -2 seconds the result should be:
# result for time difference = -2 sec
# b: 3
# c: 2
# d: 0
and so on so forth for any time difference within a time range for a granularity of 1 second.
I tried many different ways by using mostly groupBy but nothing seems to work.
I am mostly having difficulties on how to express the time difference from each row of type=A even if I have to do it for one specific time difference.
Any suggestions would be greatly appreciated!
EDIT:
If I only have to do it for one specific time difference time_difference then I could do it with the following way:
time_difference = -1
df_type_A = ts_df.where(F.col("type")=='A').selectExpr("ts as fts")
res = df_type_A.join(ts_df, on=df_type_A.fts+time_difference==ts_df.ts)\
.drop("ts","fts").groupBy(F.col("type")).count()
The the returned res DataFrame will give me exactly what I want for one specific time difference. I create a loop and solve the problem by repeating the same query over and over again.
However, is there any more efficient way than that?
EDIT2 (solution)
So that's how I did it at the end:
df1 = sc.parallelize([
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(2,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
df1.show()
df2 = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
]).toDF(["id","type","ts"]).selectExpr("id as fid","ts as fts","type as ftype")
df2.show()
df3 = df2.join(df1, on=df1.id==df2.fid).withColumn("td", F.col("ts")-F.col("fts"))
df3.show()
df4 = df3.groupBy([F.col("type"),F.col("td")]).count()
df4.show()
Will update performance details as soon as I'll have any.
Thanks!

Another way to solve this problem would be:
Divide existing data-frames in two data-frames - with A and without A
Add a new column in without A df, which is sum of "ts" and time_difference
Join both data frame, group By and count.
Here is a code:
from pyspark.sql.functions import lit
time_difference = 1
ts_df_A = (
ts_df
.filter(ts_df["type"] == "A")
.drop("id")
.drop("type")
)
ts_df_td = (
ts_df
.withColumn("ts_plus_td", lit(ts_df['ts'] + time_difference))
.filter(ts_df["type"] != "A")
.drop("ts")
)
joined_df = ts_df_A.join(ts_df_td, ts_df_A["ts"] == ts_df_td["ts_plus_td"])
agg_df = joined_df.groupBy("type").count()
>>> agg_df.show()
+----+-----+
|type|count|
+----+-----+
| d| 2|
| c| 4|
| b| 5|
+----+-----+
>>>
Let me know if this is what you are looking for?
Thanks,
Hussain Bohra

Related

PySpark: How to concatenate two distinct dataframes?

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+
You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+

Spark SQL generate SCD2 without dropping historic state

Data from an relation database is loaded over into spark - supposedly daily but in reality not every day. Furthermore, it is a full copy of the DB - no delta loading.
In order to join the dimension tables easily with the main event data I want to:
deduplicate it (i.e. improves potential for broadcast join later)
have valid_to/valid_from columns so even though data is not available daily (inconsistently) it can still be used nicely (from downstream)
I am using spark 3.0.1 and want to SCD2 style transform the existing data - without loosing history.
spark-shell
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
case class Foo (key:Int, value:Int, date:String)
val d = Seq(Foo(1, 1, "20200101"), Foo(1, 8, "20200102"), Foo(1, 9, "20200120"),Foo(1, 9, "20200121"),Foo(1, 9, "20200122"), Foo(1, 1, "20200103"), Foo(2, 5, "20200101"), Foo(1, 10, "20200113")).toDF
d.show
val windowDeduplication = Window.partitionBy("key", "value").orderBy("key", "date")
val windowPrimaryKey = Window.partitionBy("key").orderBy("key", "date")
val nextThing = lead("date", 1).over(windowPrimaryKey)
d.withColumn("date", to_date(col("date"), "yyyyMMdd")).withColumn("rank", rank().over(windowDeduplication)).filter(col("rank") === 1).drop("rank").withColumn("valid_to", nextThing).withColumn("valid_to", when(nextThing.isNotNull, date_sub(nextThing, 1)).otherwise(current_date)).withColumnRenamed("date", "valid_from").orderBy("key", "valid_from", "valid_to").show
results in:
+---+-----+----------+----------+
|key|value|valid_from| valid_to|
+---+-----+----------+----------+
| 1| 1|2020-01-01|2020-01-01|
| 1| 8|2020-01-02|2020-01-12|
| 1| 10|2020-01-13|2020-01-19|
| 1| 9|2020-01-20|2020-10-09|
| 2| 5|2020-01-01|2020-10-09|
+---+-----+----------+----------+
which is already pretty good. However:
| 1| 1|2020-01-03| 2|2020-01-12|
Is lost.
I.e. any values which occur again later (after an intermediary change) are lost.
How can I keep this row without keeping larger ranks such as:
d.withColumn("date", to_date(col("date"), "yyyyMMdd")).withColumn("rank", rank().over(windowDeduplication)).withColumn("valid_to", nextThing).withColumn("valid_to",
when(nextThing.isNotNull, date_sub(nextThing, 1)).otherwise(current_date)).withColumnRenamed("date", "valid_from").orderBy("key", "valid_from", "valid_to").show
+---+-----+----------+----+----------+
|key|value|valid_from|rank| valid_to|
+---+-----+----------+----+----------+
| 1| 1|2020-01-01| 1|2020-01-01|
| 1| 8|2020-01-02| 1|2020-01-02|
| 1| 1|2020-01-03| 2|2020-01-12|
| 1| 10|2020-01-13| 1|2020-01-19|
| 1| 9|2020-01-20| 1|2020-01-20|
| 1| 9|2020-01-21| 2|2020-01-21|
| 1| 9|2020-01-22| 3|2020-10-09|
| 2| 5|2020-01-01| 1|2020-10-09|
+---+-----+----------+----+----------+
Which is definitely not desired
The idea is to drop duplicates
But keep any historic changes to the data using a valid_to, valid_from
How can I properly transform this to a SCD2 representation, i.e. have a valid_from, valid_to but not drop intermediary state?
NOTICE: I do not need to update existing data (merge into, JOIN). It is fine to recreate / overwrite it.
I.e. Implement SCD Type 2 in Spark seems to be way too complicated. Is there a better way in my case where the state handling is not required? I.e. I have data originating from a daily full copy of a database and want to deduplicate it.
The previous approach only keeps the first (earliest) version of a duplicate. I think the only solution without a join for state handling is with a window function where each value is compared against the previous row - and if there is no change in the whole row it is discarded.
Probably less efficient - but more accurate. But this also depends on the use-case at hand i.e. how likely it is that a changed value will be seen again.
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.expressions.Window
case class Foo (key:Int, value:Int, value2:Int, date:String)
val d = Seq(Foo(1, 1,1, "20200101"), Foo(1, 8,1, "20200102"), Foo(1, 9,1, "20200120"),Foo(1, 6,1, "20200121"),Foo(1, 9,1, "20200122"), Foo(1, 1,1, "20200103"), Foo(2, 5,1, "20200101"), Foo(1, 10,1, "20200113"), Foo(1, 9,1, "20210120"),Foo(1, 9,1, "20220121"),Foo(1, 9,3, "20230122")).toDF
def compare2Rows(key:Seq[String], sortChangingIgnored:Seq[String], timeColumn:String)(df:DataFrame):DataFrame = {
val windowPrimaryKey = Window.partitionBy(key.map(col):_*).orderBy(sortChangingIgnored.map(col):_*)
val columnsToCompare = df.drop(key ++ sortChangingIgnored:_*).columns
val nextDataChange = lead(timeColumn, 1).over(windowPrimaryKey)
val deduplicated = df.withColumn("data_changes", columnsToCompare.map(e=> col(e) =!= lead(col(e), 1).over(windowPrimaryKey)).reduce(_ or _)).filter(col("data_changes").isNull or col("data_changes"))
deduplicated.withColumn("valid_to", when(nextDataChange.isNotNull, date_sub(nextDataChange, 1)).otherwise(current_date)).withColumnRenamed("date", "valid_from").drop("data_changes")
}
d.orderBy("key", "date").show
d.withColumn("date", to_date(col("date"), "yyyyMMdd")).transform(compare2Rows(Seq("key"), Seq("date"), "date")).orderBy("key", "valid_from", "valid_to").show
returns:
+---+-----+------+----------+----------+
|key|value|value2|valid_from| valid_to|
+---+-----+------+----------+----------+
| 1| 1| 1|2020-01-01|2020-01-01|
| 1| 8| 1|2020-01-02|2020-01-02|
| 1| 1| 1|2020-01-03|2020-01-12|
| 1| 10| 1|2020-01-13|2020-01-19|
| 1| 9| 1|2020-01-20|2020-01-20|
| 1| 6| 1|2020-01-21|2022-01-20|
| 1| 9| 1|2022-01-21|2023-01-21|
| 1| 9| 3|2023-01-22|2020-10-09|
| 2| 5| 1|2020-01-01|2020-10-09|
+---+-----+------+----------+----------+
for an input of:
+---+-----+------+--------+
|key|value|value2| date|
+---+-----+------+--------+
| 1| 1| 1|20200101|
| 1| 8| 1|20200102|
| 1| 1| 1|20200103|
| 1| 10| 1|20200113|
| 1| 9| 1|20200120|
| 1| 6| 1|20200121|
| 1| 9| 1|20200122|
| 1| 9| 1|20210120|
| 1| 9| 1|20220121|
| 1| 9| 3|20230122|
| 2| 5| 1|20200101|
+---+-----+------+--------+
This function has the downside that unlimited amount of state is build up - for each key ... But as I plan to apply this to rather small dimension tables I think it should be fine anyways.

How to get value_counts for a spark row?

I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+

Spark SQL: Is there a way to distinguish columns with same name?

I have a csv with a header with columns with same name.
I want to process them with spark using only SQL and be able to refer these columns unambiguously.
Ex.:
id name age height name
1 Alex 23 1.70
2 Joseph 24 1.89
I want to get only first name column using only Spark SQL
As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.
Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.
For instance, let's create a dataframe with three id columns.
val df = spark.range(3)
.select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id| x| id| y| id|
+---+---+---+---+---+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+---+---+---+---+---+
Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.
var i = -1
val names = df.columns.map( n =>
if(n == "id") {
i+=1
s"id_$i"
} else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0| x|id_1| y|id_2|
+----+---+----+---+----+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+----+---+----+---+----+

Add aggregated columns to pivot without join

Considering the table:
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df.show()
+---+-----+---------+
| id|error|timestamp|
+---+-----+---------+
| 1| 1| 1|
| 5| 0| 2|
| 27| 1| 1|
| 1| 0| 3|
| 5| 1| 1|
| 1| 0| 2|
+---+-----+---------+
I would like to make a pivot on timestamp column keeping some other aggregated information from the original table. The result I am interested in can be achieved by
df1=df.groupBy('id').agg(sf.sum('error').alias('Ne'),sf.count('*').alias('cnt'))
df2=df.groupBy('id').pivot('timestamp').agg(sf.count('*')).fillna(0)
df1.join(df2, on='id').filter(sf.col('cnt')>1).show()
with the resulting table:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
However, there are at least two issues with the mentioned solution:
I am filtering by cnt at the end of the script. If I would be able to do this at the beginning, I can avoid almost all processing, because a large portion of data is removed using this filtration. Is there any way how to do this excepting collect and isin methods?
I am doing groupBy on id two-times. First, to aggregate some columns I need in results and the second time to get the pivot columns. Finally, I need join to merge these columns. I feel that I am surely missing some solution because it should be possible to do this with just one groubBy and without join, but I cannot figure out, how to do this.
I think you can not get around the join, because the pivot will need the timestamp values and the first grouping should not consider them. So if you have to create the NE and cnt values you have to group the dataframe only by id which results in the loss of timestamp if you want to preserve the values in columns you have to do the pivot as you did separately and join it back.
The only improvement that can be done is to move the filter to the df1 creation. So as you said this could already improve the performance since df1 should be much smaller after the filtering for your real data.
from pyspark.sql.functions import *
df=sc.parallelize([(1,1,1),(5,0,2),(27,1,1),(1,0,3),(5,1,1),(1,0,2)]).toDF(['id', 'error', 'timestamp'])
df1=df.groupBy('id').agg(sum('error').alias('Ne'),count('*').alias('cnt')).filter(col('cnt')>1)
df2=df.groupBy('id').pivot('timestamp').agg(count('*')).fillna(0)
df1.join(df2, on='id').show()
Output:
+---+---+---+---+---+---+
| id| Ne|cnt| 1| 2| 3|
+---+---+---+---+---+---+
| 5| 1| 2| 1| 1| 0|
| 1| 1| 3| 1| 1| 1|
+---+---+---+---+---+---+
Actually it is indeed possible to avoid join using Window as
w1 = Window.partitionBy('id')
w2 = Window.partitionBy('id', 'timestamp')
df.select('id', 'timestamp',
sf.sum('error').over(w1).alias('Ne'),
sf.count('*').over(w1).alias('cnt'),
sf.count('*').over(w2).alias('cnt_2')
).filter(sf.col('cnt')>1) \
.groupBy('id', 'Ne', 'cnt').pivot('timestamp').agg(sf.first('cnt_2')).fillna(0).show()