How to efficiently split a dataframe in Spark based on a condition? - dataframe

I have a situtation like that with this Spark dataframe:
id
value
1
0
1
3
2
4
1
0
2
2
3
0
4
1
Now what I want to obtain is to efficiently split this single dataframe in 3 different one such that each dataframe extracted from the original one is between two 0 in the "value" column (with the first zero indicating the beginning of each dataframe) using Apache Spark, so that I would obtain this as result:
Dataframe 1 (rows from first 0 value to the last value before the next 0):
id
value
1
0
1
3
2
4
Dataframe 2 (rows from the second zero value to the last value before the 3rd zero):
id
value
1
0
2
2
Dataframe 3:
id
value
3
0
4
1

While as samkart said it is not efficient/easy way to break data on basis of order of rows still if you are using spark v3.2+ you can leverage pandas on pyspark to do it in spark way like below
import pyspark.pandas as ps
from pyspark.sql import functions as F
from pyspark.sql import Window
pdf=ps.read_csv("/FileStore/tmp4/pand.txt")
sdf = pdf.to_spark(index_col='index')
sdf=sdf.withColumn("run",F.sum(F.when(F.col("value")==0,1).otherwise(0)).over(Window.orderBy("index")))
toval= sdf.agg(F.max(F.col("run"))).collect()[0][0]
for x in range (1,toval+1):
globals()[f"sdf{x}"]=sdf.filter(F.col("run")==x).drop("index","run")
For above data it will create 3 dataframe sdf1,sdf2,sdf3 like below
sdf1.show()
sdf2.show()
sdf3.show()
#output
+---+-----+
| id|value|
+---+-----+
| 1| 0|
| 1| 3|
| 2| 4|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 1| 0|
| 2| 2|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 4| 1|
+---+-----+

Related

How can I replace the values in one pyspark dataframe column with the values from another column in a sub-section of the dataframe?

I have to perform a group-by and pivot operation on a dataframe's "activity" column, and populate the new columns resulting from the pivot with the sum of the "quantity" column. One of the activity columns, however has to be populated with the sum of the "cost" column.
Data frame before group-by and pivot:
+----+-----------+-----------+-----------+-----------+
| id | quantity | cost | activity | category |
+----+-----------+-----------+-----------+-----------+
| 1 | 2 | 2 | skiing | outdoor |
| 2 | 0 | 2 | swimming | outdoor |
+----+-----------+-----------+-----------+-----------+
pivot code:
pivotDF = df.groupBy("category").pivot("activity").sum("quantity")
result:
+----+-----------+-----------+-----------+
| id | category | skiing | swimming |
+----+-----------+-----------+-----------+
| 1 | outdoor | 2 | 5 |
| 2 | outdoor | 4 | 7 |
+----+-----------+-----------+-----------+
The problem is that for one of these activities, I need the activity column to be populated with sum("cost") instead of sum("quantity"). I can't seem to find a way to specify this during the pivot operation itself, so I thought maybe I can just exchange the values in the quantity column for the ones in the cost column wherever the activity column value corresponds to the relevant activity. However, I can't find an example of how to do this in a pyspark data frame.
Any help would be much appreciated.
You can provide more than 1 aggregation after the pivot.
Let's say the input dataframe looks like the following
# +---+---+----+--------+-------+
# | id|qty|cost| act| cat|
# +---+---+----+--------+-------+
# | 1| 2| 2| skiing|outdoor|
# | 2| 0| 2|swimming|outdoor|
# | 3| 1| 2| skiing|outdoor|
# | 4| 2| 4|swimming|outdoor|
# +---+---+----+--------+-------+
Do a pivot and use agg() to provide more than 1 aggregation.
data_sdf. \
groupBy('id', 'cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +---+-------+-----------+----------+-------------+------------+
# | id| cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +---+-------+-----------+----------+-------------+------------+
# | 2|outdoor| null| null| 2| 0|
# | 1|outdoor| 2| 2| null| null|
# | 3|outdoor| 2| 1| null| null|
# | 4|outdoor| null| null| 4| 2|
# +---+-------+-----------+----------+-------------+------------+
Notice the field names. Pyspark automatically assigned the suffix based on the alias provided in the aggregations. Use a drop or select to retain the columns required and rename them per your choice.
Removing id from the groupBy makes the result much better.
data_sdf. \
groupBy('cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +-------+-----------+----------+-------------+------------+
# | cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +-------+-----------+----------+-------------+------------+
# |outdoor| 4| 3| 6| 2|
# +-------+-----------+----------+-------------+------------+

How do I transpose a dataframe with only one row and multiple column in pyspark?

I have dataframes with one row:
A B C D E
4 1 7 2 3
I would like to convert this to a dataframe with the following format:
Letter Number
A 4
B 1
C 7
D 2
E 3
I did not find any built-in pyspark function in the docs, so I created a very simple basic function that does the job. Given that your dataframe df has only one row, you can use the following solution.
def my_transpose(df):
# get values
letter = df.columns
number = list(df.take(1)[0].asDict().values())
# combine values for a new Spark dataframe
data = [[a, b] for a, b in zip(letter, number)]
res = spark.createDataFrame(data, ['Letter', 'Number'])
return res
my_transpose(df).show()
+------+------+
|Letter|Number|
+------+------+
| A| 4|
| B| 1|
| C| 7|
| D| 2|
| E| 3|
+------+------+

How to create multiple flag columns based on list values found in the dataframe column?

The table looks like this :
ID |CITY
----------------------------------
1 |London|Paris|Tokyo
2 |Tokyo|Barcelona|Mumbai|London
3 |Vienna|Paris|Seattle
The city column contains around 1000+ values which are | delimited
I want to create a flag column to indicate if a person visited only the city of interest.
city_of_interest=['Paris','Seattle','Tokyo']
There are 20 such values in the list.
Ouput should look like this :
ID |Paris | Seattle | Tokyo
-------------------------------------------
1 |1 |0 |1
2 |0 |0 |1
3 |1 |1 |0
The solution can either be in pandas or pyspark.
For pyspark, use split + array_contains:
from pyspark.sql.functions import split, array_contains
df.withColumn('cities', split('CITY', '\|')) \
.select('ID', *[ array_contains('cities', c).astype('int').alias(c) for c in city_of_interest ])
.show()
+---+-----+-------+-----+
| ID|Paris|Seattle|Tokyo|
+---+-----+-------+-----+
| 1| 1| 0| 1|
| 2| 0| 0| 1|
| 3| 1| 1| 0|
+---+-----+-------+-----+
For Pandas, use Series.str.get_dummies:
df[city_of_interest] = df.CITY.str.get_dummies()[city_of_interest]
df = df.drop('CITY', axis=1)
Pandas Solution
First transform to list to use DataFrame.explode:
new_df=df.copy()
new_df['CITY']=new_df['CITY'].str.lstrip('|').str.split('|')
#print(new_df)
# ID CITY
#0 1 [London, Paris, Tokyo]
#1 2 [Tokyo, Barcelona, Mumbai, London]
#2 3 [Vienna, Paris, Seattle]
Then we can use:
Method 1: DataFrame.pivot_table
new_df=( new_df.explode('CITY')
.pivot_table(columns='CITY',index='ID',aggfunc='size',fill_value=0)
[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Method 2: DataFrame.groupby + DataFrame.unstack
new_df=( new_df.explode('CITY')
.groupby(['ID'])
.CITY
.value_counts()
.unstack('CITY',fill_value=0)[city_of_interest]
.reset_index()
.rename_axis(columns=None)
)
print(new_df)
Output new_df:
ID Paris Seattle Tokyo
0 1 1 0 1
1 2 0 0 1
2 3 1 1 0
Using a UDF to check if the city of interest value is in the delimited column.
from pyspark.sql.functions import udf
#Input list
city_of_interest=['Paris','Seattle','Tokyo']
#UDF definition
def city_present(city_name,city_list):
return len(set([city_name]) & set(city_list.split('|')))
city_present_udf = udf(city_present,IntegerType())
#Converting cities list to a column of array type for adding columns to the dataframe
city_array = array(*[lit(city) for city in city_of_interest])
l = len(city_of_interest)
col_names = df.columns + [city for city in city_of_interest]
result = df.select(df.columns + [city_present_udf(city_array[i],df.city) for i in range(l)])
result = result.toDF(*col_names)
result.show()

Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark

I have the following time-series data in a DataFrame in pyspark:
(id, timestamp, type)
the id column can be any integer value and many rows of the same id
can exist in the table
the timestamp column is a timestamp represented by an integer (for simplification)
the type column is a string type variable where each distinct
string on the column represents one category. One special category
out of all is 'A'
My question is the following:
Is there any way to compute (with SQL or pyspark DataFrame operations):
the counts of every type
for all the time differences from the timestamp corresponding to all the rows
of type='A' within a time range (e.g. [-5,+5]), with granularity of 1 second
For example, for the following DataFrame:
ts_df = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(1,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
ts_df.show()
+---+----+-----+
| id|type| ts|
+---+----+-----+
| 1| A| 100|
| 2| A| 1000|
| 3| A|10000|
| 1| b| 99|
| 1| b| 99|
| 1| b| 99|
| 2| b| 999|
| 2| b| 999|
| 2| c| 999|
| 2| c| 999|
| 1| d| 999|
| 3| c| 9999|
| 3| c| 9999|
| 3| d| 9999|
| 1| b| 98|
| 1| b| 98|
| 2| b| 998|
| 2| c| 998|
| 3| c| 9998|
+---+----+-----+
for a time difference of -1 second the result should be:
# result for time difference = -1 sec
# b: 5
# c: 4
# d: 2
while for a time difference of -2 seconds the result should be:
# result for time difference = -2 sec
# b: 3
# c: 2
# d: 0
and so on so forth for any time difference within a time range for a granularity of 1 second.
I tried many different ways by using mostly groupBy but nothing seems to work.
I am mostly having difficulties on how to express the time difference from each row of type=A even if I have to do it for one specific time difference.
Any suggestions would be greatly appreciated!
EDIT:
If I only have to do it for one specific time difference time_difference then I could do it with the following way:
time_difference = -1
df_type_A = ts_df.where(F.col("type")=='A').selectExpr("ts as fts")
res = df_type_A.join(ts_df, on=df_type_A.fts+time_difference==ts_df.ts)\
.drop("ts","fts").groupBy(F.col("type")).count()
The the returned res DataFrame will give me exactly what I want for one specific time difference. I create a loop and solve the problem by repeating the same query over and over again.
However, is there any more efficient way than that?
EDIT2 (solution)
So that's how I did it at the end:
df1 = sc.parallelize([
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(2,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
df1.show()
df2 = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
]).toDF(["id","type","ts"]).selectExpr("id as fid","ts as fts","type as ftype")
df2.show()
df3 = df2.join(df1, on=df1.id==df2.fid).withColumn("td", F.col("ts")-F.col("fts"))
df3.show()
df4 = df3.groupBy([F.col("type"),F.col("td")]).count()
df4.show()
Will update performance details as soon as I'll have any.
Thanks!
Another way to solve this problem would be:
Divide existing data-frames in two data-frames - with A and without A
Add a new column in without A df, which is sum of "ts" and time_difference
Join both data frame, group By and count.
Here is a code:
from pyspark.sql.functions import lit
time_difference = 1
ts_df_A = (
ts_df
.filter(ts_df["type"] == "A")
.drop("id")
.drop("type")
)
ts_df_td = (
ts_df
.withColumn("ts_plus_td", lit(ts_df['ts'] + time_difference))
.filter(ts_df["type"] != "A")
.drop("ts")
)
joined_df = ts_df_A.join(ts_df_td, ts_df_A["ts"] == ts_df_td["ts_plus_td"])
agg_df = joined_df.groupBy("type").count()
>>> agg_df.show()
+----+-----+
|type|count|
+----+-----+
| d| 2|
| c| 4|
| b| 5|
+----+-----+
>>>
Let me know if this is what you are looking for?
Thanks,
Hussain Bohra

How to filter dates column with one condtion from the other column in Pyspark?

Assume, I have the following data frame named table_df in Pyspark
sid | date | label
------------------
1033| 20170521 | 0
1033| 20170520 | 0
1033| 20170519 | 1
1033| 20170516 | 0
1033| 20170515 | 0
1033| 20170511 | 1
1033| 20170511 | 0
1033| 20170509 | 0
.....................
The data frame table_df contains different IDs in different rows, the above is simply one typical case of ID.
For each ID and for each date with label 1, I would like to find the date with label 0 that is the closest and before.
For the above table, with ID 1033, date=20170519, label 1, the date of label 0 that is closest and before is 20170516.
And with ID 1033, date=20170511, label 1, the date of label 0 that is closest and before is 20170509 .
So, finally using groupBy and some complicated operations, I will obtain the following table:
sid | filtered_date |
-------------------------
1033| 20170516 |
1033| 20170509 |
-------------
Any help is highly appreciated. I tried but could not find any smart ways.
Thanks
We can use window partition ordered by date and find difference with the next row,
df.show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170521| 0|
|1033|20170520| 0|
|1033|20170519| 1|
|1033|20170516| 0|
|1033|20170515| 0|
|1033|20170511| 1|
|1033|20170511| 0|
|1033|20170509| 0|
+----+--------+-----+
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('sid').orderBy('date')
df.withColumn('diff',F.lead('label').over(w) - df['label']).where(F.col('diff') == 1).drop('diff').show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170509| 0|
|1033|20170516| 0|
+----+--------+-----+