Selecting 'Exclusive Rows' from a PySpark Dataframe - dataframe

I have a PySpark dataframe like this:
+----------+-----+
|account_no|types|
+----------+-----+
| 1| K|
| 1| A|
| 1| S|
| 2| M|
| 2| D|
| 2| S|
| 3| S|
| 3| S|
| 4| M|
| 5| K|
| 1| S|
| 6| S|
+----------+-----+
and I am trying to pick the account numbers for which Exclusively 'S' exists.
For example: Even though '1' has type ='S', I will not pick it because it has also got other types. But I will pick 3 and 6, because they have just one type 'S'.
What I am doing right now is:
- First get all accounts for which 'K' exists and remove them; which in this example removes '1' and '5'
- Second find all accounts for which 'D' exists and remove them, which removes '2'
- Third find all accounts for which 'M' exists, and remove '4' ('2' has also got 'M' but it was removed at step 2)
- Fourth find all accounts for which 'A' exists, and remove them
So, now '1', '2', '4' and '5' are removed and I get '3' and '6' which have exclusive 'S'.
But this is a long process, how do I optimize it?
Thank you

Another alternative is counting distinct over a window and then filter where Distinct count == 1 and types == S , for ordering you can assign a monotonically increasing id and then orderBy the same.
from pyspark.sql import functions as F
W = Window.partitionBy('account_no')
out = (df.withColumn("idx",F.monotonically_increasing_id())
.withColumn("Distinct",F.approx_count_distinct(F.col("types")).over(W)).orderBy("idx")
.filter("Distinct==1 AND types =='S'")).drop('idx','Distinct')
out.show()
+----------+-----+
|account_no|types|
+----------+-----+
| 3| S|
| 3| S|
| 6| S|
+----------+-----+

One way to do this is to use Window functions. First we get a sum of the number of S in each account_no grouping. Then we compare that to the total number of entries for that group, in the filter, if they match we keep that number.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("account_no")
w1=Window().partitionBy("account_no").orderBy("types")
df.withColumn("sum_S", F.sum(F.when(F.col("types")=='S', F.lit(1)).otherwise(F.lit(0))).over(w))\
.withColumn("total", F.max(F.row_number().over(w1)).over(w))\
.filter('total=sum_S').drop("total","Sum_S").show()
#+----------+-----+
#|account_no|types|
#+----------+-----+
#| 6| S|
#| 3| S|
#| 3| S|
#+----------+-----+

You can simply detect the amount of distinct types an account has and then filter the 'S' accounts which only have 1 distinct type.
Here is my code for it:
from pyspark.sql.functions import countDistinct
data = [(1, 'k'),
(1, 'a'),
(1, 's'),
(2, 'm'),
(2, 'd'),
(2, 's'),
(3, 's'),
(3, 's'),
(4, 'm'),
(5, 'k'),
(1, 's'),
(6, 's')]
df = spark.createDataFrame(data, ['account_no', 'types']).distinct()
exclusive_s_accounts = (df.groupBy('account_no').agg(countDistinct('types').alias('distinct_count'))
.join(df, 'account_no')
.where((col('types') == 's') & (col('distinct_count') == 1))
.drop('distinct_count'))

Another alternate approach could be to get all the types under one column and then apply filter operations to exclude which has non "S" values.
from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import collectivist
from pyspark.sql.functions import col
df = spark.read.csv("/Users/Downloads/account.csv", header=True, inferSchema=True, sep=",")
type_df = df.groupBy("account_no").agg(concat_ws(",", collect_list("types")).alias("all_types")).select(col("account_no"), col("all_types"))
+----------+---------+
|account_no|all_types|
+----------+---------+
| 1| K,A,S,S|
| 6| S|
| 3| S,S|
| 5| K|
| 4| M|
| 2| M,D,S|
+----------+---------+
further filtering using regular expression
only_s_df = type_df.withColumn("S_status",F.col("all_types").rlike("K|A|M|D"))
only_s_df.show()
+----------+---------+----------+
|account_no|all_types|S_status |
+----------+---------+----------+
| 1| K,A,S,S| true|
| 6| S| false|
| 3| S,S| false|
| 5| K| true|
| 4| M| true|
| 2| M,D,S| true|
+----------+---------+----------+
hope this way you can get the answer and further processing.

Related

PySpark: How to concatenate two distinct dataframes?

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+
You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+

Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark

I have the following time-series data in a DataFrame in pyspark:
(id, timestamp, type)
the id column can be any integer value and many rows of the same id
can exist in the table
the timestamp column is a timestamp represented by an integer (for simplification)
the type column is a string type variable where each distinct
string on the column represents one category. One special category
out of all is 'A'
My question is the following:
Is there any way to compute (with SQL or pyspark DataFrame operations):
the counts of every type
for all the time differences from the timestamp corresponding to all the rows
of type='A' within a time range (e.g. [-5,+5]), with granularity of 1 second
For example, for the following DataFrame:
ts_df = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(1,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
ts_df.show()
+---+----+-----+
| id|type| ts|
+---+----+-----+
| 1| A| 100|
| 2| A| 1000|
| 3| A|10000|
| 1| b| 99|
| 1| b| 99|
| 1| b| 99|
| 2| b| 999|
| 2| b| 999|
| 2| c| 999|
| 2| c| 999|
| 1| d| 999|
| 3| c| 9999|
| 3| c| 9999|
| 3| d| 9999|
| 1| b| 98|
| 1| b| 98|
| 2| b| 998|
| 2| c| 998|
| 3| c| 9998|
+---+----+-----+
for a time difference of -1 second the result should be:
# result for time difference = -1 sec
# b: 5
# c: 4
# d: 2
while for a time difference of -2 seconds the result should be:
# result for time difference = -2 sec
# b: 3
# c: 2
# d: 0
and so on so forth for any time difference within a time range for a granularity of 1 second.
I tried many different ways by using mostly groupBy but nothing seems to work.
I am mostly having difficulties on how to express the time difference from each row of type=A even if I have to do it for one specific time difference.
Any suggestions would be greatly appreciated!
EDIT:
If I only have to do it for one specific time difference time_difference then I could do it with the following way:
time_difference = -1
df_type_A = ts_df.where(F.col("type")=='A').selectExpr("ts as fts")
res = df_type_A.join(ts_df, on=df_type_A.fts+time_difference==ts_df.ts)\
.drop("ts","fts").groupBy(F.col("type")).count()
The the returned res DataFrame will give me exactly what I want for one specific time difference. I create a loop and solve the problem by repeating the same query over and over again.
However, is there any more efficient way than that?
EDIT2 (solution)
So that's how I did it at the end:
df1 = sc.parallelize([
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(2,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
df1.show()
df2 = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
]).toDF(["id","type","ts"]).selectExpr("id as fid","ts as fts","type as ftype")
df2.show()
df3 = df2.join(df1, on=df1.id==df2.fid).withColumn("td", F.col("ts")-F.col("fts"))
df3.show()
df4 = df3.groupBy([F.col("type"),F.col("td")]).count()
df4.show()
Will update performance details as soon as I'll have any.
Thanks!
Another way to solve this problem would be:
Divide existing data-frames in two data-frames - with A and without A
Add a new column in without A df, which is sum of "ts" and time_difference
Join both data frame, group By and count.
Here is a code:
from pyspark.sql.functions import lit
time_difference = 1
ts_df_A = (
ts_df
.filter(ts_df["type"] == "A")
.drop("id")
.drop("type")
)
ts_df_td = (
ts_df
.withColumn("ts_plus_td", lit(ts_df['ts'] + time_difference))
.filter(ts_df["type"] != "A")
.drop("ts")
)
joined_df = ts_df_A.join(ts_df_td, ts_df_A["ts"] == ts_df_td["ts_plus_td"])
agg_df = joined_df.groupBy("type").count()
>>> agg_df.show()
+----+-----+
|type|count|
+----+-----+
| d| 2|
| c| 4|
| b| 5|
+----+-----+
>>>
Let me know if this is what you are looking for?
Thanks,
Hussain Bohra

How to get value_counts for a spark row?

I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+

Partition PySpark DataFrame depending on unique values in column (Custom Partitioning)

I have a PySpark data frame in which I have separate columns for names, types, days and values. An example of the dataframe can be seen below:
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
| name4| b| 1| 145|
| name5| b| 1| 185|
| name6| c| 1| 155|
| name7| c| 1| 160|
| name8| a| 2| 120|
| name9| a| 2| 110|
|name10| b| 2| 125|
|name11| b| 2| 185|
|name12| c| 3| 195|
+------+----+---+-----+
For a selected value of Type, I want to create separate dataframes depending on the unique values of the column titled Day. Let's say, I have chosen a as my preferred Type. In the aforementioned example, I have three unique values of Day (viz. 1, 2 , 3). For each unique value of Day which has a row with the chosen Type a - (that is days 1 and 2 in the above data), I want to create a dataframe which has all rows with the chosen chosen Type and Day. In the example mentioned above, I will have two dataframe which will look as below
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
+------+----+---+-----+
and
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name8| a| 2| 120|
| name9| a| 2| 110|
+------+----+---+-----+
How can I do this? In the actual data that I will be working with, I have millions of columns. So, I want to know about the most efficient way in which I can realize the above mentioned aim.
You can use the below mentioned code to generate the example given above.
from pyspark.sql import *
import numpy as np
Stats = Row("Name", "Type", "Day", "Value")
stat1 = Stats('name1', 'a', 1, 140)
stat2 = Stats('name2', 'a', 1, 180)
stat3 = Stats('name3', 'a', 1, 150)
stat4 = Stats('name4', 'b', 1, 145)
stat5 = Stats('name5', 'b', 1, 185)
stat6 = Stats('name6', 'c', 1, 155)
stat7 = Stats('name7', 'c', 1, 160)
stat8 = Stats('name8', 'a', 2, 120)
stat9 = Stats('name9', 'a', 2, 110)
stat10 = Stats('name10', 'b', 2, 125)
stat11 = Stats('name11', 'b', 2, 185)
stat12 = Stats('name12', 'c', 3, 195)
You can just use df.repartition("Type", "Day")
Docs for the same.
When I validate using the following function, I get the mentioned output
def validate(partition):
count = 0
for row in partition:
print(row)
count += 1
print(count)
My data
+------+--------------------+-------+-------+
|amount| trans_date|user_id|row_num|
+------+--------------------+-------+-------+
| 99.1|2019-06-04T00:00:...| 101| 1|
| 89.27|2019-06-04T00:00:...| 102| 2|
| 89.1|2019-03-04T00:00:...| 102| 3|
| 73.11|2019-09-10T00:00:...| 103| 4|
|-69.81|2019-09-11T00:00:...| 101| 5|
| 12.51|2018-12-14T00:00:...| 101| 6|
| 43.23|2018-09-11T00:00:...| 101| 7|
+------+--------------------+-------+-------+
After df.repartition("user_id") I get the following:
Output
Row(amount=73.11, trans_date='2019-09-10T00:00:00.000+05:30', user_id='103', row_num=4)
1
Row(amount=89.27, trans_date='2019-06-04T00:00:00.000+05:30', user_id='102', row_num=2)
Row(amount=89.1, trans_date='2019-03-04T00:00:00.000+05:30', user_id='102', row_num=3)
2
Row(amount=99.1, trans_date='2019-06-04T00:00:00.000+05:30', user_id='101', row_num=1)
Row(amount=-69.81, trans_date='2019-09-11T00:00:00.000+05:30', user_id='101', row_num=5)
Row(amount=12.51, trans_date='2018-12-14T00:00:00.000+05:30', user_id='101', row_num=6)
Row(amount=43.23, trans_date='2018-09-11T00:00:00.000+05:30', user_id='101', row_num=7)
4

Apache Spark SQL: How to use GroupBy and Max to filter data

I have a given dataset with the following structure:
https://i.imgur.com/Kk7I1S1.png
I need to solve the below problem using SparkSQL: Dataframes
For each postcode find the customer that has had the most number of previous accidents. In the case of a tie, meaning more than one customer have the same highest number of accidents, just return any one of them. For each of these selected customers output the following columns: postcode, customer id, number of previous accidents.
I think you have missed to provide data that you have mentioned in image link. I have created my own data set by taking your problem as a reference. You can use below code snippet just for your reference and also can replace df data Frame with your data set to add required column such as id etc.
scala> val df = spark.read.format("csv").option("header","true").load("/user/nikhil/acc.csv")
df: org.apache.spark.sql.DataFrame = [postcode: string, customer: string ... 1 more field]
scala> df.show()
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
| 1| Nikhil| 5|
| 2| Ram| 4|
| 1| Shyam| 3|
| 3| pranav| 1|
| 1| Suman| 2|
| 3| alex| 2|
| 2| Raj| 5|
| 4| arpit| 3|
| 1| darsh| 2|
| 1| rahul| 3|
| 2| kiran| 4|
| 3| baba| 4|
| 4| alok| 3|
| 1| Nakul| 5|
+--------+--------+---------+
scala> df.createOrReplaceTempView("tmptable")
scala> spark.sql(s"""SELECT postcode,customer, accidents FROM (SELECT postcode,customer, accidents, row_number() over (PARTITION BY postcode ORDER BY accidents desc) as rn from tmptable) WHERE rn = 1""").show(false)
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
|3 |baba |4 |
|1 |Nikhil |5 |
|4 |arpit |3 |
|2 |Raj |5 |
+--------+--------+---------+
You can get the result with the following code in python:
from pyspark.sql import Row, Window
import pyspark.sql.functions as F
from pyspark.sql.window import *
l = [(1, '682308', 25), (1, '682308', 23), (2, '682309', 23), (1, '682309', 27), (2, '682309', 22)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(c_id=int(x[0]), postcode=x[1], accident=int(x[2])))
schemaPeople = sqlContext.createDataFrame(people)
result = schemaPeople.groupby("postcode", "c_id").agg(F.max("accident").alias("accident"))
new_result = result.withColumn("row_num", F.row_number().over(Window.partitionBy("postcode").orderBy(F.desc("accident")))).filter("row_num==1")
new_result.show()