Partition PySpark DataFrame depending on unique values in column (Custom Partitioning) - apache-spark-sql

I have a PySpark data frame in which I have separate columns for names, types, days and values. An example of the dataframe can be seen below:
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
| name4| b| 1| 145|
| name5| b| 1| 185|
| name6| c| 1| 155|
| name7| c| 1| 160|
| name8| a| 2| 120|
| name9| a| 2| 110|
|name10| b| 2| 125|
|name11| b| 2| 185|
|name12| c| 3| 195|
+------+----+---+-----+
For a selected value of Type, I want to create separate dataframes depending on the unique values of the column titled Day. Let's say, I have chosen a as my preferred Type. In the aforementioned example, I have three unique values of Day (viz. 1, 2 , 3). For each unique value of Day which has a row with the chosen Type a - (that is days 1 and 2 in the above data), I want to create a dataframe which has all rows with the chosen chosen Type and Day. In the example mentioned above, I will have two dataframe which will look as below
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
+------+----+---+-----+
and
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name8| a| 2| 120|
| name9| a| 2| 110|
+------+----+---+-----+
How can I do this? In the actual data that I will be working with, I have millions of columns. So, I want to know about the most efficient way in which I can realize the above mentioned aim.
You can use the below mentioned code to generate the example given above.
from pyspark.sql import *
import numpy as np
Stats = Row("Name", "Type", "Day", "Value")
stat1 = Stats('name1', 'a', 1, 140)
stat2 = Stats('name2', 'a', 1, 180)
stat3 = Stats('name3', 'a', 1, 150)
stat4 = Stats('name4', 'b', 1, 145)
stat5 = Stats('name5', 'b', 1, 185)
stat6 = Stats('name6', 'c', 1, 155)
stat7 = Stats('name7', 'c', 1, 160)
stat8 = Stats('name8', 'a', 2, 120)
stat9 = Stats('name9', 'a', 2, 110)
stat10 = Stats('name10', 'b', 2, 125)
stat11 = Stats('name11', 'b', 2, 185)
stat12 = Stats('name12', 'c', 3, 195)

You can just use df.repartition("Type", "Day")
Docs for the same.
When I validate using the following function, I get the mentioned output
def validate(partition):
count = 0
for row in partition:
print(row)
count += 1
print(count)
My data
+------+--------------------+-------+-------+
|amount| trans_date|user_id|row_num|
+------+--------------------+-------+-------+
| 99.1|2019-06-04T00:00:...| 101| 1|
| 89.27|2019-06-04T00:00:...| 102| 2|
| 89.1|2019-03-04T00:00:...| 102| 3|
| 73.11|2019-09-10T00:00:...| 103| 4|
|-69.81|2019-09-11T00:00:...| 101| 5|
| 12.51|2018-12-14T00:00:...| 101| 6|
| 43.23|2018-09-11T00:00:...| 101| 7|
+------+--------------------+-------+-------+
After df.repartition("user_id") I get the following:
Output
Row(amount=73.11, trans_date='2019-09-10T00:00:00.000+05:30', user_id='103', row_num=4)
1
Row(amount=89.27, trans_date='2019-06-04T00:00:00.000+05:30', user_id='102', row_num=2)
Row(amount=89.1, trans_date='2019-03-04T00:00:00.000+05:30', user_id='102', row_num=3)
2
Row(amount=99.1, trans_date='2019-06-04T00:00:00.000+05:30', user_id='101', row_num=1)
Row(amount=-69.81, trans_date='2019-09-11T00:00:00.000+05:30', user_id='101', row_num=5)
Row(amount=12.51, trans_date='2018-12-14T00:00:00.000+05:30', user_id='101', row_num=6)
Row(amount=43.23, trans_date='2018-09-11T00:00:00.000+05:30', user_id='101', row_num=7)
4

Related

PySpark: How to concatenate two distinct dataframes?

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+
You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+

how to create & sort by an ordered categorical variable in pyspark

I'm migrating some code from pandas to pyspark. My source dataframe looks like this:
a b c
0 1 insert 1
1 2 update 1
2 3 seed 1
3 4 insert 2
4 5 update 2
5 6 delete 2
6 7 snapshot 1
and the operation (in python / pandas) that I'm applying is:
df.b = pd.Categorical(df.b, ordered=True, categories=['insert', 'seed', 'update', 'snapshot', 'delete'])
df.sort_values(['c', 'b'])
resulting in the output dataframe:
a b c
0 1 insert 1
2 3 seed 1
1 2 update 1
6 7 snapshot 1
3 4 insert 2
4 5 update 2
5 6 delete 2
I'm unsure how best to set up ordered categoricals using pyspark, and my initial approach creates a new column using case-when and attempts to use that subsequently:
df = df.withColumn(
"_precedence",
when(col("b") == "insert", 1)
.when(col("b") == "seed", 2)
.when(col("b") == "update", 3)
.when(col("b") == "snapshot", 4)
.when(col("b") == "delete", 5)
)
You can use a map:
from pyspark.sql.functions import create_map, lit, col
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
# per #HaleemurAli, adjusted the below list comprehension to create map
map1 = create_map([val for (i, c) in enumerate(categories) for val in (c, lit(i))])
#Column<b'map(insert, 0, seed, 1, update, 2, snapshot, 3, delete, 4)'>
df.orderBy('c', map1[col('b')]).show()
+---+---+--------+---+
| id| a| b| c|
+---+---+--------+---+
| 0| 1| insert| 1|
| 2| 3| seed| 1|
| 1| 2| update| 1|
| 6| 7|snapshot| 1|
| 3| 4| insert| 2|
| 4| 5| update| 2|
| 5| 6| delete| 2|
+---+---+--------+---+
to reverse the order on column-b: df.orderBy('c', map1[col('b')].desc()).show()
You could also do this using coalesce with ur when statements.
from pyspark.sql import functions as F
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
cols=[(F.when(F.col("b")==x,F.lit(y))) for x,y in zip(categories,[x for x in (range(1, len(categories)+1))])]
df.orderBy("c",F.coalesce(*cols)).show()
#+---+--------+---+
#| a| b| c|
#+---+--------+---+
#| 1| insert| 1|
#| 3| seed| 1|
#| 2| update| 1|
#| 7|snapshot| 1|
#| 4| insert| 2|
#| 5| update| 2|
#| 6| delete| 2|
#+---+--------+---+

Selecting 'Exclusive Rows' from a PySpark Dataframe

I have a PySpark dataframe like this:
+----------+-----+
|account_no|types|
+----------+-----+
| 1| K|
| 1| A|
| 1| S|
| 2| M|
| 2| D|
| 2| S|
| 3| S|
| 3| S|
| 4| M|
| 5| K|
| 1| S|
| 6| S|
+----------+-----+
and I am trying to pick the account numbers for which Exclusively 'S' exists.
For example: Even though '1' has type ='S', I will not pick it because it has also got other types. But I will pick 3 and 6, because they have just one type 'S'.
What I am doing right now is:
- First get all accounts for which 'K' exists and remove them; which in this example removes '1' and '5'
- Second find all accounts for which 'D' exists and remove them, which removes '2'
- Third find all accounts for which 'M' exists, and remove '4' ('2' has also got 'M' but it was removed at step 2)
- Fourth find all accounts for which 'A' exists, and remove them
So, now '1', '2', '4' and '5' are removed and I get '3' and '6' which have exclusive 'S'.
But this is a long process, how do I optimize it?
Thank you
Another alternative is counting distinct over a window and then filter where Distinct count == 1 and types == S , for ordering you can assign a monotonically increasing id and then orderBy the same.
from pyspark.sql import functions as F
W = Window.partitionBy('account_no')
out = (df.withColumn("idx",F.monotonically_increasing_id())
.withColumn("Distinct",F.approx_count_distinct(F.col("types")).over(W)).orderBy("idx")
.filter("Distinct==1 AND types =='S'")).drop('idx','Distinct')
out.show()
+----------+-----+
|account_no|types|
+----------+-----+
| 3| S|
| 3| S|
| 6| S|
+----------+-----+
One way to do this is to use Window functions. First we get a sum of the number of S in each account_no grouping. Then we compare that to the total number of entries for that group, in the filter, if they match we keep that number.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("account_no")
w1=Window().partitionBy("account_no").orderBy("types")
df.withColumn("sum_S", F.sum(F.when(F.col("types")=='S', F.lit(1)).otherwise(F.lit(0))).over(w))\
.withColumn("total", F.max(F.row_number().over(w1)).over(w))\
.filter('total=sum_S').drop("total","Sum_S").show()
#+----------+-----+
#|account_no|types|
#+----------+-----+
#| 6| S|
#| 3| S|
#| 3| S|
#+----------+-----+
You can simply detect the amount of distinct types an account has and then filter the 'S' accounts which only have 1 distinct type.
Here is my code for it:
from pyspark.sql.functions import countDistinct
data = [(1, 'k'),
(1, 'a'),
(1, 's'),
(2, 'm'),
(2, 'd'),
(2, 's'),
(3, 's'),
(3, 's'),
(4, 'm'),
(5, 'k'),
(1, 's'),
(6, 's')]
df = spark.createDataFrame(data, ['account_no', 'types']).distinct()
exclusive_s_accounts = (df.groupBy('account_no').agg(countDistinct('types').alias('distinct_count'))
.join(df, 'account_no')
.where((col('types') == 's') & (col('distinct_count') == 1))
.drop('distinct_count'))
Another alternate approach could be to get all the types under one column and then apply filter operations to exclude which has non "S" values.
from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import collectivist
from pyspark.sql.functions import col
df = spark.read.csv("/Users/Downloads/account.csv", header=True, inferSchema=True, sep=",")
type_df = df.groupBy("account_no").agg(concat_ws(",", collect_list("types")).alias("all_types")).select(col("account_no"), col("all_types"))
+----------+---------+
|account_no|all_types|
+----------+---------+
| 1| K,A,S,S|
| 6| S|
| 3| S,S|
| 5| K|
| 4| M|
| 2| M,D,S|
+----------+---------+
further filtering using regular expression
only_s_df = type_df.withColumn("S_status",F.col("all_types").rlike("K|A|M|D"))
only_s_df.show()
+----------+---------+----------+
|account_no|all_types|S_status |
+----------+---------+----------+
| 1| K,A,S,S| true|
| 6| S| false|
| 3| S,S| false|
| 5| K| true|
| 4| M| true|
| 2| M,D,S| true|
+----------+---------+----------+
hope this way you can get the answer and further processing.

How to get value_counts for a spark row?

I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+

Apache Spark SQL: How to use GroupBy and Max to filter data

I have a given dataset with the following structure:
https://i.imgur.com/Kk7I1S1.png
I need to solve the below problem using SparkSQL: Dataframes
For each postcode find the customer that has had the most number of previous accidents. In the case of a tie, meaning more than one customer have the same highest number of accidents, just return any one of them. For each of these selected customers output the following columns: postcode, customer id, number of previous accidents.
I think you have missed to provide data that you have mentioned in image link. I have created my own data set by taking your problem as a reference. You can use below code snippet just for your reference and also can replace df data Frame with your data set to add required column such as id etc.
scala> val df = spark.read.format("csv").option("header","true").load("/user/nikhil/acc.csv")
df: org.apache.spark.sql.DataFrame = [postcode: string, customer: string ... 1 more field]
scala> df.show()
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
| 1| Nikhil| 5|
| 2| Ram| 4|
| 1| Shyam| 3|
| 3| pranav| 1|
| 1| Suman| 2|
| 3| alex| 2|
| 2| Raj| 5|
| 4| arpit| 3|
| 1| darsh| 2|
| 1| rahul| 3|
| 2| kiran| 4|
| 3| baba| 4|
| 4| alok| 3|
| 1| Nakul| 5|
+--------+--------+---------+
scala> df.createOrReplaceTempView("tmptable")
scala> spark.sql(s"""SELECT postcode,customer, accidents FROM (SELECT postcode,customer, accidents, row_number() over (PARTITION BY postcode ORDER BY accidents desc) as rn from tmptable) WHERE rn = 1""").show(false)
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
|3 |baba |4 |
|1 |Nikhil |5 |
|4 |arpit |3 |
|2 |Raj |5 |
+--------+--------+---------+
You can get the result with the following code in python:
from pyspark.sql import Row, Window
import pyspark.sql.functions as F
from pyspark.sql.window import *
l = [(1, '682308', 25), (1, '682308', 23), (2, '682309', 23), (1, '682309', 27), (2, '682309', 22)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(c_id=int(x[0]), postcode=x[1], accident=int(x[2])))
schemaPeople = sqlContext.createDataFrame(people)
result = schemaPeople.groupby("postcode", "c_id").agg(F.max("accident").alias("accident"))
new_result = result.withColumn("row_num", F.row_number().over(Window.partitionBy("postcode").orderBy(F.desc("accident")))).filter("row_num==1")
new_result.show()