How to replace column values using replace() method while using dictionary? - apache-spark-sql

While replacing values of a column in a df using replace method how can we make use of the dictionary to do the same.I am having problems with the syntax.
person = spark.createDataFrame([
(0, "Bill Chambers", 0, [100]),
(1, "Matei Zaharia", 1, [500, 250, 100]),
(2, "Michael Armbrust", 1, [250, 100]),
(1,'Adam',4,[200])])\
.toDF("id", "name", "graduate_program", "spark_status")
diz={'Bill Chambers':'ABC','Adam':'DEF'}
I saw that the syntax is:
person.replace(diz,1,'name')
What is the significance of 1 here in the arguments?

First of all, I encourage you to check pyspark documentation and search for replace(to_replace, value=<no value>, subset=None) function definition.
You are passing a dictionary diz with key/value pairs, and because of that value 1 will be ignored in your case, thus, you will get the following result:
>>> person.replace(diz,1,'name').show()
+---+----------------+----------------+---------------+
| id| name|graduate_program| spark_status|
+---+----------------+----------------+---------------+
| 0| ABC| 0| [100]|
| 1| Matei Zaharia| 1|[500, 250, 100]|
| 2|Michael Armbrust| 1| [250, 100]|
| 1| DEF| 4| [200]|
+---+----------------+----------------+---------------+
Note in your usage only column name that you specified as subset will be affected, and you can clearly see that your dictionary key/value pairs have been used as to_replace/value.
Now if you want to test how value argument should work, check this example:
>>> person.replace(['Adam', 'Bill Chambers'],['Bob', 'Omar'],'name').show()
+---+----------------+----------------+---------------+
| id| name|graduate_program| spark_status|
+---+----------------+----------------+---------------+
| 0| Omar| 0| [100]|
| 1| Matei Zaharia| 1|[500, 250, 100]|
| 2|Michael Armbrust| 1| [250, 100]|
| 1| Bob| 4| [200]|
+---+----------------+----------------+---------------+
Note if you want to specify another list of to_replace/value for two columns, check the following usage of dataframe.replace():
>>> person.replace([1, 0],[9, 5],['id', 'graduate_program']).show()
+---+----------------+----------------+---------------+
| id| name|graduate_program| spark_status|
+---+----------------+----------------+---------------+
| 5| Bill Chambers| 5| [100]|
| 9| Matei Zaharia| 9|[500, 250, 100]|
| 2|Michael Armbrust| 9| [250, 100]|
| 9| Adam| 4| [200]|
+---+----------------+----------------+---------------+
In the previous example we targeted two same value typed (int) columns [id, graduate_program], and forced all ones to be replaced with 9, and all zeros to be replaced with 5.
I hope this answers your question

Related

Stratified Sampling with Pyspark Multiples Columns

I have a dataframe with millions of registers, like this:
CLI_ID OCCUPA_ID DIG_LABEL
125 2705 1
328 2708 7
400 2712 1
401 2705 2
525 2708 1
I want to take an aleatory sample of 100k rows that contains 70% of 2705, 20% of 2708, 10% of 2712 from OCCUPA_ID and 50% of 1, 20% of 2 and 30% 7 from DIG_LABEL.
How can I get this in Spark, using pyspark?
use sampleBy instead using sample function in pyspark,becasue sample only use for sampling without any column.so we will take sampleBy.
here,sampleBy we have in column,fraction and seed(Optional).
consider,
df_sample = df.sampleBy(column,fraction,seed)
where,
column is defined for selecting column you want to sampling
fraction is just defined for sampling ration like 10% so it will take as 0.1 vice versa.
seed for which of data show will saved as seed through becasue everytime it will show different data if not use this seed.
so your question required answer is,
dfsample = df.sampleBy("OCCUPA_ID",{"2705":0.7,"2708":0.2,"2712":0.1},42).sampleBy("DIG_LABEL",{"1":0.5,"2":0.2,"7":0.3},42)
just take two times of sampling OCCUPA_ID and after DIG_LABEL.
42 is seed here both time
You can use the sampleBy method for pyspark DataFrames to perform stratified sample and pass the column name and a dictionary for the fractions within each column. For example:
spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).show()
+------+---------+---------+
|CLI_ID|OCCUPA_ID|DIG_LABEL|
+------+---------+---------+
| 1| 2705| 7|
| 4| 2705| 1|
| 5| 2705| 7|
| 7| 2708| 2|
| 12| 2705| 1|
| 16| 2708| 2|
| 18| 2708| 2|
| 20| 2705| 7|
| 25| 2705| 2|
| 26| 2705| 2|
| 38| 2705| 7|
| 40| 2705| 1|
| 44| 2705| 2|
| 48| 2708| 7|
| 50| 2708| 2|
| 53| 2705| 1|
| 57| 2705| 1|
| 58| 2712| 1|
| 61| 2705| 2|
| 63| 2708| 7|
+------+---------+---------+
only showing top 20 rows
Since you want one pyspark DataFrame with two samplings performed from two different columns, you can chain the sampleBy methods together:
spark_stratified_sample_df = spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).sampleBy("DIG_LABEL", fractions={"1": 0.5, "2": 0.2, "7": 0.3}, seed=42)

Selecting 'Exclusive Rows' from a PySpark Dataframe

I have a PySpark dataframe like this:
+----------+-----+
|account_no|types|
+----------+-----+
| 1| K|
| 1| A|
| 1| S|
| 2| M|
| 2| D|
| 2| S|
| 3| S|
| 3| S|
| 4| M|
| 5| K|
| 1| S|
| 6| S|
+----------+-----+
and I am trying to pick the account numbers for which Exclusively 'S' exists.
For example: Even though '1' has type ='S', I will not pick it because it has also got other types. But I will pick 3 and 6, because they have just one type 'S'.
What I am doing right now is:
- First get all accounts for which 'K' exists and remove them; which in this example removes '1' and '5'
- Second find all accounts for which 'D' exists and remove them, which removes '2'
- Third find all accounts for which 'M' exists, and remove '4' ('2' has also got 'M' but it was removed at step 2)
- Fourth find all accounts for which 'A' exists, and remove them
So, now '1', '2', '4' and '5' are removed and I get '3' and '6' which have exclusive 'S'.
But this is a long process, how do I optimize it?
Thank you
Another alternative is counting distinct over a window and then filter where Distinct count == 1 and types == S , for ordering you can assign a monotonically increasing id and then orderBy the same.
from pyspark.sql import functions as F
W = Window.partitionBy('account_no')
out = (df.withColumn("idx",F.monotonically_increasing_id())
.withColumn("Distinct",F.approx_count_distinct(F.col("types")).over(W)).orderBy("idx")
.filter("Distinct==1 AND types =='S'")).drop('idx','Distinct')
out.show()
+----------+-----+
|account_no|types|
+----------+-----+
| 3| S|
| 3| S|
| 6| S|
+----------+-----+
One way to do this is to use Window functions. First we get a sum of the number of S in each account_no grouping. Then we compare that to the total number of entries for that group, in the filter, if they match we keep that number.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("account_no")
w1=Window().partitionBy("account_no").orderBy("types")
df.withColumn("sum_S", F.sum(F.when(F.col("types")=='S', F.lit(1)).otherwise(F.lit(0))).over(w))\
.withColumn("total", F.max(F.row_number().over(w1)).over(w))\
.filter('total=sum_S').drop("total","Sum_S").show()
#+----------+-----+
#|account_no|types|
#+----------+-----+
#| 6| S|
#| 3| S|
#| 3| S|
#+----------+-----+
You can simply detect the amount of distinct types an account has and then filter the 'S' accounts which only have 1 distinct type.
Here is my code for it:
from pyspark.sql.functions import countDistinct
data = [(1, 'k'),
(1, 'a'),
(1, 's'),
(2, 'm'),
(2, 'd'),
(2, 's'),
(3, 's'),
(3, 's'),
(4, 'm'),
(5, 'k'),
(1, 's'),
(6, 's')]
df = spark.createDataFrame(data, ['account_no', 'types']).distinct()
exclusive_s_accounts = (df.groupBy('account_no').agg(countDistinct('types').alias('distinct_count'))
.join(df, 'account_no')
.where((col('types') == 's') & (col('distinct_count') == 1))
.drop('distinct_count'))
Another alternate approach could be to get all the types under one column and then apply filter operations to exclude which has non "S" values.
from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import collectivist
from pyspark.sql.functions import col
df = spark.read.csv("/Users/Downloads/account.csv", header=True, inferSchema=True, sep=",")
type_df = df.groupBy("account_no").agg(concat_ws(",", collect_list("types")).alias("all_types")).select(col("account_no"), col("all_types"))
+----------+---------+
|account_no|all_types|
+----------+---------+
| 1| K,A,S,S|
| 6| S|
| 3| S,S|
| 5| K|
| 4| M|
| 2| M,D,S|
+----------+---------+
further filtering using regular expression
only_s_df = type_df.withColumn("S_status",F.col("all_types").rlike("K|A|M|D"))
only_s_df.show()
+----------+---------+----------+
|account_no|all_types|S_status |
+----------+---------+----------+
| 1| K,A,S,S| true|
| 6| S| false|
| 3| S,S| false|
| 5| K| true|
| 4| M| true|
| 2| M,D,S| true|
+----------+---------+----------+
hope this way you can get the answer and further processing.

How to get value_counts for a spark row?

I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+

Partition PySpark DataFrame depending on unique values in column (Custom Partitioning)

I have a PySpark data frame in which I have separate columns for names, types, days and values. An example of the dataframe can be seen below:
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
| name4| b| 1| 145|
| name5| b| 1| 185|
| name6| c| 1| 155|
| name7| c| 1| 160|
| name8| a| 2| 120|
| name9| a| 2| 110|
|name10| b| 2| 125|
|name11| b| 2| 185|
|name12| c| 3| 195|
+------+----+---+-----+
For a selected value of Type, I want to create separate dataframes depending on the unique values of the column titled Day. Let's say, I have chosen a as my preferred Type. In the aforementioned example, I have three unique values of Day (viz. 1, 2 , 3). For each unique value of Day which has a row with the chosen Type a - (that is days 1 and 2 in the above data), I want to create a dataframe which has all rows with the chosen chosen Type and Day. In the example mentioned above, I will have two dataframe which will look as below
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
+------+----+---+-----+
and
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name8| a| 2| 120|
| name9| a| 2| 110|
+------+----+---+-----+
How can I do this? In the actual data that I will be working with, I have millions of columns. So, I want to know about the most efficient way in which I can realize the above mentioned aim.
You can use the below mentioned code to generate the example given above.
from pyspark.sql import *
import numpy as np
Stats = Row("Name", "Type", "Day", "Value")
stat1 = Stats('name1', 'a', 1, 140)
stat2 = Stats('name2', 'a', 1, 180)
stat3 = Stats('name3', 'a', 1, 150)
stat4 = Stats('name4', 'b', 1, 145)
stat5 = Stats('name5', 'b', 1, 185)
stat6 = Stats('name6', 'c', 1, 155)
stat7 = Stats('name7', 'c', 1, 160)
stat8 = Stats('name8', 'a', 2, 120)
stat9 = Stats('name9', 'a', 2, 110)
stat10 = Stats('name10', 'b', 2, 125)
stat11 = Stats('name11', 'b', 2, 185)
stat12 = Stats('name12', 'c', 3, 195)
You can just use df.repartition("Type", "Day")
Docs for the same.
When I validate using the following function, I get the mentioned output
def validate(partition):
count = 0
for row in partition:
print(row)
count += 1
print(count)
My data
+------+--------------------+-------+-------+
|amount| trans_date|user_id|row_num|
+------+--------------------+-------+-------+
| 99.1|2019-06-04T00:00:...| 101| 1|
| 89.27|2019-06-04T00:00:...| 102| 2|
| 89.1|2019-03-04T00:00:...| 102| 3|
| 73.11|2019-09-10T00:00:...| 103| 4|
|-69.81|2019-09-11T00:00:...| 101| 5|
| 12.51|2018-12-14T00:00:...| 101| 6|
| 43.23|2018-09-11T00:00:...| 101| 7|
+------+--------------------+-------+-------+
After df.repartition("user_id") I get the following:
Output
Row(amount=73.11, trans_date='2019-09-10T00:00:00.000+05:30', user_id='103', row_num=4)
1
Row(amount=89.27, trans_date='2019-06-04T00:00:00.000+05:30', user_id='102', row_num=2)
Row(amount=89.1, trans_date='2019-03-04T00:00:00.000+05:30', user_id='102', row_num=3)
2
Row(amount=99.1, trans_date='2019-06-04T00:00:00.000+05:30', user_id='101', row_num=1)
Row(amount=-69.81, trans_date='2019-09-11T00:00:00.000+05:30', user_id='101', row_num=5)
Row(amount=12.51, trans_date='2018-12-14T00:00:00.000+05:30', user_id='101', row_num=6)
Row(amount=43.23, trans_date='2018-09-11T00:00:00.000+05:30', user_id='101', row_num=7)
4

Spark SQL: Is there a way to distinguish columns with same name?

I have a csv with a header with columns with same name.
I want to process them with spark using only SQL and be able to refer these columns unambiguously.
Ex.:
id name age height name
1 Alex 23 1.70
2 Joseph 24 1.89
I want to get only first name column using only Spark SQL
As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.
Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.
For instance, let's create a dataframe with three id columns.
val df = spark.range(3)
.select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id| x| id| y| id|
+---+---+---+---+---+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+---+---+---+---+---+
Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.
var i = -1
val names = df.columns.map( n =>
if(n == "id") {
i+=1
s"id_$i"
} else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0| x|id_1| y|id_2|
+----+---+----+---+----+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+----+---+----+---+----+