Convert Column of List to a Dataframe Column - dataframe

I have a column of lists in a spark dataframe.
+-----+----------+
|c1 | c2 |
+-----+----------+
|a |[1, 0, 1, 1] |
|b |[0, 1, 1, 0] |
|c |[1, 1, 0, 0] |
+-----+----------+
How do I convert this into another spark dataframe where each list is turned into a dataframe column? Also each entry from column 'c1' is the name of the new column created. Something like below.
+--------+
|a| b | c|
+--------+
|1 |0| 1 |
|0 |0| 1 |
|1 |1| 0 |
|1 |0| 0 |
+--------+
Note: I did think about following this: Convert Column of List to Dataframe and then taking a transpose of the resultant matrix. But, this creates quite a lot of columns [as the size of the list data I have is pretty huge] and therefore isn't an efficient solution.
Any help is welcome.

import pyspark.sql.functions as F
#Not a part of the solution, only used to generate the data sample
df = spark.sql("select stack(3 ,'a',array(1, 0, 1, 1), 'b',array(0, 1, 1, 0) ,'c',array(1, 1, 0, 0)) as (c1,c2)")
df.groupBy().pivot('c1').agg(F.first('c2')).selectExpr('inline(arrays_zip(*))').show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 0| 1|
| 0| 1| 1|
| 1| 1| 0|
| 1| 0| 0|
+---+---+---+
This can be easily tested for large datasets
df = sql("select id as c1, transform(sequence(1,10000), e -> tinyint(round(rand()))) as c2 from range(10000)")
Just completed a succesfull execution of 10K arrays, 10K elements each, on a VM with 4 cores & 32 GB RAM (Azure Databricks).
Took 5.35 minutes.

Related

Finding the value of a column based on other 2 columns

I have specific problem, where I want to retrieve the value of bu_id field from id and matched_ id.
When there is some value in matched_id column, bu_id should be same as the id for that particular id and ids of corresponding matched_id.
When matched_id is blank, bu_id should be same as id.
input
+---+------------+
|id |matched_id |
+---+------------+
|0 |7,8 |
|1 | |
|2 |4 |
|3 |5,9 |
|4 |2 |
|5 |3,9 |
|6 | |
|7 |0,8 |
|8 |0,7 |
|9 |3,5 |
output
+---+------------+-----+
|id |matched_id |bu_id|
+---+------------+-----+
|0 |7,8 |0 |
|1 | |1 |
|2 |4 |2 |
|3 |5,9 |3 |
|4 |2 |2 |
|5 |3,9 |3 |
|6 | |6 |
|7 |0,8 |0 |
|8 |0,7 |0 |
|9 |3,5 |3 |
Can anyone help me how to approach this problem. Thanks in advance.
We should try to use functions exclusively from the pyspark.sql.functions module because these are optimized for pyspark dataframes (see here), whereas udfs are not and should be avoided when possible.
To achieve the desired output pyspark dataframe, we can concatenate both "id" and "matched_id" columns together, convert the string that into a list of strings using split, cast the result as an array of integers, and take the minimum of the array – and we can get away with not having to worry about the blank strings because they get converted into null, and F.array_min drops nulls from consideration. This can be done with the following line of code (and while it is a little hard to read, it gets the job done):
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
],
["id", "matched_id"]
)
df.withColumn(
"bu_id",
F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),",").cast("array<int>"))
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
+---+----------+-----+
Update: in the case of non-numeric strings in columns "id" and "matched_id", we can no longer cast to an array of integers, so we can instead use pyspark functions F.when and .otherwise (see here) to set our new column to the "id" column when "matched_id" is an empty string "", and apply our other longer nested function when "matched_id" is non-empty.
df2 = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
("x", ""),
("x", "y,z")
],
["id", "matched_id"]
)
df2.withColumn(
"bu_id",
F.when(F.col("matched_id") != "", F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),","))).otherwise(
F.col("id")
)
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
| x| | x|
| x| y,z| x|
+---+----------+-----+
To answer this question I assumed that the logic you are looking to implement is,
If the matched_id column is null, then bu_id should be the same as id.
If the matched_id column is not null, we should consider the values listed in both the id and matched_id columns and bu_id should be the minimum of those values.
The Set-Up
# imports to include
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
# making your dataframe
df = spark.createDataFrame(
[
('0','7,8'),
('1',''),
('2','4'),
('3','5,9'),
('4','2'),
('5','3,9'),
('6',''),
('7','0,8'),
('8','0,7'),
('9','3,5'),
],
['id', 'matched_id'])
print(df.schema.fields)
df.show(truncate=False)
In this df, both the id and matched_id columns are StringType data types. The code that follows builds-off this assumption. You can check the column types in your df by running print(df.schema.fields)
id
matched_id
0
7,8
1
2
4
3
5,9
4
2
5
3,9
6
7
0,8
8
0,7
9
3,5
The Logic
To implement the logic for bu_id, we created a function called bu_calculation that defines the logic. Then we wrap the function in pyspark sql UDF. The bu_id column is then created by inputing the columns we need to evaluate (the id and matched_id columns) into the UDF
# create custom function with the logic for bu_id
def bu_calculation(id_col, matched_id_col):
id_int = int(id_col)
# turn the string in the matched_id column into a list and remove empty values from the list
matched_id_list = list(filter(None, matched_id_col.split(",")))
if len(matched_id_list) > 0:
# if matched_id column has values, convert strings to ints
all_ids = [int(x) for x in matched_id_list]
# join id column values with matched_id column values
all_ids.append(id_int)
# return minimum value
return min(all_ids)
else:
# if matched_id column is empty return the id column value
return id_int
# apply custom bu_calculation function to pyspark sql udf
# the use of IntegerType() here enforces that the bu_calculation function has to return an int
bu_udf = F.udf(bu_calculation, IntegerType())
# make a new column called bu_id using the pyspark sql udf we created called bu_udf
df = df.withColumn('bu_id', bu_udf('id', 'matched_id'))
df.show(truncate=False)
id
matched_id
bu_id
0
7,8
0
1
1
2
4
2
3
5,9
3
4
2
2
5
3,9
3
6
6
7
0,8
0
8
0,7
0
9
3,5
3
More about the pyspark sql udf function here: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html

Determine if pyspark DataFrame row value is present in other columns

I'm working with a dataframe in pyspark, and need to evaluate row by row if a value is present in other columns of the dataframe. As an example, given this dataframe:
df:
+---------+--------------+-------+-------+-------+
|Subject |SubjectTotal |TypeA |TypeB |TypeC |
+---------+--------------+-------+-------+-------+
|Subject1 |10 |5 |3 |2 |
+---------+--------------+-------+-------+-------+
|Subject2 |15 |0 |15 |0 |
+---------+--------------+-------+-------+-------+
|Subject3 |5 |0 |0 |5 |
+---------+--------------+-------+-------+-------+
As an output, I need to determine which Type has 100% of the SubjectTotal. So my output would look like this:
df_output:
+---------+--------------+
|Subject |Type |
+---------+--------------+
|Subject2 |TypeB |
+---------+--------------+
|Subject3 |TypeC |
+---------+--------------+
Is it even possible?
Thanks!
Yo can try with when().otherwise() PySpark SQL function or case statement in SQL
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("Subject1", 10, 5, 3, 2),
("Subject2", 15, 0, 15, 0),
("Subject3", 5, 0, 0, 5)
],
("subject", "subjectTotal", "TypeA", "TypeB", "TypeC"))
df.show()
+--------+------------+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC|
+--------+------------+-----+-----+-----+
|Subject1| 10| 5| 3| 2|
|Subject2| 15| 0| 15| 0|
|Subject3| 5| 0| 0| 5|
+--------+------------+-----+-----+-----+
df.withColumn("Type", F.
when(F.col("subjectTotal") == F.col("TypeA"), "TypeA").
when(F.col("subjectTotal") == F.col("TypeB"), "TypeB").
when(F.col("subjectTotal") == F.col("TypeC"), "TypeC").
otherwise(None)).show()
+--------+------------+-----+-----+-----+-----+
| subject|subjectTotal|TypeA|TypeB|TypeC| Type|
+--------+------------+-----+-----+-----+-----+
|Subject1| 10| 5| 3| 2| null|
|Subject2| 15| 0| 15| 0|TypeB|
|Subject3| 5| 0| 0| 5|TypeC|
+--------+------------+-----+-----+-----+-----+
You can use when expression within a list comprehension over all columns TypeX, then coalesce the list of expressions:
from pyspark.sql import functions as F
df1 = df.select(
F.col("Subject"),
F.coalesce(*[F.when(F.col(c) == F.col("SubjectTotal"), F.lit(c)) for c in df.columns[2:]]).alias("Type")
).filter("Type is not null")
df1.show()
#+--------+-----+
#| Subject| Type|
#+--------+-----+
#|Subject2|TypeB|
#|Subject3|TypeC|
#+--------+-----+
You can unpivot the dataframe using stack and filter the rows where SubjectTotal is equal to the value in the type columns:
df2 = df.selectExpr(
'Subject',
'SubjectTotal',
"stack(3, 'TypeA', TypeA, 'TypeB', TypeB, 'TypeC', TypeC) as (type, val)"
).filter('SubjectTotal = val').select('Subject', 'type')
df2.show()
+--------+-----+
| Subject| type|
+--------+-----+
|Subject2|TypeB|
|Subject3|TypeC|
+--------+-----+

How to get value_counts for a spark row?

I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+

Apache Spark SQL: How to use GroupBy and Max to filter data

I have a given dataset with the following structure:
https://i.imgur.com/Kk7I1S1.png
I need to solve the below problem using SparkSQL: Dataframes
For each postcode find the customer that has had the most number of previous accidents. In the case of a tie, meaning more than one customer have the same highest number of accidents, just return any one of them. For each of these selected customers output the following columns: postcode, customer id, number of previous accidents.
I think you have missed to provide data that you have mentioned in image link. I have created my own data set by taking your problem as a reference. You can use below code snippet just for your reference and also can replace df data Frame with your data set to add required column such as id etc.
scala> val df = spark.read.format("csv").option("header","true").load("/user/nikhil/acc.csv")
df: org.apache.spark.sql.DataFrame = [postcode: string, customer: string ... 1 more field]
scala> df.show()
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
| 1| Nikhil| 5|
| 2| Ram| 4|
| 1| Shyam| 3|
| 3| pranav| 1|
| 1| Suman| 2|
| 3| alex| 2|
| 2| Raj| 5|
| 4| arpit| 3|
| 1| darsh| 2|
| 1| rahul| 3|
| 2| kiran| 4|
| 3| baba| 4|
| 4| alok| 3|
| 1| Nakul| 5|
+--------+--------+---------+
scala> df.createOrReplaceTempView("tmptable")
scala> spark.sql(s"""SELECT postcode,customer, accidents FROM (SELECT postcode,customer, accidents, row_number() over (PARTITION BY postcode ORDER BY accidents desc) as rn from tmptable) WHERE rn = 1""").show(false)
+--------+--------+---------+
|postcode|customer|accidents|
+--------+--------+---------+
|3 |baba |4 |
|1 |Nikhil |5 |
|4 |arpit |3 |
|2 |Raj |5 |
+--------+--------+---------+
You can get the result with the following code in python:
from pyspark.sql import Row, Window
import pyspark.sql.functions as F
from pyspark.sql.window import *
l = [(1, '682308', 25), (1, '682308', 23), (2, '682309', 23), (1, '682309', 27), (2, '682309', 22)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(c_id=int(x[0]), postcode=x[1], accident=int(x[2])))
schemaPeople = sqlContext.createDataFrame(people)
result = schemaPeople.groupby("postcode", "c_id").agg(F.max("accident").alias("accident"))
new_result = result.withColumn("row_num", F.row_number().over(Window.partitionBy("postcode").orderBy(F.desc("accident")))).filter("row_num==1")
new_result.show()

How to comparing pair of columns using udf in pyspark?

I Have Dataframe like below
+---+---+---+
| t1| t2|t3 |
+---+---+---+
|0 |1 |0 |
+---+---+---+
I want to compare each column with other column.
for example t1 column value 0 and t2 column value is 1 the t1 and t2 combination column is 1.
we have to apply logical oR for all column pairs.
my expected output will be like below:
+----+---+---+---+
|t123| t1|t2 | t3|
+----+---+---+---+
|t1 |0 |1 |0 |
|t2 |1 |0 |1 |
|t2 |0 |1 |0 |
+----+---+---+---+
please help me on this.
Try this,
cols=df.columns
n=len(cols)
df1=pd.concat([df]*n,ignore_index=True).eq(1)
df2= pd.concat([df.T]*n,axis=1,ignore_index=True).eq(1)
df2.columns=cols
df2=df2.reset_index(drop=True)
print (df1|df2).astype(int)
Explanation:
Convert df1 into logical df as you need
Convert df2 into logical df as you need with transpose
Perform logical OR in both df
Output:
t1 t2 t3
0 0 1 0
1 1 1 1
2 0 1 0
For pyspark, you can create an empty df then insert it in loop based on columns. Below works not only 3 columns but also for more columns
>>> import pyspark.sql.functions as F
>>>
>>> df1 = spark.createDataFrame(sc.emptyRDD(), df.schema)
>>> df.show()
+---+---+---+
| t1| t2| t3|
+---+---+---+
| 0| 1| 0|
+---+---+---+
>>> df1 = spark.createDataFrame(sc.emptyRDD(), df.schema)
>>> df1 = df1.select(F.lit('').alias('t123'), F.col('*'))
>>> df1.show()
+----+---+---+---+
|t123| t1| t2| t3|
+----+---+---+---+
+----+---+---+---+
>>> for x in df.columns:
... mydf = df.select([(F.when(df[i]+df[x]==1,1).otherwise(0)).alias(i) for i in df.columns])
... df1 = df1.union(mydf.select(F.lit(x).alias('t123'), F.col('*')))
...
>>> df1.show()
+----+---+---+---+
|t123| t1| t2| t3|
+----+---+---+---+
| t1| 0| 1| 0|
| t2| 1| 0| 1|
| t3| 0| 1| 0|
+----+---+---+---+