How to create BinaryType Column using multiple columns of a pySpark Dataframe? - apache-spark-sql

I have recently started working with pySpark so don't know about many details regarding this.
I am trying to create a BinaryType column in a data frame? But struggling to do it...
for example, let's take a simple df
df.show(2)
+---+----------+
| col1|col2|
+---+----------+
| "1"| null|
| "2"| "20"|
+---+----------+
Now I want to have a third column "col3" with BinaryType like
| col1|col2| col3|
+---+----------+
| "1"| null|[1 null]
| "2"| "20"|[ 2 20]
+---+----------+
How should i do it?

Try this:
a = [('1', None), ('2', '20')]
df = spark.createDataFrame(a, ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 2| 20|
+----+----+
df = df.withColumn('col3', F.array(['col1', 'col2']))
df.show()
+----+----+-------+
|col1|col2| col3|
+----+----+-------+
| 1|null| [1,]|
| 2| 20|[2, 20]|
+----+----+-------+

Related

Finding the value of a column based on other 2 columns

I have specific problem, where I want to retrieve the value of bu_id field from id and matched_ id.
When there is some value in matched_id column, bu_id should be same as the id for that particular id and ids of corresponding matched_id.
When matched_id is blank, bu_id should be same as id.
input
+---+------------+
|id |matched_id |
+---+------------+
|0 |7,8 |
|1 | |
|2 |4 |
|3 |5,9 |
|4 |2 |
|5 |3,9 |
|6 | |
|7 |0,8 |
|8 |0,7 |
|9 |3,5 |
output
+---+------------+-----+
|id |matched_id |bu_id|
+---+------------+-----+
|0 |7,8 |0 |
|1 | |1 |
|2 |4 |2 |
|3 |5,9 |3 |
|4 |2 |2 |
|5 |3,9 |3 |
|6 | |6 |
|7 |0,8 |0 |
|8 |0,7 |0 |
|9 |3,5 |3 |
Can anyone help me how to approach this problem. Thanks in advance.
We should try to use functions exclusively from the pyspark.sql.functions module because these are optimized for pyspark dataframes (see here), whereas udfs are not and should be avoided when possible.
To achieve the desired output pyspark dataframe, we can concatenate both "id" and "matched_id" columns together, convert the string that into a list of strings using split, cast the result as an array of integers, and take the minimum of the array – and we can get away with not having to worry about the blank strings because they get converted into null, and F.array_min drops nulls from consideration. This can be done with the following line of code (and while it is a little hard to read, it gets the job done):
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
],
["id", "matched_id"]
)
df.withColumn(
"bu_id",
F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),",").cast("array<int>"))
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
+---+----------+-----+
Update: in the case of non-numeric strings in columns "id" and "matched_id", we can no longer cast to an array of integers, so we can instead use pyspark functions F.when and .otherwise (see here) to set our new column to the "id" column when "matched_id" is an empty string "", and apply our other longer nested function when "matched_id" is non-empty.
df2 = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
("x", ""),
("x", "y,z")
],
["id", "matched_id"]
)
df2.withColumn(
"bu_id",
F.when(F.col("matched_id") != "", F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),","))).otherwise(
F.col("id")
)
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
| x| | x|
| x| y,z| x|
+---+----------+-----+
To answer this question I assumed that the logic you are looking to implement is,
If the matched_id column is null, then bu_id should be the same as id.
If the matched_id column is not null, we should consider the values listed in both the id and matched_id columns and bu_id should be the minimum of those values.
The Set-Up
# imports to include
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
# making your dataframe
df = spark.createDataFrame(
[
('0','7,8'),
('1',''),
('2','4'),
('3','5,9'),
('4','2'),
('5','3,9'),
('6',''),
('7','0,8'),
('8','0,7'),
('9','3,5'),
],
['id', 'matched_id'])
print(df.schema.fields)
df.show(truncate=False)
In this df, both the id and matched_id columns are StringType data types. The code that follows builds-off this assumption. You can check the column types in your df by running print(df.schema.fields)
id
matched_id
0
7,8
1
2
4
3
5,9
4
2
5
3,9
6
7
0,8
8
0,7
9
3,5
The Logic
To implement the logic for bu_id, we created a function called bu_calculation that defines the logic. Then we wrap the function in pyspark sql UDF. The bu_id column is then created by inputing the columns we need to evaluate (the id and matched_id columns) into the UDF
# create custom function with the logic for bu_id
def bu_calculation(id_col, matched_id_col):
id_int = int(id_col)
# turn the string in the matched_id column into a list and remove empty values from the list
matched_id_list = list(filter(None, matched_id_col.split(",")))
if len(matched_id_list) > 0:
# if matched_id column has values, convert strings to ints
all_ids = [int(x) for x in matched_id_list]
# join id column values with matched_id column values
all_ids.append(id_int)
# return minimum value
return min(all_ids)
else:
# if matched_id column is empty return the id column value
return id_int
# apply custom bu_calculation function to pyspark sql udf
# the use of IntegerType() here enforces that the bu_calculation function has to return an int
bu_udf = F.udf(bu_calculation, IntegerType())
# make a new column called bu_id using the pyspark sql udf we created called bu_udf
df = df.withColumn('bu_id', bu_udf('id', 'matched_id'))
df.show(truncate=False)
id
matched_id
bu_id
0
7,8
0
1
1
2
4
2
3
5,9
3
4
2
2
5
3,9
3
6
6
7
0,8
0
8
0,7
0
9
3,5
3
More about the pyspark sql udf function here: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html

PySpark: How to concatenate two distinct dataframes?

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+
You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+

PySpark: get all dataframe columns defined as values into another column

I'm newby with PySpark and don't know what's the problem with my code.
I have 2 dataframes
df1=
+---+--------------+
| id|No_of_Question|
+---+--------------+
| 1| Q1|
| 2| Q4|
| 3| Q23|
|...| ...|
+---+--------------+
df2 =
+--------------------+---+---+---+---+---+---+
| Q1| Q2| Q3| Q4| Q5| ... |Q22|Q23|Q24|Q25|
+--------------------+---+---+---+---+---+---+
| 1| 0| 1| 0| 0| ... | 1| 1| 1| 1|
+--------------------+---+---+---+---+---+---+
I'd like to create a new dataframe with all columns from df2 defined into df1.No_of_Question.
Expected result
df2 =
+------------+
| Q1| Q4| Q24|
+------------+
| 1| 0| 1|
+------------+
I've already tried
df2 = df2.select(*F.collect_list(df1.No_of_Question)) #Error: Column is not iterable
or
df2 = df2.select(F.collect_list(df1.No_of_Question)) #Error: Resolved attribute(s) No_of_Question#1791 missing from Q1, Q2...
or
df2 = df2.select(*df1.No_of_Question)
of
df2= df2.select([col for col in df2.columns if col in df1.No_of_Question])
But none of these solutions worked.
Could you help me please?
You can collect the values of No_of_Question into a python list then pass it to df2.select().
Try this:
questions = [
F.col(r.No_of_Question).alias(r.No_of_Question)
for r in df1.select("No_of_Question").collect()
]
df2 = df2.select(*questions)

transposing rows into columns in pyspark

I have dataframe track_log where columns are
item track_info Date
----------------------
1 ordered 01/01/19
1 Shipped 02/01/19
1 delivered 03/01/19
I want to get data as
item ordered Shipped Delivered
--------------------------------------------
1 01/01/19 02/01/19 03/01/19
need to resolve this using pyspark
I can think of a solution like this:
>>> df.show()
+----+----------+--------+
|item|track_info| date|
+----+----------+--------+
| 1| ordered|01/01/19|
| 1| Shipped|02/01/19|
| 1| delivered|03/01/19|
+----+----------+--------+
>>> df_grouped=df.groupBy(df.item).agg(collect_list(df.track_info).alias('grouped_data'))
>>> df_grouped_date=df.groupBy(df.item).agg(collect_list(df.date).alias('grouped_dates'))
>>> df_cols=df_grouped.select(df_grouped.grouped_data).first()['grouped_data'].insert(0,'item')
>>> df_grouped_date.select(df_grouped_date.item,df_grouped_date.grouped_dates[0],df_grouped_date.grouped_dates[1],df_grouped_date.grouped_dates[2]).toDF(*df_cols).show()
+----+--------+--------+---------+
|item| ordered| Shipped|delivered|
+----+--------+--------+---------+
| 1|01/01/19|02/01/19| 03/01/19|
+----+--------+--------+---------+
You can use spark pivot function to do that as a single liner as below
>>> df.show()
+----+----------+--------+
|item|track_info| date|
+----+----------+--------+
| 1| ordered|01/01/19|
| 1| Shipped|02/01/19|
| 1| delivered|03/01/19|
+----+----------+--------+
>>> pivot_df = df.groupBy('item').pivot('track_info').agg(collect_list('date'))
>>> pivot_df.show()
+----+--------+--------+---------+
|item| ordered| Shipped|delivered|
+----+--------+--------+---------+
| 1|[01/01/19]|[02/01/19]| [03/01/19]|
+----+--------+--------+---------+

pyspark dataframe ordered by multiple columns at the same time

i have json file that contain some data, i converted this json to pyspark dataframe(i chose some columns not all of them) this is my code:
import os
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
import json
from pyspark.sql.functions import col
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
df=spark.read.json("/Users/deemaalomair/PycharmProj
ects/first/deema.json").select('full_text',
'retweet_count', 'favorite_count')
c=df.count()
print(c)
df.orderBy(["retweet_count", "favorite_count"], ascending=[0, 0]).show(10)
and this is the output:
+--------------------+-------------+--------------+
| full_text|retweet_count|favorite_count|
+--------------------+-------------+--------------+
|Check out this in...| 388| 785|
|Review – Apple Ai...| 337| 410|
|This #iPhone atta...| 159| 243|
|March is #Nationa...| 103| 133|
|📱
Amazing vide...| 87| 139|
|Business email wi...| 86| 160|
|#wallpapers #iPho...| 80| 385|
|#wallpapers #iPho...| 71| 352|
|#wallpapers #iPho...| 57| 297|
|Millions of #iPho...| 46| 52|
+--------------------+-------------+--------------+
only showing top 10 rows
Q1:
now i need to order this dataframe by descending order for two columns at the same time ('retweet_count' , 'favorite_count')
i tried multiple functions like the one above and `
Cols = ['retweet_count','favorite_count']
df = df.OrderBy(cols,ascending=False).show(10)
but all of them just order the first columns and skip the second ! i do not what i am doing wrong.
i know there are lot of same questions , but i tried everything before posting the question here!
Q2: dataframe output for the fulltext is shorten how can i print the whole text ?
If you are trying to see the descending values in two columns simultaneously, that is not going to happen as each column has it's own separate order.
In the above data frame you can see that both the retweet_count and favorite_count has it's own order. This is the case with your data.
>>> import os
>>> from pyspark import SparkContext
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.sql import SparkSession
>>> sc = SparkContext.getOrCreate()
>>> spark = SparkSession(sc)
>>> df = spark.read.format('csv').option("header","true").load("/home/samba693/test.csv")
>>> df.show()
+---------+-------------+--------------+
|full_text|retweet_count|favorite_count|
+---------+-------------+--------------+
| abc| 45| 45|
| def| 50| 40|
| ghi| 50| 39|
| jkl| 50| 41|
+---------+-------------+--------------+
When we are applying order by based on two columns, what exactly is happening is, it is ordering by based on the first column, if there is a tie, it is taking the second column's value into consideration. But this might be not what you are seeing for. You are seeing for sorting both the columns based on their sum.
>>> df.orderBy(["retweet_count", "favorite_count"], ascending=False).show()
+---------+-------------+--------------+
|full_text|retweet_count|favorite_count|
+---------+-------------+--------------+
| jkl| 50| 41|
| def| 50| 40|
| ghi| 50| 39|
| abc| 45| 45|
+---------+-------------+--------------+
One way to work around this is adding a new column with sum of these both column and apply orderby on the new column and remove the new column after ordering.
>>> from pyspark.sql.functions import expr
>>> df1 = df.withColumn('total',expr("retweet_count+favorite_count"))
>>> df1.show()
+---------+-------------+--------------+-----+
|full_text|retweet_count|favorite_count|total|
+---------+-------------+--------------+-----+
| abc| 45| 45| 90.0|
| def| 50| 40| 90.0|
| ghi| 50| 39| 89.0|
| jkl| 50| 41| 91.0|
+---------+-------------+--------------+-----+
Ordering by using new column and removing it later
>>> df2 = df1.orderBy("total", ascending=False)
>>> df2.show()
+---------+-------------+--------------+-----+
|full_text|retweet_count|favorite_count|total|
+---------+-------------+--------------+-----+
| jkl| 50| 41| 91.0|
| abc| 45| 45| 90.0|
| def| 50| 40| 90.0|
| ghi| 50| 39| 89.0|
+---------+-------------+--------------+-----+
>>> df = df2.select("full_text","retweet_count","favorite_count")
>>> df.show()
+---------+-------------+--------------+
|full_text|retweet_count|favorite_count|
+---------+-------------+--------------+
| jkl| 50| 41|
| abc| 45| 45|
| def| 50| 40|
| ghi| 50| 39|
+---------+-------------+--------------+
Hope this helps!