Is there any method to find number of columns having data in pyspark data frame - sql

I have a pyspark data frame that has 7 columns, I have to add a new column named "sum" and calculate a number of columns that have data (Not null) in the sum column.Example a data frame in which yellow highlighted part is required answer

This sum can be calculated like this:
df = spark.createDataFrame([
(1, "a", "xxx", None, "abc", "xyz","fgh"),
(2, "b", None, 3, "abc", "xyz","fgh"),
(3, "c", "a23", None, None, "xyz","fgh")
], ("ID","flag", "col1", "col2", "col3", "col4", "col5"))
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
df2 = df.withColumn("sum",sum([(~F.isnull(df[col])).cast(IntegerType()) for col in df.columns]))
df2.show()
+---+----+----+----+----+----+----+---+
| ID|flag|col1|col2|col3|col4|col5|sum|
+---+----+----+----+----+----+----+---+
| 1| a| xxx|null| abc| xyz| fgh| 6|
| 2| b|null| 3| abc| xyz| fgh| 6|
| 3| c| a23|null|null| xyz| fgh| 5|
+---+----+----+----+----+----+----+---+
Hope this helps!

Related

How to retain the preceding updated row values in PySpark and use it in the next row calculation?

The below condition needs to be applied on RANK and RANKA columns
Input table:
Condition for RANK column:
IF RANK == 0 : then RANK= previous RANK value + 1 ;
else : RANK=RANK
Condition for RANKA column:
IF RANKA == 0 : then RANKA= previous RANKA value + current row Salary
value;
else : RANKA=RANKA
Below is a piece of code that I tried.
I have created dummy columns named RANK_new and RANKA_new for storing the desired outputs of RANK and RANKA columns after applying conditions.
And then once I get the correct values I can replace the RANK and RANKA column with those dummy columns.
# importing necessary libraries
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col
# function to create new SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.functions import lag,lead
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("employee_profile.com") \
.getOrCreate()
return spk
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [(1, "Shivansh", "Data Scientist", 2,1,1,2),
(0, "Rishabh", "Software Developer", 5,2,0,3),
(0, "Swati", "Data Analyst", 10,3,10,4),
(1, "Amar", "Data Analyst", 2,4,9,0),
(0, "Arpit", "Android Developer", 3,5,0,0),
(0, "Ranjeet", "Python Developer", 4,6,0,0),
(0, "Priyanka", "Full Stack Developer",5,7,0,0)]
schema = ["Id", "Name", "Job Profile", "Salary",'hi','RANK','RANKA']
# calling function to create dataframe
dff = create_df(spark, input_data, schema)
# Below 3 lines for RANK
df1=dff.repartition(1)
df2 = df1.withColumn('RANK_new', when(col('RANK') == 0,lag(col('RANK')+lit(1)).over(Window.orderBy(col('hi')))).otherwise(col('RANK')))
df2 = df2.withColumn('RANK_new', when((col('RANK') == 0) & (lag(col('RANK')).over(Window.orderBy(col('hi'))) == 0) ,lag(col('RANK_new')+lit(1)).over(Window.orderBy(col('hi')))).otherwise(col('RANK_new')))
#Below line for RANKA
df2=df2.withColumn('RANKA_new', when(col('RANKA') == 0, lag(col("RANKA")).over(Window.orderBy("hi"))+col("Salary")).otherwise(col('RANKA')))
df2.show()
The issue with this code is that the lag function is not taking the updated values of the previous rows.
This can be done with a for loop but since my data is so huge, I need a solution without for loop.
Below is the desired output:
Below is a summarized picture to show the Output I got and the desired output.
RANK_new, RANKA_new --> These are the output I got for RANK and RANKA columns after I applied the above code
RANK_desired, RANKA-desired ---> This is what is expected to be produced.
You can first create groups for partitioning for both, RANK and RANKA. Then using sum inside partitions should work.
Input
from pyspark.sql import functions as F, Window as W
input_data = [(1, "Shivansh", "Data Scientist", 2,1,1,2),
(0, "Rishabh", "Software Developer", 5,2,0,3),
(0, "Swati", "Data Analyst", 10,3,10,4),
(1, "Amar", "Data Analyst", 2,4,9,0),
(0, "Arpit", "Android Developer", 3,5,0,0),
(0, "Ranjeet", "Python Developer", 4,6,0,0),
(0, "Priyanka", "Full Stack Developer",5,7,0,0)]
schema = ["Id", "Name", "Job Profile", "Salary",'hi','RANK','RANKA']
dff = spark.createDataFrame(input_data, schema)
Script:
w0 = W.orderBy('hi')
rank_grp = F.when(F.col('RANK') != 0, 1).otherwise(0)
dff = dff.withColumn('RANK_grp', F.sum(rank_grp).over(w0))
w1 = W.partitionBy('RANK_grp').orderBy('hi')
ranka_grp = F.when(F.col('RANKA') != 0, 1).otherwise(0)
dff = dff.withColumn('RANKA_grp', F.sum(ranka_grp).over(w0))
w2 = W.partitionBy('RANKA_grp').orderBy('hi')
dff = (dff
.withColumn('RANK_new', F.sum(F.when(F.col('RANK') == 0, 1).otherwise(F.col('RANK'))).over(w1))
.withColumn('RANKA_new', F.sum(F.when(F.col('RANKA') == 0, F.col('Salary')).otherwise(F.col('RANKA'))).over(w2))
.drop('RANK_grp', 'RANKA_grp')
)
dff.show()
# +---+--------+--------------------+------+---+----+-----+--------+---------+
# | Id| Name| Job Profile|Salary| hi|RANK|RANKA|RANK_new|RANKA_new|
# +---+--------+--------------------+------+---+----+-----+--------+---------+
# | 1|Shivansh| Data Scientist| 2| 1| 1| 2| 1| 2|
# | 0| Rishabh| Software Developer| 5| 2| 0| 3| 2| 3|
# | 0| Swati| Data Analyst| 10| 3| 10| 4| 10| 4|
# | 1| Amar| Data Analyst| 2| 4| 9| 0| 9| 6|
# | 0| Arpit| Android Developer| 3| 5| 0| 0| 10| 9|
# | 0| Ranjeet| Python Developer| 4| 6| 0| 0| 11| 13|
# | 0|Priyanka|Full Stack Developer| 5| 7| 0| 0| 12| 18|
# +---+--------+--------------------+------+---+----+-----+--------+---------+

Replacing column values in pyspark by iterating through list

I have a pyspark data frame as
| ID|colA|colB |colC|
+---+----+-----+----+
|ID1| 3|5.85 | LB|
|ID2| 4|12.67| RF|
|ID3| 2|20.78| LCM|
|ID4| 1| 2 | LWB|
|ID5| 6| 3 | LF|
|ID6| 7| 4 | LM|
|ID7| 8| 5 | RS|
+---+----+----+----+
My goal is to replace the values in ColC as for the values of LB,LWB,LF with x and so on as shown below.
x = [LB,LWB,LF]
y = [RF,LCM]
z = [LM,RS]
Currently I'm able to achieve this by replacing each of the values manually as in below code :
# Replacing the values LB,LWF,LF with x
df_new = df.withColumn('ColC',f.when((f.col('ColC') == 'LB')|(f.col('ColC') == 'LWB')|(f.col('ColC') == 'LF'),'x').otherwise(df.ColC))
My question here is that how can we replace the values of a column (ColC in my example) by iterating through a list (x,y,z) dynamically at once using pyspark? What is the time complexity involved? Also, how can we truncate the decimal values in ColB to 1 decmial place?
You can coalesce the when statements if you have many conditions to match. You can also use a dictionary to hold the columns to be converted, and construct the when statements dynamically using a dict comprehension. As for rounding to 1 decimal place, you can use round.
import pyspark.sql.functions as F
xyz_dict = {'x': ['LB','LWB','LF'],
'y': ['RF','LCM'],
'z': ['LM','RS']}
df2 = df.withColumn(
'colC',
F.coalesce(*[F.when(F.col('colC').isin(v), k) for (k, v) in xyz_dict.items()])
).withColumn(
'colB',
F.round('colB', 1)
)
df2.show()
+---+----+----+----+
| ID|colA|colB|colC|
+---+----+----+----+
|ID1| 3| 5.9| x|
|ID2| 4|12.7| y|
|ID3| 2|20.8| y|
|ID4| 1| 2.0| x|
|ID5| 6| 3.0| x|
|ID6| 7| 4.0| z|
|ID7| 8| 5.0| z|
+---+----+----+----+
You can use replace on dataframe to replace the values in colC by passing a dict object for the mappings. And round function to limit the number of decimals in colB:
from pyspark.sql import functions as F
replacement = {
"LB": "x", "LWB": "x", "LF": "x",
"RF": "y", "LCM": "y",
"LM": "z", "RS": "z"
}
df1 = df.replace(replacement, ["colC"]).withColumn("colB", F.round("colB", 1))
df1.show()
#+---+----+----+----+
#| ID|colA|colB|colC|
#+---+----+----+----+
#|ID1| 3| 5.9| x|
#|ID2| 4|12.7| y|
#|ID3| 2|20.8| y|
#|ID4| 1| 2.0| x|
#|ID5| 6| 3.0| x|
#|ID6| 7| 4.0| z|
#|ID7| 8| 5.0| z|
#+---+----+----+----+
Also you can use isin function:
from pyspark.sql.functions import col, when
x = ['LB','LWB','LF']
y = ['LCM','RF']
z = ['LM','RS']
df = df.withColumn('ColC', when(col('colC').isin(x), "x")\
.otherwise(when(col('colC').isin(y), "y")\
.otherwise(when(col('colC').isin(z), "z")\
.otherwise(df.ColC))))
If you have a few lists with too many values in this way your complexity is less than blackbishop answer but in this problem his answer is easier.
You can try also with a regular expression using regexp_replace:
import pyspark.sql.functions as f
replacements = [
("(LB)|(LWB)|(LF)", "x"),
("(LCM)|(RF)", "y"),
("(LM)|(RS)", "z")
]
for x, y in replacements:
df = df.withColumn("colC", f.regexp_replace("colC", x, y))

Join PySpark dataframes with unequal numbers of rows

I have two PySpark dataframe which are as given underneath
First is df1 which is given below:
+-----+-----+----------+-----+
| name| type|timestamp1|score|
+-----+-----+----------+-----+
|name1|type1|2012-01-10| 11|
|name2|type1|2012-01-10| 14|
|name3|type2|2012-01-10| 2|
|name3|type2|2012-01-17| 3|
|name1|type1|2012-01-18| 55|
|name1|type1|2012-01-19| 10|
+-----+-----+----------+-----+
Second is df2 which is given below:
+-----+-------------------+-------+-------+
| name| timestamp2|string1|string2|
+-----+-------------------+-------+-------+
|name1|2012-01-10 00:00:00| A| aa|
|name2|2012-01-10 00:00:00| A| bb|
|name3|2012-01-10 00:00:00| C| cc|
|name4|2012-01-17 00:00:00| D| dd|
|name3|2012-01-10 00:00:00| C| cc|
|name2|2012-01-17 00:00:00| A| bb|
|name2|2012-01-17 00:00:00| A| bb|
|name4|2012-01-10 00:00:00| D| dd|
|name3|2012-01-17 00:00:00| C| cc|
+-----+-------------------+-------+-------+
These two dataframes have one common column, i.e. name. Each unique value of name in df2 has unique values of string1 and string2.
I want to join df1 and df2 and form a new dataframe df3 such that df3 contains all the rows of df1 (same structure, numbers of rows as df1) but assigns values from columns string1 and string2 (from df2) to appropriate values of name in df1. Following is how I want the combined dataframe (df3) to look like.
+-----+-----+----------+-----+-------+-------+
| name| type|timestamp1|score|string1|string2|
+-----+-----+----------+-----+-------+-------+
|name1|type1|2012-01-10| 11| A| aa|
|name2|type1|2012-01-10| 14| A| bb|
|name3|type2|2012-01-10| 2| C| cc|
|name3|type2|2012-01-17| 3| C| cc|
|name1|type1|2012-01-18| 55| A| aa|
|name1|type1|2012-01-19| 10| A| aa|
+-----+-----+----------+-----+-------+-------+
How can I do get the above mentioned dataframe (df3)?
I tried the following df3 = df1.join( df2.select("name", "string1", "string2") , on=["name"], how="left"). But that gives me a dataframe with 14 rows with multiple (duplicate) entries of rows.
You can use the below mentioned code to generate df1 and df2.
from pyspark.sql import *
import pyspark.sql.functions as F
df1_Stats = Row("name", "type", "timestamp1", "score")
df1_stat1 = df1_Stats('name1', 'type1', "2012-01-10", 11)
df1_stat2 = df1_Stats('name2', 'type1', "2012-01-10", 14)
df1_stat3 = df1_Stats('name3', 'type2', "2012-01-10", 2)
df1_stat4 = df1_Stats('name3', 'type2', "2012-01-17", 3)
df1_stat5 = df1_Stats('name1', 'type1', "2012-01-18", 55)
df1_stat6 = df1_Stats('name1', 'type1', "2012-01-19", 10)
df1_stat_lst = [df1_stat1 , df1_stat2, df1_stat3, df1_stat4, df1_stat5, df1_stat6]
df1 = spark.createDataFrame(df1_stat_lst)
df2_Stats = Row("name", "timestamp2", "string1", "string2")
df2_stat1 = df2_Stats("name1", "2012-01-10 00:00:00", "A", "aa")
df2_stat2 = df2_Stats("name2", "2012-01-10 00:00:00", "A", "bb")
df2_stat3 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat4 = df2_Stats("name4", "2012-01-17 00:00:00", "D", "dd")
df2_stat5 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat6 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat7 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat8 = df2_Stats("name4", "2012-01-10 00:00:00", "D", "dd")
df2_stat9 = df2_Stats("name3", "2012-01-17 00:00:00", "C", "cc")
df2_stat_lst = [
df2_stat1,
df2_stat2,
df2_stat3,
df2_stat4,
df2_stat5,
df2_stat6,
df2_stat7,
df2_stat8,
df2_stat9,
]
df2 = spark.createDataFrame(df2_stat_lst)
It would be better to remove duplicates before joining , making small table to join.
df3 = df1.join(df2.select("name", "string1", "string2").distinct(),on=["name"] , how="left")
Apparently the following technique does it:
df3 = df1.join(
df2.select("name", "string1", "string2"), on=["name"], how="left"
).dropDuplicates()
df3.show()
+-----+-----+----------+-----+-------+-------+
| name| type| timestamp|score|string1|string2|
+-----+-----+----------+-----+-------+-------+
|name2|type1|2012-01-10| 14| A| bb|
|name3|type2|2012-01-10| 2| C| cc|
|name1|type1|2012-01-18| 55| A| aa|
|name1|type1|2012-01-10| 11| A| aa|
|name3|type2|2012-01-17| 3| C| cc|
|name1|type1|2012-01-19| 10| A| aa|
+-----+-----+----------+-----+-------+-------+
I am still open for answers. So, if you have a more efficient method of answering the question, please feel free to drop your answer.

using sparksql and spark dataframe How can we find the COLUMN NAME based on the minimum value in a row

i have a dataframe df . its having 4 columns
+-------+-------+-------+-------+
| dist1 | dist2 | dist3 | dist4 |
+-------+-------+-------+-------+
| 42 | 53 | 24 | 17 |
+-------+-------+-------+-------+
output i want is
dist4
seems easy but i did not find any proper solution using dataframe or sparksql query
You may use least function as
select least(dist1,dist2,dist3,dist4) as min_dist
from yourTable;
For the opposite cases greatest may be used.
EDIT :
To detect column names the following maybe used to get rows
select inline(array(struct(42, 'dist1'), struct(53, 'dist2'),
struct(24, 'dist3'), struct(17, 'dist4') ))
42 dist1
53 dist2
24 dist3
17 dist4
and then min function may be applied to get dist4
Try this,
df.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
val temp_df = df.columns.foldLeft(df) { (acc: DataFrame, colName: String) => acc.withColumn(colName, concat(col(colName), lit(","+colName)))}
val minval = udf((ar: Seq[String]) => ar.min.split(",")(1))
val result = temp_df.withColumn("least", split(concat_ws(":",x.columns.map(col(_)):_*),":")).withColumn("least_col", minval(col("least")))
result.show
+---+---+---+---+--------------------+---------+
| A| B| C| D| least|least_col|
+---+---+---+---+--------------------+---------+
|1,A|2,B|3,C|4,D|[1,A, 2,B, 3,C, 4,D]| A|
|5,A|4,B|3,C|1,D|[5,A, 4,B, 3,C, 1,D]| D|
+---+---+---+---+--------------------+---------+
RDD way and without udf()s.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.withColumn("arr", array(df.columns.map(col(_)):_*))
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 3 more fields]
scala> val rowarr = df.columns
rowarr: Array[String] = Array(A, B, C, D)
scala> val rdd1 = df2.rdd.map( x=> {val p = x.getAs[WrappedArray[Int]]("arr").toArray; val q=rowarr(p.indexWhere(_==p.min));Row.merge(x,Row(q)) })
rdd1: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[83] at map at <console>:47
scala> spark.createDataFrame(rdd1,df2.schema.add(StructField("mincol",StringType))).show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala>
you can do something like,
import org.apache.spark.sql.functions._
val cols = df.columns
val u1 = udf((s: Seq[Int]) => cols(s.zipWithIndex.min._2))
df.withColumn("res", u1(array("*")))
You could access the rows schema, retrieve a list of names out of there and access the rows value by name and then figure it out that way.
See: https://spark.apache.org/docs/2.3.2/api/scala/index.html#org.apache.spark.sql.Row
It would look roughly like this
dataframe.map(
row => {
val schema = row.schema
val fieldNames:List[String] = ??? //extract names from schema
fieldNames.foldLeft(("", 0))(???) // retrieve field value using it's name and retain maximum
}
)
This would yield a Dataset[String]

SparkSQL: conditional sum using two columns

I hope you can help me with this.
I have a DF as follows:
val df = sc.parallelize(Seq(
(1, "a", "2014-12-01", "2015-01-01", 100),
(2, "a", "2014-12-01", "2015-01-02", 150),
(3, "a", "2014-12-01", "2015-01-03", 120),
(4, "b", "2015-12-15", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns")
.withColumn("dateTrans", to_date($"dateTrans"))
I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates defined by the difference between the column 'dateIns' and 'dateTrans'. In particular, I would like to have a way to define a conditional sum that sums all values within a predefined max difference between the above mentioned columns. I.e. all value that happened between 10, 20, 30 days from dateIns ('dateTrans' - 'dateIns' <=10, 20, 30).
Is there any predefined aggregated function in spark that allows doing conditional sums? Do you recommend develop a aggr. UDF (if so, any suggestions)?
I'm using pySpqrk, but very happy to get Scala solutions as well. Thanks a lot!
Lets make your a little bit more interesting so there are some events in the window:
val df = sc.parallelize(Seq(
(1, "a", "2014-12-30", "2015-01-01", 100),
(2, "a", "2014-12-21", "2015-01-02", 150),
(3, "a", "2014-12-10", "2015-01-03", 120),
(4, "b", "2014-12-05", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns"))
.withColumn("dateTrans", to_date($"dateTrans"))
What you need is more or less something like this:
import org.apache.spark.sql.functions.{col, datediff, lit, sum}
// Find difference in tens of days
val diff = (datediff(col("dateTrans"), col("dateIns")) / 10)
.cast("integer") * 10
val dfWithDiff = df.withColumn("diff", diff)
val aggregated = dfWithDiff
.where((col("diff") < 30) && (col("diff") >= 0))
.groupBy(col("prodId"), col("diff"))
.agg(sum(col("value")))
And the results
aggregated.show
// +------+----+----------+
// |prodId|diff|sum(value)|
// +------+----+----------+
// | a| 20| 120|
// | b| 20| 100|
// | a| 0| 100|
// | a| 10| 150|
// +------+----+----------+
where diff is a lower bound for the range (0 -> [0, 10), 10 -> [10, 20), ...). This will work in PySpark as well if you remove val and adjust imports.
Edit (aggregate per column):
val exprs = Seq(0, 10, 20).map(x => sum(
when(col("diff") === lit(x), col("value"))
.otherwise(lit(0)))
.alias(x.toString))
dfWithDiff.groupBy(col("prodId")).agg(exprs.head, exprs.tail: _*).show
// +------+---+---+---+
// |prodId| 0| 10| 20|
// +------+---+---+---+
// | a|100|150|120|
// | b| 0| 0|100|
// +------+---+---+---+
with Python equivalent:
from pyspark.sql.functions import *
def make_col(x):
cnd = when(col("diff") == lit(x), col("value")).otherwise(lit(0))
return sum(cnd).alias(str(x))
exprs = [make_col(x) for x in range(0, 30, 10)]
dfWithDiff.groupBy(col("prodId")).agg(*exprs).show()
## +------+---+---+---+
## |prodId| 0| 10| 20|
## +------+---+---+---+
## | a|100|150|120|
## | b| 0| 0|100|
## +------+---+---+---+