Create new column in Pyspark Dataframe by filling existing Column - dataframe

I am trying to create new column in an existing Pyspark DataFrame. Currently the DataFrame looks as follows:
+----+----+---+----+----+----+----+
|Acct| M1D|M1C| M2D| M2C| M3D| M3C|
+----+----+---+----+----+----+----+
| B| 10|200|null|null| 20|null|
| C|1000|100| 10|null|null|null|
| A| 100|200| 200| 200| 300| 10|
+----+----+---+----+----+----+----+
I want to fill null values in column M2C with 0 and create a new column Ratio. My expected output would be as follows:
+------+------+-----+------+------+------+------+-------+
| Acct | M1D | M1C | M2D | M2C | M3D | M3C | Ratio |
+------+------+-----+------+------+------+------+-------+
| B | 10 | 200 | null | null | 20 | null | 0 |
| C | 1000 | 100 | 10 | null | null | null | 0 |
| A | 100 | 200 | 200 | 200 | 300 | 10 | 200 |
+------+------+-----+------+------+------+------+-------+
I was trying to achieve my desired results by using following lines of code.
df = df.withColumn('Ratio', df.select('M2C').na.fill(0))
The above line of code resulted in an assertion error as shown below.
AssertionError: col should be Column
The possible solution that I found using this link was to use lit function.
I changed my code to
df = df.withColumn('Ratio', lit(df.select('M2C').na.fill(0)))
The above code led to AttributeError: 'DataFrame' object has no attribute '_get_object_id'
How can I achieve my desired output?

You're doing two things wrong here.
df.select will return a dataframe, not a column.
na.fill will replace null values in all columns, not just in specific columns.
The following code snippet will solve your usecase
from pyspark.sql.functions import col
df = df.withColumn('Ratio', col('M2C')).fillna(0, subset=['Ratio'])

Related

add corresponding columns for every column of the spark df that assigns 1 for every notNull in the original column using case when

I have a sample dataframe:
df = spark.createDataFrame([('name1','id1',1,None,3),('name2','id2',None,2,5)],['NAME','personID','col1','col2','col3'])
My use case has 15 columns
What I would like to do is using case when and loop, to add new columns that correspond to each column from the original except the first two columns. Within those new columns, it will give a value of 1 if notNull, otherwise 0.
I am aiming to get something like below:
+--------+--------+--------+-------+-------+-------+------+------+
|Name | ID | col1 | col2 | col3 | col1_N|col2_N|col3_N|
+--------+--------+--------+-------+-------+-------+------+------+
|name1 | id1 | 1 | Null | 3 | 1 | 0 | 1 |
|name2 | id2 | Null | 2 | 5 | 0 | 1 | 1 |
+--------+--------+--------+-------+-------+-------+------+------+
the first five columns are the original columns, the last three columns will be added with corresponding 1 or 0 from 'col1', 'col2', and 'col3' values.
The last code/s I am working on creates a new one but does not keep the original dataframe values.
df.select([when(col(c).isNotNull(), 1).otherwise(0).alias(c + '_N') for c in df.columns])
for which I get:
+-------+-------+-------+------+------+
| Name_N| ID_N | col1_N|col2_N|col3_N|
+-------+-------+-------+------+------+
| 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 |
+-------+-------+-------+------+------+
The above could have been acceptable but I need to keep the original values of Name and ID columns.
I got an InvalidArgument with below:
df.select(['*'],[when(col(c).isNotNull(), 1).otherwise(0).alias(c + '_N') for c in df.columns])
TypeError: Invalid argument, not a string or column: ['*'] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
I thought selecting all first will give me all the columns of the original
UPDATE:
somehow this worked, but I only get the last column:
for c in df.columns[2:]:
sdf = df.withColumn(c+'_N', when(col(c).isNotNull(),1).otherwise(0))
but this is what I get:
+--------+--------+--------+-------+-------+------+
|Name | ID | col1 | col2 | col3 |col3_N|
+--------+--------+--------+-------+-------+------+
|name1 | id1 | 1 | Null | 3 | 1 |
|name2 | id2 | Null | 2 | 5 | 1 |
+--------+--------+--------+-------+-------+------+
``
I only got the last original column
Using list comprehension as show below will give expected result.
df.select([col(c) if c in ['NAME', 'personID'] else when(col(c).isNotNull(), 1).otherwise(0).alias(f"{c}_N") for c in df.columns]).show()
+-----+--------+------+------+------+
| NAME|personID|col1_N|col2_N|col3_N|
+-----+--------+------+------+------+
|name1| id1| 1| 0| 1|
|name2| id2| 0| 1| 1|
+-----+--------+------+------+------+
Just fix your 1st approach with specifying a slice of columns and simplify boolean condition:
df.select([col(c).isNotNull().cast("integer").alias(c + '_N') for c in df.columns[2:]])

Pandas apply to a range of columns

Given the following dataframe, I would like to add a fifth column that contains a list of column headers when a certain condition is met on a row, but only for a range of dynamically selected columns (ie subset of the dataframe)
| North | South | East | West |
|-------|-------|------|------|
| 8 | 1 | 8 | 6 |
| 4 | 4 | 8 | 4 |
| 1 | 1 | 1 | 2 |
| 7 | 3 | 7 | 8 |
For instance, given that the inner two columns ('South', 'East') are selected and that column headers are to be returned when the row contains the value of one (1), the expected output would look like this:
Headers
|---------------|
| [South] |
| |
| [South, East] |
| |
The following one liner manages to return column headers for the entire dataframe.
df['Headers'] = df.apply(lambda x: df.columns[x==1].tolist(),axis=1)
I tried adding the dynamic column range condition by using iloc but to no avail. What am I missing?
For reference, these are my two failed attempts (N1 and N2 being column range variables here)
df['Headers'] = df.iloc[N1:N2].apply(lambda x: df.columns[x==1].tolist(),axis=1)
df['Headers'] = df.apply(lambda x: df.iloc[N1:N2].columns[x==1].tolist(),axis=1)
This works:
df=pd.DataFrame({'North':[8,4,1,7],'South':[1,4,1,3],'East':[8,8,1,7],\
'West':[6,4,2,8]})
df1=df.melt(ignore_index=False)
condition1=df1['variable']=='South'
condition2=df1['variable']=='East'
condition3=df1['value']==1
df1=df1.loc[(condition1|condition2)&condition3]
df1=df1.groupby(df1.index)['variable'].apply(list)
df=df.join(df1)

Check the elements in two columns of different dataframes

I have two dataframes.
Df1
Id | Name | Remarks
---------------------
1 | A | Not bad
1 | B | Good
2 | C | Very bad
Df2
Id | Name | Place |Job
-----------------------
1 | A | Can | IT
2 |C | Cbe | CS
4 |L | anc | ME
5 | A | cne | IE
Output
Id | Name | Remarks |Results
------------------------------
1 | A | Not bad |True
1 | B | Good |False
2 | C | VeryGood |True
That is the result should be true if same id and name are present in both dataframes. I tried
df1['Results']=np.where(Df1['id','Name'].isin(Df2['Id','Name']),'true','false')
But it was not successful.
Use DataFrame.merge with indicator parameter and compare both values:
df = Df1[['id','Name']].merge(Df2[['Id','Name']], indicator='Results', how='left')
df['Results'] = df['Results'].eq('both')
Your solution is possible by compare index values by DataFrame.set_index with Index.isin:
df1['Results']= Df1.set_index(['id','Name']).index.isin(Df2.set_index(['id','Name']).index)
Or compare tuples from both columns:
df1['Results']= Df1[['id','Name']].agg(tuple, 1).isin(Df2[['id','Name']].agg(tuple, 1))
You can easily achieve by merge like #jezrael 's answer.
You can also achieve it with np.where,list comprehension and zip like below:
df1['Results']=np.where([str(i)+'_'+str(j)==str(k)+'_'+str(l) for i,j,k,l in zip(Df1['ID'],Df1['Name'],Df2['ID'],Df2['Name'])],True,False)

Using PySpark window functions with conditions to add rows

I have a need to be able to add new rows to a PySpark df will values based upon the contents of other rows with a common id. There will eventually millions of ids with lots rows for each id. I have tried the below method which works but seems overly complicated.
I start with a df in the format below (but in reality have more columns):
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
+-------+----------+-------+
Currently I am pivoting this df to get it in the following format:
+-----+------+------+------+
| id | varA | varB | varC |
+-----+------+------+------+
| 1 | 30 | 1 | -9 |
+-----+------+------+------+
On this df I can then use the standard withColumn and when functionality to add new columns based on the values in other columns. For example:
df = df.withColumn("varD", when((col("varA") > 16) & (col("varC") != -9)), 2).otherwise(1)
Which leads to:
+-----+------+------+------+------+
| id | varA | varB | varC | varD |
+-----+------+------+------+------+
| 1 | 30 | 1 | -9 | 1 |
+-----+------+------+------+------+
I can then pivot this df back to the original format leading to this:
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
| 1 | varD | 1 |
+-------+----------+-------+
This works but seems like it could, with millions of rows, lead to expensive and unnecessary operations. It feels like it should be doable without the need to pivot and unpivot the data. Do I need to do this?
I have read about Window functions and it sounds as if they may be another way to achieve the same result but to be honest I am struggling to get started with them. I can see how they can be used to generate a value, say a sum, for each id, or to find a maximum value but have not found a way to even get started on applying complex conditions that lead to a new row.
Any help to get started with this problem would be gratefully received.
You can use pandas_udf for adding/deleting rows/col on grouped data, and implement your processing logic in pandas udf.
import pyspark.sql.functions as F
row_schema = StructType(
[StructField("id", IntegerType(), True),
StructField("variable", StringType(), True),
StructField("value", IntegerType(), True)]
)
#F.pandas_udf(row_schema, F.PandasUDFType.GROUPED_MAP)
def addRow(pdf):
val = 1
if (len(pdf.loc[(pdf['variable'] == 'varA') & (pdf['value'] > 16)]) > 0 ) & \
(len(pdf.loc[(pdf['variable'] == 'varC') & (pdf['value'] != -9)]) > 0):
val = 2
return pdf.append(pd.Series([1, 'varD', val], index=['id', 'variable', 'value']), ignore_index=True)
df = spark.createDataFrame([[1, 'varA', 30],
[1, 'varB', 1],
[1, 'varC', -9]
], schema=['id', 'variable', 'value'])
df.groupBy("id").apply(addRow).show()
which resuts
+---+--------+-----+
| id|variable|value|
+---+--------+-----+
| 1| varA| 30|
| 1| varB| 1|
| 1| varC| -9|
| 1| varD| 1|
+---+--------+-----+

How to get distinct value, count of a column in dataframe and store in another dataframe as (k,v) pair using Spark2 and Scala

I want to get the distinct values and their respective counts of every column of a dataframe and store them as (k,v) in another dataframe.
Note: My Columns are not static, they keep changing. So, I cannot hardcore the column names instead I should loop through them.
For Example, below is my dataframe
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| Blaze | IND| 19950312|
| Scarlet | USA| 19950313|
| Jonas | CAD| 19950312|
| Blaze | USA| 19950312|
| Jonas | CAD| 19950312|
| mark | USA| 19950313|
| mark | CAD| 19950313|
| Smith | USA| 19950313|
| mark | UK | 19950313|
| scarlet | CAD| 19950313|
My final result should be created in a new dataframe as (k,v) where k is the distinct record and v is the count of it.
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| (Blaze,2) | (IND,1) |(19950312,3)|
| (Scarlet,2) | (USA,4) |(19950313,6)|
| (Jonas,3) | (CAD,4) | |
| (mark,3) | (UK,1) | |
| (smith,1) | | |
Can anyone please help me with this, I'm using Spark 2.4.0 and Scala 2.11.12
Note: My columns are dynamic, so I can't hardcore the columns and do groupby on them.
I don't have exact solution to your query but I can surely provide you with some help that can get you started working on your issue.
Create dataframe
scala> val df = Seq(("Blaze ","IND","19950312"),
| ("Scarlet","USA","19950313"),
| ("Jonas ","CAD","19950312"),
| ("Blaze ","USA","19950312"),
| ("Jonas ","CAD","19950312"),
| ("mark ","USA","19950313"),
| ("mark ","CAD","19950313"),
| ("Smith ","USA","19950313"),
| ("mark ","UK ","19950313"),
| ("scarlet","CAD","19950313")).toDF("name", "country","dob")
Next calculate count of distinct element of each column
scala> val distCount = df.columns.map(c => df.groupBy(c).count)
Create a range to iterate over distCount
scala> val range = Range(0,distCount.size)
range: scala.collection.immutable.Range = Range(0, 1, 2)
Aggregate your data
scala> val aggVal = range.toList.map(i => distCount(i).collect().mkString).toSeq
aggVal: scala.collection.immutable.Seq[String] = List([Jonas ,2][Smith ,1][Scarlet,1][scarlet,1][mark ,3][Blaze ,2], [CAD,4][USA,4][IND,1][UK ,1], [19950313,6][19950312,4])
Create data frame:
scala> Seq((aggVal(0),aggVal(1),aggVal(2))).toDF("name", "country","dob").show()
+--------------------+--------------------+--------------------+
| name| country| dob|
+--------------------+--------------------+--------------------+
|[Jonas ,2][Smith...|[CAD,4][USA,4][IN...|[19950313,6][1995...|
+--------------------+--------------------+--------------------+
I hope this helps you in some way.