drop record based on multile columns value using pyspark - pandas

I have a pyspark dataframe like below :
I wanted to keep only one record if two column uniq_id and date_time have same value.
Expected Output :
I wanted to achieve this using pyspark.
Thank you

You can group by uniq_id and date_time and use first()
from pyspark.sql import functions as F
df.groupBy("uniq_id", "date_time").agg(F.first("col_1"), F.first("col_2"), F.first("col_3")).show()

I can't get how you compare int column and timestamp one(though it can be done with casting timestamp to int) but such a filtering can be made via
from pyspark.sql import functions as F
# assume you already have your DataFrame
df = df.filter(F.col('first_column_name') == F.col('second_column_name'))
or just
df = df.filter('first_column_name = second_column_name')

Related

Pyspark dataframe: creating column based on other column values

I have a pyspark dataframe:
Now, I want to add a new column called "countryAndState", where, for example for the first row, the value would be "USA_CA". I have tried several approaches, the last one was the following:
df_2 = df.withColumn("countryAndState", '{}_{}'.format(df.country, df.state))
I have tried with "country" and "state" instead, or with simply country and state,and also using col() but nothing seems to work. Can anyone help me solve this?
You can't use Python format strings in Spark. Use concat instead:
import pyspark.sql.functions as F
df_2 = df.withColumn("countryAndState", F.concat(F.col('country'), F.lit('_'), F.col('state')))
or concat_ws, if you need to chain many columns together with a given separator:
import pyspark.sql.functions as F
df_2 = df.withColumn("countryAndState", F.concat_ws('_', F.col('country'), F.col('state')))

How to add Extra column with current date in Spark dataframe

I am trying to add one column in my existing Pyspark Dataframe using withColumn method.I want to insert current date in this column.From my Source I don't have any date column so i am adding this current date column in my dataframe and saving this dataframe in my table so later for tracking purpose i can use this current date column.
I am using below code
df2=df.withColumn("Curr_date",datetime.now().strftime('%Y-%m-%d'))
here df is my existing Dataframe and i want to save df2 as table with Curr_date column.
but here its expecting existing column or lit method instead of datetime.now().strftime('%Y-%m-%d').
someone please guide me how should i add this Date column in my dataframe.?
use either lit or current_date
from pyspark.sql import functions as F
df2 = df.withColumn("Curr_date", F.lit(datetime.now().strftime("%Y-%m-%d")))
# OR
df2 = df.withColumn("Curr_date", F.current_date())
current_timestamp() is good but it is evaluated during the serialization time.
If you prefer to use the timestamp of the processing time of a row, then you may use the below method,
withColumn('current', expr("reflect('java.time.LocalDateTime', 'now')"))
There is a spark function current_timestamp().
from pyspark.sql.functions import *
df.withColumn('current', date_format(current_timestamp(), 'yyyy-MM-dd')).show()
+----+----------+
|test| current|
+----+----------+
|test|2020-09-09|
+----+----------+

How to filter in rows where any column is null in pyspark dataframe

It has to be somewhere on stackoverflow already but I'm only finding ways to filter the rows of a pyspark dataframe where 1 specific column is null, not where any column is null.
import pandas as pd
import pyspark.sql.functions as f
my_dict = {"column1":list(range(100)),"column2":["a","b","c",None]*25,"column3":["a","b","c","d",None]*20}
my_pandas_df = pd.DataFrame(my_dict)
sparkDf = spark.createDataFrame(my_pandas_df)
sparkDf.show(5)
I'm trying to include any row with null values on any column of my dataframe, basically the opposite of this:
sparkDf.na.drop()
For including rows having any columns with null:
sparkDf.filter(F.greatest(*[F.col(i).isNull() for i in sparkDf.columns])).show(5)
For excluding the same:
sparkDf.na.drop(how='any').show(5)

Replace pyspark column based on other columns

In my "data" dataframe, I have 2 columns, 'time_stamp' and 'hour'. I want to insert 'hour' column values where 'time_stamp' values is missing. I do not want to create a new column, instead fill missing values in 'time_stamp'
What I'm trying to do is replace this pandas code to pyspark code:
data['time_stamp'] = data.apply(lambda x: x['hour'] if pd.isna(x['time_stamp']) else x['time_stamp'], axis=1)
Something like this should work
from pyspark.sql import functions as f
df = (df.withColumn('time_stamp',
f.expr('case when time_stamp is null then hour else timestamp'))) #added ) which you mistyped
Alternatively, if you don't like sql:
df = df.withColumn('time_stamp', f.when(f.col('time_stamp').isNull(),f.col('hour'))).otherwise(f.col('timestamp')) # Please correct the Brackets

pandas HDFStore select rows with non-null values in the data column

In pandas Dataframe/Series there's a .isnull() method. Is there something similar in the syntax of where= filter of the select method of HDFStore?
WORKAROUND SOLUTION:
The /meta section of a data column inside hdf5 can be used as a hack solution:
import pandas as pd
store = pd.HDFStore('store.h5')
print(store.groups)
non_null = list(store.select("/df/meta/my_data_column/meta"))
df = store.select('df', where='my_data_column == non_null')