How do I specify a default value when the value is "null" in a spark dataframe? - sql

I have a data frame like the picture below.
In the case of "null" among the values of the "item_param" column, I want to replace the string'test'.
How can I do it?
df = sv_df.withColumn("srv_name", col('col.srv_name'))\
.withColumn("srv_serial", col('col.srv_serial'))\
.withColumn("col2",explode('col.groups'))\
.withColumn("groups_id", col('col2.group_id'))\
.withColumn("col3", explode('col2.items'))\
.withColumn("item_id", col('col3.item_id'))\
.withColumn("item_param", from_json(col("col3.item_param"), MapType(StringType(), StringType())) ) \
.withColumn("item_param", map_values(col("item_param"))[0])\
.withColumn("item_time", col('col3.item_time'))\
.withColumn("item_time", from_unixtime( col('col3.item_time')/10000000 - 11644473600))\
.withColumn("item_value",col('col3.item_value'))\
.drop("servers","col","col2", "col3")
df.show(truncate=False)
df.printSchema()

Use coalesce:
.withColumn("item_param", coalesce(col("item_param"), lit("someDefaultValue"))
It will apply the first column/expression which is not null

You can use fillna, which allows you to replace the null values in all columns, a subset of columns, or each column individually. [Docs]
# All values
df = df.fillna(0)
# Subset of columns
df = df.fillna(0, subset=['a', 'b'])
# Per selected column
df = df.fillna( { 'a':0, 'b':-1 } )
In you case it would be:
df = df.fillna( {'item_param': 'test'} )

Related

Pandas sort column names by first character after delimiter

I want to sort the columns in a df based on the first letter after the delimiter '-'
df.columns = ['apple_B','cat_A','dog_C','car_D']
df.columns.sort_values(by=df.columns.str.split('-')[1])
TypeError: sort_values() got an unexpected keyword argument 'by'
df.sort_index(axis=1, key=lambda s: s.str.split('-')[1])
ValueError: User-provided `key` function must not change the shape of the array.
Desired columns would be:
'cat_A','apple_B','dog_C','car_D'
Many thanks!
I needed to sort the index names and then rename the columns accordingly:
sorted_index = sorted(df.index, key=lambda s: s.split('_')[1])
# reorder index
df = df.loc[sorted_index]
# reorder columns
df = df[sorted_index]
Use sort_index with the extracted part of the string as key:
df.sort_index(axis=1, key=lambda s: s.str.extract('_(\w+)', expand=False))
Output columns:
[cat_A, apple_B, dog_C, car_D]
You can do:
df.columns = ['apple_B','cat_A','dog_C','car_D']
new_cols = sorted(df.columns, key=lambda s: s.str.split('-')[1])
df = df[new_cols]

python: aggregate columns in pivot table with multiindex structure

if i have multi-index pivot table like this:
what would be the way to aggregate total 'sum' and 'count' for all dates?
I want to see additional column with totals for all rows in the table.
Thanks to #Nik03 for the idea. The methond of concat returns required data frame but with single index level. To add it to original dataframe, you have to create columns first and assign new dataframes to:
table_to_show = pd.concat([table_to_record.filter(like='sum').sum(1), table_to_record.filter(like='count').sum(1)], axis=1)
table_to_show.columns = ['sum', 'count']
table_to_record['total_sum'] = table_to_show['sum']
table_to_record['total_count'] = table_to_show['count']
column_1st = table_to_record.pop('total_sum')
column_2nd = table_to_record.pop('total_count')
table_to_record.insert(0, 'total_sum', column_1st)
table_to_record.insert(1,'total_count', column_2nd)
and here is the result:
One way:
df1 = pd.concat([df.filter(like='sum').sum(
1), df.filter(like='mean').sum(1)], axis=1)
df1.columns = ['sum', 'mean']

How to drop a pandas column based on number of values in it?

Turns out that when trying to drop a column with categorical data (0s and 1s) I cannot get the desired result. I have tried several procedures but they all yield the same result: the dataframe itself with all columns.
df1.drop([i for i in df1 if df1[i].nunique == 2], axis = 1, inplace = True)
That's one way I tried. Another one is as follows:
df1.drop(df.columns[df.apply(lambda col: col.nunique == 2)], axis = 1)
Can anyone help? Thanks
one approach could be to get all the columns which are boolean and drop then as below, this will work if the data type in column is correctly classified. choose to pass the datatype .dtypes as appropriate
bool_col = []
for cols in df:
if df[col].dtypes == "bool":
non_floats.append(col)
df = df.drop(columns=non_floats)
Your first attempt is perfect. You just need to add () to df1[i].nunique so that it becomes like this: df1.drop([i for i in df1 if df1[i].nunique() == 2], axis = 1, inplace = True)

Replace pyspark column based on other columns

In my "data" dataframe, I have 2 columns, 'time_stamp' and 'hour'. I want to insert 'hour' column values where 'time_stamp' values is missing. I do not want to create a new column, instead fill missing values in 'time_stamp'
What I'm trying to do is replace this pandas code to pyspark code:
data['time_stamp'] = data.apply(lambda x: x['hour'] if pd.isna(x['time_stamp']) else x['time_stamp'], axis=1)
Something like this should work
from pyspark.sql import functions as f
df = (df.withColumn('time_stamp',
f.expr('case when time_stamp is null then hour else timestamp'))) #added ) which you mistyped
Alternatively, if you don't like sql:
df = df.withColumn('time_stamp', f.when(f.col('time_stamp').isNull(),f.col('hour'))).otherwise(f.col('timestamp')) # Please correct the Brackets

concat series onto dataframe with column name

I want to add a Series (s) to a Pandas DataFrame (df) as a new column. The series has more values than there are rows in the dataframe, so I am using the concat method along axis 1.
df = pd.concat((df, s), axis=1)
This works, but the new column of the dataframe representing the series is given an arbitrary numerical column name, and I would like this column to have a specific name instead.
Is there a way to add a series to a dataframe, when the series is longer than the rows of the dataframe, and with a specified column name in the resulting dataframe?
You can try Series.rename:
df = pd.concat((df, s.rename('col')), axis=1)
One option is simply to specify the name when creating the series:
example_scores = pd.Series([1,2,3,4], index=['t1', 't2', 't3', 't4'], name='example_scores')
Using the name attribute when creating the series is all I needed.
Try:
df = pd.concat((df, s.rename('CoolColumnName')), axis=1)