Pandas sort column names by first character after delimiter - pandas

I want to sort the columns in a df based on the first letter after the delimiter '-'
df.columns = ['apple_B','cat_A','dog_C','car_D']
df.columns.sort_values(by=df.columns.str.split('-')[1])
TypeError: sort_values() got an unexpected keyword argument 'by'
df.sort_index(axis=1, key=lambda s: s.str.split('-')[1])
ValueError: User-provided `key` function must not change the shape of the array.
Desired columns would be:
'cat_A','apple_B','dog_C','car_D'
Many thanks!
I needed to sort the index names and then rename the columns accordingly:
sorted_index = sorted(df.index, key=lambda s: s.split('_')[1])
# reorder index
df = df.loc[sorted_index]
# reorder columns
df = df[sorted_index]

Use sort_index with the extracted part of the string as key:
df.sort_index(axis=1, key=lambda s: s.str.extract('_(\w+)', expand=False))
Output columns:
[cat_A, apple_B, dog_C, car_D]

You can do:
df.columns = ['apple_B','cat_A','dog_C','car_D']
new_cols = sorted(df.columns, key=lambda s: s.str.split('-')[1])
df = df[new_cols]

Related

pandas DataFrame remove Index from columns

I have a dataFrame, such that when I execute:
df.columns
I get
Index(['a', 'b', 'c'])
I need to remove Index to have columns as list of strings, and was trying:
df.columns = df.columns.tolist()
but this doesn't remove Index.
tolist() should be able to convert the Index object to a list:
df1 = df.columns.tolist()
print(df1)
or use values to convert it to an array:
df1 = df.columns.values
The columns attribute of a pandas dataframe returns an Index object, you cannot assign the list back to the df.columns (as in your original code df.columns = df.columns.tolist()), but you can assign the list to another variable.

How to remove part of the column name?

I have a DataFrame with several columns like:
'clientes_enderecos.CEP', 'tabela_clientes.RENDA','tabela_produtos.cod_ramo', 'tabela_qar.chave', etc
I want to change the name of the columns and remove all the text neighbord a dot.
I only know the method pandas.rename({'A':'a','B':'b'})
But to name them as they are now I used:
df_tabela_clientes.columns = ["tabela_clientes." + str(col) for col in df_tabela_clientes.columns]
How could I reverse the process?
Try rename with lambda and string manipulation:
df = pd.DataFrame(columns=['clientes_enderecos.CEP', 'tabela_clientes.RENDA','tabela_produtos.cod_ramo', 'tabela_qar.chave'])
print(df)
#Empty DataFrame
#Columns: [clientes_enderecos.CEP, tabela_clientes.RENDA, tabela_produtos.cod_ramo, #tabela_qar.chave]
#Index: []
dfc = df.rename(columns=lambda x: x.split('.')[-1])
print(dfc)
#Empty DataFrame
#Columns: [CEP, RENDA, cod_ramo, chave]
#Index: []
To get rid of whats to the right of the dot you can split the columns names and choose whichever side of the dot you want.
import pandas as pd
df = pd.DataFrame(columns=['clientes_enderecos.CEP', 'tabela_clientes.RENDA','tabela_produtos.cod_ramo', 'tabela_qar.chave'])
df.columns = [name.split('.')[0] for name in df.columns] # 0: before the dot | 1:after the dot

How do I specify a default value when the value is "null" in a spark dataframe?

I have a data frame like the picture below.
In the case of "null" among the values of the "item_param" column, I want to replace the string'test'.
How can I do it?
df = sv_df.withColumn("srv_name", col('col.srv_name'))\
.withColumn("srv_serial", col('col.srv_serial'))\
.withColumn("col2",explode('col.groups'))\
.withColumn("groups_id", col('col2.group_id'))\
.withColumn("col3", explode('col2.items'))\
.withColumn("item_id", col('col3.item_id'))\
.withColumn("item_param", from_json(col("col3.item_param"), MapType(StringType(), StringType())) ) \
.withColumn("item_param", map_values(col("item_param"))[0])\
.withColumn("item_time", col('col3.item_time'))\
.withColumn("item_time", from_unixtime( col('col3.item_time')/10000000 - 11644473600))\
.withColumn("item_value",col('col3.item_value'))\
.drop("servers","col","col2", "col3")
df.show(truncate=False)
df.printSchema()
Use coalesce:
.withColumn("item_param", coalesce(col("item_param"), lit("someDefaultValue"))
It will apply the first column/expression which is not null
You can use fillna, which allows you to replace the null values in all columns, a subset of columns, or each column individually. [Docs]
# All values
df = df.fillna(0)
# Subset of columns
df = df.fillna(0, subset=['a', 'b'])
# Per selected column
df = df.fillna( { 'a':0, 'b':-1 } )
In you case it would be:
df = df.fillna( {'item_param': 'test'} )

concat series onto dataframe with column name

I want to add a Series (s) to a Pandas DataFrame (df) as a new column. The series has more values than there are rows in the dataframe, so I am using the concat method along axis 1.
df = pd.concat((df, s), axis=1)
This works, but the new column of the dataframe representing the series is given an arbitrary numerical column name, and I would like this column to have a specific name instead.
Is there a way to add a series to a dataframe, when the series is longer than the rows of the dataframe, and with a specified column name in the resulting dataframe?
You can try Series.rename:
df = pd.concat((df, s.rename('col')), axis=1)
One option is simply to specify the name when creating the series:
example_scores = pd.Series([1,2,3,4], index=['t1', 't2', 't3', 't4'], name='example_scores')
Using the name attribute when creating the series is all I needed.
Try:
df = pd.concat((df, s.rename('CoolColumnName')), axis=1)

Conditional on pandas DataFrame's

Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)