pandas DataFrame remove Index from columns - pandas

I have a dataFrame, such that when I execute:
df.columns
I get
Index(['a', 'b', 'c'])
I need to remove Index to have columns as list of strings, and was trying:
df.columns = df.columns.tolist()
but this doesn't remove Index.

tolist() should be able to convert the Index object to a list:
df1 = df.columns.tolist()
print(df1)
or use values to convert it to an array:
df1 = df.columns.values
The columns attribute of a pandas dataframe returns an Index object, you cannot assign the list back to the df.columns (as in your original code df.columns = df.columns.tolist()), but you can assign the list to another variable.

Related

Pandas sort column names by first character after delimiter

I want to sort the columns in a df based on the first letter after the delimiter '-'
df.columns = ['apple_B','cat_A','dog_C','car_D']
df.columns.sort_values(by=df.columns.str.split('-')[1])
TypeError: sort_values() got an unexpected keyword argument 'by'
df.sort_index(axis=1, key=lambda s: s.str.split('-')[1])
ValueError: User-provided `key` function must not change the shape of the array.
Desired columns would be:
'cat_A','apple_B','dog_C','car_D'
Many thanks!
I needed to sort the index names and then rename the columns accordingly:
sorted_index = sorted(df.index, key=lambda s: s.split('_')[1])
# reorder index
df = df.loc[sorted_index]
# reorder columns
df = df[sorted_index]
Use sort_index with the extracted part of the string as key:
df.sort_index(axis=1, key=lambda s: s.str.extract('_(\w+)', expand=False))
Output columns:
[cat_A, apple_B, dog_C, car_D]
You can do:
df.columns = ['apple_B','cat_A','dog_C','car_D']
new_cols = sorted(df.columns, key=lambda s: s.str.split('-')[1])
df = df[new_cols]

Pandas concat 2 data frames with different columns: Reindexing only valid with uniquely valued Index objects

I've seen this question before but I see it for duplicate columns, my columns are different:
df.cols: Index(['keys', 'clicks', 'impressions', 'ctr', 'position'], dtype='object')
split_df.cols: Index(['DEVICE', 'DATE', 'QUERY', 'COUNTRY', 'PAGE'], dtype='object')
the split_df dataframe is actually from the original df, the keys column from df was a list and I split out each element into several new columns (see below) which then became split_df. Now I'm just trying to concat them back together but I see when I concat
Reindexing only valid with uniquely valued Index objects
df = g_conn.get_search_console(ds)
split_df = pd.DataFrame(df['keys'].tolist(), columns=['DEVICE', 'DATE', 'QUERY', 'COUNTRY', 'PAGE'])
df = pd.concat([df, split_df], axis=1)
It appears that the duplicates are in your row index(es) not columns. Rows not columns because axis=1.
For example this causes that error:
df1 = pd.DataFrame({'A': list('ABCD')}, index=list('1122'))
df2 = pd.DataFrame({'B': list('WXYZ')}, index=list('3455'))
pd.concat([df1,df2], axis=1)
InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Convert json array column to list of dict in Pyspark

I created a pandas df from a pyspark df in the following way:
pd_df = (
df
.withColumn('city_list', F.struct(F.col('n_res'), F.col('city')))
.groupBy(['user_ip', 'created_month'])
.agg(
F.to_json(
F.sort_array(
F.collect_list(F.col('city_list')), asc=False
)
).alias('city_list')
)
).toPandas()
However the column city_list of my pd_df was not converted into a list of dict but into a string.
Here is an example of a column value: "[{'n_res': 40653, 'city': 00005}, {'n_res': 12498, 'city': 00008}]".
A solution could be to call pd_df.city_list.apply(eval) But I wouldn't consider it a safe approach.
How can I create a list of dict directly from pyspark?
Thanks.

concat series onto dataframe with column name

I want to add a Series (s) to a Pandas DataFrame (df) as a new column. The series has more values than there are rows in the dataframe, so I am using the concat method along axis 1.
df = pd.concat((df, s), axis=1)
This works, but the new column of the dataframe representing the series is given an arbitrary numerical column name, and I would like this column to have a specific name instead.
Is there a way to add a series to a dataframe, when the series is longer than the rows of the dataframe, and with a specified column name in the resulting dataframe?
You can try Series.rename:
df = pd.concat((df, s.rename('col')), axis=1)
One option is simply to specify the name when creating the series:
example_scores = pd.Series([1,2,3,4], index=['t1', 't2', 't3', 't4'], name='example_scores')
Using the name attribute when creating the series is all I needed.
Try:
df = pd.concat((df, s.rename('CoolColumnName')), axis=1)

Assign dataframes in a list to a list of names; pandas

I have a variable
var=[name1,name2]
I have a dataframe also in a list
df= [df1, df2]
How do i assign df1 to name1 and df2 to name2 and so on.
If I understand correctly, assuming the lengths of both lists are the same you just iterate over the indices of both lists and just assign them, example:
In [412]:
name1,name2 = None,None
var=[name1,name2]
df1, df2 = 1,2
df= [df1, df2]
​
for x in range(len(var)):
var[x] = df[x]
var
Out[412]:
[1, 2]
If your variable list is storing strings then I would not make variables from those strings (see How do I create a variable number of variables?) and instead create a dict:
In [414]:
var=['name1','name2']
df1, df2 = 1,2
df= [df1, df2]
d = dict(zip(var,df))
d
Out[414]:
{'name1': 1, 'name2': 2}
To answer your question, you can do this by:
for i in zip(var, df):
globals()[i[0]] = i[1]
And then access your variables.
But proceeding this way is bad. You're like launching a dog in your global environment. It's better to keep control about what you handle, keep your dataframe in a list or dictionary.