Pandas rename columns does not rename the column - pandas

I tried to do this:
districtident.rename(columns={'leaid':'nces'},inplace=True)
and it failed:
Other things that didn't work:
districtident = districtident.rename(columns={'leaid':'nces'})
Renaming the column names of pandas dataframe is not working as expected - python
renaming columns in pandas doesn't do anything
Wasn't great either.
Here's an obvious appeal:
Alas, no.
Restarting the kernel didnt' work either. The only thing that worked was:
districtident['nces'] = districtident['leaid']
disrictident.drop(['leaid'],axis=1,inplace=True)
But that's really not the best, I feel. Especially if I need to do a number of columns.

Related

Error by conversion from Pyspark to Pandas

I am trying to convert a Pyspark dataframe to Pandas, so I simply write df1=df.toPandas(), and I get the error "ValueError: ordinal must be >= 1". Unfortunately, I don't see any other usefull information in the error message (it's quite long so i cannot post it here).
If somebody has an idea, what could be wrong, it would be nice.
I only saw this error in the case when a Pyspark dataframe had multiple columns with the same name, but this is not the case this time.
Thanks in advance.
Edit: I have experimented and found out, that the problem appears only if I select some specific columns. But I don't see what can be wrong with these columns.

how to loop / iterate over multiple dataframes using their names as strings

i have some dataframes
df_1
df_2
…
df_99
df_100
over which i would like to iterate to perform some operations on a specific column, say Column_A, which exists in each dataframe.
i can create strings with the names of the dataframes using
for i in range (1,101):
’df_’+str(i)
but when i try to use these to access the dataframes like this
for i in range (1,101):
df_x = ’df_’+str(i)
df_x['Column_A’].someoperation(i)
# the operation involves the number of the dataframe
i get a TypeError: „string indices must be integers“.
I searched extensively and the suggested solution to this kind of problem which i found most often was to create a dictionary with the names of the dataframes as keys and the actual dataframes as the associated values.
However i would not like to proceed like this for two or three reasons:
For one, as i am still rather new to pandas, i am not sure about how to address a specific column in a dataframe which is placed as a value in a dictionary.
Additionally, putting the dataframes in a dictionary would create copies of them (if i understand correctly), which is not ideal if there are very many dataframes or if the dataframes are large.
But most importantly, since i do not know how to iterate over the names, putting the dataframes in a dictionary would have to be done manually, so it is still the same problem in a way.
I tried creating a list with the names of the dataframes to loop over
df_list= [ ]
for i in range (1,101):
df_list.append('df_‘+str(i))
for df in df_list:
df['Column_A’].someoperation
but that approach results in the same type error as above - and i cannot conveniently involve the number of the dataframe in "someoperation".
Apparently pandas does take df_1 , df_2 etc as the strings they are and not as the name of the already existing dataframe i would like to access, but i dont know how to tell it to do otherwise.
Any suggestions how this could be solved are much appreciated.
You're defining a list of strings, but you're not giving Python any way of knowing that "df_1" is in some way connected to df_1
To answer your question, you're looking for the eval function, which takes a string, executes it as code, and returns the output. So eval("df_1") will give you the dataframe df_1.
df_list= [ ]
for i in range (1,101): #~ look up list comprehensions for a more elegant way to do this.
df_list.append('df_'+str(i))
for df in df_list:
eval(df)['Column_A'].someoperation
However, you should take the advice you've gotten and use a dictionary or list. Putting the dataframes in a dictionary would definitely not create copies of them. The dictionary is simply a mapping from a set of strings to the corresponding object in memory. This is also a much more elegant solution, keeping all of the relevant dataframes in one place without having to adhere to a strict naming convention that will inevitably get messed up in some way.
If you don't really need names for each dataframe and just want them accessible together, an even simpler solution would be to put them in a list and access each one as dfs[0]-dfs[100].
If you've already got df_1-df_100 loaded the way you're describing, eval will let you organize them all into one place like that: dfs = [eval("df_"+str(i)) for i in range(1,101)] or dfs={i:eval(f"df_{i}") for i in range(1,101)}
Finally, you can access columns and do operations on dataframes accessed through lists and dictionaries in the normal way. E.g.
dfs[0]['column 1'] = 1.
means = dfs[40].groupby('date').mean()
#~ ect.

Dataframe Key Error Column not in index

I am reading an excel file into pandas using pd.ExcelFile.
It reads correctly and I can print the dataframe. But when I try to select a subset of columns like:
subdf= origdf[['CUTOMER_ID','ASSET_BAL']]
I get error:
KeyError: "['CUTOMER_ID' 'ASSET_BAL'] not in index"
Do I need to define some kind of index here? When I printed the df, I verified that the columns are there.
Ensure that the columns actually exist in the dataframe. For example, you have written CUTOMER and not CUSTOMER, which I assume is the correct name.
You can verify the column names by using list(origdf.columns.values).
And for when you don't have a typo problem, here is a solution:
Use loc instead,
subdf= origdf.loc[:, ['CUSTOMER_ID','ASSET_BAL']].values
(I'd be glad to learn why this one works, though.)

Pandas Removing index column

When I import a excel as a data frame in pandas and try to get rid of the first column I'm unable to do it even though i give index=None. What am I missing?
I ran into this, too. You can make the friend column the index instead. That gets rid of the original index that comes with pd.read_excel(). As #ALollz says, a dataframe always has an index. You get to choose what's in it.
data.set_index('friend', inplace=True)
See examples in the documentation on DataFrame.set_index().

Pandas conditions on count()

Hi hoping this is not a silly question.
I have a dataframe from which I am plotting a chart based on how many times something appears with the following code.
df.groupby('name').name.count().plot.bar()
plt.xlabel('Name')
plt.ylabel('Number')
plt.title('Number of times name appears')
Is there a way to get it to only plot those names that appear a certain amount of times? I am guessing I need some kind of function but not really sure where to start.
By using value_counts
df.name.value_counts().plot(kind='bar')
Edit :
df.group1.value_counts().compress(lambda s: s>=8).plot(kind='bar')