Error by conversion from Pyspark to Pandas

Error by conversion from Pyspark to Pandas - pandas

I am trying to convert a Pyspark dataframe to Pandas, so I simply write df1=df.toPandas(), and I get the error "ValueError: ordinal must be >= 1". Unfortunately, I don't see any other usefull information in the error message (it's quite long so i cannot post it here).
If somebody has an idea, what could be wrong, it would be nice.
I only saw this error in the case when a Pyspark dataframe had multiple columns with the same name, but this is not the case this time.
Thanks in advance.
Edit: I have experimented and found out, that the problem appears only if I select some specific columns. But I don't see what can be wrong with these columns.

Related

Pandas rename columns does not rename the column

I tried to do this:
districtident.rename(columns={'leaid':'nces'},inplace=True)
and it failed:
Other things that didn't work:
districtident = districtident.rename(columns={'leaid':'nces'})
Renaming the column names of pandas dataframe is not working as expected - python
renaming columns in pandas doesn't do anything
Wasn't great either.
Here's an obvious appeal:
Alas, no.
Restarting the kernel didnt' work either. The only thing that worked was:
districtident['nces'] = districtident['leaid']
disrictident.drop(['leaid'],axis=1,inplace=True)
But that's really not the best, I feel. Especially if I need to do a number of columns.

df.corr() does not work. it just considers one feature column

df.corr() resultI wonder if someone could help me to solve my problem. I have a data frame called: df_normalized that is a normal data frame with 17 columns. I want a correlation matrix based on spearman method to find if the feature columns are correlated with each other?
However, df_normalized.corr(method='spearman') just considers sex column as you can see in the uploaded pictures of my codes.[the data frame][1]

It would be nice if you could post the full code and at least part of the dataframe so it's easier to see what's wrong. It looks like there's only the sex column in your dataframe, but it's hard to tell.
You can find a nice example of how to do what you want here: https://datatofish.com/correlation-matrix-pandas/

Dataframe Key Error Column not in index

I am reading an excel file into pandas using pd.ExcelFile.
It reads correctly and I can print the dataframe. But when I try to select a subset of columns like:
subdf= origdf[['CUTOMER_ID','ASSET_BAL']]
I get error:
KeyError: "['CUTOMER_ID' 'ASSET_BAL'] not in index"
Do I need to define some kind of index here? When I printed the df, I verified that the columns are there.

Ensure that the columns actually exist in the dataframe. For example, you have written CUTOMER and not CUSTOMER, which I assume is the correct name.
You can verify the column names by using list(origdf.columns.values).

And for when you don't have a typo problem, here is a solution:
Use loc instead,
subdf= origdf.loc[:, ['CUSTOMER_ID','ASSET_BAL']].values
(I'd be glad to learn why this one works, though.)

Pandas Removing index column

When I import a excel as a data frame in pandas and try to get rid of the first column I'm unable to do it even though i give index=None. What am I missing?

I ran into this, too. You can make the friend column the index instead. That gets rid of the original index that comes with pd.read_excel(). As #ALollz says, a dataframe always has an index. You get to choose what's in it.
data.set_index('friend', inplace=True)
See examples in the documentation on DataFrame.set_index().

Pandas conditions on count()

Hi hoping this is not a silly question.
I have a dataframe from which I am plotting a chart based on how many times something appears with the following code.
df.groupby('name').name.count().plot.bar()
plt.xlabel('Name')
plt.ylabel('Number')
plt.title('Number of times name appears')
Is there a way to get it to only plot those names that appear a certain amount of times? I am guessing I need some kind of function but not really sure where to start.

By using value_counts
df.name.value_counts().plot(kind='bar')
Edit :
df.group1.value_counts().compress(lambda s: s>=8).plot(kind='bar')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Error by conversion from Pyspark to Pandas - pandas

Related

Pandas rename columns does not rename the column

df.corr() does not work. it just considers one feature column

Dataframe Key Error Column not in index

Pandas Removing index column

Pandas conditions on count()

Categories

Resources