Hi hoping this is not a silly question.
I have a dataframe from which I am plotting a chart based on how many times something appears with the following code.
df.groupby('name').name.count().plot.bar()
plt.xlabel('Name')
plt.ylabel('Number')
plt.title('Number of times name appears')
Is there a way to get it to only plot those names that appear a certain amount of times? I am guessing I need some kind of function but not really sure where to start.
By using value_counts
df.name.value_counts().plot(kind='bar')
Edit :
df.group1.value_counts().compress(lambda s: s>=8).plot(kind='bar')
Related
I have a function that returns tuples. When I apply this to my pandas dataframe using pd.apply() function, the results look this way.
The Date here is an index and I am not interested in it.
I want to create two new columns in a dataframe and set their values to the values you see in these tuples.
How do I do this?
I tried the following:
This errors out citing mismatch between expected and available values. It is seeing these tuples as a single entity, so those two columns I specified on the left hand side are a problem. Its expecting only one.
And what I need is to break it down into two parts that can be used to set two different columns.
Whats the correct way to achieve this?
Make your function return a pd.Series, this will be expanded into a frame.
orders.apply(lambda x: pd.Series(myFunc(x)), axis=1)
use zip
orders['a'], orders['b'] = zip(*df['your_column'])
I am trying to convert a Pyspark dataframe to Pandas, so I simply write df1=df.toPandas(), and I get the error "ValueError: ordinal must be >= 1". Unfortunately, I don't see any other usefull information in the error message (it's quite long so i cannot post it here).
If somebody has an idea, what could be wrong, it would be nice.
I only saw this error in the case when a Pyspark dataframe had multiple columns with the same name, but this is not the case this time.
Thanks in advance.
Edit: I have experimented and found out, that the problem appears only if I select some specific columns. But I don't see what can be wrong with these columns.
df.corr() resultI wonder if someone could help me to solve my problem. I have a data frame called: df_normalized that is a normal data frame with 17 columns. I want a correlation matrix based on spearman method to find if the feature columns are correlated with each other?
However, df_normalized.corr(method='spearman') just considers sex column as you can see in the uploaded pictures of my codes.[the data frame][1]
It would be nice if you could post the full code and at least part of the dataframe so it's easier to see what's wrong. It looks like there's only the sex column in your dataframe, but it's hard to tell.
You can find a nice example of how to do what you want here: https://datatofish.com/correlation-matrix-pandas/
Working with the Wine Review Data from Kaggle here. I am able to return the number of occurrences by variety using value_counts()
However, I am trying to find a quick way to limit the results to varieties and their counts where there is more than one occurrence.
Trying df.loc[df['variety'].value_counts()>1].value_counts()
and df['variety'].loc[df['variety'].value_counts()>1].value_counts()
both return errors.
The results can be turned into a DataFrame and the constraint added there, but something tells me that there is a way more elegant way to achieve this.
#wen ansered this in the comments.
df['variety'].value_counts().loc[lambda x : x>1]
I'd like to merge/concatenate multiple dataframe together; basically it's too add up many feature columns together based on the same first column 'Name'.
F1.merge(F2, on='Name', how='outer').merge(F3, on='Name', how='outer').merge(F4,on='Name', how='outer')...
I tried the code above, it's working. But I've got say, 100 features to add up together, I'm wondering is there any better way?
Without data it is not easy, but this can works:
df = pd.concat([x.set_index('Name') for x in [df1,df2,df3]]).reset_index()