My categorical variable case_satus takes on four unique values. I have data from 2014 to 2016. I would like to plot the distribution of case_status grouped by year. I try to this using:
df.groupby('year').case_status.value_counts().plot.barh()
And I get the following plot:
What I would like to have is a nicer represenation. For example where I have one color for each year, and all the "DENIED" would stand next to each other.
I think it can be achieved since the groupby object is a multi-index, but I don't understand it well enough to create the plot I want.
The solution is:
df.groupby('year').case_status.value_counts().unstack(0).plot.barh()
and results in
I think you need add unstack for DataFrame:
df.groupby('year').case_status.value_counts().unstack().plot.barh()
Also is possible change level:
df.groupby('year').case_status.value_counts().unstack(0).plot.barh()
Related
I have a df with a multiindex with 2 levels. One of these levels, age, is used to generate another column, Numeric Age.
Currently, my idea is to reset_index, use apply with age_func which reads row["age"], and then re-set the index, something like...
df = df.reset_index("age")
df["Numeric Age"] = df.apply(age_func, axis=1)
df = df.set_index("age") # ValueError: cannot reindex from a duplicate axis
This strikes me as a bad idea. I'm having a hard time resetting the indices correctly, and I think this is probably a slow way to go.
What is the correct way to make a new column based on the values of one of your indices? Or, if this is the correct way to go, is there a way to re-set the indices such that the df is the exact same as when I started, with the new column added?
We can set a new column using .loc, and modify the rows we need using masks. To use the correct col values, we also use a mask.
First step is to make a mask for the rows to target.
mask_foo = df.index.get_level_values("age") == "foo"
Later we will use .apply(axis=1), so write a function to handle the rows you will have from mask_foo.
def calc_foo_numeric_age(row):
# The logic here isn't important, the point is we have access to the row
return row["some_other_column"].split(" ")[0]
And now the .loc magic
df[mask_foo, "Numeric Age"] = df[mask_foo].apply(calc_foo_numeric_age, axis=1)
Repeat process for other target indices.
If your situation allows you to reset_index().apply(axis=1), I recommend that over this. I am doing this because I have other reasons for not wanting to reset_index().
I managed to solve my problem in two ways (sol1 and sol2), both seem to work ok. My question is more to understand how it works.
Sol2 first I filter the series, and what is left, I get the indices and then the list. I think I get it.
But sol1, first I get the indices, then put the filter to get only the values I want?
My Series is called serie.
print(serie.index[serie==serie.max()].tolist()) # sol 1
print(serie[serie==serie.max()].index.tolist()) # sol 2
I'm aware that this similar question has been asked; however, I'm looking for further clarification to have better understanding of .groupby if it's possible.
Data used
I want the exact same result like this but with .groupby():
df.pivot(columns='survived').age.plot.hist()
So I try:
df.groupby('age')['survived'].count().plot.hist()
The x-axis doesn't look right. Is there any way I can get the same result as .pivot() does using the pure .groupby() method? Thank you.
Expanding on Quang's comment, you would want to bin the ages rather than grouping on every single age (which is what df.groupby('age') does).
One method is to cut the age bins:
df['age group'] = pd.cut(df.age, bins=range(0, 100, 10), right=False)
Then groupby those bins and make a bar plot of the survived.value_counts():
(df.groupby('age group').survived.value_counts()
.unstack().plot.bar(width=1, stacked=True))
I noticed that in the link you posted, all the histograms look a little different. I think that's due to slight differences in how each method is binned. One advantage of cutting your own bins is that you can clearly see the exact bin boundaries:
I upvoted this question because there's a very subtle difference between pivot and groupby. I think you're looking for something similar to this:
df.groupby('age').size().plot.bar(width=1)
plt.show()
However, I do not think there's a reasonable way to get the same result by grouping because hist() needs the observations in its raw form, while groupby is designed to be followed by a function that will transform the data (such as count, min, mean, etc.).
To see this, notice that by grouping by age and then using count, you no longer have the raw array of ages anymore. For instance, there are 13 observations of people who are 40 years of age. The raw data looks like (40, 40, ... , 40, 40), while the grouped count looks like:
age count
40 13
This is not what the data should look like for a histogram. Another key difference are the bins in a histogram. As you can see, the first plot counts all the observations of people with ages between 0 and 10. By grouping by age, you would have 11 bins inside this bin: one for people aged 0, one for people aged 1, one for people aged 2, etc.
To summarize, groupby expects a function that will transform the original data, but in order to plot a histogram, you need the data in its crude state. For this reason, pivot is the go-to solution for this kind of task, as it also splits the data by survived, but does not apply any functions the data.
I have a large dataframe (14,000 rows). The columns include 'title', 'x' and 'y' as well as other random data.
For a particular title, I've written a code which basically performs an analysis using the x and y values for a subset of this data (but the specifics are unimportant for this).
For this title (which is something like "Part number Y1-17") there are about 80 rows.
At the moment I have only worked out how to get my code to work on 1 subset of titles (i.e. one set of rows with the same title) at a time. For this I've been making a smaller dataframe out of my big one using:
df = pd.read_excel(r"mydata.xlsx")
a = df.loc[df['title'].str.contains('Y1-17')]
But given there are about 180 of these smaller datasets I need to do this analysis on, I don't want to have to do it manually.
My question is, is there a way to make all of the smaller dataframes automatically, by slicing the data by the unique 'title' value? All the help I've found, it seems like you need to specify the 'title' to make a subset. I want to subset all of it and I don't want to have to list all the title names to do it.
I've searched quite a lot and haven't found anything, however I am a beginner so it's very possible I've missed some really basic way of doing this.
I'm not sure if its important information but the modules I'm working with pandas, and numpy
Thanks for any help!
You can use Pandas groupby
For example:
df_dict = {key: title for key, title in df.copy().groupby('title', sort=False)}
Which creates a dictionary of DataFrames each containing all the columns and only the rows pertaining to each unique value of title.
I am relatively new to Pandas and to python and I am trying to find out how to turn all content(all fields are strings) of a Pandas Dataframe to categorical one.
All the values from rows and columns have to be treated as a big unique data set before turning them to categorical numbers.
So far I was able to write the following piece of code
for col_name in X.columns:
if(X[col_name].dtype == 'object'):
X[col_name]= X[col_name].astype('category')
X[col_name] = X[col_name].cat.codes
that works on a data frame X of multiple columns. It takes the strings and turns them to unique numbers.
What I am not sure for the code above is that my for loop only works per column and I am not sure if the codes assigned are unique per column or per whole data frame (the latter is the desired action).
Can you please provide advice on how I can turn my code to provide unique numbers considering all the values of the data frame?
I would like to thank you in advance for your help.
Regards
Alex
Use DataFrame.stack with Series.unstack for set MultiIndex Series to unique values:
cols = df.select_dtypes('object').columns
df[cols] = df[cols].stack().astype('category').cat.codes.unstack()