pandas .plot.hist() with .groupby() - pandas

I'm aware that this similar question has been asked; however, I'm looking for further clarification to have better understanding of .groupby if it's possible.
Data used
I want the exact same result like this but with .groupby():
df.pivot(columns='survived').age.plot.hist()
So I try:
df.groupby('age')['survived'].count().plot.hist()
The x-axis doesn't look right. Is there any way I can get the same result as .pivot() does using the pure .groupby() method? Thank you.

Expanding on Quang's comment, you would want to bin the ages rather than grouping on every single age (which is what df.groupby('age') does).
One method is to cut the age bins:
df['age group'] = pd.cut(df.age, bins=range(0, 100, 10), right=False)
Then groupby those bins and make a bar plot of the survived.value_counts():
(df.groupby('age group').survived.value_counts()
.unstack().plot.bar(width=1, stacked=True))
I noticed that in the link you posted, all the histograms look a little different. I think that's due to slight differences in how each method is binned. One advantage of cutting your own bins is that you can clearly see the exact bin boundaries:

I upvoted this question because there's a very subtle difference between pivot and groupby. I think you're looking for something similar to this:
df.groupby('age').size().plot.bar(width=1)
plt.show()
However, I do not think there's a reasonable way to get the same result by grouping because hist() needs the observations in its raw form, while groupby is designed to be followed by a function that will transform the data (such as count, min, mean, etc.).
To see this, notice that by grouping by age and then using count, you no longer have the raw array of ages anymore. For instance, there are 13 observations of people who are 40 years of age. The raw data looks like (40, 40, ... , 40, 40), while the grouped count looks like:
age count
40 13
This is not what the data should look like for a histogram. Another key difference are the bins in a histogram. As you can see, the first plot counts all the observations of people with ages between 0 and 10. By grouping by age, you would have 11 bins inside this bin: one for people aged 0, one for people aged 1, one for people aged 2, etc.
To summarize, groupby expects a function that will transform the original data, but in order to plot a histogram, you need the data in its crude state. For this reason, pivot is the go-to solution for this kind of task, as it also splits the data by survived, but does not apply any functions the data.

Related

Slice dataframe according to unique values into many smaller dataframes

I have a large dataframe (14,000 rows). The columns include 'title', 'x' and 'y' as well as other random data.
For a particular title, I've written a code which basically performs an analysis using the x and y values for a subset of this data (but the specifics are unimportant for this).
For this title (which is something like "Part number Y1-17") there are about 80 rows.
At the moment I have only worked out how to get my code to work on 1 subset of titles (i.e. one set of rows with the same title) at a time. For this I've been making a smaller dataframe out of my big one using:
df = pd.read_excel(r"mydata.xlsx")
a = df.loc[df['title'].str.contains('Y1-17')]
But given there are about 180 of these smaller datasets I need to do this analysis on, I don't want to have to do it manually.
My question is, is there a way to make all of the smaller dataframes automatically, by slicing the data by the unique 'title' value? All the help I've found, it seems like you need to specify the 'title' to make a subset. I want to subset all of it and I don't want to have to list all the title names to do it.
I've searched quite a lot and haven't found anything, however I am a beginner so it's very possible I've missed some really basic way of doing this.
I'm not sure if its important information but the modules I'm working with pandas, and numpy
Thanks for any help!
You can use Pandas groupby
For example:
df_dict = {key: title for key, title in df.copy().groupby('title', sort=False)}
Which creates a dictionary of DataFrames each containing all the columns and only the rows pertaining to each unique value of title.

How to do sampling in sql query to get dataframe with pandas

Note my question is a bit different here:
I am working with pandas on a dataset that has a lot of data (10M+):
q = "SELECT COUNT(*) as total FROM `<public table>`"
df = pd.read_gbq(q, project_id=project, dialect='standard')
I know I can do with pandas function with a frac option like
df_sample = df.sample(frac=0.01)
however, I do not want to generate the original df with that size. I wonder what is the best practice to generate a dataframe with data already sampled.
I've read some sql posts showing the sample data was generated from a slice, that is absolutely not accepted in my case. The sample data needs to be evenly distributed as much as possible.
Can anyone shed me with more light?
Thank you very much.
UPDATE:
Below is a table showing how the data looks like:
Reputation is the field I am working on. You can see majority records have a very small reputation.
I don't want to work with a dataframe with all the records, I want the sampled data also looks like the un-sampled data, for example, similar histogram, that's what I meant "evenly".
I hope this clarifies a bit.
A simple random sample can be performed using the following syntax:
select * from mydata where rand()>0.9
This gives each row in the table a 10% chance of being selected. It doesn't guarantee a certain sample size or guarantee that every bin is represented (that would require a stratified sample). Here's a fiddle of this approach
http://sqlfiddle.com/#!9/21d1ee/2
On average, random sampling will provide a distribution the same as that of the underlying data, so meets your requirement. However if you want to 'force' the sample to be more representative or force it to be a certain size we need to look at something a little more advanced.

Understanding Stratified sampling in numpy

I am currently completing an exercise book on machine learning to wet my feet so to speak in the discipline. Right now I am working on a real estate data set: each instance is a district of california and has several attributes, including the district's median income, which has been scaled and capped at 15. The median income histogram reveals that most median income values are clustered around 2 to 5, but some values go far beyond 6. The author wants to use stratified sampling, basing the strata on the median income value. He offers the next piece of code to create an income category attribute.
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
He explains that he divides the median_income by 1.5 to limit the number of categories and that he then keeps only those categories lower than 5 and merges all other categories into category 5.
What I don't understand is
Why is it mathematically sound to divide the median_income of each instance to create the strata? What exactly does the result of this division mean? Are there other ways to calculate/limit the number of strata?
How does the division restrict the number of categories and why did he choose 1.5 as the divisor instead of a different value? How did he know which value to pick?
Why does he only want 5 categories and how did he know beforehand that there would be at least 5 categories?
Any help understanding these decisions would be greatly appreciated.
I'm also not sure if this is the StackOverFlow category I should post this question in, so if I made a mistake by doing so please let me know what might be the appropriate forum.
Thank you!
You may be the right person to analyze more on this based on your data set. But I can help you understanding stratified sampling, so that you will have an idea.
STRATIFIED SAMPLING: suppose you have a data set with consumers who eat different fruits. One feature is 'fruit type' and this feature has 10 different categories(apple,orange,grapes..etc) now if you just sample the data from data set, there is a possibility that sample data might not cover all the categories. Which is very bad when train the data. To avoid such scenario, we have a method called stratified sampling, in this probability of sampling each different category is same so that we will not miss any useful data.
Please let me know if you still have any questions, I would be very happy to help you.

Internal db logic/operation to group/compress result

I have a CrateDB table storing various information for zipcodes. It contains around 30k zipcodes, and I need my query to return certain profiling information for all zipcodes at once. I understand that typically it wouldn't be feasible, but since I only need ballpark information and many zipcodes are consecutive, I think an optimization is possible.
For example, if I wanted to profile population, a grouped result such as this would work for me:
group 1 (0-1000): 00000-02000,02004-02010,02012
group 2 (1001-3000): ...
...
The populations and groups above are fake, but the idea should hold. Basically, group profiled category into buckets, assign zipcodes to correct bucket, and further reduce size by using range representation. I could settle for a predefined number of groups or have group buckets defined by request/query itself. This would hopefully reduce the response from something that would be too large for a single query to one that's manageable.
Is it possible to write a cratedb function to do something similar to avoid bandwidth issues from having this grouping done on a different service/container/vm?
You could probably crate groups on the fly or as columns if you wish with a regex, I have done this on a 23M row table and group by that.
In my example regex grouping and AVG took around 30s, but this is very subjective to my hardware.
Something like this would probably work as a general pointer
SELECT avg (--yourColumn--), regexp_matches(--yourColumn--, '--your regex--','i')[1]
FROM "doc"."--yourTable--"
group by regexp_matches(postcode, '--your regex--','i')[1]
order by regexp_matches(postcode, '--your regex--','i')[1]
You could use over windowed function but this doesn't yet have the full SQL support for partitioning etc.

Bar plot with groupby

My categorical variable case_satus takes on four unique values. I have data from 2014 to 2016. I would like to plot the distribution of case_status grouped by year. I try to this using:
df.groupby('year').case_status.value_counts().plot.barh()
And I get the following plot:
What I would like to have is a nicer represenation. For example where I have one color for each year, and all the "DENIED" would stand next to each other.
I think it can be achieved since the groupby object is a multi-index, but I don't understand it well enough to create the plot I want.
The solution is:
df.groupby('year').case_status.value_counts().unstack(0).plot.barh()
and results in
I think you need add unstack for DataFrame:
df.groupby('year').case_status.value_counts().unstack().plot.barh()
Also is possible change level:
df.groupby('year').case_status.value_counts().unstack(0).plot.barh()