pandas : stacked bar plot from customers orders - pandas

I am trying to do a stacked plot bar from customers order on many companies. I want to show each order as a part of the bar of each company. The issue is that I do have an uncertain number of order by company, and that the display of the plot may do my jupyter notebook crash.
Conceptually I reach my goal with the following :
company1 = pd.Series([10,10,10])
company2 = pd.Series([20,20])
df = pd.DataFrame([company1, company2]).T
df.columns = ["company1", "company2"]
df.T.plot.bar(stacked=True)
which give me a plot :
Now how can I apply that on my dataset ?
I try the following on a subset of my data (only 3 companies in p2) :
p3 = p2[["COMPANY", "TOTAL_PAID"]]
companies = [company for company, group in p3.groupby("COMPANY")]
series = [group["TOTAL_PAID"] for company, group in p3.groupby("COMPANY")]
df = pd.DataFrame(series).T
df.columns = companies
df.T.plot.bar(stacked=True, legend=False)
and it works :
but when I apply it on the whole file (who is still small : 15 k lines) I can wait a long long time before getting any result (indeed I wrote this whole question after launching the plot creation, and it is not displayed yet ...) , so the question are :
Is it a good strategy this idea of the two comprehension lists ? I thaught it was a bit suboptimal...
Is it normal that the display of the plot takes so long ?
Is it normal that jupyter may crash ?

Related

How to use a loop to make a plot of 3 columns at the time?

I have a dataframe which contains the 3 columns of data (P, EP and Q) for each of the three catchment areas. I need to make a subplot of each catchment area showing the 3 columns of data that belong to this catchment area using one loop.
I did manage to make the three subplots without using a loop, but don't get how I am supposed to use one loop.
df = pd.read_excel('catchment_water_balance_data_ex2.xlsx', index_col=0, parse_dates=[0], skiprows=4)
df_monthly = df.resample('M').mean()
fig, axs = plt.subplots(3)
catchment_1 = df_monthly[['P1', 'EP1', 'Q1']]
catchment_2 = df_monthly[['P2', 'EP2', 'Q2']]
catchment_3 = df_monthly[['P3', 'EP3', 'Q3']]
axs[0].plot(catchment_1)
axs[1].plot(catchment_2)
axs[2].plot(catchment_3)
fig.suptitle('Water data of 3 catchments')
fig.supylabel('mm/day');
enter image description here

Plotting a graph of the top 15 highest values

I am working on a dataset which shows the budget spent on movies. I want make a plot which contains the top 15 highest budget movies.
#sort the 'budget' column in decending order and store it in the new dataframe.
info = pd.DataFrame(dp['budget'].sort_values(ascending = False))
info['original_title'] = dp['original_title']
data = list(map(str,(info['original_title'])))
#extract the top 10 budget movies data from the list and dataframe.
x = list(data[:10])
y = list(info['budget'][:10])
This was the ouput i got
C:\Users\Phillip\AppData\Local\Temp\ipykernel_7692\1681814737.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
y = list(info['budget'][:5])
I'm new to the data analysis scene so i'm confused on how else to go about the problem
A simple example using a movie dataset I found online:
import pandas as pd
url = "https://raw.githubusercontent.com/erajabi/Python_examples/master/movie_sample_dataset.csv"
df = pd.read_csv(url)
# Bar plot of 15 highest budgets:
df.nlargest(n=15, columns="budget").plot.bar(x="movie_title", y="budget")
You can customize your plot in various ways by adding arguments to the .bar(...) call.

Generating Percentages from Pandas

0
I am working with a data set from SQL currently -
import pandas as pd
df = spark.sql("select * from donor_counts_2015")
df_info = df.toPandas()
print(df_info)
The output looks like this (I can't include the actual output for privacy reasons): enter image description here
As you can see, it's a data set that has the name of a fund and then the number of people who have donated to that fund. What I am trying to do now is calculate what percent of funds have only 1 donation, what percent have 2, 34, etc. I am wondering if there is an easy way to do this with pandas? I also would appreciate if you were able to see the percentage of a range of funds too, like what percentage of funds have between 50-100 donations, 500-1000, etc. Thanks!
You can make a histogram of the donations to visualize the distribution. np.histogram might help. Or you can also sort the data and count manually.
For the first task, to get the percentage the column 'number_of_donations', you can do:
df['number_of_donations'].value_counts(normalize=True) * 100
For the second task, you need to create a new column with categories, and then make the same:
# Create a Serie with categories
New_Serie = pd.cut(df.number_of_donations,bins=[0,100,200,500,99999999],labels = ['Few','Medium','Many','Too Many'])
# Change the name of the Column
New_Serie.name = Category
# Concat df and New_Serie
df = pd.concat([df, New_Serie], axis=1)
# Get the percentage of the Categories
df['Category'].value_counts(normalize=True) * 100

How to plot a stacked bar using the groupby data from the dataframe in python?

I am reading huge csv file using pandas module.
filename = pd.read_csv(filepath)
Converted to Dataframe,
df = pd.DataFrame(filename, index=None)
From the csv file, I am concerned with the three columns of name country, year, and value.
I have groupby the country names and sum the values of it as in the following code and plot it as a bar graph.
df.groupby('country').value.sum().plot(kind='bar')
where, x axis is country and y axis is value.
Now, I want to make this bar graph as a stacked bar and used the third column year with different color bars representing each year. Looking forward for an easy way.
Note that, year column contains years from 2000 to 2019.
Thanks.
from what i understand you should try something like :
df.groupby(['country', 'Year']).value.sum().unstack().plot(kind='bar', stacked=True)

Visualizing pandas grouped data

Hi I am working on the following dataset
Dataset
df = pd.read_csv('https://github.com/datameet/india-election-data/blob/master/parliament-elections/parliament.csv')
df.groupby(['YEAR','PARTY'])['PC'].nunique()
How do I create a stacked bar plot with year as x axis and pc count as y axis and stacked column labels as party names. Basically I want to display the top 5 parties each year by value, bucket all other parties (excluding IND) as 'others'
Want to visualize something like this Election Viz
IIUC this should work:
sd = df.groupby(['YEAR','PARTY'])['PC'].nunique().reset_index()
sd.pivot(index='YEAR',values='PC',columns='PARTY').plot(kind='bar',stacked=True,figsize=(8,8))