Related
I try to plot a grouped bar chart from a merged dataframe. below code the bar is stacked, how can I put it side by side just like a grouped bar chart?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'value':[ 10 ,6, 6, 8]})
df2 = pd.DataFrame({
'key': ['B', 'D', 'A', 'F'],
'value':[ 3, 5, 5, 7]})
df3 = pd.merge(df1, df2, how='inner', on=['key'])
print(df1)
print(df2)
print(df3)
fig, ax = plt.subplots(figsize=(12, 8))
b1 = ax.bar(df3['key'],df3['value_x'])
b2 = ax.bar(df3['key'],df3['value_y'])
pngname = "demo.png"
fig.savefig(pngname, dpi=fig.dpi)
print("[[./%s]]"%(pngname))
Current output:
The problem is that the x axis data is the same, in your case it aren't numbers, it are the keys: "A", "B", "C". So matplotlib stacks them one onto another.
There's a simple way around it, as some tutorials online show https://www.geeksforgeeks.org/create-a-grouped-bar-plot-in-matplotlib/.
So, what you do is basically enumerate the keys, i.e. A=1, B=2, C=3. After this, choose your desired bar width, I chose 0.4 for example. And now, shift one group of bars to the left by bar_width/2, and shift the other one to the right by bar_width/2.
Perhaps the code explains it better than I did:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'value':[ 10 ,6, 6, 8]})
df2 = pd.DataFrame({
'key': ['B', 'D', 'A', 'F'],
'value':[ 3, 5, 5, 7]})
df3 = pd.merge(df1, df2, how='inner', on=['key'])
fig, ax = plt.subplots(figsize=(12, 8))
# modifications
x = np.arange(len(df3['key'])) # enumerate the keys
bar_width = 0.4 # choose bar length
b1 = ax.bar(x - bar_width/2,df3['value_x'], width=bar_width, label='value_x') # shift x values left
b2 = ax.bar(x + bar_width/2,df3['value_y'], width=bar_width, label='value_y') # shift x values right
plt.xticks(x, df3['key']) # replace x axis ticks with keys from df3.
plt.legend(['value_x', 'value_y'])
plt.show()
Result:
I have a pandas dataframe containing IDs and Codes which are of type list:
df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3, 3, 4], 'Code': [['A', 'B'], ['A', 'B'], ['A', 'B', 'C'],
['A'], ['A'], ['A', 'C'], ['D', 'C'], ['A', 'D']]})
I would like to groupby ID and get a list of all codes associated with each ID:
df_groupby = pd.DataFrame(df.groupby('ID')['Code'].apply(list))
After executing the above code I have a dataframe at the ID level with the 'Code' column transformed to a list of lists. How would I flatten each list of lists within the 'Code' column such that I have a list of all codes associated with each ID?
Try this.You can use np.hstack to Stack arrays in sequence horizontally.
import numpy as np
df_groupby["Code"] = df_groupby["Code"].apply(lambda x: np.hstack(x))
or
df_groupby["Code"] = df_groupby["Code"].apply(np.hstack)
Use list comprehension:
df = df.groupby('ID')['Code'].agg(lambda x: [z for y in x for z in y]).to_frame()
print(df)
Code
ID
1 [A, B, A, B, A, B, C]
2 [A, A]
3 [A, C, D, C]
4 [A, D]
Answering my own question. Applying numpy's hstack does the trick:
df_groupby['Code'] = df_groupby['Code'].apply(np.hstack)
I have a dataframe which when dump to excel appears as following:
I need to dump to excel with formatting such that it appears as:
i.e. I have a dictionary which is used to apply color to column name and index names.
colorIndex = {'A':'Bb', 'B':'B'}
colorColumn = {'ATC':'X1', 'P25':'Y'}
I am using the following code to generate dataframe and dump to excel:
import pandas as pd, numpy as np, sys, os
def getDF():
df = pd.DataFrame()
df['ATC'] =np.random.rand(1, 7).round(2).flatten()
df['P25'] =np.random.rand(1, 7).round(2).flatten()
df['P75'] =np.random.rand(1, 7).round(2).flatten()
df['Type1'] = ['A', 'B', 'B', 'A', 'B', 'B', 'A']
df['Type11'] = ['A', 'Aa', 'Bb', 'A', 'Bb', 'B', 'Bb']
df['Type2'] = ['X', 'X', 'X1', 'Y', 'Y', 'Y1', 'Y']
df = df.pivot_table(index=['Type1', 'Type11'], columns='Type2', aggfunc=[np.mean])['mean']
return df
df = getDF()
fn = r'C:\Users\Desktop\format_file.xlsx'
df.to_excel(fn, engine='openpyxl')
But I don't have clue how to generate the style parameters for this kind of excel dump.
In this dataframe...
import pandas as pd
import numpy as np
import datetime
tf = 365
dt = datetime.datetime.now()-datetime.timedelta(days=365)
df = pd.DataFrame({
'Cat': np.repeat(['a', 'b', 'c'], tf),
'Date': np.tile(pd.date_range(dt, periods=tf), 3),
'Val': np.random.rand(3*tf)
})
How can I get a dictionary of standard deviation of each 'Cat' values for a specific number of days - back from the last day for a large dataset?
This code gives the standard deviation for 10 days...
{s: np.std(df[(df.Cat == s) &
(df.Date > today-datetime.timedelta(days=10))].Val)
for s in df.Cat.unique()}
...looks clunky.
Is there a better way?
First filter by boolean indexing and then aggregate std, but because default value ddof=1 is necessary set it to 0:
d1 = df[(df.Date>dt-datetime.timedelta(days=10))].groupby('Cat')['Val'].std(ddof=0).to_dict()
print (d1)
{'a': 0.28435695432581953, 'b': 0.2908486860242955, 'c': 0.2995981283031974}
Another solution is use custom function:
f = lambda x: np.std(x.loc[(x.Date > dt-datetime.timedelta(days=10)), 'Val'])
d2 = df.groupby('Cat').apply(f).to_dict()
Difference between solutions is if some values in group not matched conditions then is removed and for second solution is assignd NaN:
d1 = {'b': 0.2908486860242955, 'c': 0.2995981283031974}
d2 = {'a': nan, 'b': 0.2908486860242955, 'c': 0.2995981283031974}
I have a seaborn boxplot (sns.boxplot) on which I would like to add some points. For example, say I have this pandas DataFrame:
[In] import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'Property 1':['a']*100+['b']*100,
'Property 2': ['w', 'x', 'y', 'z']*50,
'Value': np.random.normal(size=200)})
df.head(3)
[Out] Property 1 Property 2 Value
0 a w 1.421380
1 a x -1.034465
2 a y 0.212911
[In] df.shape
[Out] (200, 3)
I can easily generate a boxplot with seaborn:
[In] sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
[Out]
Now say I want to add markers for a specific case in my sample. I can get close with this:
[In] specific_case = pd.DataFrame([['a', 'w', '0.5'],
['a', 'x', '0.2'],
['a', 'y', '0.1'],
['a', 'z', '0.3'],
['b', 'w', '-0.5'],
['b', 'x', '-0.2'],
['b', 'y', '0.3'],
['b', 'z', '0.5']
],
columns = df.columns
)
[In] sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
plt.plot(np.arange(-0.25, 3.75, 0.5),
specific_case['Value'].values, 'ro')
[Out]
That is unsatisfactory, of course.
I then used this answer that talks about getting the bBox and this tutorial about converting diplay coordinates into data coordinates to write this function:
[In] def get_x_coordinates_of_seaborn_boxplot(ax, x_or_y):
display_coordinates = []
inv = ax.transData.inverted()
for c in ax.get_children():
if type(c) == mpl.patches.PathPatch:
if x_or_y == 'x':
display_coordinates.append(
(c.get_extents().xmin+c.get_extents().xmax)/2)
if x_or_y == 'y':
display_coordinates.append(
(c.get_extents().ymin+c.get_extents().ymax)/2)
return inv.transform(tuple(display_coordinates))
That works great for my first hue, but not at all for my second:
[In] ax = sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
coords = get_x_coordinates_of_seaborn_boxplot(ax, 'x')
plt.plot(coords, specific_case['Value'].values, 'ro')
[Out]
How can I get the data coordinates of all my boxes?
I'm unsure about the purpose of those transformations. But it seems the real problem is just to plot the points from the specific_case at the correct positions. The xcoordinate of every boxplot is shifted by 0.2 from the whole number. (That is because bars are 0.8 wide by default, you have 2 boxes, which makes each 0.4 wide, half of that is 0.2.)
You then need to arrange the x values to fit to those of the specific_case dataframe.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'Property 1':['a']*100+['b']*100,
'Property 2': ['w', 'x', 'y', 'z']*50,
'Value': np.random.normal(size=200)})
specific_case = pd.DataFrame([['a', 'w', '0.5'],
['a', 'x', '0.2'],
['a', 'y', '0.1'],
['a', 'z', '0.3'],
['b', 'w', '-0.5'],
['b', 'x', '-0.2'],
['b', 'y', '0.3'],
['b', 'z', '0.5']
], columns = df.columns )
ax = sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
X = np.repeat(np.atleast_2d(np.arange(4)),2, axis=0)+ np.array([[-.2],[.2]])
ax.plot(X.flatten(), specific_case['Value'].values, 'ro', zorder=4)
plt.show()
I got it figured out:
In your code do this to extract the x-coordinate based on hue. I did not do it for the y, but the logic should be the same:
Create two lists holding your x coordinate:
display_coordinates_1=[]
display_coordinates_2=[]
Inside your for loop that starts with:
for c in ax.get_children():
Use the following:
display_coordinates_1.append(c.get_extents().x0)
You need x0 for the x-coordinate of boxplots under first hue.
The following gives you the x-coordinates for the subplots in the second hue. Note the use of x1 here:
display_coordinates_2.append(c.get_extents().x1)
Lastly, after you inv.transform() the two lists, make sure you select every other value, since for x-coordinates each list has 6 outputs and you want the ones at indices 0,2,4 or [::2].
Hope this helps.