I have a dataframe which when dump to excel appears as following:
I need to dump to excel with formatting such that it appears as:
i.e. I have a dictionary which is used to apply color to column name and index names.
colorIndex = {'A':'Bb', 'B':'B'}
colorColumn = {'ATC':'X1', 'P25':'Y'}
I am using the following code to generate dataframe and dump to excel:
import pandas as pd, numpy as np, sys, os
def getDF():
df = pd.DataFrame()
df['ATC'] =np.random.rand(1, 7).round(2).flatten()
df['P25'] =np.random.rand(1, 7).round(2).flatten()
df['P75'] =np.random.rand(1, 7).round(2).flatten()
df['Type1'] = ['A', 'B', 'B', 'A', 'B', 'B', 'A']
df['Type11'] = ['A', 'Aa', 'Bb', 'A', 'Bb', 'B', 'Bb']
df['Type2'] = ['X', 'X', 'X1', 'Y', 'Y', 'Y1', 'Y']
df = df.pivot_table(index=['Type1', 'Type11'], columns='Type2', aggfunc=[np.mean])['mean']
return df
df = getDF()
fn = r'C:\Users\Desktop\format_file.xlsx'
df.to_excel(fn, engine='openpyxl')
But I don't have clue how to generate the style parameters for this kind of excel dump.
Related
How do I make four subplots into one figure and save it to my desktop? I'm having trouble with the input prompt where you can insert 4 different are protein sequences
import numpy as np
from matplotlib import pyplot as plt
protein_input = input('Protein Sequence: ')
protein_nospace = protein_input.strip()
# plot protein frequency and print graph
x_values = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
counts = defaultdict(int)
for aa in protein_nospace:
if aa in x_values:
counts[aa] += 1
else:
counts[aa] = 1
y_values = np.array([v for v in counts.values()])
plt.figure()
plt.bar(x_values, y_values)
plt.title('Amino acid Frequencies')
plt.xlabel('Amino Acids')
plt.ylabel('Frequency')
plt.show()
You could create a subplot for each of the proteins. Matplotlib object-oriented interface helps to write everything into its subplot. plt.savefig saves the plot to a file.
In general, it is not a good idea to read this type of input with an interactive session. That way, you can't easily check for errors, nor can you easily reference the input later to verify how the plot was created.
For short scripts and short inputs, the easiest is to just copy-paste your input into the code. And then save the code for later reference. For longer inputs, or to have the same graph with other input sets, you can save only the strings in separate files.
The code below now shows how the input can be read interactively. (In the comment there is a version with "hard-coded" inputs.)
from matplotlib import pyplot as plt
from collections import defaultdict
protein_input_list = []
while True:
protein_input = input('Enter next Protein Sequences (empty input to stop):')
protein_nospace = protein_input.strip()
if len(protein_nospace) == 0:
break
else:
protein_input_list.append(protein_nospace)
'''
protein_input_list = ['ABABDEADWDSWEFAECD',
'DEFSDMDSLHEWVOIHWEAAEHRG',
'HIWEORMLSDAWEEFWEFWEEWJK',
'JLSSFSFLIWIJOWHOE']
'''
fig, axes = plt.subplots(ncols=len(protein_input_list), figsize=(15, 5))
for index, ( ax, protein_nospace) in enumerate( zip(axes.ravel(), protein_input_list)):
x_values = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
counts = defaultdict(int)
for aa in protein_nospace:
if aa in x_values:
counts[aa] += 1
else:
counts[aa] = 1
ax.bar(counts.keys(), counts.values())
ax.set_title(f'Amino acid Frequencies {index+1}')
ax.set_xlabel('Amino Acids')
ax.set_ylabel('Frequency')
plt.savefig('Amino acid Frequencies.png')
plt.show()
I have a pandas dataframe df with the contents below:
df = pd.DataFrame({'x': ['a', 'b', 'c'], 'y': [15, 10, 5]})
enter image description here
I would like to get a third column that shows the result of dividing by the value in y when x=c
enter image description here
I tried some but not worked:
df['z'] = df['y']/df.loc[df['x']== 'c', 'y']
First, this answer does not work for me, but the problem is essentially the same. My data x is a list in the range [-2:18], labeled as [A:U]. The last bin (17 or T) is actually accumulating the number of values of 17-T and 18-U, showing bin 18-U empty.
My code looks like this (aesthetics have been omitted, x was read from a .csv):
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=figsize)
Labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',
'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U']
bins = len(Labels)
ax.hist(x, bins=bins, density=False, histtype='step', color='grey', linewidth=2)
ax.set_xticklabels(Labels)
plt.show()
The result is this:
Trying the existing solution, bins = len(Labels) + 1 does not make any difference.
I have a seaborn boxplot (sns.boxplot) on which I would like to add some points. For example, say I have this pandas DataFrame:
[In] import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'Property 1':['a']*100+['b']*100,
'Property 2': ['w', 'x', 'y', 'z']*50,
'Value': np.random.normal(size=200)})
df.head(3)
[Out] Property 1 Property 2 Value
0 a w 1.421380
1 a x -1.034465
2 a y 0.212911
[In] df.shape
[Out] (200, 3)
I can easily generate a boxplot with seaborn:
[In] sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
[Out]
Now say I want to add markers for a specific case in my sample. I can get close with this:
[In] specific_case = pd.DataFrame([['a', 'w', '0.5'],
['a', 'x', '0.2'],
['a', 'y', '0.1'],
['a', 'z', '0.3'],
['b', 'w', '-0.5'],
['b', 'x', '-0.2'],
['b', 'y', '0.3'],
['b', 'z', '0.5']
],
columns = df.columns
)
[In] sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
plt.plot(np.arange(-0.25, 3.75, 0.5),
specific_case['Value'].values, 'ro')
[Out]
That is unsatisfactory, of course.
I then used this answer that talks about getting the bBox and this tutorial about converting diplay coordinates into data coordinates to write this function:
[In] def get_x_coordinates_of_seaborn_boxplot(ax, x_or_y):
display_coordinates = []
inv = ax.transData.inverted()
for c in ax.get_children():
if type(c) == mpl.patches.PathPatch:
if x_or_y == 'x':
display_coordinates.append(
(c.get_extents().xmin+c.get_extents().xmax)/2)
if x_or_y == 'y':
display_coordinates.append(
(c.get_extents().ymin+c.get_extents().ymax)/2)
return inv.transform(tuple(display_coordinates))
That works great for my first hue, but not at all for my second:
[In] ax = sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
coords = get_x_coordinates_of_seaborn_boxplot(ax, 'x')
plt.plot(coords, specific_case['Value'].values, 'ro')
[Out]
How can I get the data coordinates of all my boxes?
I'm unsure about the purpose of those transformations. But it seems the real problem is just to plot the points from the specific_case at the correct positions. The xcoordinate of every boxplot is shifted by 0.2 from the whole number. (That is because bars are 0.8 wide by default, you have 2 boxes, which makes each 0.4 wide, half of that is 0.2.)
You then need to arrange the x values to fit to those of the specific_case dataframe.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'Property 1':['a']*100+['b']*100,
'Property 2': ['w', 'x', 'y', 'z']*50,
'Value': np.random.normal(size=200)})
specific_case = pd.DataFrame([['a', 'w', '0.5'],
['a', 'x', '0.2'],
['a', 'y', '0.1'],
['a', 'z', '0.3'],
['b', 'w', '-0.5'],
['b', 'x', '-0.2'],
['b', 'y', '0.3'],
['b', 'z', '0.5']
], columns = df.columns )
ax = sns.boxplot(x='Property 2', hue='Property 1', y='Value', data=df)
X = np.repeat(np.atleast_2d(np.arange(4)),2, axis=0)+ np.array([[-.2],[.2]])
ax.plot(X.flatten(), specific_case['Value'].values, 'ro', zorder=4)
plt.show()
I got it figured out:
In your code do this to extract the x-coordinate based on hue. I did not do it for the y, but the logic should be the same:
Create two lists holding your x coordinate:
display_coordinates_1=[]
display_coordinates_2=[]
Inside your for loop that starts with:
for c in ax.get_children():
Use the following:
display_coordinates_1.append(c.get_extents().x0)
You need x0 for the x-coordinate of boxplots under first hue.
The following gives you the x-coordinates for the subplots in the second hue. Note the use of x1 here:
display_coordinates_2.append(c.get_extents().x1)
Lastly, after you inv.transform() the two lists, make sure you select every other value, since for x-coordinates each list has 6 outputs and you want the ones at indices 0,2,4 or [::2].
Hope this helps.
Is there a simply way to specify bar colors by column name using Pandas DataFrame.plot(kind='bar') method?
I have a script that generates multiple DataFrames from several different data files in a directory. For example it does something like this:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pds
data_files = ['a', 'b', 'c', 'd']
df1 = pds.DataFrame(np.random.rand(4,3), columns=data_files[:-1])
df2 = pds.DataFrame(np.random.rand(4,3), columns=data_files[1:])
df1.plot(kind='bar', ax=plt.subplot(121))
df2.plot(kind='bar', ax=plt.subplot(122))
plt.show()
With the following output:
Unfortunately, the column colors aren't consistent for each label in the different plots. Is it possible to pass in a dictionary of (filenames:colors), so that any particular column always has the same color. For example, I could imagine creating this by zipping up the filenames with the Matplotlib color_cycle:
data_files = ['a', 'b', 'c', 'd']
colors = plt.rcParams['axes.color_cycle']
print zip(data_files, colors)
[('a', u'b'), ('b', u'g'), ('c', u'r'), ('d', u'c')]
I could figure out how to do this directly with Matplotlib: I just thought there might be a simpler, built-in solution.
Edit:
Below is a partial solution that works in pure Matplotlib. However, I'm using this in an IPython notebook that will be distributed to non-programmer colleagues, and I'd like to minimize the amount of excessive plotting code.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pds
data_files = ['a', 'b', 'c', 'd']
mpl_colors = plt.rcParams['axes.color_cycle']
colors = dict(zip(data_files, mpl_colors))
def bar_plotter(df, colors, sub):
ncols = df.shape[1]
width = 1./(ncols+2.)
starts = df.index.values - width*ncols/2.
plt.subplot(120+sub)
for n, col in enumerate(df):
plt.bar(starts + width*n, df[col].values, color=colors[col],
width=width, label=col)
plt.xticks(df.index.values)
plt.grid()
plt.legend()
df1 = pds.DataFrame(np.random.rand(4,3), columns=data_files[:-1])
df2 = pds.DataFrame(np.random.rand(4,3), columns=data_files[1:])
bar_plotter(df1, colors, 1)
bar_plotter(df2, colors, 2)
plt.show()
You can pass a list as the colors. This will require a little bit of manual work to get it to line up, unlike if you could pass a dictionary, but may be a less cluttered way to accomplish your goal.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pds
data_files = ['a', 'b', 'c', 'd']
df1 = pds.DataFrame(np.random.rand(4,3), columns=data_files[:-1])
df2 = pds.DataFrame(np.random.rand(4,3), columns=data_files[1:])
color_list = ['b', 'g', 'r', 'c']
df1.plot(kind='bar', ax=plt.subplot(121), color=color_list)
df2.plot(kind='bar', ax=plt.subplot(122), color=color_list[1:])
plt.show()
EDIT
Ajean came up with a simple way to return a list of the correct colors from a dictionary:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pds
data_files = ['a', 'b', 'c', 'd']
color_list = ['b', 'g', 'r', 'c']
d2c = dict(zip(data_files, color_list))
df1 = pds.DataFrame(np.random.rand(4,3), columns=data_files[:-1])
df2 = pds.DataFrame(np.random.rand(4,3), columns=data_files[1:])
df1.plot(kind='bar', ax=plt.subplot(121), color=map(d2c.get,df1.columns))
df2.plot(kind='bar', ax=plt.subplot(122), color=map(d2c.get,df2.columns))
plt.show()
Pandas version 1.1.0 makes this easier. You can pass a dictionary to specify different color for each column in the pandas.DataFrame.plot.bar() function:
Here is an example:
df1 = pd.DataFrame({'a': [1.2, .8, .9], 'b': [.2, .9, .7]})
df2 = pd.DataFrame({'b': [0.2, .5, .4], 'c': [.5, .6, .7], 'd': [1.1, .6, .7]})
color_dict = {'a':'green', 'b': 'red', 'c':'blue', 'd': 'cyan'}
df1.plot.bar(color = color_dict)
df2.plot.bar(color = color_dict)