PyPlot line plot changing color by column value - pandas

I have a dataframe with a structure similar to the following example.
df = pd.DataFrame({'x': ['2008-01-01', '2008-01-02', '2008-01-03', '2008-01-04'], 'y': [1, 2, 3, 6],
'group_id': ['OBSERVED', 'IMPUTED', 'OBSERVED', 'IMPUTED'], 'color': ['blue', 'red', 'blue', 'red']})
df['x'] = pd.to_datetime(df['x'])
I.e. a dataframe where some of the values (y) are observed and others are imputed.
x y group_id color
0 2008-01-01 1 OBSERVED blue
1 2008-01-02 2 IMPUTED red
2 2008-01-03 3 OBSERVED blue
3 2008-01-04 6 IMPUTED red
How to I create a single line which changes color based on the group_id (the column color is uniquely determined by group_id as in this example)?
I have tried the following two solutions (one of them being omitted by the comment)
df_grp = df.groupby('group_id')
fig, ax = plt.subplots(1)
for id, data in df_grp:
#ax.plot(data['x'], data['y'], label=id, color=data['color'].unique().tolist()[0])
data.plot('x', 'y', label=id, ax=ax)
plt.legend()
plt.show()
However, the plow is not
a single line.
colored correctly by each segment.

You can use the below code to do the forward looking colors. The key was to get the data right in the dataframe, so that the plotting was easy. You can print(df) after manipulation to see what was done. Primarily, I added the x and y from below row as additional columns in the current row for all except last row. I also included a marker of the resultant color so that you know whether the color is red of blue. One thing to note, the dates in the x column should be in ascending order.
#Add x_end column to df from subsequent row - column X
end_date=df.iloc[1:,:]['x'].reset_index(drop=True).rename('x_end')
df = pd.concat([df, end_date], axis=1)
#Add y_end column to df from subsequent row - column y
end_y=df.iloc[1:,:]['y'].reset_index(drop=True).astype(int).rename('y_end')
df = pd.concat([df, end_y], axis=1)
#Add values for last row, same as x and y so the marker is of right color
df.iat[len(df)-1, 4] = df.iat[len(df)-1, 0]
df.iat[len(df)-1, 5] = df.iat[len(df)-1, 1]
for i in range(len(df)):
plt.plot([df.iloc[i,0],df.iloc[i,4]],[df.iloc[i,1],df.iloc[i,5]], marker='o', c=df.iat[i,3])
Output plot

Related

Matplotlib Plot X, Y Line Plot Multiple Columns Fixed X Axis

I'm trying to plot a df with the x axis forced to 12, 1, 2 for (Dec, Jan, Feb) and I cannot see how to do this. Matplot keeps wanting to plot the x axis in the 1,2,12 order. My DF (analogs_re) partial columns for the example looks like this:
Month 2000 1995 2009 2014 1994 2003
0 12 -0.203835 0.580590 0.233124 0.490193 0.605808 0.016756
1 1 -0.947029 -1.239794 -0.977004 0.207236 0.436458 -0.501948
2 2 -0.059957 0.708626 0.111840 0.422534 1.051873 -0.149000
I need the y data plotted with x axis in 12, 1, 2 order as shown in the 'Month" column.
My code:
fig, ax = plt.subplots()
#for name, group in groups:
analogs_re.set_index('Month').plot(figsize=(10,5),grid=True)
analogs_re.plot(x='Month', y=analogs_re.columns[1:len(analogs_re.columns)])
When you set Month as the x-axis then obviously it's going to plot it in numerical order (0, 1, 2, 3...), because a sequential series does not start with 12, then 1, then 2, ...
The trick is to use the original index as x-axis, then label those ticks using the month number:
fig, ax = plt.subplots()
analogs_re.drop(columns='Month').plot(figsize=(10,5), grid=True, ax=ax)
ax.set_xticks(analogs_re.index)
ax.set_xticklabels(analogs_re["Month"])

Why does pandas.DataFrame.apply produces Series instead of DataFrame

I do not really understand why from the following code pandas return is Series but not a DataFrame.
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 1) # Applied to each row
df_row
While if I change axis=0 it produces DataFrame as expected:
import pandas as pd
df = pd.DataFrame([[4,9]]*3, columns = ["A", "B"])
def plus_2(x):
y =[]
for i in range(0, len(x)):
y.append(x[i]+2)
return y
df_row = df.apply(plus_2, axis = 0) # Applied to each row
df_row
Here is the output:
In first example where you put axis=1 you implement on row level.
It means that for each row plus_2 function returns y which is list of two element (but list as a whole is single element so this is pd.Series).
Based on your example it will be returned 3x list (2 element each). Here single list if single row.
You could expand this result and create two columns (each element from list will be new column) by adding result_type="expand" in apply:
df_row = df.apply(lambda x: plus_2(x), axis=1, result_type="expand")
# output
0 1
0 6 11
1 6 11
2 6 11
In second approach you have axis=0 co this is applied on column level.
It means that for each column plus_2 function returns y, so plus_2 is applied twice, separately for A column and for B column. This is why it returns dataframe: your input is DataFrame with columns A and B, each column applies plus_2 function and returns A and B columns as result of plus_2 functions applied.
Based on your example it will be returned 2x list (3 element each). Here single list is single column.
So the main difference between axis=1 and axis=0 is that:
if you applied on row level apply will return:
[6, 11]
[6, 11]
[6, 11]
if you applied on column level apply will return:
[6, 6, 6]
[11, 11, 11]

Cumulative histogram plot from dataframe

The goal is to create a plot like this
Dummy df:
columns = ['number_of_words', 'occurrences']
data = [[1, 2312252],
[2,1000000],
[3,800000],
[4, 400000],
[5, 100000],
[6, 70000],
[7, 40000],
[8, 10000],
[9, 4000],
[10, 50]]
dummy_df = pd.DataFrame(columns=columns, data=data)
The y axis represents the occurrences and the x axis the number of words column from the dummy_df.
The x axis should be cumulative such that it stacks the values on top of each other.
Example: With number_of_words = 1 we have around 2.3 m occurrences. With number_of_words = 2 we have around 1m occurrences, thus it should plot 2.3m + 1m at occurrences = 2.
At the final entry of number_of_words the histogram should reach sum(occurrences).
I do NOT want to normalize it.
Since you already got the frequencies worked out, just add it cumulatively:
dummy_df['acc'] = dummy_df.occurrences.cumsum()
ax = dummy_df['acc'].plot('bar', width=1, color='b')
dummy_df['acc'].shift().plot('bar', alpha=0.7, width=1, color='r', ax=ax)
To split it into parts, plot it twice. The first is the normal cumsum, then second is just the values, with the shifted cumsum setting the bottom (This overlaps the top of the previous plotted cumsum).
Using .iloc[1:] to slice the Series just before plotting removes the first bar, which you want to exclude.
fig, ax = plt.subplots()
df['occurrences'].cumsum().iloc[1:].plot(kind='bar', width=1, ec='k', ax=ax)
df['occurrences'].iloc[1:].plot(kind='bar', width=1, ec='k',
bottom=df['occurrences'].cumsum().shift().fillna(0).iloc[1:], ax=ax, color='red')
plt.show()

Matplotlib - Correlation plots with different range of numbers but on same scale

I would like to have a 2 by 3 figure with 6 correlation plots, sharing the same scale, even when the values in the plots have different ranges. Below you can see what I have so far.
In the first column, the values range from 0 to 1, with 1 on the diagonal, and close to 0 elsewhere. For the other two columns it holds for the top row that the values range from 0 to 1, whereas the values in the bottom row range from -1 and 1. The difference between the second and third column is that the values in the second column are around 0.3 (and -0.3) and the values in the third column are around 0.7 (and -0.7).
As you can see, several things seem to be going incorrect. First of all, although I want them all to be plotted according to the same color scale, with dark blue being -1 and yellow being 1, this is clearly not the case. If this would hold, we would have bright blue/greenish in the first column. What could I do to indicate the range for the colors? Next, how do I change the labels of the color scale on the right? I would like it to range from -1 to 1.
Below, you find my implementation.
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(15,8))
idx_mixed = {False: 0, True: 1}
idx_rho = {0: 0, 0.3: 1, 0.7: 2}
for mixed in [False, True]:
for rho in [0, 0.3, 0.7]:
ax[idx_mixed[mixed]][idx_rho[rho]].matshow(results[mixed][rho])
ax[0][0].set_title("No correlation", pad=20, fontsize=14)
ax[0][1].set_title("Weakly correlated", pad=20, fontsize=14)
ax[0][2].set_title("Strongly correlated", pad=20, fontsize=14)
ax[0][0].set_ylabel("Positive correlations", fontsize = 14)
ax[1][0].set_ylabel("Mixed correlations", fontsize = 14)
fig.colorbar(mpl.cm.ScalarMappable(), ax=fig.get_axes())
You need to provide a norm= argument to matshow() so that the data is scaled to the range [-1, 1] rather than a range defined by the min and max value present in the data. See Colormap Normalization for more details.
cmap = 'viridis'
norm = matplotlib.colors.Normalize(vmin=-1, vmax=1)
fig, axs = plt.subplots(2,3)
for ax, d in zip(axs.flat, data):
m = ax.matshow(d, cmap=cmap, norm=norm)
fig.colorbar(m, ax=axs)

Making multiple pie charts out of a pandas dataframe (one for each column)

My question is similar to Making multiple pie charts out of a pandas dataframe (one for each row).
However, instead of each row, I am looking for each column in my case.
I can make pie chart for each column, however, as I have 12 columns the pie charts are too much close to each other.
I have used this code:
fig, axes = plt.subplots(4, 3, figsize=(10, 6))
for i, (idx, row) in enumerate(df.iterrows()):
ax = axes[i // 3, i % 3]
row = row[row.gt(row.sum() * .01)]
ax.pie(row, labels=row.index, startangle=30)
ax.set_title(idx)
fig.subplots_adjust(wspace=.2)
and I have the following result
But I want is on the other side. I need to have 12 pie charts (becuase I have 12 columns) and each pie chart should have 4 sections (which are leg, car, walk, and bike)
and if I write this code
fig, axes = plt.subplots(4,3)
for i, col in enumerate(df.columns):
ax = axes[i // 3, i % 3]
plt.plot(df[col])
then I have the following results:
and if I use :
plot = df.plot.pie(subplots=True, figsize=(17, 8),labels=['pt','car','walk','bike'])
then I have the following results:
Which is quite what I am looking for. but it is not possible to read the pie charts. if it can produce in more clear output, then it is better.
As in your linked post I would use matplotlib.pyplot for this. The accepted answer uses plt.subplots(2, 3) and I would suggest doing the same for creating two rows with each 3 plots in them.
Like this:
fig, axes = plt.subplots(2,3)
for i, col in enumerate(df.columns):
ax = axes[i // 3, i % 3]
ax.plot(df[col])
Finally, I understood that if I swap rows and columns
df_sw = df.T
Then I can use the code in the examples:
Making multiple pie charts out of a pandas dataframe (one for each row)