Pandas bar plot with continuous x axis - pandas

I try to make a barchart in pandas, with two data series coming from a groupby:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().plot(kind='bar', layout=(2,2))
The x axis is not continuous, and only shows values that are in the dataset. In this example, it jumps from 11 to 13.
How can I make it continuous?
**EDIT 2: **
I tried JohnE datacentric approach, and it works. It creates a new index with no missing values:
temp = data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose()
temp.reindex(np.arange(temp.index.min(), temp.index.max())).plot(kind='bar', layout=(2,2))
However, I assume there should be a better approach with histogram instead of bar plot. The best I could do with histograms is:
data.groupby(['popup','UID']).size().groupby(level=0).plot(kind='hist', bins=30, alpha=0.5, layout=(2,2), legend=True)
But I didn't find any option in hist plot to get the same rendering than bar plot, without bar overlapping.
**EDIT: ** Here are some information to answer comments.
Data sample:
INSEE C1 popup C3 date \
0 75101.0 0.0 0 NaN 2017-05-17T13:20:16Z
0 75101.0 0.0 0 NaN 2017-05-17T14:23:51Z
1 31557.0 0.0 1 NaN 2017-05-17T14:58:27Z
UID
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
0 ba4bd353-f14d-4bc5-95ba-6a1f5134cc84
1 bafe9715-3a07-4d9b-b85c-0bbf658a9115
First groupby result (sample):
data.groupby(['popup','UID']).size().head(3)
popup UID
0 016d3e7e-1901-4f84-be0e-117988ec57a8 6
01c15455-29cc-4d1e-8743-638fd0f51602 6
03fc9eb0-c5fb-4205-91f0-4b74f78a8b96 3
dtype: int64
Second groupby result (sample):
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().head(3)
popup
0 1 46
3 23
4 22
dtype: int64
After unstack and transpose:
data.groupby(['popup','UID']).size().groupby(level=0).value_counts().unstack().transpose().head(3)
popup 0 1
1 46.0 38.0
2 21.0 35.0
3 23.0 22.0

There is a solution with histogram plot from matplotlib.axes.Axes.hist. It is better to use histograms than bar plots for this purpose, as we can choose the number of bins.
# Separate groups by 'popup' and count number of records for each 'UID'
popup_values = data['popup'].unique()
count_by_popup = [data[data['popup'] == popup_value].groupby(['UID']).size() for popup_value in popup_values]
# Create histogram
fig, ax = plt.subplots()
ax.hist(count_by_popup, 20, histtype='bar', label=[str(x) for x in popup_values])
ax.legend()
plt.show()

Related

Plotting interval of data in dataframe

A bit new to python so maybe code could be improved.
I have a txt file with x and y values, separated by some NaN in between.
Data goes from -x to x and then comes back (x to -x) but with somewhat different values of y, say:
x=np.array([-0.02,-0.01,0,0.01,0.02,NaN,1,NaN,0.02,0.01,0,-0.01,-0.02])
And I would like to plot (matplotlib) up to the first NaN with certain format, x=1 with other format, and last set of data with a third different format (color, marker, linewidth...).
Of course the data I have is a bit more complex, but I guess is a simple useful approximation.
Any idea?
I'm using pandas as my data manipulation tool
You can create a group label taking the cumsum of where x is null. Then you can define a dictionary keyed by the label with values being a dictionary containing all of the plotting parameters. Use groupby to plot each group separately, unpacking all the parameters to set the arguments for that group.
Sample Data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = np.array([-0.02,-0.01,0,0.01,0.02,np.NaN,1,np.NaN,0.02,0.01,0,-0.01,-0.02])
df = pd.DataFrame({'x': x})
Code
df['label'] = df.x.isnull().cumsum().where(df.x.notnull())
plot_params = {0: {'lw': 2, 'color': 'red', 'marker': 'o'},
1: {'lw': 6, 'color': 'black', 'marker': 's'},
2: {'lw': 9, 'color': 'blue', 'marker': 'x'}}
fig, ax = plt.subplots(figsize=(3,3))
for label, gp in df.groupby('label'):
gp.plot(y='x', **plot_params[label], ax=ax, legend=None)
plt.show()
This is what df looks like for reference after defining the group label
print(df)
x label
0 -0.02 0.0
1 -0.01 0.0
2 0.00 0.0
3 0.01 0.0
4 0.02 0.0
5 NaN NaN
6 1.00 1.0
7 NaN NaN
8 0.02 2.0
9 0.01 2.0
10 0.00 2.0
11 -0.01 2.0
12 -0.02 2.0

Can't get y-axis on matplotlib histogram to display the right numbers

So I have this simple DataFrame which i am trying to plot a histogram with
Hour Count Average Count
2 6 4 0.129032
4 7 1 0.032258
1 12 9 0.290323
3 16 3 0.096774
0 20 2022 65.225806
What I want is the Hour to be on the x-axis and Average Count to be on the Y axis. But when i tried this:
fig, hour = plt.subplots(1, 1)
hour.hist(test.Hour)
hour.set_xlabel('Time in 24 Hours')
hour.set_ylabel('Frequency')
plt.show()
I got this instead. I have tried doing test.Count and test['Average Count'] but both only affects the x-axis
Are you looking for something like this?
'df' is the name of the dataframe.
df.plot(x='Hour', y = 'Averag Count', kind='bar')
Output

plot dataframe column on one axis and other columns as separate lines on the same plot (in different color)

I have following dataframe.
precision recall F1 cutoff
cutoff
0 0.690148 1.000000 0.814610 0
1 0.727498 1.000000 0.839943 1
2 0.769298 0.916667 0.834051 2
3 0.813232 0.916667 0.859741 3
4 0.838062 0.833333 0.833659 4
5 0.881454 0.833333 0.854946 5
6 0.925455 0.750000 0.827202 6
7 0.961111 0.666667 0.786459 7
8 0.971786 0.500000 0.659684 8
9 0.970000 0.166667 0.284000 9
10 0.955000 0.083333 0.152857 10
I want to plot cutoff column on x-axis and precision,recall and F1 values as separate lines on the same plot (in different color). How can I do it?
When I am trying to plot the dataframe, it is taking the cutoff column also for plotting.
Thanks
Remove column before ploting:
df.drop('cutoff', axis=1).plot()
But maybe problem is how is created index, maybe help change:
df = df.set_index(df['cutoff'])
df.drop('cutoff', axis=1).plot()
to:
df = df.set_index('cutoff')
df.plot()

Numpy matrix to Pandas DataFrame

I have a numpy user-item matrix, each row corresponds to an user and each columns corresponds to an item.
I want to convert the matrix in a pandas DataFrame like the one as follows:
user item rating
0 1 1907 4.0
1 1 1028 5.0
2 1 608 4.0
3 1 2692 4.0
4 1 1193 5.0
I use the following code to generate a DataFrame:
predictions = pd.DataFrame(data=pred)
predictions = predictions.stack().reset_index(name='rating')
predictions.columns = ['user', 'item', 'rating']
and I obtain a df like this:
user item rating
0 0 0 5.000000
1 0 1 0.000000
2 0 2 0.000000
3 0 3 0.000000
Is there a way in pandas to map each values in user and items columns to a value stored in list? User with value 0 should be mapped with the 1st value in user list, user with value 5 with the 6th element in user list and so on...
I'm trying using:
predictions[["user"]].apply(lambda value: users[value])
but I got an IndexError I don't understand because my users list is of size 96
IndexError: ('index 96 is out of bounds for axis 1 with size 96', 'occurred at index user')
my fault was in this code:
while not session.should_stop():
predictions = session.run(decoder_op)
pred = np.vstack((pred, predictions))
just replaced with:
np.vstack((pred, predictions))
and it works like a charm with:
predictions['user'] = predictions['user'].map(lambda value: users[value])
predictions['item'] = predictions['item'].map(lambda value: items[value])

How can I plot rows less than 0 as a different color?

I have a dataframe with one column of data. I'd like to visualize the data such that all the bars above the horizontal axis are blue, and those below it are red.
How can I accomplish this?
You can use where for selecting values above and below 0 to new columns b and c:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
data = np.random.randn(10)
df = pd.DataFrame({'a':data})
df['b'] = df.a.where(df.a >= 0)
df['c'] = df.a.where(df.a < 0)
print (df)
a b c
0 1.624345 1.624345 NaN
1 -0.611756 NaN -0.611756
2 -0.528172 NaN -0.528172
3 -1.072969 NaN -1.072969
4 0.865408 0.865408 NaN
5 -2.301539 NaN -2.301539
6 1.744812 1.744812 NaN
7 -0.761207 NaN -0.761207
8 0.319039 0.319039 NaN
9 -0.249370 NaN -0.249370
#plot to same figure
ax = df.b.plot.bar(color='b')
df.c.plot.bar(ax=ax, color='r')
plt.show()
Using numpy.where you can get indices at which data is below 0: np.where(x < 0) and over 0: np.where(x >= 0), thus you will get two not overlapping arrays, which you can visualize, using different colors.
Actually, pandas frame has its own equivalent of numpy.where, please look at this question: pandas equivalent of np.where