Stacked bar chart with ggplotly isn't shown properly - ggplotly

One category in my stacked bar chart doesn't have any color, although it has the right color in the legend. Does anyone have an idea how this can be solved? It concerns "date" = 2020-11-7 and "Altersklassen_num" = 1.
In ggplot (the first step) the graph looks perfectly fine. Also when I remove dynamicTicks (but then the x-scale gets very messy).
My data
structure(list(date = structure(c(18567, 18568, 18569,
18569, 18570, 18571, 18571, 18571, 18572, 18572, 18572, 18573,
18573, 18574, 18574, 18575, 18575, 18575, 18576, 18576, 18577,
18578, 18578, 18579, 18580, 18580, 18581, 18581, 18582, 18582,
18583, 18584, 18584, 18585, 18585, 18586, 18586, 18587, 18587,
18588, 18588, 18589, 18589, 18589, 18590, 18591, 18592, 18592,
18593, 18593, 18594, 18594, 18595, 18595, 18596, 18596, 18596,
18597, 18598, 18598, 18599, 18599, 18600, 18600, 18601, 18601,
18601, 18602, 18602, 18603, 18603, 18603), class = "Date"), Altersklassen_num = structure(c(2L,
1L, 1L, 2L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L,
3L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 3L, 1L, 2L, 1L, 2L, 4L), .Label = c("4", "3", "2", "1"), class = "factor"),
Altersklasse1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1), Altersklasse2 = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 0,
0, 0, 0, 0), Altersklasse3 = c(2, 0, 4, 4, 0, 3, 3, 3, 3,
3, 3, 1, 1, 4, 4, 1, 1, 1, 3, 3, 0, 2, 2, 0, 3, 3, 2, 2,
3, 3, 0, 2, 2, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 0, 0, 2,
2, 4, 4, 2, 2, 4, 4, 3, 3, 3, 0, 2, 2, 3, 3, 1, 1, 1, 1,
1, 1, 1, 3, 3, 3), Altersklasse4 = c(0, 3, 7, 7, 6, 5, 5,
5, 7, 7, 7, 3, 3, 11, 11, 9, 9, 9, 5, 5, 6, 2, 2, 5, 3, 3,
2, 2, 4, 4, 2, 3, 3, 4, 4, 2, 2, 3, 3, 5, 5, 3, 3, 3, 3,
3, 2, 2, 7, 7, 3, 3, 7, 7, 4, 4, 4, 7, 8, 8, 5, 5, 4, 4,
6, 6, 6, 1, 1, 3, 3, 3), Anz = c(2, 3, 7, 4, 6, 5, 3, 1,
7, 3, 1, 3, 1, 11, 4, 9, 1, 2, 5, 3, 6, 2, 2, 5, 3, 3, 2,
2, 4, 3, 2, 3, 2, 4, 1, 2, 2, 3, 2, 5, 3, 3, 3, 1, 3, 3,
2, 2, 7, 4, 3, 2, 7, 4, 4, 3, 1, 7, 8, 2, 5, 3, 4, 1, 6,
1, 2, 1, 1, 3, 3, 1)), row.names = c(NA, -72L), class = c("tbl_df",
"tbl", "data.frame"))
My code for the plot:
g.barchart<- ggplot(d.data_agg, aes(x= as.Date(date, "%Y-%m-%d"), y=Anz, fill=Altersklassen_num, text=paste("Datum: ", format(as.Date(date),"%d.%m.%Y"), "<br><br>0- bis 49-Jährige: ", Altersklasse1,"<br>50- bis 64-Jährige: ", Altersklasse2,"<br>65- bis 79-Jährige: ", Altersklasse3,"<br>über 80-Jährige: ", Altersklasse4))) +
geom_bar(position="stack",stat="identity",width = 1.0)
g.barchart
fig <- ggplotly(g.barchart,tooltip = c("text"), dynamicTicks = TRUE) %>%
style(hoverinfo = c("text")) %>%
config(displayModeBar = T, locale = "de") %>%
layout(hoverlabel = list(align = "left", bgcolor = "#E5E8E8", font = list(size=14)),
legend = list(orientation = "h", x = 0.15, y = -0.15),
xaxis = list(showgrid = F, ticklen = 10,
tickval = NULL,
ticktext = NULL,
ticks = "outside",
tickmode = "array",
tickformat = '%e.%b',
ticktext = labels,
tickangle = 0,
title = "",
type = "date",
autorange = FALSE,
range = c(as.numeric(as.POSIXct("2020-10-30", format="%Y-%m-%d"))*1000,
as.numeric(as.POSIXct(k.DownDat, format="%Y-%m-%d"))*1000)),
yaxis = list(autorange = FALSE))
fig

Try to complete the dataframe and then run ggplot. After that if you run ggplotly, you will get the missing stacked bar.
d.data_agg <- d.data_agg %>% complete(date, nesting(Altersklassen_num), fill = list(Anz=0))

Related

ax=b solving with numpy llinalg gives "Last 2 dimensions of the array must be square"

I have 3 matrices of shape (157, 4) , (157,) and (4,) respectively. The first matrix (an_array) contains the quantity of the items. the second, (Items matrix) contains the items names and the 3rd (Price matrix) one contains the price. I am trying to workout the price of individual items but I get an error saying "Last 2 dimensions of the array must be square" everytime I am trying to do
X = np.linalg.solve(an_array,price)
I am trying to solve ax=b but unfortunately its not working out. Any help would be very much appreciated.
**Price matrix**
array([499.25, 381. , 59.5 , 290. , 128.5 , 305.25, 336.25, 268.5 ,
395.25, 136.5 , 194.5 , 498.75, 62.25, 312.75, 332. , 171. ,
402.5 , 144.5 , 261.5 , 242.75, 381. , 371. , 355.5 , 373. ,
65.5 , 228.75, 208.75, 204.5 , 86.5 , 143. , 70.5 , 36.5 ,
82. , 302.5 , 365.5 , 158.5 , 316.5 , 508. , 86.5 , 359.75,
25.5 , 345.5 , 304.5 , 491.25, 181.5 , 343.75, 383.5 , 283.5 ,
140.25, 426. , 386. , 337.25, 415.5 , 268.25, 406. , 149.5 ,
200. , 122. , 510.25, 280. , 406.75, 191.25, 198. , 114.5 ,
211.5 , 241.75, 195.75, 276.25, 276. , 165.25, 102. , 425. ,
195. , 132.25, 86.75, 446.5 , 318. , 290.75, 286. , 232. ,
520.5 , 382.75, 94. , 482.75, 233.25, 262. , 368.25, 438.75,
433.5 , 334.5 , 360. , 422. , 191. , 292.25, 151.75, 440.25,
370. , 105.25, 122. , 455.5 , 363. , 436. , 147.5 , 548.5 ,
365.75, 185.5 , 348. , 342.5 , 509.25, 465.5 , 380.25, 361. ,
271.25, 414.25, 366.75, 145.5 , 348. , 471.25, 254.5 , 329. ,
441. , 253.25, 448.5 , 142. , 312.5 , 350. , 94. , 333. ,
418. , 194.5 , 543. , 212.5 , 66.5 , 370. , 423. , 164. ,
393.25, 299.75, 529.5 , 166.25, 228.5 , 476. , 373. , 383.25,
409. , 241. , 107.75, 194.5 , 350. , 221.75, 633.25, 444.25,
155.25, 76. , 542. , 346. , 159.75])
**Item matrix**
array(['Cereal', 'Coffee', 'Eggs', 'Pancake'], dtype='<U7')
#an_array matrix
array([[ 7, 6, 6, 9],
[ 6, 7, 0, 10],
[ 0, 7, 1, 0],
[ 4, 0, 10, 0],
[ 0, 0, 4, 2],
[10, 7, 0, 3],
[ 9, 1, 4, 3],
[ 9, 8, 0, 2],
[10, 0, 4, 5],
[ 6, 3, 0, 0],
[ 0, 1, 9, 0],
[ 3, 9, 9, 9],
[ 2, 0, 0, 1],
[ 6, 0, 6, 3],
[ 9, 0, 3, 4],
[ 8, 2, 0, 0],
[ 9, 0, 0, 10],
[ 5, 0, 0, 2],
[ 8, 7, 3, 0],
[ 0, 1, 6, 5],
[ 7, 0, 3, 8],
[ 0, 5, 10, 6],
[ 7, 1, 10, 0],
[ 4, 7, 10, 2],
[ 0, 0, 1, 2],
[ 2, 6, 0, 7],
[ 6, 4, 0, 3],
[ 8, 0, 0, 2],
[ 0, 0, 2, 2],
[ 3, 7, 0, 2],
[ 0, 9, 1, 0],
[ 1, 3, 0, 0],
[ 3, 4, 0, 0],
[ 4, 0, 0, 10],
[ 4, 9, 7, 4],
[ 6, 7, 0, 0],
[ 6, 9, 7, 0],
[ 6, 0, 10, 8],
[ 0, 0, 2, 2],
[ 6, 0, 4, 7],
[ 1, 1, 0, 0],
[ 3, 9, 7, 4],
[ 0, 1, 10, 4],
[ 6, 1, 10, 7],
[ 0, 2, 6, 2],
[ 4, 1, 7, 5],
[ 0, 3, 9, 8],
[ 3, 2, 8, 2],
[ 0, 10, 3, 1],
[ 4, 0, 8, 8],
[ 2, 0, 8, 8],
[ 7, 8, 2, 5],
[ 7, 2, 2, 10],
[ 0, 9, 3, 7],
[ 4, 4, 6, 8],
[ 5, 9, 0, 0],
[ 0, 2, 9, 0],
[ 4, 0, 2, 0],
[10, 9, 5, 7],
[ 4, 4, 0, 8],
[ 9, 10, 5, 3],
[ 4, 0, 0, 5],
[ 1, 0, 0, 8],
[ 1, 1, 0, 4],
[ 4, 1, 6, 0],
[ 0, 8, 2, 7],
[ 1, 5, 6, 1],
[ 4, 4, 3, 5],
[ 9, 6, 3, 0],
[ 2, 3, 2, 3],
[ 4, 4, 0, 0],
[ 8, 10, 10, 0],
[ 6, 6, 2, 0],
[ 0, 0, 1, 5],
[ 1, 0, 0, 3],
[ 7, 0, 4, 10],
[ 4, 8, 5, 4],
[ 9, 8, 0, 3],
[ 4, 6, 4, 4],
[10, 2, 1, 0],
[10, 3, 6, 8],
[ 6, 8, 3, 7],
[ 0, 9, 0, 2],
[ 8, 7, 4, 9],
[ 0, 6, 0, 9],
[ 0, 0, 4, 8],
[ 0, 0, 8, 9],
[ 8, 8, 8, 3],
[ 3, 5, 8, 8],
[ 7, 3, 0, 8],
[ 9, 6, 7, 0],
[ 8, 0, 4, 8],
[ 9, 2, 0, 0],
[ 6, 3, 0, 7],
[ 0, 4, 3, 3],
[10, 1, 8, 3],
[ 5, 10, 6, 4],
[ 0, 7, 0, 3],
[ 4, 0, 2, 0],
[10, 6, 0, 10],
[ 4, 0, 5, 8],
[10, 0, 7, 4],
[ 1, 7, 0, 4],
[10, 0, 6, 10],
[ 8, 10, 4, 3],
[ 9, 1, 0, 0],
[ 9, 0, 8, 0],
[ 6, 0, 0, 10],
[ 9, 1, 8, 7],
[ 1, 10, 8, 10],
[ 0, 6, 7, 9],
[10, 5, 0, 6],
[ 0, 10, 5, 5],
[ 8, 6, 1, 9],
[ 0, 4, 9, 7],
[ 4, 0, 1, 2],
[ 3, 9, 5, 6],
[10, 10, 5, 5],
[ 5, 0, 1, 6],
[ 9, 8, 5, 0],
[ 8, 3, 2, 10],
[ 0, 2, 2, 9],
[ 5, 9, 10, 4],
[ 6, 4, 0, 0],
[ 8, 1, 7, 0],
[ 6, 7, 7, 2],
[ 0, 9, 0, 2],
[10, 8, 0, 4],
[10, 1, 8, 2],
[ 0, 3, 0, 8],
[10, 8, 10, 4],
[ 0, 0, 8, 2],
[ 0, 4, 0, 2],
[ 8, 0, 10, 0],
[ 7, 0, 5, 8],
[ 6, 8, 0, 0],
[ 8, 6, 0, 9],
[10, 6, 0, 3],
[ 9, 4, 5, 10],
[ 0, 10, 0, 5],
[ 6, 4, 2, 2],
[ 9, 10, 3, 8],
[ 8, 3, 3, 6],
[ 6, 0, 3, 9],
[ 3, 1, 10, 6],
[ 0, 0, 3, 8],
[ 4, 1, 0, 1],
[ 5, 1, 0, 4],
[ 7, 0, 10, 0],
[ 5, 10, 0, 3],
[10, 8, 9, 9],
[10, 6, 9, 1],
[ 0, 8, 0, 5],
[ 0, 10, 1, 0],
[ 8, 7, 10, 6],
[ 0, 0, 8, 8],
[ 0, 5, 1, 5]])
you have 4 variables that are present in 157 equations, and you need to solve for the 4 variables, this system has no solution and is not solved by a linear system solution, rather as a least squares solution using np.linalg.lstsq
X = np.linalg.lstsq(an_array, price)[0]
print(X)
Using the lstsqr as recommended by #Ahmed:
In [88]: X = np.linalg.lstsq(quant, price)[0]
C:\Users\paul\AppData\Local\Temp\ipykernel_6428\995094510.py:1: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
X = np.linalg.lstsq(quant, price)[0]
In [90]: X
Out[90]: array([20. , 5.5 , 21. , 22.25])
Multiplying the quantities by these prices and summing gives the same numbers as your price array:
In [91]: quant#X
Out[91]:
array([499.25, 381. , 59.5 , 290. , 128.5 , 305.25, 336.25, 268.5 ,
395.25, 136.5 , 194.5 , 498.75, 62.25, 312.75, 332. , 171. ,
... , 159.75])
So there's no noise in these prices - the values are exact. That means you could use solve with any 4 of the price and quant values. That addresses the "must be square" error. quant[:4,:] has (4,4) shape.
In [92]: np.linalg.solve(quant[:4,:], price[:4])
Out[92]: array([20. , 5.5 , 21. , 22.25])

Bar of proportion of two variables

I am having a pandas dataframe as shown below
import numpy as np
data = {
'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
'baseline': [1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1],
'endline': [1, 0, np.nan, 1, 0, 0, 1, np.nan, 1, 0, 0, 1, 0, 0, 1, 0, np.nan, np.nan, 1, 0, 1, np.nan, 0, 1, 0, 1, 0, np.nan, 1, 0, np.nan, 0, 0, 0, np.nan, 1, np.nan, 1, np.nan, 0, np.nan, 1, 1, 0, 1, 1, 1, 0, 1, 1],
'gender': ['male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'male', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female', 'female']
}
df = pd.DataFrame(data)
df.head(n = 5)
The challenge is the endline column may have some missing values. My goal is to have 2 bars for each variable side by side as shown below.
Thanks in advance!
Seaborn prefers its data in "long form". Pandas' melt can convert the given dataframe to combine the 'baseline' and 'endline' columns.
By default, sns.barplot shows the mean when there are multiple y-values belonging to the same x-value. You can use a different estimator, e.g. summing the values and dividing by the number of values to get a percentage.
Here is some code to get you started:
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
import pandas as pd
import numpy as np
data = {
'id': range(1, 51),
'baseline': [1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1],
'endline': [1, 0, np.nan, 1, 0, 0, 1, np.nan, 1, 0, 0, 1, 0, 0, 1, 0, np.nan, np.nan, 1, 0, 1, np.nan, 0, 1, 0, 1, 0, np.nan, 1, 0, np.nan, 0, 0, 0, np.nan, 1, np.nan, 1, np.nan, 0, np.nan, 1, 1, 0, 1, 1, 1, 0, 1, 1]
}
df = pd.DataFrame(data)
sns.set_style('white')
ax = sns.barplot(data=df.melt(value_vars=['baseline', 'endline']),
x='variable', y='value',
estimator=lambda x: np.sum(x) / np.size(x) * 100, ci=None,
color='cornflowerblue')
ax.bar_label(ax.containers[0], fmt='%.1f %%', fontsize=20)
sns.despine(ax=ax, left=True)
ax.grid(True, axis='y')
ax.yaxis.set_major_formatter(PercentFormatter(100))
ax.set_xlabel('')
ax.set_ylabel('')
plt.tight_layout()
plt.show()

Counting zeros in a rolling - numpy array (including NaNs)

I am trying to find a way of Counting zeros in a rolling using numpy array ?
Using pandas I can get it using:
df['demand'].apply(lambda x: (x == 0).rolling(7).sum()).fillna(0))
or
df['demand'].transform(lambda x: x.rolling(7).apply(lambda x: 7 - np.count _nonzero(x))).fillna(0)
In numpy, using the code from Here
def rolling_window(a, window_size):
shape = (a.shape[0] - window_size + 1, window_size) + a.shape[1:]
print(shape)
strides = (a.strides[0],) + a.strides
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
arr = np.asarray([10, 20, 30, 5, 6, 0, 0, 0])
np.count_nonzero(rolling_window(arr==0, 7), axis=1)
Output:
array([2, 3])
However, I need the first 6 NaNs as well, and fill it with zeros:
Expected output:
array([0, 0, 0, 0, 0, 0, 2, 3])
Think an efficient one would be with 1D convolution -
def sum_occurences_windowed(arr, W):
K = np.ones(W, dtype=int)
out = np.convolve(arr==0,K)[:len(arr)]
out[:W-1] = 0
return out
Sample run -
In [42]: arr
Out[42]: array([10, 20, 30, 5, 6, 0, 0, 0])
In [43]: sum_occurences_windowed(arr,W=7)
Out[43]: array([0, 0, 0, 0, 0, 0, 2, 3])
Timings on varying length arrays and window of 7
Including count_rolling from #Quang Hoang's post.
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [sum_occurences_windowed, count_rolling]
in_ = {n:(np.random.randint(0,5,(n)),7) for n in [10,20,50,100,200,500,1000,2000,5000]}
t = benchit.timings(funcs, in_, multivar=True, input_name='Length')
t.plot(logx=True, save='timings.png')
Extending to generic n-dim arrays
from scipy.ndimage.filters import convolve1d
def sum_occurences_windowed_ndim(arr, W, axis=-1):
K = np.ones(W, dtype=int)
out = convolve1d((arr==0).astype(int),K,axis=axis,origin=-(W//2))
out.swapaxes(axis,0)[:W-1] = 0
return out
So, on a 2D array, for counting along each row, use axis=1 and for cols, axis=0 and so on.
Sample run -
In [155]: np.random.seed(0)
In [156]: a = np.random.randint(0,3,(3,10))
In [157]: a
Out[157]:
array([[0, 1, 0, 1, 1, 2, 0, 2, 0, 0],
[0, 2, 1, 2, 2, 0, 1, 1, 1, 1],
[0, 1, 0, 0, 1, 2, 0, 2, 0, 1]])
In [158]: sum_occurences_windowed_ndim(a, W=7)
Out[158]:
array([[0, 0, 0, 0, 0, 0, 3, 2, 3, 3],
[0, 0, 0, 0, 0, 0, 2, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 4, 3, 4, 3]])
# Verify with earlier 1D solution
In [159]: np.vstack([sum_occurences_windowed(i,7) for i in a])
Out[159]:
array([[0, 0, 0, 0, 0, 0, 3, 2, 3, 3],
[0, 0, 0, 0, 0, 0, 2, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 4, 3, 4, 3]])
Let's test out our original 1D input array -
In [187]: arr
Out[187]: array([10, 20, 30, 5, 6, 0, 0, 0])
In [188]: sum_occurences_windowed_ndim(arr, W=7)
Out[188]: array([0, 0, 0, 0, 0, 0, 2, 3])
I would modify the function as follow:
def count_rolling(a, window_size):
shape = (a.shape[0] - window_size + 1, window_size) + a.shape[1:]
strides = (a.strides[0],) + a.strides
rolling = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
out = np.zeros_like(a)
out[window_size-1:] = (rolling == 0).sum(1)
return out
arr = np.asarray([10, 20, 30, 5, 6, 0, 0, 0])
count_rolling(arr,7)
Output:
array([0, 0, 0, 0, 0, 0, 2, 3])

Vectorize this for loop in numpy

I am trying to compute matrix z (defined below) in python with numpy.
Here's my current solution (using 1 for loop)
z = np.zeros((n, k))
for i in range(n):
v = pi * (1 / math.factorial(x[i])) * np.exp(-1 * lamb) * (lamb ** x[i])
numerator = np.sum(v)
c = v / numerator
z[i, :] = c
return z
Is it possible to completely vectorize this computation? I need to do this computation for thousands of iterations, and matrix operations in numpy is much faster than huge for loops.
Here is a vectorized version of E. It replaces the for-loop and scalar arithmetic with NumPy broadcasting and array-based arithmetic:
def alt_E(x):
x = x[:, None]
z = pi * (np.exp(-lamb) * (lamb**x)) / special.factorial(x)
denom = z.sum(axis=1)[:, None]
z /= denom
return z
I ran em.py to get a sense for the typical size of x, lamb, pi, n and k. On data of this size,
alt_E is about 120x faster than E:
In [32]: %timeit E(x)
100 loops, best of 3: 11.5 ms per loop
In [33]: %timeit alt_E(x)
10000 loops, best of 3: 94.7 µs per loop
In [34]: 11500/94.7
Out[34]: 121.43611404435057
This is the setup I used for the benchmark:
import math
import numpy as np
import scipy.special as special
def alt_E(x):
x = x[:, None]
z = pi * (np.exp(-lamb) * (lamb**x)) / special.factorial(x)
denom = z.sum(axis=1)[:, None]
z /= denom
return z
def E(x):
z = np.zeros((n, k))
for i in range(n):
v = pi * (1 / math.factorial(x[i])) * \
np.exp(-1 * lamb) * (lamb ** x[i])
numerator = np.sum(v)
c = v / numerator
z[i, :] = c
return z
n = 576
k = 2
x = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5])
lamb = np.array([ 0.84835141, 1.04025989])
pi = np.array([ 0.5806958, 0.4193042])
assert np.allclose(alt_E(x), E(x))
By the way, E could also be calculated using scipy.stats.poisson:
import scipy.stats as stats
pois = stats.poisson(mu=lamb)
def alt_E2(x):
z = pi * pois.pmf(x[:,None])
denom = z.sum(axis=1)[:, None]
z /= denom
return z
but this does not turn out to be faster, at least for arrays of this length:
In [33]: %timeit alt_E(x)
10000 loops, best of 3: 94.7 µs per loop
In [102]: %timeit alt_E2(x)
1000 loops, best of 3: 278 µs per loop
For larger x, alt_E2 is faster:
In [104]: x = np.random.random(10000)
In [106]: %timeit alt_E(x)
100 loops, best of 3: 2.18 ms per loop
In [105]: %timeit alt_E2(x)
1000 loops, best of 3: 643 µs per loop

Convert string to integer pandas dataframe index

I have a pandas dataframe with a multiindex. Unfortunately one of the indices gives years as a string
e.g. '2010', '2011'
how do I convert these to integers?
More concretely
MultiIndex(levels=[[u'2010', u'2011'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, , ...]], names=[u'Year', u'Month'])
.
df_cbs_prelim_total.index.set_levels(df_cbs_prelim_total.index.get_level_values(0).astype('int'))
seems to do it, but not inplace. Any proper way of changing them?
Cheers,
Mike
Will probably be cleaner to do this before you assign it as index (as #EdChum points out), but when you already have it as index, you can indeed use set_levels to alter one of the labels of a level of your multi-index. A bit cleaner as your code (you can use index.levels[..]):
In [165]: idx = pd.MultiIndex.from_product([[1,2,3], ['2011','2012','2013']])
In [166]: idx
Out[166]:
MultiIndex(levels=[[1, 2, 3], [u'2011', u'2012', u'2013']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
In [167]: idx.levels[1]
Out[167]: Index([u'2011', u'2012', u'2013'], dtype='object')
In [168]: idx = idx.set_levels(idx.levels[1].astype(int), level=1)
In [169]: idx
Out[169]:
MultiIndex(levels=[[1, 2, 3], [2011, 2012, 2013]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
You have to reassign it to save the changes (as is done above, in your case this would be df_cbs_prelim_total.index = df_cbs_prelim_total.index.set_levels(...))