OverflowError: Python int too large to convert to C long- Matplotlib - pandas

I have a pretty simple dataframe as seen below. I am trying to manipulate the x-axis(dates) so it starts at 1996-31-12 and ends at 2016-31-12 on increments of 365 days.
Date A B
1996-31-12 10 3
1997-31-03 5 6
1997-31-07 7 5
1997-30-11 3 12
1997-31-12 4 10
1998-31-03 5 8
2016-31-12 3 9
#change date string to datetime variable
df12.Date = pd.to_datetime(df12.Date)
fig, ax = plt.subplots()
ax.xaxis.grid(True, which="major")
I am getting an error message when i try and run the above code that I am not sure what it means-OverflowError: Python int too large to convert to C long. ANy one know what this means? If not, is there another way to do what i want to do?


Python altair - facet line plot with multiple variables

I have the following kind of DataFrame
Marque Annee Modele PVFP PM
0 A 1 Python 70783.066836 2.067821e+07
1 A 2 Python 75504.270716 1.957717e+07
2 A 3 Python 66383.237169 1.848982e+07
3 A 4 Python 61966.851675 1.755261e+07
4 A 5 Python 54516.367597 1.671907e+07
5 A 1 Sol 66400.686091 2.067821e+07
6 A 2 Sol 74953.770294 1.955218e+07
7 A 3 Sol 66500.916446 1.844078e+07
8 A 4 Sol 62016.941237 1.748098e+07
9 A 5 Sol 54356.008414 1.662684e+07
10 B 1 Python 43152.461787 1.340989e+07
11 B 2 Python 62397.794144 1.494418e+07
12 B 3 Python 1871.135251 2.178552e+06
I tried to build a facet graph but without really succeeding. I'am just able to concat vertically the 2 charts generated. I would be grateful if you have any idea to do it properly in one operation.
My current code :
chart = alt.Chart(euro).mark_line().encode(
chart2 = alt.Chart(euro).mark_line().encode(
chart & chart2
One good way to do this is to use a Fold Transform to fold your two columns into one, and then you can use row and column facets to facet by both variables at once. For example:
['PVFP', 'PM'], as_=['key', 'value']

Unexpected groupby result: some rows are missing

I am facing an issue with transforming my data using Pandas' groupby. I have a table (several million rows and 3 variables) that I am trying to group by "Date" variable.
Snippet from a raw table:
Date V1 V2
07_19_2017_17_00_06 10 5
07_19_2017_17_00_06 20 6
07_19_2017_17_00_08 15 3
01_07_2019_14_06_59 30 1
01_07_2019_14_06_59 40 2
The goal is to group rows with the same value of "Date" by applying a mean function over V1 and sum function over V2. So that the expected result resembles:
Date V1 V2
07_19_2017_17_00_06 15 11 # This row has changed
07_19_2017_17_00_08 15 3
01_07_2019_14_06_59 35 3 # and this one too!
My code:
df = df.groupby(['Date'], as_index=False).agg({'V1': 'mean', 'V2': 'sum'})
The output I am getting, however, is totally unexpected and I am can't find a reasonable explanation of why it happens. It seems like Pandas is only processing data from 01_01_2018_00_00_01 to 12_31_2018_23_58_40, instead of 07_19_2017_17_00_06 to 01_07_2019_14_06_59.
Date V1 V2
01_01_2018_00_00_01 30 3
01_01_2018_00_00_02 20 4
12_31_2018_23_58_35 15 3
12_31_2018_23_58_40 16 11
If you have any clue, I would really appreciate your input. Thank you!
I suspect that the issue is based around Pandas not recognizing the date format that I've used. A solution turned out to be quite simple: convert all of the dates into UNIX time format, divide by 60 and then, repeat the groupby procedure.

Removing every 2nd xtick label only works for the first 6 ticks

I have a simple line graph, but the xticks are overlapping. Therefore I want to display only every 2nd xtick. I implemented this answer which worked for me in another graph. As you can see below it stops working after the 6th tick and I can't wrap my head around why.
My code is the following:
data = pf.cum_perc
fig, ax = plt.subplots(figsize=(10, 6))
ax.set_xlabel("Days", size=18)
ax.set_ylabel("Share of reviews", size = 18)
for label in ax.xaxis.get_ticklabels()[1::2]:
pf.cum_perc is a column of a data frame (therefore a series) with the following data:
1 0.037599
2 0.089759
3 0.203477
4 0.302451
5 0.398169
6 0.486392
7 0.533514
8 0.538183
9 0.539411
10 0.550040
11 0.550716
12 0.553050
13 0.553726
14 0.654789
15 0.681084
16 0.706211
17 0.731462
18 0.756712
19 0.781594
20 0.807766
21 0.873687
(and so on)
The resulting graph:
Any help is greatly appreciated :)
As user #ImportanceOfBeingErnest suggested:
Solution 1:
Convert the x-axis data to numbers, so matplotbib takes care of the ticks automatically. In my case this is done by
pf.index = pf.index.map(int)
Solution 2:
Remove the ticks, after the graph is plotted, otherwise the objects don't exist yet and therefore can't be set invisible.
The new code would look like this:
data = pf.cum_perc
fig, ax = plt.subplots(figsize=(10, 6))
ax.set_xlabel("Days", size=18)
ax.set_ylabel("Share of reviews", size = 18)
for label in ax.xaxis.get_ticklabels()[1::2]:

Get coherent subsets from pandas series

I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]

Most efficient way to shift MultiIndex time series

I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?
Right now, I do something like:
cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))
For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)
I'm converting this from R code and the equivalent data.table call:
dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]
only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)
Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.
Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:
ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.
Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])
I would suggest you reshape the data and do a single shift versus the groupby approach:
result = df.unstack(0).shift(1).stack()
This switches the order of the levels so you'd want to swap and reorder:
result = result.swaplevel(0, 1).sortlevel(0)
You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):
In [17]: result.ix[1]
a b c d e
1 0.752511 0.600825 0.328796 0.852869 0.306379
2 0.251120 0.871167 0.977606 0.509303 0.809407
3 0.198327 0.587066 0.778885 0.565666 0.172045
4 0.298184 0.853896 0.164485 0.169562 0.923817
5 0.703668 0.852304 0.030534 0.415467 0.663602
6 0.851866 0.629567 0.918303 0.205008 0.970033
7 0.758121 0.066677 0.433014 0.005454 0.338596
8 0.561382 0.968078 0.586736 0.817569 0.842106
9 0.246986 0.829720 0.522371 0.854840 0.887886
10 0.709550 0.591733 0.919168 0.568988 0.849380
11 0.997787 0.084709 0.664845 0.808106 0.872628
12 0.008661 0.449826 0.841896 0.307360 0.092581
13 0.727409 0.791167 0.518371 0.691875 0.095718
14 0.928342 0.247725 0.754204 0.468484 0.663773
15 0.934902 0.692837 0.367644 0.061359 0.381885
16 0.828492 0.026166 0.050765 0.524551 0.296122
17 0.589907 0.775721 0.061765 0.033213 0.793401
18 0.532189 0.678184 0.747391 0.199283 0.349949
In [18]: df.ix[1]
a b c d e
0 0.752511 0.600825 0.328796 0.852869 0.306379
1 0.251120 0.871167 0.977606 0.509303 0.809407
2 0.198327 0.587066 0.778885 0.565666 0.172045
3 0.298184 0.853896 0.164485 0.169562 0.923817
4 0.703668 0.852304 0.030534 0.415467 0.663602
5 0.851866 0.629567 0.918303 0.205008 0.970033
6 0.758121 0.066677 0.433014 0.005454 0.338596
7 0.561382 0.968078 0.586736 0.817569 0.842106
8 0.246986 0.829720 0.522371 0.854840 0.887886
9 0.709550 0.591733 0.919168 0.568988 0.849380
10 0.997787 0.084709 0.664845 0.808106 0.872628
11 0.008661 0.449826 0.841896 0.307360 0.092581
12 0.727409 0.791167 0.518371 0.691875 0.095718
13 0.928342 0.247725 0.754204 0.468484 0.663773
14 0.934902 0.692837 0.367644 0.061359 0.381885
15 0.828492 0.026166 0.050765 0.524551 0.296122
16 0.589907 0.775721 0.061765 0.033213 0.793401
17 0.532189 0.678184 0.747391 0.199283 0.349949
Perf isn't too bad with this method (it might be a touch slower in 0.9.0):
In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s