Pandas timestamp and python datetime interpret timezone differently - pandas

I don't understand why a isn't the same as b:
import pandas as pd
from datetime import datetime
import pytz
here = pytz.timezone('Europe/Amsterdam')
a = pd.Timestamp('2018-4-9', tz=here).to_pydatetime()
# datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo'Europe/Amsterdam' CEST+2:00:00 DST>)
b = datetime(2018, 4, 9, 0, tzinfo=here)
# datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>)
print(b-a)
# returns 01:40:00

From this stackoverflow post I learned that tzinfo doesn't work well for some timezones and that could be the reason for the wrong result.
pytz doc:
Unfortunately using the tzinfo argument of the standard datetime
constructors ‘’does not work’’ with pytz for many timezones.
The solution is to use localize or astimezone:
import pandas as pd
from datetime import datetime
import pytz
here = pytz.timezone('Europe/Amsterdam')
a = pd.Timestamp('2018-4-9', tz=here).to_pydatetime()
# datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo'Europe/Amsterdam' CEST+2:00:00 DST>)
b = here.localize(datetime(2018, 4, 9))
# datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo 'Europe/Amsterdam' CEST+2:00:00 DST>)
print(b-a)
# returns 00:00:00

If you look at a and b,
a
datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo 'Europe/Amsterdam' CEST+2:00:00 DST>)
verus
b
datetime.datetime(2018, 4, 9, 0, 0, tzinfo=<DstTzInfo 'Europe/Amsterdam' LMT+0:20:00 STD>)
CEST European Central Summer Time
vs
LMT Local Mean Time

Related

Efficient negation of subdiagonal values of a 2d numpy array

How would one efficiently negate the subdiagonal entries of a 2d numpy array?
You may use numpy.tril_indices to compute the indices of the subdiagonal entries, considering a diagonal offset of k = -1, and negate them with a mask, for instance:
import numpy as np
a = np.arange(16).reshape(4, 4)
>>> array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
idx_low_tri = np.tril_indices(a.shape[0], k=-1)
a[idx_low_tri] = -a[idx_low_tri]
>>> array([[ 0, 1, 2, 3],
[ -4, 5, 6, 7],
[ -8, -9, 10, 11],
[-12, -13, -14, 15]])
Hope this helps !

Pandas dataframe consider multiple columns to an aggregate function for each group

import pandas as pd
from functools import partial
def maxx(x, y, take_higher):
"""
:param x: some column in the df
:param y: some column in the df
:param take_higher: bool
:return: if take_higher is True: max(max(x), max(y)), else: min(max(x), max(y))
"""
pass
df = pd.DataFrame({'cat': [0, 1, 0, 0, 0, 1, 0, 0, 0, 0], 'x': [10, 15, 5, 11, 0, 4.3, 5.1, 8, 10, 12], 'y': [1, 3, 5, 1, 0, 4.3, 1, 0, 2, 2], 'z': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] })
My purpose is to apply the maxx function to each group (based on cat). It should take BOTH columns x and y as input. I would like to somehow specify the column names that I am going to consider as x and y in the function. I would also like to pass the take_lower parameter (for that purpose, I have imported functools.partial so we can wrap the function and give param). Lastly, I would like to apply that function with both take_higher=True and take_higher=False.
I am trying to do something like :
df.groupby(df.cat).agg(partial(mmax, take_higher=True), partial(mmax, take_higher=False))
but obviously, it does not work. I don't know how to specify which columns should I take into account. How can I do it?
You can use apply
def maxx(gdf,take_higher):
if take_higher:
return(max(max(gdf.x),max(gdf.y)))
else:
return(min(max(gdf.x),max(gdf.y)))
df.groupby(df.cat).apply(lambda g:maxx(g,take_higher=False))
# do both aggregation in one call
df.groupby(df.cat).apply(lambda g:pd.Series({'maxx_min': maxx(g,take_higher=False),'maxx_max' : maxx(g,take_higher=True)}))

yaml dump of a pandas dataframe

I figured I'd share, since I searched that on SO and couldn't quite find what I needed.
I wanted to dump a pd.DataFrame into a yaml file.
Timestamps should be shown nicely, not as the default:
date: !!python/object/apply:pandas._libs.tslibs.timestamps.Timestamp
- 1589241600000000000
- null
- null
Also, the output should be correct YaML format, i.e., it should be readable back by yaml.load. The output should be reasonably concise, i.e. preferring the 'flow' format.
As an example, here is some data:
df = pd.DataFrame([
dict(
date=pd.Timestamp.now().normalize() - pd.Timedelta('1 day'),
x=0,
b='foo',
c=[1,2,3,4],
other_t=pd.Timestamp.now(),
),
dict(
date=pd.Timestamp.now().normalize(),
x=1,
b='bar',
c=list(range(32)),
other_t=pd.Timestamp.now(),
),
]).set_index('date')
Here is what I came up with. It has some customization of the Dumper to handle Timestamp. The output is more legible, and still valid yaml. Upon loading, yaml recognizes the format of a valid datetime (ISO format, I think), and re-creates those as datetime. In fact, we can read it back into a DataFrame, where these datetimes are automatically converted into Timestamp. After a minor reset of index, we observe that the new df is identical to the original.
import yaml
from yaml import CDumper
from yaml.representer import SafeRepresenter
import datetime
class TSDumper(CDumper):
pass
def timestamp_representer(dumper, data):
return SafeRepresenter.represent_datetime(dumper, data.to_pydatetime())
TSDumper.add_representer(datetime.datetime, SafeRepresenter.represent_datetime)
TSDumper.add_representer(pd.Timestamp, timestamp_representer)
With this, now we can do:
text = yaml.dump(
df.reset_index().to_dict(orient='records'),
sort_keys=False, width=72, indent=4,
default_flow_style=None, Dumper=TSDumper,
)
print(text)
The output is relatively clean:
- date: 2020-05-12 00:00:00
x: 0
b: foo
c: [1, 2, 3, 4]
other_t: 2020-05-13 02:30:23.422589
- date: 2020-05-13 00:00:00
x: 1
b: bar
c: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
other_t: 2020-05-13 02:30:23.422613
Now, we can load this back:
df2 = pd.DataFrame(yaml.load(text, Loader=yaml.SafeLoader)).set_index('date')
And (drum roll, please):
df2.equals(df)
# True

How can pd.cut return a number as group?

Example
pd.cut(df['a'],[0,2,4,10,np.inf],right=False)
It returns [0,2),[2,4),[4,10),[10,np.inf) .
But how can I get [0],(0,2),[2,4),[4,10),[10,np.inf)?
If all values are integers and greater than zero, this could work:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 3, 5, 7, 9, 11, 13]})
pd.cut(df['a'], [-np.inf, 1, 2, 4, 10, np.inf], right=False)

How should width be set for a bar in matplotlib?

I'm using python 2, and the following code is just using some example data, my actual data can be of varying lengths and might not be minutely.
import numpy as np
import datetime
import matplotlib
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
x_values = [datetime.datetime(2018, 11, 8, 11, 16),
datetime.datetime(2018, 11, 8, 11, 17),
datetime.datetime(2018, 11, 8, 11, 18),
datetime.datetime(2018, 11, 8, 11, 19),
datetime.datetime(2018, 11, 8, 11, 20),
datetime.datetime(2018, 11, 8, 11, 21),
datetime.datetime(2018, 11, 8, 11, 22),
datetime.datetime(2018, 11, 8, 11, 23),
datetime.datetime(2018, 11, 8, 11, 24),
datetime.datetime(2018, 11, 8, 11, 25),
datetime.datetime(2018, 11, 8, 11, 26),
datetime.datetime(2018, 11, 8, 11, 27),
datetime.datetime(2018, 11, 8, 11, 28),
datetime.datetime(2018, 11, 8, 11, 29),
datetime.datetime(2018, 11, 8, 11, 30),
datetime.datetime(2018, 11, 8, 11, 31)]
y_values = [1392.1017964071857,
1392.2814371257484,
1392.37125748503,
1227.6802721088436,
1083.1,
1317.0461538461539,
1393.059880239521,
1393.4011976047905,
1393.491017964072,
1393.8502994011976,
1318.3461538461538,
1229.4965986394557,
1394.2095808383233,
1394.3892215568862,
1394.6586826347304,
1394.688622754491]
rects1 = ax.bar(x_values, y_values)
fig.tight_layout()
plt.show()
How am I supposed to set the width of the bars automatically? As it is I get the following:
If I set the width to 0.0006 then it looks good for the example data:
from which I've worked out that matplotlib is measuring the x axis in days (since 0.0007 days is almost exactly 1 minute, which matches my time intervals, and 0.0006 gives the gaps between bars) but that's no good if I get hourly values or seconds, or weeks, etc. Surely there's an option for handling this automatically?
If you want the bar width to be no larger than the difference between any successive datetimes, you can calculate that number and supply it to the bar's width argument.
import matplotlib.dates as mdates
width = np.min(np.diff(mdates.date2num(x_values)))
ax.bar(x_values, y_values, width=width, ec="k")