Pandas timeseries indexing fails when the index is hierarchical - pandas

I tried the following code snippet.
In [84]:
from datetime import datetime
from dateutil.parser import parse
​
rng = [datetime(2017,1,13), datetime(2017,1,14), datetime(2017,2,15), datetime(2017,2,16)]
​
s = Series([1,2,3,4], index=rng)
​s['2017/1']
Out[84]:
2017-01-13 1
2017-01-14 2
dtype: int64
As I expected, I could successfully retrieve only those items belonging to JAN by only specifying up to JAN like s['2017/1'].
Next time, I tried a bit extended version of the above code, where a hierarchical index was used instead:
from datetime import datetime
from dateutil.parser import parse
rng1 = [datetime(2017,1,1), datetime(2017,1,1), datetime(2017,2,1), datetime(2017,2,1)]
rng2 = [datetime(2017,1,13), datetime(2017,1,14), datetime(2017,2,15), datetime(2017,2,16)]
midx = pd.MultiIndex.from_arrays([rng1, rng2])
s = Series([1,2,3,4], index=midx)
s['2017/1']
The above code snippet, however, generates an error:
TypeError: unorderable types: int() > slice()
Would you give me some help?

It seems it is more complicated.
Partial string indexing on datetimeindex when part of a multiindex is implemented in DataFrame in pandas 0.18.
So if use:
rng1 = [pd.Timestamp(2017,5,1), pd.Timestamp(2017,5,1),
pd.Timestamp(2017,6,1), pd.Timestamp(2017,6,1)]
rng2 = pd.date_range('2017-01-13', periods=2).tolist() +
pd.date_range('2017-02-15', periods=2).tolist()
s = pd.Series([1,2,3,4], index=[rng1, rng2])
print (s)
2017-05-01 2017-01-13 1
2017-01-14 2
2017-06-01 2017-02-15 3
2017-02-16 4
Then for me works:
print (s.to_frame().loc[pd.IndexSlice[:, '2017/1'],:].squeeze())
2017-05-01 2017-01-13 1
2017-01-14 2
Name: 0, dtype: int64
print (s.loc['2017/6'])
2017-06-01 2017-02-15 3
2017-02-16 4
dtype: int64
But this return empty Series:
print (s.loc[pd.IndexSlice[:, '2017/2']])
Series([], dtype: int64

Related

How to convert pandas float64 type to NUMERIC Bigquery type?

I have a panda dataframe df:
<bound method NDFrame.head of DAT_RUN DAT_FORECAST LIB_SOURCE MES_LONGITUDE MES_LATITUDE MES_TEMPERATURE MES_HUMIDITE MES_PLUIE MES_VITESSE_VENT MES_U_WIND MES_V_WIND
0 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 3.75 11.994824 72.0 0.0 2.653137 -2.402910 -1.124792
1 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.00 13.094824 74.3 0.0 2.976434 -2.972910 -0.144792
2 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.25 12.594824 75.3 0.0 3.128418 -2.702910 1.575208
3 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.50 12.094824 75.5 0.0 3.183418 -2.342910 2.155208
I convert DAT_RUN and DAT_FORECAST columns to datetime format :
df["DAT_RUN"] = pd.to_datetime(df['DAT_RUN'], format="%Y-%m-%dT%H:%M:%SZ") # previously "%Y-%m-%d %H:%M:%S"
df["DAT_FORECAST"] = pd.to_datetime(df['DAT_FORECAST'], format="%Y-%m-%dT%H:%M:%SZ")
df.dtypes:
DAT_RUN datetime64[ns]
DAT_FORECAST datetime64[ns]
LIB_SOURCE object
MES_LONGITUDE float64
MES_LATITUDE float64
MES_TEMPERATURE float64
MES_HUMIDITE float64
MES_PLUIE float64
MES_VITESSE_VENT float64
MES_U_WIND float64
MES_V_WIND float64
I use bigquery.Client().load_table_from_dataframe() function to insert data into Bigquery table which numeric columns have NUMERIC bigquery table.
It returns this error :
pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16)
I tried to fix it with :
df["MES_LONGITUDE"] = df["MES_LONGITUDE"].astype(str).map(decimal.Decimal)
But no more.
Thanks.
I managed to work around this issue with a decimal.Context, hope it helps:
import decimal
import numpy as np
import pandas as pd
from google.cloud import bigquery
df = pd.DataFrame(
data={
"MES_HUMIDITE": np.array([2.653137, 2.976434, 3.128418, 3.183418]),
"MES_PLUIE": np.array([-2.402910, -2.972910, -2.702910, -2.342910]),
},
dtype="float",
)
We check data type declaration:
df.dtypes
# MES_HUMIDITE float64
# MES_PLUIE float64
# dtype: object
Initialize Context to 7 digits, because it is the precision in those columns, you can create multiple Context if you need different precision values for each column:
context = decimal.Context(prec=7)
df["MES_HUMIDITE"] = df["MES_HUMIDITE"].apply(context.create_decimal_from_float)
df["MES_PLUIE"] = df["MES_PLUIE"].apply(context.create_decimal_from_float)
Now, each item is a Decimal object:
df["MES_HUMIDITE"][0]
# Decimal('2.653137')
Types have changed and Pandas stores Decimals as objects, as I guess is not a native data format:
df.dtypes
# MES_HUMIDITE object
# MES_PLUIE object
# dtype: object
table_id = "test_dataset.test"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("MES_HUMIDITE", "NUMERIC"),
bigquery.SchemaField("MES_PLUIE", "NUMERIC"),
],
write_disposition="WRITE_TRUNCATE",
)
client = bigquery.Client.from_service_account_json("/path_to_key.json")
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
However, decimal types are generally recommended for financial calculations and, although I do not know your exact case and usage, you are probably safe using FLOAT64, at least for latitude and longitude.

Pandas Index.droplevel() works in 0.25.3 but not in 1.2.4

Some formerly-working code fails after I migrated from 0.25.3 Pandas to 1.2.4. Here is a reproducible example:
import numpy as np
import pandas as pd
print(f"pandas: {pd.__version__}")
!python --version
cols = pd.MultiIndex.from_product([['coz',], ['alpha', 'beta', 'gamma']], names=['health', 'protocol'])
index=pd.date_range(start="1jan2020", end=None, periods=5, freq="d", name="Date")
data = np.random.rand(5,3)
df = pd.DataFrame(data=data, index=index, columns=cols)
def foo(row):
row.index = row.index.droplevel(0)
return row['beta'] > row['alpha']
df.apply(foo, axis="columns")
in 0.25.3 this worked as I wanted:
pandas: 0.25.3
Python 3.7.11
Date
2020-01-01 False
2020-01-02 True
2020-01-03 False
2020-01-04 True
2020-01-05 False
Freq: D, dtype: bool
but in 1.2.4 the same code throws an error apparently due to the droplevel:
pandas: 1.2.4
Python 3.9.4
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-22-4242d4f13ab1> in <module>
15 return row['beta'] > row['alpha']
16
---> 17 df.apply(foo, axis="columns")
~\.conda\envs\yagi\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
~\.conda\envs\yagi\lib\site-packages\pandas\core\apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
~\.conda\envs\yagi\lib\site-packages\pandas\core\apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
~\.conda\envs\yagi\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
<ipython-input-22-4242d4f13ab1> in foo(row)
12
13 def foo(row):
---> 14 row.index = row.index.droplevel(0)
15 return row['beta'] > row['alpha']
16
~\.conda\envs\yagi\lib\site-packages\pandas\core\indexes\base.py in droplevel(self, level)
1609 levnums = sorted(self._get_level_number(lev) for lev in level)[::-1]
1610
-> 1611 return self._drop_level_numbers(levnums)
1612
1613 def _drop_level_numbers(self, levnums: List[int]):
~\.conda\envs\yagi\lib\site-packages\pandas\core\indexes\base.py in _drop_level_numbers(self, levnums)
1619 return self
1620 if len(levnums) >= self.nlevels:
-> 1621 raise ValueError(
1622 f"Cannot remove {len(levnums)} levels from an index with "
1623 f"{self.nlevels} levels: at least one level must be left."
ValueError: Cannot remove 1 levels from an index with 1 levels: at least one level must be left.
What seems to be happening is in 1.2.4 the droplevel seems to be accumulating! The first row passed into apply(), has a 2-level index. But the second row passed into apply() has a single-level index, and this is where the error screams. This I don't understand at all.
Here is same toy w/ a print diagnostic
import numpy as np
import pandas as pd
print(f"pandas: {pd.__version__}")
!python --version
cols = pd.MultiIndex.from_product([['coz',], ['alpha', 'beta', 'gamma']], names=['health', 'protocol'])
index=pd.date_range(start="1jan2020", end=None, periods=5, freq="d", name="Date")
data = np.random.rand(5,3)
df = pd.DataFrame(data=data, index=index, columns=cols)
def foo(row):
print(f"\nROW: {row} END")
row.index = row.index.droplevel(0)
return row['beta'] > row['alpha']
foo = df.apply(foo, axis="columns")
correct output:
pandas: 0.25.3
Python 3.7.11
ROW: health protocol
coz alpha 0.054421
beta 0.922885
gamma 0.843888
Name: 2020-01-01T00:00:00.000000000, dtype: float64 END
ROW: health protocol
coz alpha 0.962803
beta 0.827594
gamma 0.260147
Name: 2020-01-02T00:00:00.000000000, dtype: float64 END
ROW: health protocol
coz alpha 0.680902
beta 0.124468
gamma 0.960604
Name: 2020-01-03T00:00:00.000000000, dtype: float64 END
ROW: health protocol
coz alpha 0.133331
beta 0.664735
gamma 0.623440
Name: 2020-01-04T00:00:00.000000000, dtype: float64 END
ROW: health protocol
coz alpha 0.984164
beta 0.578701
gamma 0.538993
Name: 2020-01-05T00:00:00.000000000, dtype: float64 END
Date
2020-01-01 True
2020-01-02 False
2020-01-03 False
2020-01-04 True
2020-01-05 False
Freq: D, dtype: bool
failing output:
pandas: 1.2.4
Python 3.9.4
ROW: health protocol
coz alpha 0.374974
beta 0.137263
gamma 0.494556
Name: 2020-01-01 00:00:00, dtype: float64 END
ROW: protocol
alpha 0.591057
beta 0.560530
gamma 0.183457
Name: 2020-01-02 00:00:00, dtype: float64 END
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-28-bbef1b39f13a> in <module>
16 return row['beta'] > row['alpha']
17
---> 18 foo = df.apply(foo, axis="columns")
...
ValueError: Cannot remove 1 levels from an index with 1 levels: at least one level must be left.
========
So I can fix this by operating on a .copy() of the row, but this feels like a hack. I don't understand why the code has started working in this way after the version change.
def foo(row):
#print(f"\nROW: {row} END")
row=row.copy()
row.index = row.index.droplevel(0)
return row['beta'] > row['alpha']
https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods
Do not mutate with user-defined-methods like .apply(). I was just lucky that it worked in 0.25.3....

Convert a dict to a DataFrame in pandas

I am using the following code:
import pandas as pd
from yahoofinancials import YahooFinancials
mutual_funds = ['PRLAX', 'QASGX', 'HISFX']
yahoo_financials_mutualfunds = YahooFinancials(mutual_funds)
daily_mutualfund_prices = yahoo_financials_mutualfunds.get_historical_price_data('2015-01-01', '2021-01-30', 'daily')
I get a dictionary as the output file. I would like to get a pandas dataframe with the columns: data, PRLAX, QASGX, HISFX where data is the formatted_date and the Open price for each ticker
pandas dataframe
What you can do is this:
df = pd.DataFrame({
a: {x['formatted_date']: x['adjclose'] for x in daily_mutualfund_prices[a]['prices']} for a in mutual_funds
})
which gives:
PRLAX QASGX HISFX
2015-01-02 19.694817 17.877445 11.852874
2015-01-05 19.203604 17.606575 11.665626
2015-01-06 19.444574 17.316357 11.450289
2015-01-07 19.963596 17.616247 11.525190
2015-01-08 20.260176 18.003208 11.665626
... ... ... ...
2021-01-25 21.799999 33.700001 14.350000
2021-01-26 22.000000 33.139999 14.090000
2021-01-27 21.620001 32.000000 13.590000
2021-01-28 22.120001 32.360001 13.990000
2021-01-29 21.379999 31.709999 13.590000
[1530 rows x 3 columns]
or any other of the values in the dict.

Pandas equals gives False result even it should be True

Let me generate a dataframe
df=pd.DataFrame([[1,2],[2,1]])
then I compare
df[0].equals(df[1].sort_values())
this gives False.
However, both d[0] and df[1].sort_values() gives the same output
0 1
1 2
Name: 0, dtype: int64
Why equals gives False? What is wrong?
There is different order of index values, so if create same e.g. here by Series.reset_index with drop=true it working like you expected:
a = df[0].equals(df[1].sort_values().reset_index(drop=True))
print (a)
True
Details:
print (df[0])
0 1
1 2
Name: 0, dtype: int64
print (df[1].sort_values())
1 1
0 2
Name: 1, dtype: int64
print (df[1].sort_values().reset_index(drop=True))
0 1
1 2
Name: 1, dtype: int64
You can also directly access Series values:
np.equal(df[0].values, df[1].sort_values().values)
array([ True, True])
np.equal(df[0].values, df[1].sort_values().values).all()
True
np.array_equal(df[0], df[1].sort_values())
True
As time perfomance is concerned, second and third approaches are equivalent while df[0].equals(df[1].sort_values().reset_index(drop=True)) is 1.5x slower.

Tensorflow - timeseries data import with datetime

I am a noob for the Tensorflow and I am starting with some timeseries prediction example.
I would like to import the exact datetime instead of the sequence number for the below code. How to do that? Thanks.
Code:
csv_file_name = './data/sales.csv'
reader = tf.contrib.timeseries.CSVReader(csv_file_name)
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(reader, batch_size=16, window_size=42)
with tf.Session() as sess:
data = reader.read_full()
coord = tf.train.Coordinator()
tf.train.start_queue_runners(sess=sess, coord=coord)
data = sess.run(data)
coord.request_stop()
ar = tf.contrib.timeseries.ARRegressor(
periodicities=100, input_window_size=35, output_window_size=7,
num_features=1,
loss=tf.contrib.timeseries.ARModel.NORMAL_LIKELIHOOD_LOSS)
ar.train(input_fn=train_input_fn, steps=6000)
evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)
evaluation = ar.evaluate(input_fn=evaluation_input_fn, steps=1)
(predictions,) = tuple(ar.predict(
input_fn=tf.contrib.timeseries.predict_continuation_input_fn(
evaluation, steps=100)))
sales.csv
1,12223696.5
2,14098603
3,10515241
4,6328012
5,7200172
6,7864498
7,8036747.5
8,7537712.5
9,15359748.5
10,10074294.5
Error if i try to import datetime
tensorflow.python.framework.errors_impl.InvalidArgumentError: Field 0 in record 0 is not a valid int64: 2017-01-01
According to the source code, a RandomWindowInputFn accepts either a CSVReader or a NumpyReader. So you could use pandas to read the CSV, do the date parsing and then feed the transformed dates into a NumpyReader
My time-series data looks like this
timestamp value
0 2014-02-14 14:30:00 0.132
1 2014-02-14 14:35:00 0.134
2 2014-02-14 14:40:00 0.134
3 2014-02-14 14:45:00 0.134
4 2014-02-14 14:50:00 0.134
First, i parsed the timestamp column into a int col using pandas
from datetime import datetime as dt
import pandas as pd
def date_parser(date_str):
return dt.strptime(date_str, "%Y-%m-%d %H:%M:%S").strftime("%s")
data = pd.read_csv("my_data.csv"
, header=0
, parse_dates=['timestamp']
, date_parser=date_parser)
data['timestamp'] = data['timestamp'].apply(lambda x: int(x))
Then we can pass on these arrays to the NumpyReader
np_reader = tf.contrib.timeseries.NumpyReader(data={tf.contrib.timeseries.TrainEvalFeatures.TIMES: data['timestamp'].values, tf.contrib.timeseries.TrainEvalFeatures.VALUES : data['value'].values})
And finally pass the np_reader to the RandomWindowInputFn
train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(
np_reader, batch_size=32, window_size=16)
Hope this helps somebody!