missing observation panel data, bring forward value 20 periods - pandas

Here's to read in a DataFrame like the one I'm looking at
pd.DataFrame({
'period' : [1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15, 16, 19, 20, 21, 22,
23, 25, 26],
'id' : [1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285,
1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285],
'pred': [-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775],
'ret' : [ None, -0.02222222, -0.01363636, 0. , -0.02764977,
None, -0.00909091, -0.01376147, 0.00465116, None,
0.01869159, 0. , 0. , None , -0.00460829,
0.00462963, 0.02304147, 0. , None, -0.00050756]})
Which will look like this when read in.
period id pred ret
0 1 1285 -1.653477 NaN
1 2 1285 -1.653477 -0.022222
2 3 1285 -1.653477 -0.013636
3 4 1285 -1.653477 0.000000
4 5 1285 -1.653477 -0.027650
5 8 1285 -1.653477 NaN
6 9 1285 -1.653477 -0.009091
7 10 1285 -1.653477 -0.013761
8 11 1285 -1.653477 0.004651
9 13 1285 -1.653477 NaN
10 14 1285 -1.653477 0.018692
11 15 1285 -1.653477 0.000000
12 16 1285 -1.653477 0.000000
13 19 1285 -1.653477 NaN
14 20 1285 -1.653477 -0.004608
15 21 1285 -1.653477 0.004630
16 22 1285 -1.653477 0.023041
17 23 1285 -1.653477 0.000000
18 25 1285 -1.653477 NaN
19 26 1285 -1.653477 -0.000508
pred is 20 period prediction and consequently I want to do is bring the returns back 20 days. (but do it in a flexible way.)
Here's the lag function I have presently
def lag(df, col, lag_dist=1, ref='period', group='id'):
df = df.copy()
new_col = 'lag'+str(lag_dist)+'_'+col
df[new_col] = df.groupby(group)[col].shift(lag_dist)
# set NaN values that differ from specified
df[new_col] = (df.groupby(group)[ref]
.shift(lag_dist)
.sub(df[ref])
.eq(-lag_dist)
.mul(1)
.replace(0,np.nan)*df[new_col])
return df[new_col]
but when I run
df['fut20_ret'] = lag(df, 'ret', -20, 'period')
df.head(20)
I get
period id pred gain fee prc ret fut20_ret
0 1 1285 -1.653478 0.000000 0.87 1.000000 NaN NaN
1 2 1285 -1.653478 -0.022222 0.87 0.977778 -0.022222 NaN
2 3 1285 -1.653478 -0.035556 0.87 0.964444 -0.013636 NaN
3 4 1285 -1.653478 -0.035556 0.87 0.964444 0.000000 NaN
4 5 1285 -1.653478 -0.062222 0.87 0.937778 -0.027650 NaN
6 8 1285 -1.653478 -0.022222 0.87 0.977778 NaN NaN
7 9 1285 -1.653478 -0.031111 0.87 0.968889 -0.009091 NaN
8 10 1285 -1.653478 -0.044444 0.87 0.955556 -0.013761 NaN
9 11 1285 -1.653478 -0.040000 0.87 0.960000 0.004651 NaN
10 13 1285 -1.653478 -0.048889 0.87 0.951111 NaN NaN
11 14 1285 -1.653478 -0.031111 0.87 0.968889 0.018692 NaN
12 15 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
13 16 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
15 19 1285 -1.653478 -0.035556 0.87 0.964444 NaN NaN
16 20 1285 -1.653478 -0.040000 0.87 0.960000 -0.004608 NaN
17 21 1285 -1.653478 -0.035556 0.87 0.964444 0.004630 NaN
18 22 1285 -1.653478 -0.013333 0.87 0.986667 0.023041 NaN
19 23 1285 -1.653478 -0.013333 0.87 0.986667 0.000000 NaN
How can I modify my lag function so that it works properly? It's close but I'm struggling on the last little bit.

Related

pandas multiindex sort by specified rules

There is a dataframe like below
arrays = [
np.array(["baz", "baz", "bar", "bar", "qux", "foo"]),
np.array(["yes", "no", "yes", "no", "yes", "no"]),
]
df = pd.DataFrame(np.random.randint(100, size=(6,4)), index=arrays)
df
Now want to know the yes_rate(yes/all) of each column
the implement code as below
first_index_list = list(df.index.get_level_values(0).unique())
for index in first_index_list:
index_sum = df.loc[index].sum()
if df.index.isin([(index,'yes')]).any():
yes_rate = df.loc[(index, 'yes')] / index_sum
df.loc[(index, 'yes_rate'),:] = yes_rate
df.loc[(index,'All'),:] = index_sum
df.sort_index()
but the code is not ideal, there are some problems,
how to sort as below order
first index: [baz,bar,qux,foo] just as first picture
second index: [no,yes,All,yes_rate]
repeat execute the code, All and yes_rate values not change
So how to only add yes and no to generate All (note: yes and no not guaranteed to exist)
index_sum = ...
if yes exists:
index_sum += df.loc[(index, 'yes')]
if no exists:
index_sum += df.loc[(index, 'no')]
IIUC, you can use pandas.concat to concatenate in the desired order, then only sort the first level:
l = ['baz', 'bar', 'qux', 'foo']
order = pd.Series({k:v for v,k in enumerate(l)})
df_all = df.groupby(level=0).sum()
out = (pd
.concat([df,
pd.concat({'All': df_all,
'yes_rate': df.xs('yes', level=1).div(df_all)})
.dropna(how='all')
.swaplevel()
],)
.sort_index(level=0, key=order.reindex, sort_remaining=False)
)
output:
0 1 2 3
baz yes 20.000000 97.000000 95.000000 38.000000
no 85.000000 73.000000 23.000000 27.000000
All 105.000000 170.000000 118.000000 65.000000
yes_rate 0.190476 0.570588 0.805085 0.584615
bar yes 86.000000 32.000000 73.000000 16.000000
no 9.000000 97.000000 2.000000 55.000000
All 95.000000 129.000000 75.000000 71.000000
yes_rate 0.905263 0.248062 0.973333 0.225352
qux yes 69.000000 16.000000 92.000000 82.000000
All 69.000000 16.000000 92.000000 82.000000
yes_rate 1.000000 1.000000 1.000000 1.000000
foo no 77.000000 5.000000 12.000000 3.000000
All 77.000000 5.000000 12.000000 3.000000
def function1(ss:pd.Series):
if ss.name=='level_1':
ss[6]='yes_rate'
ss[-1]='ALL'
else:
ss[6]=ss.iloc[0]/ss.sum()
ss[-1]=ss.sum()
return ss.sort_index()
df.reset_index().groupby('level_0').apply(lambda dd:dd.apply(function1).set_index('level_1'))
#%%
0 1 2 3
level_0 level_1
bar ALL 164.506098 136.514706 176.511364 83.012048
yes 83.000000 70.000000 90.000000 1.000000
no 81.000000 66.000000 86.000000 82.000000
yes_rate 0.506098 0.514706 0.511364 0.012048
baz ALL 100.310000 32.281250 35.228571 143.342657
yes 31.000000 9.000000 8.000000 49.000000
no 69.000000 23.000000 27.000000 94.000000
yes_rate 0.310000 0.281250 0.228571 0.342657
foo ALL 52.000000 29.000000 29.000000 35.000000
no 51.000000 28.000000 28.000000 34.000000
yes_rate 1.000000 1.000000 1.000000 1.000000
qux ALL 7.000000 65.000000 33.000000 57.000000
yes 6.000000 64.000000 32.000000 56.000000
yes_rate 1.000000 1.000000 1.000000 1.000000

Apply customized functions in pandas groupby and panel data

I have a panel data as follows:
volume VWAP open close high low n ticker date
time
2021-09-02 09:30:00 597866 110.2781 110.32 110.37 110.4900 110.041 3719.0 AMD 2021-09-02
2021-09-02 09:31:00 512287 109.9928 110.36 109.85 110.4000 109.725 3732.0 AMD 2021-09-02
2021-09-02 09:32:00 359379 109.7271 109.81 109.89 109.9600 109.510 2455.0 AMD 2021-09-02
2021-09-02 09:33:00 368225 109.5740 109.89 109.66 109.8900 109.420 2555.0 AMD 2021-09-02
2021-09-02 09:34:00 320260 109.5616 109.67 109.45 109.8299 109.390 2339.0 AMD 2021-09-02
... ... ... ... ... ... ... ... ... ...
2021-12-31 15:56:00 62680 3334.8825 3332.24 3337.60 3337.8500 3331.890 2334.0 AMZN 2021-12-31
2021-12-31 15:57:00 26046 3336.0700 3337.70 3335.72 3338.6000 3334.990 1292.0 AMZN 2021-12-31
2021-12-31 15:58:00 47989 3336.3885 3334.65 3337.23 3338.0650 3334.650 1651.0 AMZN 2021-12-31
2021-12-31 15:59:00 63865 3335.5288 3336.70 3334.72 3337.3700 3334.180 2172.0 AMZN 2021-12-31
2021-12-31 16:00:00 1974 3334.8869 3334.34 3334.34 3334.3400 3334.340 108.0 AMZN 2021-12-31
153700 rows × 9 columns
I would like to calculate a series of attributes engeered from the panel data. These functions are pre-written and posted on github https://github.com/twopirllc/pandas-ta/blob/main/pandas_ta/overlap/ema.py. In doctor Jansen's example, he used
import pandas_ta as ta
import pandas as pd
df["feature"] = df.groupby("ticker", group_keys = False).apply(lambda x: ta.ema(x.close))
I was able to follow along using google cloud's compute engine under python 3.7. However, when I use my school's cluster with python 3.8, eventhough with the same pandas version, it would not work. I also tried using the same version of python. Unfortunately it did not work as well.
df.groupby("ticker").apply(lambda x: ta.ema(x.close, 200))
output:
ticker time
AAPL 2021-09-02 09:30:00 NaN
2021-09-02 09:31:00 NaN
2021-09-02 09:32:00 NaN
2021-09-02 09:33:00 NaN
2021-09-02 09:34:00 NaN
...
TSLA 2021-12-31 15:56:00 1064.446659
2021-12-31 15:57:00 1064.358135
2021-12-31 15:58:00 1064.278452
2021-12-31 15:59:00 1064.207621
2021-12-31 16:00:00 1064.135904
Name: EMA_200, Length: 153700, dtype: float64
df["alpha_01"] = df.groupby("ticker").apply(lambda x: ta.ema(x.close))
output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10772, in _reindex_for_setitem(value, index)
10771 try:
> 10772 reindexed_value = value.reindex(index)._values
10773 except ValueError as err:
10774 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
File ~/quant/lib/python3.8/site-packages/pandas/core/series.py:4579, in Series.reindex(self, index, **kwargs)
4571 #doc(
4572 NDFrame.reindex, # type: ignore[has-type]
4573 klass=_shared_doc_kwargs["klass"],
(...)
4577 )
4578 def reindex(self, index=None, **kwargs):
-> 4579 return super().reindex(index=index, **kwargs)
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4809, in NDFrame.reindex(self, *args, **kwargs)
4808 # perform the reindex on the axes
-> 4809 return self._reindex_axes(
4810 axes, level, limit, tolerance, method, fill_value, copy
4811 ).__finalize__(self, method="reindex")
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4825, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
4824 ax = self._get_axis(a)
-> 4825 new_index, indexer = ax.reindex(
4826 labels, level=level, limit=limit, tolerance=tolerance, method=method
4827 )
4829 axis = self._get_axis_number(a)
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/multi.py:2533, in MultiIndex.reindex(self, target, method, level, limit, tolerance)
2532 try:
-> 2533 target = MultiIndex.from_tuples(target)
2534 except TypeError:
2535 # not all tuples, see test_constructor_dict_multiindex_reindex_flat
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/multi.py:202, in names_compat.<locals>.new_meth(self_or_cls, *args, **kwargs)
200 kwargs["names"] = kwargs.pop("name")
--> 202 return meth(self_or_cls, *args, **kwargs)
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/multi.py:553, in MultiIndex.from_tuples(cls, tuples, sortorder, names)
551 tuples = np.asarray(tuples._values)
--> 553 arrays = list(lib.tuples_to_object_array(tuples).T)
554 elif isinstance(tuples, list):
File ~/quant/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2919, in pandas._libs.lib.tuples_to_object_array()
ValueError: cannot include dtype 'M' in a buffer
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 df_features["alpha_01"] = df.groupby("ticker").apply(lambda x: ta.ema(x.close))
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3607, in DataFrame.__setitem__(self, key, value)
3604 self._setitem_array([key], value)
3605 else:
3606 # set column
-> 3607 self._set_item(key, value)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3779, in DataFrame._set_item(self, key, value)
3769 def _set_item(self, key, value) -> None:
3770 """
3771 Add series to DataFrame in specified column.
3772
(...)
3777 ensure homogeneity.
3778 """
-> 3779 value = self._sanitize_column(value)
3781 if (
3782 key in self.columns
3783 and value.ndim == 1
3784 and not is_extension_array_dtype(value)
3785 ):
3786 # broadcast across multiple columns if necessary
3787 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:4501, in DataFrame._sanitize_column(self, value)
4499 # We should never get here with DataFrame value
4500 if isinstance(value, Series):
-> 4501 return _reindex_for_setitem(value, self.index)
4503 if is_list_like(value):
4504 com.require_length_match(value, self.index)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10779, in _reindex_for_setitem(value, index)
10775 if not value.index.is_unique:
10776 # duplicate axis
10777 raise err
> 10779 raise TypeError(
10780 "incompatible index of inserted column with frame index"
10781 ) from err
10782 return reindexed_value
TypeError: incompatible index of inserted column with frame index
df_features["alpha_01"] = df.groupby("ticker", group_keys = False).apply(lambda x: ta.ema(x.close))
output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 df_features["alpha_01"] = df.groupby("ticker", group_keys = False).apply(lambda x: ta.ema(x.close))
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3607, in DataFrame.__setitem__(self, key, value)
3604 self._setitem_array([key], value)
3605 else:
3606 # set column
-> 3607 self._set_item(key, value)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3779, in DataFrame._set_item(self, key, value)
3769 def _set_item(self, key, value) -> None:
3770 """
3771 Add series to DataFrame in specified column.
3772
(...)
3777 ensure homogeneity.
3778 """
-> 3779 value = self._sanitize_column(value)
3781 if (
3782 key in self.columns
3783 and value.ndim == 1
3784 and not is_extension_array_dtype(value)
3785 ):
3786 # broadcast across multiple columns if necessary
3787 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:4501, in DataFrame._sanitize_column(self, value)
4499 # We should never get here with DataFrame value
4500 if isinstance(value, Series):
-> 4501 return _reindex_for_setitem(value, self.index)
4503 if is_list_like(value):
4504 com.require_length_match(value, self.index)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10777, in _reindex_for_setitem(value, index)
10773 except ValueError as err:
10774 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
10775 if not value.index.is_unique:
10776 # duplicate axis
> 10777 raise err
10779 raise TypeError(
10780 "incompatible index of inserted column with frame index"
10781 ) from err
10782 return reindexed_value
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10772, in _reindex_for_setitem(value, index)
10770 # GH#4107
10771 try:
> 10772 reindexed_value = value.reindex(index)._values
10773 except ValueError as err:
10774 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
10775 if not value.index.is_unique:
10776 # duplicate axis
File ~/quant/lib/python3.8/site-packages/pandas/core/series.py:4579, in Series.reindex(self, index, **kwargs)
4571 #doc(
4572 NDFrame.reindex, # type: ignore[has-type]
4573 klass=_shared_doc_kwargs["klass"],
(...)
4577 )
4578 def reindex(self, index=None, **kwargs):
-> 4579 return super().reindex(index=index, **kwargs)
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4809, in NDFrame.reindex(self, *args, **kwargs)
4806 return self._reindex_multi(axes, copy, fill_value)
4808 # perform the reindex on the axes
-> 4809 return self._reindex_axes(
4810 axes, level, limit, tolerance, method, fill_value, copy
4811 ).__finalize__(self, method="reindex")
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4830, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
4825 new_index, indexer = ax.reindex(
4826 labels, level=level, limit=limit, tolerance=tolerance, method=method
4827 )
4829 axis = self._get_axis_number(a)
-> 4830 obj = obj._reindex_with_indexers(
4831 {axis: [new_index, indexer]},
4832 fill_value=fill_value,
4833 copy=copy,
4834 allow_dups=False,
4835 )
4837 return obj
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4874, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
4871 indexer = ensure_platform_int(indexer)
4873 # TODO: speed up on homogeneous DataFrame objects
-> 4874 new_data = new_data.reindex_indexer(
4875 index,
4876 indexer,
4877 axis=baxis,
4878 fill_value=fill_value,
4879 allow_dups=allow_dups,
4880 copy=copy,
4881 )
4882 # If we've made a copy once, no need to make another one
4883 copy = False
File ~/quant/lib/python3.8/site-packages/pandas/core/internals/managers.py:663, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice)
661 # some axes don't allow reindexing with dups
662 if not allow_dups:
--> 663 self.axes[axis]._validate_can_reindex(indexer)
665 if axis >= self.ndim:
666 raise IndexError("Requested axis not found in manager")
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/base.py:3785, in Index._validate_can_reindex(self, indexer)
3783 # trying to reindex on an axis with duplicates
3784 if not self._index_as_unique and len(indexer):
-> 3785 raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
The data and the ipynb is avaliable via this link: https://drive.google.com/drive/folders/1QnIdYnDFs8XNk7L8KFzCHC_YJPDo618t?usp=sharing
Ideal output:
df["new_col"] = df.groupby().apply() # without writing any additional helper function
The apply function following the dataframe has the following output:
df.groupby("ticker").apply(lambda x: ta.ema(x.close, 200))
output:
ticker time
AAPL 2021-09-02 09:30:00 NaN
2021-09-02 09:31:00 NaN
2021-09-02 09:32:00 NaN
2021-09-02 09:33:00 NaN
2021-09-02 09:34:00 NaN
...
TSLA 2021-12-31 15:56:00 1064.446659
2021-12-31 15:57:00 1064.358135
2021-12-31 15:58:00 1064.278452
2021-12-31 15:59:00 1064.207621
2021-12-31 16:00:00 1064.135904
Name: EMA_200, Length: 153700, dtype: float64
We want make the dataframe to be appended to have the identical multi-index columns.
df_features = df.reset_index().groupby([pd.Grouper(key = "ticker"), "time"]).sum()
df_features
out:
volume VWAP open close high low n
ticker time
AAPL 2021-09-02 09:30:00 1844930 154.0857 153.8700 154.4300 154.4402 153.8600 9899.0
2021-09-02 09:31:00 565141 154.2679 154.4299 154.0600 154.4600 154.0600 5132.0
2021-09-02 09:32:00 524794 154.1198 154.0600 154.2339 154.3700 153.8500 4036.0
2021-09-02 09:33:00 504479 154.3071 154.2305 154.4750 154.4800 154.1600 4171.0
2021-09-02 09:34:00 794989 154.5478 154.4800 154.4906 154.7100 154.4206 5019.0
... ... ... ... ... ... ... ... ...
TSLA 2021-12-31 15:56:00 91296 1055.9030 1055.4900 1055.9400 1056.3200 1055.3200 2360.0
2021-12-31 15:57:00 104648 1056.0563 1055.9850 1055.5500 1056.4300 1055.5500 2988.0
2021-12-31 15:58:00 149130 1055.6994 1055.5500 1056.3500 1056.8000 1054.5900 3603.0
2021-12-31 15:59:00 189018 1056.4131 1056.2900 1057.1600 1057.2400 1056.0700 4214.0
2021-12-31 16:00:00 37983 1056.3289 1057.0100 1057.0000 1057.1000 1056.0000 319.0
153700 rows × 7 columns
Then append the calculated series to this dataframe.
df_features["alpha_01"] = df.groupby("ticker").parallel_apply(lambda x: ta.ema(x.close, 200))
df_features
out:
volume VWAP open close high low n alpha_01
ticker time
AAPL 2021-09-02 09:30:00 1844930 154.0857 153.8700 154.4300 154.4402 153.8600 9899.0 NaN
2021-09-02 09:31:00 565141 154.2679 154.4299 154.0600 154.4600 154.0600 5132.0 NaN
2021-09-02 09:32:00 524794 154.1198 154.0600 154.2339 154.3700 153.8500 4036.0 NaN
2021-09-02 09:33:00 504479 154.3071 154.2305 154.4750 154.4800 154.1600 4171.0 NaN
2021-09-02 09:34:00 794989 154.5478 154.4800 154.4906 154.7100 154.4206 5019.0 NaN
... ... ... ... ... ... ... ... ... ...
TSLA 2021-12-31 15:56:00 91296 1055.9030 1055.4900 1055.9400 1056.3200 1055.3200 2360.0 1064.446659
2021-12-31 15:57:00 104648 1056.0563 1055.9850 1055.5500 1056.4300 1055.5500 2988.0 1064.358135
2021-12-31 15:58:00 149130 1055.6994 1055.5500 1056.3500 1056.8000 1054.5900 3603.0 1064.278452
2021-12-31 15:59:00 189018 1056.4131 1056.2900 1057.1600 1057.2400 1056.0700 4214.0 1064.207621
2021-12-31 16:00:00 37983 1056.3289 1057.0100 1057.0000 1057.1000 1056.0000 319.0 1064.135904
153700 rows × 8 columns

Why are the limits of the histogram data autodetected as [nan, nan] instead of discarding NaNs?

The following code generates an error
print(g['resp'])
par = {'hist': True, 'kde': False, 'fit': scipy.stats.norm, 'bins': 'auto'}
sns.distplot(g['resp'], color='blue', **par)
31 23.0
32 28.0
33 29.0
34 31.0
35 32.0
36 35.0
37 35.0
38 36.0
39 37.0
40 38.0
41 38.0
42 38.0
43 41.0
44 42.0
45 42.0
46 42.0
47 42.0
48 46.0
49 48.0
50 49.0
51 50.0
52 52.0
53 55.0
54 56.0
55 60.0
56 60.0
57 100.0
58 NaN
59 NaN
60 NaN
61 NaN
Name: resp, dtype: float64
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-42944bf1e405> in <module>
1 print(g['resp'])
2 par = {'hist': True, 'kde': False, 'fit': scipy.stats.norm, 'bins': 'auto'}
----> 3 sns.distplot(g['resp'], color='blue', **par)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py in distplot(a, bins, hist, kde, rug, fit, hist_kws, kde_kws, rug_kws, fit_kws, color, vertical, norm_hist, axlabel, label, ax)
223 hist_color = hist_kws.pop("color", color)
224 ax.hist(a, bins, orientation=orientation,
--> 225 color=hist_color, **hist_kws)
226 if hist_color != color:
227 hist_kws["color"] = hist_color
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, data, *args, **kwargs)
1808 "the Matplotlib list!)" % (label_namer, func.__name__),
1809 RuntimeWarning, stacklevel=2)
-> 1810 return func(ax, *args, **kwargs)
1811
1812 inner.__doc__ = _add_data_doc(inner.__doc__,
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in hist(self, x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, normed, **kwargs)
6589 # this will automatically overwrite bins,
6590 # so that each histogram uses the same bins
-> 6591 m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
6592 m = m.astype(float) # causes problems later if it's an int
6593 if mlast is None:
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in histogram(a, bins, range, normed, weights, density)
708 a, weights = _ravel_and_check_weights(a, weights)
709
--> 710 bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
711
712 # Histogram is an integer or a float array depending on the weights.
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in _get_bin_edges(a, bins, range, weights)
331 "bins is not supported for weighted data")
332
--> 333 first_edge, last_edge = _get_outer_edges(a, range)
334
335 # truncate the range if needed
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in _get_outer_edges(a, range)
259 if not (np.isfinite(first_edge) and np.isfinite(last_edge)):
260 raise ValueError(
--> 261 "autodetected range of [{}, {}] is not finite".format(first_edge, last_edge))
262
263 # expand empty range to avoid divide by zero
ValueError: autodetected range of [nan, nan] is not finite
It looks like the NaN values are causing trouble - how to discard them?
I think not, so possible solution is Series.dropna for remove missing values:
sns.distplot(g['resp'].dropna(), color='blue', **par)

How to add missing data to Pandas in Monthly Data

I have this following dataframe:
Date
2002-01-01 10.0 NaN NaN
2002-05-01 NaN 30.0 40.0
2002-07-01 NaN NaN 50.0
I would like to complete the missing months with zeros. I am actualy able to do that, but I can do that only adding the entire range of days that are missing as you can get in the following code. The relevant part of the code is marked with
#############################
-
def createSeriesOfCompanies(df):
listOfCompanies=list(set(df['Company']))
dfSeries=df.pivot(index='Date', columns='Company', values='var1')
# Here I include the missing dates
#######################################################
initialDate=dfSeries.index[0]
endDate=dfSeries.index[-1]
idx = pd.date_range(initialDate, endDate)
dfSeries.index = pd.DatetimeIndex(dfSeries.index)
dfSeries = dfSeries.reindex(idx, fill_value=0)
########################################################
# Here it finishes the procedure
def creatingDataFrame():
dateList=[]
dateList.append(datetime.date(2002,1,1))
dateList.append(datetime.date(2002,7,1))
dateList.append(datetime.date(2002,5,1))
dateList.append(datetime.date(2002,5,1))
dateList.append(datetime.date(2002,7,1))
raw_data = {'Date': dateList,
'Company': ['A', 'B', 'B', 'C' , 'C'],
'var1': [10, 20, 30, 40 , 50]}
df = pd.DataFrame(raw_data, columns = ['Date','Company', 'var1'])
df.loc[1, 'var1'] = np.nan
return df
if __name__=="__main__":
df=creatingDataFrame()
print(df)
dfSeries,listOfCompanies=createSeriesOfCompanies(df)
I would like to get
Date
2002-01-01 10.0 NaN NaN
2002-02-01 0 0 0
2002-03-01 0 0 0
2002-04-01 0 0 0
2002-05-01 NaN 30.0 40.0
2002-06-01 0 0 0
2002-07-01 NaN NaN 50.0
But I am getting this
Company A B C
2002-01-01 10.0 NaN NaN
2002-01-02 0.0 0.0 0.0
2002-01-03 0.0 0.0 0.0
2002-01-04 0.0 0.0 0.0
2002-01-05 0.0 0.0 0.0
2002-01-06 0.0 0.0 0.0
2002-01-07 0.0 0.0 0.0
2002-01-08 0.0 0.0 0.0
2002-01-09 0.0 0.0 0.0
2002-01-10 0.0 0.0 0.0
2002-01-11 0.0 0.0 0.0
2002-01-12 0.0 0.0 0.0
2002-01-13 0.0 0.0 0.0
2002-01-14 0.0 0.0 0.0
2002-01-15 0.0 0.0 0.0
2002-01-16 0.0 0.0 0.0
2002-01-17 0.0 0.0 0.0
2002-01-18 0.0 0.0 0.0
2002-01-19 0.0 0.0 0.0
2002-01-20 0.0 0.0 0.0
2002-01-21 0.0 0.0 0.0
2002-01-22 0.0 0.0 0.0
2002-01-23 0.0 0.0 0.0
2002-01-24 0.0 0.0 0.0
2002-01-25 0.0 0.0 0.0
2002-01-26 0.0 0.0 0.0
2002-01-27 0.0 0.0 0.0
2002-01-28 0.0 0.0 0.0
2002-01-29 0.0 0.0 0.0
2002-01-30 0.0 0.0 0.0
...
How can I deal with this problem?
You can use reindex. Given the date is index,
df.index = pd.to_datetime(df.index)
df.reindex(pd.date_range(df.index.min(), df.index.max(), freq = 'MS'))
A B C
2002-01-01 10.0 NaN NaN
2002-02-01 NaN NaN NaN
2002-03-01 NaN NaN NaN
2002-04-01 NaN NaN NaN
2002-05-01 NaN 30.0 40.0
2002-06-01 NaN NaN NaN
2002-07-01 NaN NaN 50.0
Use asfreq by MS (start of months):
df=creatingDataFrame()
df = df.pivot(index='Date', columns='Company', values='var1').asfreq('MS', fill_value=0)
print (df)
Company A B C
Date
2002-01-01 10.0 NaN NaN
2002-02-01 0.0 0.0 0.0
2002-03-01 0.0 0.0 0.0
2002-04-01 0.0 0.0 0.0
2002-05-01 NaN 30.0 40.0
2002-06-01 0.0 0.0 0.0
2002-07-01 NaN NaN 50.0

SQL query is not working (Error in rsqlite_send_query)

This is what the head of my data frame looks like
> head(d19_1)
SMZ SIZ1_diff SIZ1_base SIZ2_diff SIZ2_base SIZ3_diff SIZ3_base SIZ4_diff SIZ4_base SIZ5_diff SIZ5_base
1 1 -620 4170 -189 1347 -35 2040 82 1437 244 1533
2 2 -219 831 -57 255 -4 392 8 282 14 297
3 3 -426 834 -162 294 -134 379 -81 241 -22 221
4 4 -481 676 -142 216 -114 267 -50 158 -43 166
5 5 -233 1711 -109 584 54 913 71 624 74 707
6 6 -322 1539 -79 512 -50 799 23 532 63 576
Total_og Total_base %_SIZ1 %_SIZ2 %_SIZ3 %_SIZ4 %_SIZ5 Total_og Total_base
1 11980 12648 14.86811 14.03118 1.715686 5.706333 15.916504 11980 12648
2 2156 2415 26.35379 22.35294 1.020408 2.836879 4.713805 2156 2415
3 1367 2314 51.07914 55.10204 35.356201 33.609959 9.954751 1367 2314
4 790 1736 71.15385 65.74074 42.696629 31.645570 25.903614 790 1736
5 5339 5496 13.61777 18.66438 5.914567 11.378205 10.466761 5339 5496
6 4362 4747 20.92268 15.42969 6.257822 4.323308 10.937500 4362 4747
The datatype of the data frame is as below str(d19_1)
> str(d19_1)
'data.frame': 1588 obs. of 20 variables:
$ SMZ : int 1 2 3 4 5 6 7 8 9 10 ...
$ SIZ1_diff : int -620 -219 -426 -481 -233 -322 -176 -112 -34 -103 ...
$ SIZ1_base : int 4170 831 834 676 1711 1539 720 1396 998 1392 ...
$ SIZ2_diff : int -189 -57 -162 -142 -109 -79 -12 72 -36 -33 ...
$ SIZ2_base : int 1347 255 294 216 584 512 196 437 343 479 ...
$ SIZ3_diff : int -35 -4 -134 -114 54 -50 16 4 26 83 ...
$ SIZ3_base : int 2040 392 379 267 913 799 361 804 566 725 ...
$ SIZ4_diff : int 82 8 -81 -50 71 23 36 127 46 75 ...
$ SIZ4_base : int 1437 282 241 158 624 532 242 471 363 509 ...
$ SIZ5_diff : int 244 14 -22 -43 74 63 11 143 79 125 ...
$ SIZ5_base : int 1533 297 221 166 707 576 263 582 429 536 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
$ %_SIZ1 : num 14.9 26.4 51.1 71.2 13.6 ...
$ %_SIZ2 : num 14 22.4 55.1 65.7 18.7 ...
$ %_SIZ3 : num 1.72 1.02 35.36 42.7 5.91 ...
$ %_SIZ4 : num 5.71 2.84 33.61 31.65 11.38 ...
$ %_SIZ5 : num 15.92 4.71 9.95 25.9 10.47 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
When I run the below query, it is returning me the below error and I don't know why. I don't have any column in table
Query
d20_1 <- sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')
Error:
Error in rsqlite_send_query(conn#ptr, statement) :
table d19_1 has no column named <NA>
Your code works correctly for me:
d19_1 <- structure(list(SMZ = 1:6, SIZ1_diff = c(-620L, -219L, -426L,
-481L, -233L, -322L), SIZ1_base = c(4170L, 831L, 834L, 676L,
1711L, 1539L), SIZ2_diff = c(-189L, -57L, -162L, -142L, -109L,
-79L), SIZ2_base = c(1347L, 255L, 294L, 216L, 584L, 512L), SIZ3_diff = c(-35L,
-4L, -134L, -114L, 54L, -50L), SIZ3_base = c(2040L, 392L, 379L,
267L, 913L, 799L), SIZ4_diff = c(82L, 8L, -81L, -50L, 71L, 23L
), SIZ4_base = c(1437L, 282L, 241L, 158L, 624L, 532L), SIZ5_diff = c(244L,
14L, -22L, -43L, 74L, 63L), SIZ5_base = c(1533L, 297L, 221L,
166L, 707L, 576L), Total_og = c(11980L, 2156L, 1367L, 790L, 5339L,
4362L), Total_base = c(12648L, 2415L, 2314L, 1736L, 5496L, 4747L
), X._SIZ1 = c(14.86811, 26.35379, 51.07914, 71.15385, 13.61777,
20.92268), X._SIZ2 = c(14.03118, 22.35294, 55.10204, 65.74074,
18.66438, 15.42969), X._SIZ3 = c(1.715686, 1.020408, 35.356201,
42.696629, 5.914567, 6.257822), X._SIZ4 = c(5.706333, 2.836879,
33.609959, 31.64557, 11.378205, 4.323308), X._SIZ5 = c(15.916504,
4.713805, 9.954751, 25.903614, 10.466761, 10.9375), Total_og.1 = c(11980L,
2156L, 1367L, 790L, 5339L, 4362L), Total_base.1 = c(12648L, 2415L,
2314L, 1736L, 5496L, 4747L)), .Names = c("SMZ", "SIZ1_diff",
"SIZ1_base", "SIZ2_diff", "SIZ2_base", "SIZ3_diff", "SIZ3_base",
"SIZ4_diff", "SIZ4_base", "SIZ5_diff", "SIZ5_base", "Total_og",
"Total_base", "X._SIZ1", "X._SIZ2", "X._SIZ3", "X._SIZ4", "X._SIZ5",
"Total_og.1", "Total_base.1"), row.names = c(NA, -6L), class = "data.frame")
library(sqldf)
sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')