Apply customized functions in pandas groupby and panel data - pandas

I have a panel data as follows:
volume VWAP open close high low n ticker date
time
2021-09-02 09:30:00 597866 110.2781 110.32 110.37 110.4900 110.041 3719.0 AMD 2021-09-02
2021-09-02 09:31:00 512287 109.9928 110.36 109.85 110.4000 109.725 3732.0 AMD 2021-09-02
2021-09-02 09:32:00 359379 109.7271 109.81 109.89 109.9600 109.510 2455.0 AMD 2021-09-02
2021-09-02 09:33:00 368225 109.5740 109.89 109.66 109.8900 109.420 2555.0 AMD 2021-09-02
2021-09-02 09:34:00 320260 109.5616 109.67 109.45 109.8299 109.390 2339.0 AMD 2021-09-02
... ... ... ... ... ... ... ... ... ...
2021-12-31 15:56:00 62680 3334.8825 3332.24 3337.60 3337.8500 3331.890 2334.0 AMZN 2021-12-31
2021-12-31 15:57:00 26046 3336.0700 3337.70 3335.72 3338.6000 3334.990 1292.0 AMZN 2021-12-31
2021-12-31 15:58:00 47989 3336.3885 3334.65 3337.23 3338.0650 3334.650 1651.0 AMZN 2021-12-31
2021-12-31 15:59:00 63865 3335.5288 3336.70 3334.72 3337.3700 3334.180 2172.0 AMZN 2021-12-31
2021-12-31 16:00:00 1974 3334.8869 3334.34 3334.34 3334.3400 3334.340 108.0 AMZN 2021-12-31
153700 rows × 9 columns
I would like to calculate a series of attributes engeered from the panel data. These functions are pre-written and posted on github https://github.com/twopirllc/pandas-ta/blob/main/pandas_ta/overlap/ema.py. In doctor Jansen's example, he used
import pandas_ta as ta
import pandas as pd
df["feature"] = df.groupby("ticker", group_keys = False).apply(lambda x: ta.ema(x.close))
I was able to follow along using google cloud's compute engine under python 3.7. However, when I use my school's cluster with python 3.8, eventhough with the same pandas version, it would not work. I also tried using the same version of python. Unfortunately it did not work as well.
df.groupby("ticker").apply(lambda x: ta.ema(x.close, 200))
output:
ticker time
AAPL 2021-09-02 09:30:00 NaN
2021-09-02 09:31:00 NaN
2021-09-02 09:32:00 NaN
2021-09-02 09:33:00 NaN
2021-09-02 09:34:00 NaN
...
TSLA 2021-12-31 15:56:00 1064.446659
2021-12-31 15:57:00 1064.358135
2021-12-31 15:58:00 1064.278452
2021-12-31 15:59:00 1064.207621
2021-12-31 16:00:00 1064.135904
Name: EMA_200, Length: 153700, dtype: float64
df["alpha_01"] = df.groupby("ticker").apply(lambda x: ta.ema(x.close))
output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10772, in _reindex_for_setitem(value, index)
10771 try:
> 10772 reindexed_value = value.reindex(index)._values
10773 except ValueError as err:
10774 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
File ~/quant/lib/python3.8/site-packages/pandas/core/series.py:4579, in Series.reindex(self, index, **kwargs)
4571 #doc(
4572 NDFrame.reindex, # type: ignore[has-type]
4573 klass=_shared_doc_kwargs["klass"],
(...)
4577 )
4578 def reindex(self, index=None, **kwargs):
-> 4579 return super().reindex(index=index, **kwargs)
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4809, in NDFrame.reindex(self, *args, **kwargs)
4808 # perform the reindex on the axes
-> 4809 return self._reindex_axes(
4810 axes, level, limit, tolerance, method, fill_value, copy
4811 ).__finalize__(self, method="reindex")
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4825, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
4824 ax = self._get_axis(a)
-> 4825 new_index, indexer = ax.reindex(
4826 labels, level=level, limit=limit, tolerance=tolerance, method=method
4827 )
4829 axis = self._get_axis_number(a)
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/multi.py:2533, in MultiIndex.reindex(self, target, method, level, limit, tolerance)
2532 try:
-> 2533 target = MultiIndex.from_tuples(target)
2534 except TypeError:
2535 # not all tuples, see test_constructor_dict_multiindex_reindex_flat
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/multi.py:202, in names_compat.<locals>.new_meth(self_or_cls, *args, **kwargs)
200 kwargs["names"] = kwargs.pop("name")
--> 202 return meth(self_or_cls, *args, **kwargs)
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/multi.py:553, in MultiIndex.from_tuples(cls, tuples, sortorder, names)
551 tuples = np.asarray(tuples._values)
--> 553 arrays = list(lib.tuples_to_object_array(tuples).T)
554 elif isinstance(tuples, list):
File ~/quant/lib/python3.8/site-packages/pandas/_libs/lib.pyx:2919, in pandas._libs.lib.tuples_to_object_array()
ValueError: cannot include dtype 'M' in a buffer
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 df_features["alpha_01"] = df.groupby("ticker").apply(lambda x: ta.ema(x.close))
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3607, in DataFrame.__setitem__(self, key, value)
3604 self._setitem_array([key], value)
3605 else:
3606 # set column
-> 3607 self._set_item(key, value)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3779, in DataFrame._set_item(self, key, value)
3769 def _set_item(self, key, value) -> None:
3770 """
3771 Add series to DataFrame in specified column.
3772
(...)
3777 ensure homogeneity.
3778 """
-> 3779 value = self._sanitize_column(value)
3781 if (
3782 key in self.columns
3783 and value.ndim == 1
3784 and not is_extension_array_dtype(value)
3785 ):
3786 # broadcast across multiple columns if necessary
3787 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:4501, in DataFrame._sanitize_column(self, value)
4499 # We should never get here with DataFrame value
4500 if isinstance(value, Series):
-> 4501 return _reindex_for_setitem(value, self.index)
4503 if is_list_like(value):
4504 com.require_length_match(value, self.index)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10779, in _reindex_for_setitem(value, index)
10775 if not value.index.is_unique:
10776 # duplicate axis
10777 raise err
> 10779 raise TypeError(
10780 "incompatible index of inserted column with frame index"
10781 ) from err
10782 return reindexed_value
TypeError: incompatible index of inserted column with frame index
df_features["alpha_01"] = df.groupby("ticker", group_keys = False).apply(lambda x: ta.ema(x.close))
output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 df_features["alpha_01"] = df.groupby("ticker", group_keys = False).apply(lambda x: ta.ema(x.close))
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3607, in DataFrame.__setitem__(self, key, value)
3604 self._setitem_array([key], value)
3605 else:
3606 # set column
-> 3607 self._set_item(key, value)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:3779, in DataFrame._set_item(self, key, value)
3769 def _set_item(self, key, value) -> None:
3770 """
3771 Add series to DataFrame in specified column.
3772
(...)
3777 ensure homogeneity.
3778 """
-> 3779 value = self._sanitize_column(value)
3781 if (
3782 key in self.columns
3783 and value.ndim == 1
3784 and not is_extension_array_dtype(value)
3785 ):
3786 # broadcast across multiple columns if necessary
3787 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:4501, in DataFrame._sanitize_column(self, value)
4499 # We should never get here with DataFrame value
4500 if isinstance(value, Series):
-> 4501 return _reindex_for_setitem(value, self.index)
4503 if is_list_like(value):
4504 com.require_length_match(value, self.index)
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10777, in _reindex_for_setitem(value, index)
10773 except ValueError as err:
10774 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
10775 if not value.index.is_unique:
10776 # duplicate axis
> 10777 raise err
10779 raise TypeError(
10780 "incompatible index of inserted column with frame index"
10781 ) from err
10782 return reindexed_value
File ~/quant/lib/python3.8/site-packages/pandas/core/frame.py:10772, in _reindex_for_setitem(value, index)
10770 # GH#4107
10771 try:
> 10772 reindexed_value = value.reindex(index)._values
10773 except ValueError as err:
10774 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs
10775 if not value.index.is_unique:
10776 # duplicate axis
File ~/quant/lib/python3.8/site-packages/pandas/core/series.py:4579, in Series.reindex(self, index, **kwargs)
4571 #doc(
4572 NDFrame.reindex, # type: ignore[has-type]
4573 klass=_shared_doc_kwargs["klass"],
(...)
4577 )
4578 def reindex(self, index=None, **kwargs):
-> 4579 return super().reindex(index=index, **kwargs)
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4809, in NDFrame.reindex(self, *args, **kwargs)
4806 return self._reindex_multi(axes, copy, fill_value)
4808 # perform the reindex on the axes
-> 4809 return self._reindex_axes(
4810 axes, level, limit, tolerance, method, fill_value, copy
4811 ).__finalize__(self, method="reindex")
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4830, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
4825 new_index, indexer = ax.reindex(
4826 labels, level=level, limit=limit, tolerance=tolerance, method=method
4827 )
4829 axis = self._get_axis_number(a)
-> 4830 obj = obj._reindex_with_indexers(
4831 {axis: [new_index, indexer]},
4832 fill_value=fill_value,
4833 copy=copy,
4834 allow_dups=False,
4835 )
4837 return obj
File ~/quant/lib/python3.8/site-packages/pandas/core/generic.py:4874, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
4871 indexer = ensure_platform_int(indexer)
4873 # TODO: speed up on homogeneous DataFrame objects
-> 4874 new_data = new_data.reindex_indexer(
4875 index,
4876 indexer,
4877 axis=baxis,
4878 fill_value=fill_value,
4879 allow_dups=allow_dups,
4880 copy=copy,
4881 )
4882 # If we've made a copy once, no need to make another one
4883 copy = False
File ~/quant/lib/python3.8/site-packages/pandas/core/internals/managers.py:663, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice)
661 # some axes don't allow reindexing with dups
662 if not allow_dups:
--> 663 self.axes[axis]._validate_can_reindex(indexer)
665 if axis >= self.ndim:
666 raise IndexError("Requested axis not found in manager")
File ~/quant/lib/python3.8/site-packages/pandas/core/indexes/base.py:3785, in Index._validate_can_reindex(self, indexer)
3783 # trying to reindex on an axis with duplicates
3784 if not self._index_as_unique and len(indexer):
-> 3785 raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
The data and the ipynb is avaliable via this link: https://drive.google.com/drive/folders/1QnIdYnDFs8XNk7L8KFzCHC_YJPDo618t?usp=sharing
Ideal output:
df["new_col"] = df.groupby().apply() # without writing any additional helper function

The apply function following the dataframe has the following output:
df.groupby("ticker").apply(lambda x: ta.ema(x.close, 200))
output:
ticker time
AAPL 2021-09-02 09:30:00 NaN
2021-09-02 09:31:00 NaN
2021-09-02 09:32:00 NaN
2021-09-02 09:33:00 NaN
2021-09-02 09:34:00 NaN
...
TSLA 2021-12-31 15:56:00 1064.446659
2021-12-31 15:57:00 1064.358135
2021-12-31 15:58:00 1064.278452
2021-12-31 15:59:00 1064.207621
2021-12-31 16:00:00 1064.135904
Name: EMA_200, Length: 153700, dtype: float64
We want make the dataframe to be appended to have the identical multi-index columns.
df_features = df.reset_index().groupby([pd.Grouper(key = "ticker"), "time"]).sum()
df_features
out:
volume VWAP open close high low n
ticker time
AAPL 2021-09-02 09:30:00 1844930 154.0857 153.8700 154.4300 154.4402 153.8600 9899.0
2021-09-02 09:31:00 565141 154.2679 154.4299 154.0600 154.4600 154.0600 5132.0
2021-09-02 09:32:00 524794 154.1198 154.0600 154.2339 154.3700 153.8500 4036.0
2021-09-02 09:33:00 504479 154.3071 154.2305 154.4750 154.4800 154.1600 4171.0
2021-09-02 09:34:00 794989 154.5478 154.4800 154.4906 154.7100 154.4206 5019.0
... ... ... ... ... ... ... ... ...
TSLA 2021-12-31 15:56:00 91296 1055.9030 1055.4900 1055.9400 1056.3200 1055.3200 2360.0
2021-12-31 15:57:00 104648 1056.0563 1055.9850 1055.5500 1056.4300 1055.5500 2988.0
2021-12-31 15:58:00 149130 1055.6994 1055.5500 1056.3500 1056.8000 1054.5900 3603.0
2021-12-31 15:59:00 189018 1056.4131 1056.2900 1057.1600 1057.2400 1056.0700 4214.0
2021-12-31 16:00:00 37983 1056.3289 1057.0100 1057.0000 1057.1000 1056.0000 319.0
153700 rows × 7 columns
Then append the calculated series to this dataframe.
df_features["alpha_01"] = df.groupby("ticker").parallel_apply(lambda x: ta.ema(x.close, 200))
df_features
out:
volume VWAP open close high low n alpha_01
ticker time
AAPL 2021-09-02 09:30:00 1844930 154.0857 153.8700 154.4300 154.4402 153.8600 9899.0 NaN
2021-09-02 09:31:00 565141 154.2679 154.4299 154.0600 154.4600 154.0600 5132.0 NaN
2021-09-02 09:32:00 524794 154.1198 154.0600 154.2339 154.3700 153.8500 4036.0 NaN
2021-09-02 09:33:00 504479 154.3071 154.2305 154.4750 154.4800 154.1600 4171.0 NaN
2021-09-02 09:34:00 794989 154.5478 154.4800 154.4906 154.7100 154.4206 5019.0 NaN
... ... ... ... ... ... ... ... ... ...
TSLA 2021-12-31 15:56:00 91296 1055.9030 1055.4900 1055.9400 1056.3200 1055.3200 2360.0 1064.446659
2021-12-31 15:57:00 104648 1056.0563 1055.9850 1055.5500 1056.4300 1055.5500 2988.0 1064.358135
2021-12-31 15:58:00 149130 1055.6994 1055.5500 1056.3500 1056.8000 1054.5900 3603.0 1064.278452
2021-12-31 15:59:00 189018 1056.4131 1056.2900 1057.1600 1057.2400 1056.0700 4214.0 1064.207621
2021-12-31 16:00:00 37983 1056.3289 1057.0100 1057.0000 1057.1000 1056.0000 319.0 1064.135904
153700 rows × 8 columns

Related

Getting error message when trying to get at risk numbers below KM-plot (lifelines)

I've used lifelines a lot, but when I'm re-running old code that previously worked fine I get the following error: KeyError: "None of [Index(['At risk', 'Censored', 'Events'], dtype='object')] are in the [index]"
I'm guessing there has been some changes to the code when displaying at risk counts, but I can't find any evidence of it in the lifelines documentation. I am using version 27.0
Snippet of the table with data
index
t2p
O
1
354
False
2
113
False
3
1222
False
4
13
True
5
59
False
6
572
False
Code:
ax = plt.subplot(111)
m = KaplanMeierFitter()
ax = m.fit(h.t2p, h.O, label='PPI').plot_cumulative_density(ax=ax,ci_show=False)
add_at_risk_counts(m)
Full error:
KeyError Traceback (most recent call last)
<ipython-input-96-a8ce3ea9e60c> in <module>
4 ax = m.fit(h.t2p, h.O, label='PPI').plot_cumulative_density(ax=ax,ci_show=False)
5
----> 6 add_at_risk_counts(m)
7
8
~\AppData\Local\Continuum\anaconda3\lib\site-packages\lifelines\plotting.py in add_at_risk_counts(labels, rows_to_show, ypos, xticks, ax, at_risk_count_from_start_of_period, *fitters, **kwargs)
510 .rename({"at_risk": "At risk", "censored": "Censored", "observed": "Events"})
511 )
--> 512 counts.extend([int(c) for c in event_table_slice.loc[rows_to_show]])
513
514 if n_rows > 1:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
1766
1767 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768 return self._getitem_axis(maybe_callable, axis=axis)
1769
1770 def _is_scalar_access(self, key: Tuple):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1952 raise ValueError("Cannot index with multidimensional key")
1953
-> 1954 return self._getitem_iterable(key, axis=axis)
1955
1956 # nested tuple slicing
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_iterable(self, key, axis)
1593 else:
1594 # A collection of keys
-> 1595 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1596 return self.obj._reindex_with_indexers(
1597 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1551
1552 self._validate_read_indexer(
-> 1553 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1554 )
1555 return keyarr, indexer
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1638 if missing == len(indexer):
1639 axis_name = self.obj._get_axis_name(axis)
-> 1640 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1641
1642 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "None of [Index(['At risk', 'Censored', 'Events'], dtype='object')] are in the [index]"

Seaborn pairplot not running only on a specific system

I have the following data with the name 'Salaries.csv'. It looks like the following:[The dataset has some columns like Index(['yearID', 'teamID', 'lgID', 'salary', 'num_feat'], dtype='object'). Please note that the column num_feat I have added to the DataFrame.
I want to do a Seaborn pairplot for team 'ATL' to plot scatter plots among all numeric features in the data frame.
I have the following code :
import seaborn as sns
var_set = [
"yearID",
"teamID",
"lgID",
"playerID",
"salary"
]
head_set = []
head_set.extend(var_set)
head_set.append("num_feat")
df = pd.read_csv('Salaries.csv',index_col='playerID', header=None, names=head_set)
df['num_feat'] = 100 * np.random.random_sample(df.shape[0]). #Adding column num_feat
df_copy = df
cols_with_team_ATL = df_copy.loc[df_copy.teamID=="ATL", ]
# Create the default pairplot
pairplot_fig = sns.pairplot(cols_with_team_ATL, vars=['yearID', 'salary', 'num_feat'])
plt.subplots_adjust(top=0.9)
pairplot_fig.fig.suptitle("Scatter plots among all numeric features in the data frame for teamID = ATL", fontsize=18, alpha=0.9, weight='bold')
plt.show()
The same code runs perfectly on my friend's system but not on mine. It shows the following error in my system :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/ch/6r9p7n0j3xg1l79lz1zdkvsh0000gq/T/ipykernel_97373/3735184261.py in <module>
25 # Create the default pairplot
26 print(df.columns)
---> 27 pairplot_fig = sns.pairplot(cols_with_team_ATL, vars=['yearID', 'salary', 'num_feat'])
28 plt.subplots_adjust(top=0.9)
29 pairplot_fig.fig.suptitle("Scatter plots among all numeric features in the data frame for teamID = ATL", fontsize=18, alpha=0.9, weight='bold')
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
2124 diag_kws.setdefault("legend", False)
2125 if diag_kind == "hist":
-> 2126 grid.map_diag(histplot, **diag_kws)
2127 elif diag_kind == "kde":
2128 diag_kws.setdefault("fill", True)
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/axisgrid.py in map_diag(self, func, **kwargs)
1476 plot_kwargs.setdefault("hue_order", self._hue_order)
1477 plot_kwargs.setdefault("palette", self._orig_palette)
-> 1478 func(x=vector, **plot_kwargs)
1479 ax.legend_ = None
1480
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/distributions.py in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
1460 if p.univariate:
1461
-> 1462 p.plot_univariate_histogram(
1463 multiple=multiple,
1464 element=element,
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)
426
427 # First pass through the data to compute the histograms
--> 428 for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
429
430 # Prepare the relevant data
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/_core.py in iter_data(self, grouping_vars, reverse, from_comp_data)
981
982 if from_comp_data:
--> 983 data = self.comp_data
984 else:
985 data = self.plot_data
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/_core.py in comp_data(self)
1055 orig = self.plot_data[var].dropna()
1056 comp_col = pd.Series(index=orig.index, dtype=float, name=var)
-> 1057 comp_col.loc[orig.index] = pd.to_numeric(axis.convert_units(orig))
1058
1059 if axis.get_scale() == "log":
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
721
722 iloc = self if self.name == "iloc" else self.obj.iloc
--> 723 iloc._setitem_with_indexer(indexer, value, self.name)
724
725 def _validate_key(self, key, axis: int):
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
1730 self._setitem_with_indexer_split_path(indexer, value, name)
1731 else:
-> 1732 self._setitem_single_block(indexer, value, name)
1733
1734 def _setitem_with_indexer_split_path(self, indexer, value, name: str):
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
1966
1967 # actually do the set
-> 1968 self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
1969 self.obj._maybe_update_cacher(clear=True)
1970
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py in setitem(self, indexer, value)
353
354 def setitem(self: T, indexer, value) -> T:
--> 355 return self.apply("setitem", indexer=indexer, value=value)
356
357 def putmask(self, mask, new, align: bool = True):
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
325 applied = b.apply(f, **kwargs)
326 else:
--> 327 applied = getattr(b, f)(**kwargs)
328 except (TypeError, NotImplementedError):
329 if not ignore_failures:
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/internals/blocks.py in setitem(self, indexer, value)
941
942 # length checking
--> 943 check_setitem_lengths(indexer, value, values)
944 exact_match = is_exact_shape_match(values, arr_value)
945
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/indexers.py in check_setitem_lengths(indexer, value, values)
174 and len(indexer[indexer]) == len(value)
175 ):
--> 176 raise ValueError(
177 "cannot set using a list-like indexer "
178 "with a different length than the value"
ValueError: cannot set using a list-like indexer with a different length than the value
Why is it not running particularly on my system? Is there any problem with the python version or Jupyter Notebook?
Please help.

Why are the limits of the histogram data autodetected as [nan, nan] instead of discarding NaNs?

The following code generates an error
print(g['resp'])
par = {'hist': True, 'kde': False, 'fit': scipy.stats.norm, 'bins': 'auto'}
sns.distplot(g['resp'], color='blue', **par)
31 23.0
32 28.0
33 29.0
34 31.0
35 32.0
36 35.0
37 35.0
38 36.0
39 37.0
40 38.0
41 38.0
42 38.0
43 41.0
44 42.0
45 42.0
46 42.0
47 42.0
48 46.0
49 48.0
50 49.0
51 50.0
52 52.0
53 55.0
54 56.0
55 60.0
56 60.0
57 100.0
58 NaN
59 NaN
60 NaN
61 NaN
Name: resp, dtype: float64
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-42944bf1e405> in <module>
1 print(g['resp'])
2 par = {'hist': True, 'kde': False, 'fit': scipy.stats.norm, 'bins': 'auto'}
----> 3 sns.distplot(g['resp'], color='blue', **par)
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py in distplot(a, bins, hist, kde, rug, fit, hist_kws, kde_kws, rug_kws, fit_kws, color, vertical, norm_hist, axlabel, label, ax)
223 hist_color = hist_kws.pop("color", color)
224 ax.hist(a, bins, orientation=orientation,
--> 225 color=hist_color, **hist_kws)
226 if hist_color != color:
227 hist_kws["color"] = hist_color
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, data, *args, **kwargs)
1808 "the Matplotlib list!)" % (label_namer, func.__name__),
1809 RuntimeWarning, stacklevel=2)
-> 1810 return func(ax, *args, **kwargs)
1811
1812 inner.__doc__ = _add_data_doc(inner.__doc__,
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in hist(self, x, bins, range, density, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, normed, **kwargs)
6589 # this will automatically overwrite bins,
6590 # so that each histogram uses the same bins
-> 6591 m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
6592 m = m.astype(float) # causes problems later if it's an int
6593 if mlast is None:
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in histogram(a, bins, range, normed, weights, density)
708 a, weights = _ravel_and_check_weights(a, weights)
709
--> 710 bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
711
712 # Histogram is an integer or a float array depending on the weights.
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in _get_bin_edges(a, bins, range, weights)
331 "bins is not supported for weighted data")
332
--> 333 first_edge, last_edge = _get_outer_edges(a, range)
334
335 # truncate the range if needed
C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\histograms.py in _get_outer_edges(a, range)
259 if not (np.isfinite(first_edge) and np.isfinite(last_edge)):
260 raise ValueError(
--> 261 "autodetected range of [{}, {}] is not finite".format(first_edge, last_edge))
262
263 # expand empty range to avoid divide by zero
ValueError: autodetected range of [nan, nan] is not finite
It looks like the NaN values are causing trouble - how to discard them?
I think not, so possible solution is Series.dropna for remove missing values:
sns.distplot(g['resp'].dropna(), color='blue', **par)

index value vs. flight (data range A row & E row )

I want to know the scatter plot of the sum of the flight fields per minute. My information is as follows
http://python2018.byethost10.com/flights.csv
My grammar is as follows
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
Df=pd.read_csv('flights.csv')
Df["time_hour"] = pd.to_datetime(df['time_hour'])
grp = df.groupby(by=[df.time_hour.map(lambda x : (x.hour, x.minute))])
a=grp.sum()
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Produced the following error:
Produced the following error
Traceback (most recent call last):
File "I:/PycharmProjects/1223/raise1/char3.py", line 10, in
Plt.scatter(a.index, a['flight'], c='b', marker='o')
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 3470, in scatter
Edgecolors=edgecolors, data=data, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib__init__.py", line 1855, in inner
Return func(ax, *args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\axes_axes.py", line 4320, in scatter
Alpha=alpha
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 927, in init
Collection.init(self, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\collections.py", line 159, in init
Offsets = np.asanyarray(offsets, float)
File "C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 544, in asanyarray
Return array(a, dtype, copy=False, order=order, subok=True)
ValueError: setting an array element with a sequence.
How can I produce the following results? Thank you.
http://python2018.byethost10.com/image.png
Problem is in aggregation, in your code it return tuples in index.
Solution is convert time_dt column to strings HH:MM by Series.dt.strftime:
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
All together:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['Noto Serif CJK TC']
matplotlib.rcParams['font.family']='sans-serif'
#first column is index and second clumn is parsed to datetimes
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.strftime('%H:%M')]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00 122793 37856 87445 11282.0 72838 366 1256
05:01 120780 44810 82113 11115.0 71168 435 1310
05:02 122793 52989 99975 11165.0 72068 515 1489
05:03 120780 57653 98323 10366.0 65137 561 1553
05:04 122793 67706 110230 10026.0 63118 661 1606
05:05 122793 75807 126426 9161.0 55371 742 1607
05:06 120780 82010 120753 10804.0 67827 799 2110
05:07 122793 90684 130339 8408.0 52945 890 1684
05:08 120780 93687 114415 10299.0 63271 922 1487
05:09 122793 101571 99526 11525.0 72915 1002 1371
05:10 122793 107252 107961 10383.0 70137 1056 1652
05:11 120780 111351 120261 10949.0 73350 1098 1551
05:12 122793 120575 135930 8661.0 57406 1190 1575
05:13 120780 118272 104763 7784.0 55886 1166 1672
05:14 122793 37289 109300 9838.0 63582 364 889
05:15 122793 42374 67193 11480.0 78183 409 1474
05:16 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
#rotate labels of x axis
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()
Another solution is convert datetimes to times:
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
matplotlib.rcParams['font.sans-serif'] = 'Noto Serif CJK TC'
matplotlib.rcParams['font.family']='sans-serif'
df=pd.read_csv('flights.csv', index_col=[0], parse_dates=[1])
a = df.groupby(by=[df.time_hour.dt.time]).sum()
print (a)
year sched_dep_time flight air_time distance hour minute
time_hour
05:00:00 122793 37856 87445 11282.0 72838 366 1256
05:01:00 120780 44810 82113 11115.0 71168 435 1310
05:02:00 122793 52989 99975 11165.0 72068 515 1489
05:03:00 120780 57653 98323 10366.0 65137 561 1553
05:04:00 122793 67706 110230 10026.0 63118 661 1606
05:05:00 122793 75807 126426 9161.0 55371 742 1607
05:06:00 120780 82010 120753 10804.0 67827 799 2110
05:07:00 122793 90684 130339 8408.0 52945 890 1684
05:08:00 120780 93687 114415 10299.0 63271 922 1487
05:09:00 122793 101571 99526 11525.0 72915 1002 1371
05:10:00 122793 107252 107961 10383.0 70137 1056 1652
05:11:00 120780 111351 120261 10949.0 73350 1098 1551
05:12:00 122793 120575 135930 8661.0 57406 1190 1575
05:13:00 120780 118272 104763 7784.0 55886 1166 1672
05:14:00 122793 37289 109300 9838.0 63582 364 889
05:15:00 122793 42374 67193 11480.0 78183 409 1474
05:16:00 58377 22321 53424 4271.0 27527 216 721
plt.scatter(a.index, a['flight'], c='b', marker='o')
plt.xticks(rotation=90)
plt.xlabel('index value', fontsize=16)
plt.ylabel('flight', fontsize=16)
plt.title('scatter plot - index value vs. flight (data range A row & E row )', fontsize=20)
plt.show()

xarray: mean of data stored via OPeNDAP

I'm using xarray's very cool pydap back-end (http://xarray.pydata.org/en/stable/io.html#opendap) to read data stored via OPenDAP at IRI:
import xarray as xr
remote_data = xr.open_dataarray('http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.RSMAS/.CCSM4/.hindcast/.zg/dods')
print(remote_data)
#<xarray.DataArray 'zg' (P: 2, S: 6569, M: 3, L: 45, Y: 181, X: 360)>
#[115569730800 values with dtype=float32]
#Coordinates:
# * L (L) timedelta64[ns] 0 days 12:00:00 1 days 12:00:00 ...
# * Y (Y) float32 -90.0 -89.0 -88.0 -87.0 -86.0 -85.0 -84.0 -83.0 ...
# * S (S) datetime64[ns] 1999-01-07 1999-01-08 1999-01-09 1999-01-10 ...
# * M (M) float32 1.0 2.0 3.0
# * X (X) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ...
# * P (P) int32 500 200
#Attributes:
# level_type: pressure level
# standard_name: geopotential_height
# long_name: Geopotential Height
# units: m
For reference it's sub-seasonal forecast data where L is lead-time (45 days forecasts), S is initialization date and M is ensemble.
I would like to do an ensemble mean and i'm only interested in the 500 hPa level. However, it crashes out and gives a RuntimeError: NetCDF: Access failure:
da = remote_data.sel(P=500)
da_ensmean = da.mean(dim='M')
RuntimeError Traceback (most recent call last)
<ipython-input-46-eca488e9def5> in <module>()
1 remote_data = xr.open_dataarray('http://iridl.ldeo.columbia.edu/SOURCES/.Models' '/.SubX/.RSMAS/.CCSM4/.hindcast/.zg/dods')
2 da = remote_data.sel(P=500)
----> 3 da_ensmean = da.mean(dim='M')
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/common.py in wrapped_func(self, dim, axis, skipna, keep_attrs, **kwargs)
20 keep_attrs=False, **kwargs):
21 return self.reduce(func, dim, axis, keep_attrs=keep_attrs,
---> 22 skipna=skipna, allow_lazy=True, **kwargs)
23 else:
24 def wrapped_func(self, dim=None, axis=None, keep_attrs=False,
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/dataarray.py in reduce(self, func, dim, axis, keep_attrs, **kwargs)
1359 summarized data and the indicated dimension(s) removed.
1360 """
-> 1361 var = self.variable.reduce(func, dim, axis, keep_attrs, **kwargs)
1362 return self._replace_maybe_drop_dims(var)
1363
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/variable.py in reduce(self, func, dim, axis, keep_attrs, allow_lazy, **kwargs)
1264 if dim is not None:
1265 axis = self.get_axis_num(dim)
-> 1266 data = func(self.data if allow_lazy else self.values,
1267 axis=axis, **kwargs)
1268
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/variable.py in data(self)
293 return self._data
294 else:
--> 295 return self.values
296
297 #data.setter
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/variable.py in values(self)
385 def values(self):
386 """The variable's data as a numpy.ndarray"""
--> 387 return _as_array_or_item(self._data)
388
389 #values.setter
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/variable.py in _as_array_or_item(data)
209 TODO: remove this (replace with np.asarray) once these issues are fixed
210 """
--> 211 data = np.asarray(data)
212 if data.ndim == 0:
213 if data.dtype.kind == 'M':
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype)
622
623 def __array__(self, dtype=None):
--> 624 self._ensure_cached()
625 return np.asarray(self.array, dtype=dtype)
626
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in _ensure_cached(self)
619 def _ensure_cached(self):
620 if not isinstance(self.array, NumpyIndexingAdapter):
--> 621 self.array = NumpyIndexingAdapter(np.asarray(self.array))
622
623 def __array__(self, dtype=None):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype)
600
601 def __array__(self, dtype=None):
--> 602 return np.asarray(self.array, dtype=dtype)
603
604 def __getitem__(self, key):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype)
506 def __array__(self, dtype=None):
507 array = as_indexable(self.array)
--> 508 return np.asarray(array[self.key], dtype=None)
509
510 def transpose(self, order):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/coding/variables.py in __getitem__(self, key)
64
65 def __getitem__(self, key):
---> 66 return self.func(self.array[key])
67
68 def __repr__(self):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/coding/variables.py in _apply_mask(data, encoded_fill_values, decoded_fill_value, dtype)
133 for fv in encoded_fill_values:
134 condition |= data == fv
--> 135 data = np.asarray(data, dtype=dtype)
136 return np.where(condition, decoded_fill_value, data)
137
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype)
506 def __array__(self, dtype=None):
507 array = as_indexable(self.array)
--> 508 return np.asarray(array[self.key], dtype=None)
509
510 def transpose(self, order):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/backends/netCDF4_.py in __getitem__(self, key)
63 with self.datastore.ensure_open(autoclose=True):
64 try:
---> 65 array = getitem(self.get_array(), key.tuple)
66 except IndexError:
67 # Catch IndexError in netCDF4 and return a more informative
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/backends/common.py in robust_getitem(array, key, catch, max_retries, initial_delay)
114 for n in range(max_retries + 1):
115 try:
--> 116 return array[key]
117 except catch:
118 if n == max_retries:
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__getitem__()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._get()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()
RuntimeError: NetCDF: Access failure
Breaking down the calculation removes the RuntimeError. Guess it was just too hefty of a calculation with all the start times. Shouldn't be too difficult to put in a loop over S:
da = remote_data.isel(P=0,S=0)
da_ensmean = da.mean(dim='M')
print(da_ensmean)
<xarray.DataArray 'zg' (L: 45, Y: 181, X: 360)>
array([[[5231.1445, 5231.1445, ..., 5231.1445, 5231.1445],
[5231.1445, 5231.1445, ..., 5231.1445, 5231.1445],
...,
[5056.2383, 5056.2383, ..., 5056.2383, 5056.2383],
[5056.2383, 5056.2383, ..., 5056.2383, 5056.2383]],
[[5211.346 , 5211.346 , ..., 5211.346 , 5211.346 ],
[5211.346 , 5211.346 , ..., 5211.346 , 5211.346 ],
...,
[5082.062 , 5082.062 , ..., 5082.062 , 5082.062 ],
[5082.062 , 5082.062 , ..., 5082.062 , 5082.062 ]],
...,
[[5108.8247, 5108.8247, ..., 5108.8247, 5108.8247],
[5108.8247, 5108.8247, ..., 5108.8247, 5108.8247],
...,
[5154.2173, 5154.2173, ..., 5154.2173, 5154.2173],
[5154.2173, 5154.2173, ..., 5154.2173, 5154.2173]],
[[5106.4893, 5106.4893, ..., 5106.4893, 5106.4893],
[5106.4893, 5106.4893, ..., 5106.4893, 5106.4893],
...,
[5226.0063, 5226.0063, ..., 5226.0063, 5226.0063],
[5226.0063, 5226.0063, ..., 5226.0063, 5226.0063]]], dtype=float32)
Coordinates:
* L (L) timedelta64[ns] 0 days 12:00:00 1 days 12:00:00 ...
* Y (Y) float32 -90.0 -89.0 -88.0 -87.0 -86.0 -85.0 -84.0 -83.0 ...
S datetime64[ns] 1999-01-07
* X (X) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ...
P int32 500
This is a good use-case for chunking with dask, e.g.,
import xarray as xr
url = 'http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.RSMAS/.CCSM4/.hindcast/.zg/dods'
remote_data = xr.open_dataarray(url, chunks={'S': 1, 'L': 1})
da = remote_data.sel(P=500)
da_ensmean = da.mean(dim='M')
This version will access the data server in parallel, using many smaller chunks. It will still be slow to download 231 GB of data, but your request will have much better odds of success.