pyspark toPandas() IndexError: index is out of bounds

pyspark toPandas() IndexError: index is out of bounds - pandas

I'm experiencing a weird behaviour of pyspark's .toPandas() method running from Jupyt. For example, if I try this:
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
{"Category": 'Category B', "ID": 2, "Value": 30.10},
{"Category": 'Category C', "ID": 3, "Value": 100.01}
]
# Create data frame (where spark is a SparkSession)
df = spark.createDataFrame(data)
df.show()
I'm able to successfully create the pyspark dataframe. However, when converting to pandas I get IndexError: index is out of bounds:
IndexError Traceback (most recent call last)
<path_to_python>/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
<path_to_python>/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
400 if cls is not object \
401 and callable(cls.__dict__.get('__repr__')):
--> 402 return _repr_pprint(obj, self, cycle)
403
404 return _default_pprint(obj, self, cycle)
<path_to_python>/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
695 """A pprint that just redirects to the normal repr function."""
696 # Find newlines and replace them with p.break_()
--> 697 output = repr(obj)
698 for idx,output_line in enumerate(output.splitlines()):
699 if idx:
<path_to_python>/lib/python3.7/site-packages/pandas/core/base.py in __repr__(self)
76 Yields Bytestring in Py2, Unicode String in py3.
77 """
---> 78 return str(self)
79
80
<path_to_python>/lib/python3.7/site-packages/pandas/core/base.py in __str__(self)
55
56 if compat.PY3:
---> 57 return self.__unicode__()
58 return self.__bytes__()
59
<path_to_python>/lib/python3.7/site-packages/pandas/core/frame.py in __unicode__(self)
632 width = None
633 self.to_string(buf=buf, max_rows=max_rows, max_cols=max_cols,
--> 634 line_width=width, show_dimensions=show_dimensions)
635
636 return buf.getvalue()
<path_to_python>/lib/python3.7/site-packages/pandas/core/frame.py in to_string(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, max_rows, max_cols, show_dimensions, decimal, line_width)
719 decimal=decimal,
720 line_width=line_width)
--> 721 formatter.to_string()
722
723 if buf is None:
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in to_string(self)
596 else:
597
--> 598 strcols = self._to_str_columns()
599 if self.line_width is None: # no need to wrap around just print
600 # the whole frame
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in _to_str_columns(self)
527 str_columns = [[label] for label in self.header]
528 else:
--> 529 str_columns = self._get_formatted_column_labels(frame)
530
531 stringified = []
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in _get_formatted_column_labels(self, frame)
770 need_leadsp[x] else x]
771 for i, (col, x) in enumerate(zip(columns,
--> 772 fmt_columns))]
773
774 if self.show_row_idx_names:
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in <listcomp>(.0)
769 str_columns = [[' ' + x if not self._get_formatter(i) and
770 need_leadsp[x] else x]
--> 771 for i, (col, x) in enumerate(zip(columns,
772 fmt_columns))]
773
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in _get_formatter(self, i)
362 else:
363 if is_integer(i) and i not in self.columns:
--> 364 i = self.columns[i]
365 return self.formatters.get(i, None)
366
<path_to_python>/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
3956 if is_scalar(key):
3957 key = com.cast_scalar_indexer(key)
-> 3958 return getitem(key)
3959
3960 if isinstance(key, slice):
IndexError: index 3 is out of bounds for axis 0 with size 3
I'm not sure where the problem can be, I've used this many times without problems but this time I tried a new environment and I got this issue. In case it can help my configuration is:
Python: 3.7.6;
Pandas: 0.24.2;
PySpark: 2.4.5
Any idea?
Thanks :)

I found the issue. Trying to minimize the code to reproduce the error I omitted that I was adding a pandas setting:
pd.set_option('display.max_columns', -1)
This caused the error independently of the dataframe being converted. To fix it I just specified a positive number of columns or None.

Related

NotImplementedError when trying to concat dataframes in Pandas

I have a pandas dataframe where there is a column called 'CITY' with various city names. I did one-hot encoding on that column to convert the categorical features to numeric features.
dummy_CITY = pd.get_dummies(df['CITY'], drop_first=False)
dummy_CITY.head()
Next I'm trying to concatenate the new dataframe obtained after one-hot encoding, as shown below:
df_cat = pd.concat([df, dummy_CITY])
for which I'm getting the following error:
`
NotImplementedError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8620/1976427847.py in <module>
----> 1 df_cat = pd.concat([df, dummy_CITY])
~\anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
~\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
305 )
306
--> 307 return op.get_result()
308
309
~\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in get_result(self)
530 mgrs_indexers.append((obj._mgr, indexers))
531
--> 532 new_data = concatenate_managers(
533 mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
534 )
~\anaconda3\lib\site-packages\pandas\core\internals\concat.py in concatenate_managers(mgrs_indexers, axes, concat_axis, copy)
224 fastpath = blk.values.dtype == values.dtype
225 else:
--> 226 values = _concatenate_join_units(join_units, concat_axis, copy=copy)
227 fastpath = False
228
~\anaconda3\lib\site-packages\pandas\core\internals\concat.py in _concatenate_join_units(join_units, concat_axis, copy)
486
487 has_none_blocks = any(unit.block is None for unit in join_units)
--> 488 upcasted_na = _dtype_to_na_value(empty_dtype, has_none_blocks)
489
490 to_concat = [
~\anaconda3\lib\site-packages\pandas\core\internals\concat.py in _dtype_to_na_value(dtype, has_none_blocks)
546 elif dtype.kind == "O":
547 return np.nan
--> 548 raise NotImplementedError
549
550
NotImplementedError:
I expected the new dataframe to be concatenated to the old one without any errors, as the number of rows match for both dataframes.

Seaborn pairplot not running only on a specific system

I have the following data with the name 'Salaries.csv'. It looks like the following:[The dataset has some columns like Index(['yearID', 'teamID', 'lgID', 'salary', 'num_feat'], dtype='object'). Please note that the column num_feat I have added to the DataFrame.
I want to do a Seaborn pairplot for team 'ATL' to plot scatter plots among all numeric features in the data frame.
I have the following code :
import seaborn as sns
var_set = [
"yearID",
"teamID",
"lgID",
"playerID",
"salary"
]
head_set = []
head_set.extend(var_set)
head_set.append("num_feat")
df = pd.read_csv('Salaries.csv',index_col='playerID', header=None, names=head_set)
df['num_feat'] = 100 * np.random.random_sample(df.shape[0]). #Adding column num_feat
df_copy = df
cols_with_team_ATL = df_copy.loc[df_copy.teamID=="ATL", ]
# Create the default pairplot
pairplot_fig = sns.pairplot(cols_with_team_ATL, vars=['yearID', 'salary', 'num_feat'])
plt.subplots_adjust(top=0.9)
pairplot_fig.fig.suptitle("Scatter plots among all numeric features in the data frame for teamID = ATL", fontsize=18, alpha=0.9, weight='bold')
plt.show()
The same code runs perfectly on my friend's system but not on mine. It shows the following error in my system :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/ch/6r9p7n0j3xg1l79lz1zdkvsh0000gq/T/ipykernel_97373/3735184261.py in <module>
25 # Create the default pairplot
26 print(df.columns)
---> 27 pairplot_fig = sns.pairplot(cols_with_team_ATL, vars=['yearID', 'salary', 'num_feat'])
28 plt.subplots_adjust(top=0.9)
29 pairplot_fig.fig.suptitle("Scatter plots among all numeric features in the data frame for teamID = ATL", fontsize=18, alpha=0.9, weight='bold')
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
2124 diag_kws.setdefault("legend", False)
2125 if diag_kind == "hist":
-> 2126 grid.map_diag(histplot, **diag_kws)
2127 elif diag_kind == "kde":
2128 diag_kws.setdefault("fill", True)
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/axisgrid.py in map_diag(self, func, **kwargs)
1476 plot_kwargs.setdefault("hue_order", self._hue_order)
1477 plot_kwargs.setdefault("palette", self._orig_palette)
-> 1478 func(x=vector, **plot_kwargs)
1479 ax.legend_ = None
1480
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/distributions.py in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
1460 if p.univariate:
1461
-> 1462 p.plot_univariate_histogram(
1463 multiple=multiple,
1464 element=element,
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)
426
427 # First pass through the data to compute the histograms
--> 428 for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
429
430 # Prepare the relevant data
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/_core.py in iter_data(self, grouping_vars, reverse, from_comp_data)
981
982 if from_comp_data:
--> 983 data = self.comp_data
984 else:
985 data = self.plot_data
~/USC/anaconda3/lib/python3.9/site-packages/seaborn/_core.py in comp_data(self)
1055 orig = self.plot_data[var].dropna()
1056 comp_col = pd.Series(index=orig.index, dtype=float, name=var)
-> 1057 comp_col.loc[orig.index] = pd.to_numeric(axis.convert_units(orig))
1058
1059 if axis.get_scale() == "log":
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
721
722 iloc = self if self.name == "iloc" else self.obj.iloc
--> 723 iloc._setitem_with_indexer(indexer, value, self.name)
724
725 def _validate_key(self, key, axis: int):
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value, name)
1730 self._setitem_with_indexer_split_path(indexer, value, name)
1731 else:
-> 1732 self._setitem_single_block(indexer, value, name)
1733
1734 def _setitem_with_indexer_split_path(self, indexer, value, name: str):
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/indexing.py in _setitem_single_block(self, indexer, value, name)
1966
1967 # actually do the set
-> 1968 self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
1969 self.obj._maybe_update_cacher(clear=True)
1970
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py in setitem(self, indexer, value)
353
354 def setitem(self: T, indexer, value) -> T:
--> 355 return self.apply("setitem", indexer=indexer, value=value)
356
357 def putmask(self, mask, new, align: bool = True):
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, ignore_failures, **kwargs)
325 applied = b.apply(f, **kwargs)
326 else:
--> 327 applied = getattr(b, f)(**kwargs)
328 except (TypeError, NotImplementedError):
329 if not ignore_failures:
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/internals/blocks.py in setitem(self, indexer, value)
941
942 # length checking
--> 943 check_setitem_lengths(indexer, value, values)
944 exact_match = is_exact_shape_match(values, arr_value)
945
~/USC/anaconda3/lib/python3.9/site-packages/pandas/core/indexers.py in check_setitem_lengths(indexer, value, values)
174 and len(indexer[indexer]) == len(value)
175 ):
--> 176 raise ValueError(
177 "cannot set using a list-like indexer "
178 "with a different length than the value"
ValueError: cannot set using a list-like indexer with a different length than the value
Why is it not running particularly on my system? Is there any problem with the python version or Jupyter Notebook?
Please help.

How to convert coordinate columns to Point column with Shapely and Dask?

I have the following problem. My data is a huge dataframe, looking like this (this is the head of the dataframe)
import pandas
import dask.dataframe as dd
data = dd.read_csv(data_path)
data.persist()
print(data.head())
Gitter_ID_100m x_mp_100m y_mp_100m Einwohner
0 100mN26840E43341 4334150 2684050 -1
1 100mN26840E43342 4334250 2684050 -1
2 100mN26840E43343 4334350 2684050 -1
3 100mN26840E43344 4334450 2684050 -1
4 100mN26840E43345 4334550 2684050 -1
I am using Dask to handle it. I now want to create a new column where the 'x_mp_100m' and 'y_mp_100m' are converted into a Shapely Point. For a single row, it would look like this:
from shapely.geometry import Point
test_df = data.head(1)
test_df = test_df.assign(geom=lambda k: Point(k.x_mp_100m,k.y_mp_100m))
print(test_df)
Gitter_ID_100m x_mp_100m y_mp_100m Einwohner geom
0 100mN26840E43341 4334150 2684050 -1 POINT (4334150 2684050)
I already tried the following code with Dask:
data_out = data.map_partitions(lambda df: df.assign(geom= lambda k: Point(k.x_mp_100m,k.y_mp_100m)), meta=pd.DataFrame)
When doing that, I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-b8de11d9b9b3> in <module>
----> 1 data_out.compute()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\base.py in compute(self, **kwargs)
154 dask.base.compute
155 """
--> 156 (result,) = compute(self, traverse=False, **kwargs)
157 return result
158
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\base.py in compute(*args, **kwargs)
395 keys = [x.__dask_keys__() for x in collections]
396 postcomputes = [x.__dask_postcompute__() for x in collections]
--> 397 results = schedule(dsk, keys, **kwargs)
398 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
399
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
2319 try:
2320 results = self.gather(packed, asynchronous=asynchronous,
-> 2321 direct=direct)
2322 finally:
2323 for f in futures.values():
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
1653 return self.sync(self._gather, futures, errors=errors,
1654 direct=direct, local_worker=local_worker,
-> 1655 asynchronous=asynchronous)
1656
1657 #gen.coroutine
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in sync(self, func, *args, **kwargs)
671 return future
672 else:
--> 673 return sync(self.loop, func, *args, **kwargs)
674
675 def __repr__(self):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\utils.py in sync(loop, func, *args, **kwargs)
275 e.wait(10)
276 if error[0]:
--> 277 six.reraise(*error[0])
278 else:
279 return result[0]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\utils.py in f()
260 if timeout is not None:
261 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262 result[0] = yield future
263 except Exception as exc:
264 error[0] = sys.exc_info()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\tornado\gen.py in run(self)
1131
1132 try:
-> 1133 value = future.result()
1134 except Exception:
1135 self.had_exception = True
~\AppData\Local\Continuum\anaconda3\lib\site-packages\tornado\gen.py in run(self)
1139 if exc_info is not None:
1140 try:
-> 1141 yielded = self.gen.throw(*exc_info)
1142 finally:
1143 # Break up a reference to itself
~\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\client.py in _gather(self, futures, errors, direct, local_worker)
1498 six.reraise(type(exception),
1499 exception,
-> 1500 traceback)
1501 if errors == 'skip':
1502 bad_keys.add(key)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\six.py in reraise(tp, value, tb)
690 value = tp()
691 if value.__traceback__ is not tb:
--> 692 raise value.with_traceback(tb)
693 raise value
694 finally:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\dask\dataframe\core.py in apply_and_enforce()
3682
3683 Ensures the output has the same columns, even if empty."""
-> 3684 df = func(*args, **kwargs)
3685 if isinstance(df, (pd.DataFrame, pd.Series, pd.Index)):
3686 if len(df) == 0:
<ipython-input-16-d5710cb00158> in <lambda>()
----> 1 data_out = data.map_partitions(lambda df: df.assign(geom= lambda k: Point(k.x_mp_100m,k.y_mp_100m)), meta=pd.DataFrame)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in assign()
3549 if PY36:
3550 for k, v in kwargs.items():
-> 3551 data[k] = com.apply_if_callable(v, data)
3552 else:
3553 # <= 3.5: do all calculations first...
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\common.py in apply_if_callable()
327
328 if callable(maybe_callable):
--> 329 return maybe_callable(obj, **kwargs)
330
331 return maybe_callable
<ipython-input-16-d5710cb00158> in <lambda>()
----> 1 data_out = data.map_partitions(lambda df: df.assign(geom= lambda k: Point(k.x_mp_100m,k.y_mp_100m)), meta=pd.DataFrame)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\shapely\geometry\point.py in __init__()
47 BaseGeometry.__init__(self)
48 if len(args) > 0:
---> 49 self._set_coords(*args)
50
51 # Coordinate getters and setters
~\AppData\Local\Continuum\anaconda3\lib\site-packages\shapely\geometry\point.py in _set_coords()
130 self._geom, self._ndim = geos_point_from_py(args[0])
131 else:
--> 132 self._geom, self._ndim = geos_point_from_py(tuple(args))
133
134 coords = property(BaseGeometry._get_coords, _set_coords)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\shapely\geometry\point.py in geos_point_from_py()
207 coords = ob
208 n = len(coords)
--> 209 dx = c_double(coords[0])
210 dy = c_double(coords[1])
211 dz = None
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in wrapper()
91 return converter(self.iloc[0])
92 raise TypeError("cannot convert the series to "
---> 93 "{0}".format(str(converter)))
94
95 wrapper.__name__ = "__{name}__".format(name=converter.__name__)
TypeError: cannot convert the series to <class 'float'>
So I think, I am using pandas.assign() function in a wrong way, or there should be a better fitting function, I just cannot seem to wrap my head around it. Do you know a better way to handle this?
I also found this way:
data_out = data.map_partitions(lambda df: df.apply(lambda row: Point(row['x_mp_100m'],row['y_mp_100m']), axis=1))
But is that the most efficient way?

What you're doing seems fine. I would find a function that works well on a single row and then use the apply method or a function that works well on a single Pandas dataframe and then use the map_partitions method.
For the error that you're getting I would first verify that your function works on a pandas dataframe.

xarray: mean of data stored via OPeNDAP

I'm using xarray's very cool pydap back-end (http://xarray.pydata.org/en/stable/io.html#opendap) to read data stored via OPenDAP at IRI:
import xarray as xr
remote_data = xr.open_dataarray('http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.RSMAS/.CCSM4/.hindcast/.zg/dods')
print(remote_data)
#<xarray.DataArray 'zg' (P: 2, S: 6569, M: 3, L: 45, Y: 181, X: 360)>
#[115569730800 values with dtype=float32]
#Coordinates:
# * L (L) timedelta64[ns] 0 days 12:00:00 1 days 12:00:00 ...
# * Y (Y) float32 -90.0 -89.0 -88.0 -87.0 -86.0 -85.0 -84.0 -83.0 ...
# * S (S) datetime64[ns] 1999-01-07 1999-01-08 1999-01-09 1999-01-10 ...
# * M (M) float32 1.0 2.0 3.0
# * X (X) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ...
# * P (P) int32 500 200
#Attributes:
# level_type: pressure level
# standard_name: geopotential_height
# long_name: Geopotential Height
# units: m
For reference it's sub-seasonal forecast data where L is lead-time (45 days forecasts), S is initialization date and M is ensemble.
I would like to do an ensemble mean and i'm only interested in the 500 hPa level. However, it crashes out and gives a RuntimeError: NetCDF: Access failure:
da = remote_data.sel(P=500)
da_ensmean = da.mean(dim='M')
RuntimeError Traceback (most recent call last)
<ipython-input-46-eca488e9def5> in <module>()
1 remote_data = xr.open_dataarray('http://iridl.ldeo.columbia.edu/SOURCES/.Models' '/.SubX/.RSMAS/.CCSM4/.hindcast/.zg/dods')
2 da = remote_data.sel(P=500)
----> 3 da_ensmean = da.mean(dim='M')
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/common.py in wrapped_func(self, dim, axis, skipna, keep_attrs, **kwargs)
20 keep_attrs=False, **kwargs):
21 return self.reduce(func, dim, axis, keep_attrs=keep_attrs,
---> 22 skipna=skipna, allow_lazy=True, **kwargs)
23 else:
24 def wrapped_func(self, dim=None, axis=None, keep_attrs=False,
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/dataarray.py in reduce(self, func, dim, axis, keep_attrs, **kwargs)
1359 summarized data and the indicated dimension(s) removed.
1360 """
-> 1361 var = self.variable.reduce(func, dim, axis, keep_attrs, **kwargs)
1362 return self._replace_maybe_drop_dims(var)
1363
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/variable.py in reduce(self, func, dim, axis, keep_attrs, allow_lazy, **kwargs)
1264 if dim is not None:
1265 axis = self.get_axis_num(dim)
-> 1266 data = func(self.data if allow_lazy else self.values,
1267 axis=axis, **kwargs)
1268
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/variable.py in data(self)
293 return self._data
294 else:
--> 295 return self.values
296
297 #data.setter
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/variable.py in values(self)
385 def values(self):
386 """The variable's data as a numpy.ndarray"""
--> 387 return _as_array_or_item(self._data)
388
389 #values.setter
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/variable.py in _as_array_or_item(data)
209 TODO: remove this (replace with np.asarray) once these issues are fixed
210 """
--> 211 data = np.asarray(data)
212 if data.ndim == 0:
213 if data.dtype.kind == 'M':
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype)
622
623 def __array__(self, dtype=None):
--> 624 self._ensure_cached()
625 return np.asarray(self.array, dtype=dtype)
626
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in _ensure_cached(self)
619 def _ensure_cached(self):
620 if not isinstance(self.array, NumpyIndexingAdapter):
--> 621 self.array = NumpyIndexingAdapter(np.asarray(self.array))
622
623 def __array__(self, dtype=None):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype)
600
601 def __array__(self, dtype=None):
--> 602 return np.asarray(self.array, dtype=dtype)
603
604 def __getitem__(self, key):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype)
506 def __array__(self, dtype=None):
507 array = as_indexable(self.array)
--> 508 return np.asarray(array[self.key], dtype=None)
509
510 def transpose(self, order):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/coding/variables.py in __getitem__(self, key)
64
65 def __getitem__(self, key):
---> 66 return self.func(self.array[key])
67
68 def __repr__(self):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/coding/variables.py in _apply_mask(data, encoded_fill_values, decoded_fill_value, dtype)
133 for fv in encoded_fill_values:
134 condition |= data == fv
--> 135 data = np.asarray(data, dtype=dtype)
136 return np.where(condition, decoded_fill_value, data)
137
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
490
491 """
--> 492 return array(a, dtype, copy=False, order=order)
493
494
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/indexing.py in __array__(self, dtype)
506 def __array__(self, dtype=None):
507 array = as_indexable(self.array)
--> 508 return np.asarray(array[self.key], dtype=None)
509
510 def transpose(self, order):
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/backends/netCDF4_.py in __getitem__(self, key)
63 with self.datastore.ensure_open(autoclose=True):
64 try:
---> 65 array = getitem(self.get_array(), key.tuple)
66 except IndexError:
67 # Catch IndexError in netCDF4 and return a more informative
~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/backends/common.py in robust_getitem(array, key, catch, max_retries, initial_delay)
114 for n in range(max_retries + 1):
115 try:
--> 116 return array[key]
117 except catch:
118 if n == max_retries:
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__getitem__()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable._get()
netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()
RuntimeError: NetCDF: Access failure
Breaking down the calculation removes the RuntimeError. Guess it was just too hefty of a calculation with all the start times. Shouldn't be too difficult to put in a loop over S:
da = remote_data.isel(P=0,S=0)
da_ensmean = da.mean(dim='M')
print(da_ensmean)
<xarray.DataArray 'zg' (L: 45, Y: 181, X: 360)>
array([[[5231.1445, 5231.1445, ..., 5231.1445, 5231.1445],
[5231.1445, 5231.1445, ..., 5231.1445, 5231.1445],
...,
[5056.2383, 5056.2383, ..., 5056.2383, 5056.2383],
[5056.2383, 5056.2383, ..., 5056.2383, 5056.2383]],
[[5211.346 , 5211.346 , ..., 5211.346 , 5211.346 ],
[5211.346 , 5211.346 , ..., 5211.346 , 5211.346 ],
...,
[5082.062 , 5082.062 , ..., 5082.062 , 5082.062 ],
[5082.062 , 5082.062 , ..., 5082.062 , 5082.062 ]],
...,
[[5108.8247, 5108.8247, ..., 5108.8247, 5108.8247],
[5108.8247, 5108.8247, ..., 5108.8247, 5108.8247],
...,
[5154.2173, 5154.2173, ..., 5154.2173, 5154.2173],
[5154.2173, 5154.2173, ..., 5154.2173, 5154.2173]],
[[5106.4893, 5106.4893, ..., 5106.4893, 5106.4893],
[5106.4893, 5106.4893, ..., 5106.4893, 5106.4893],
...,
[5226.0063, 5226.0063, ..., 5226.0063, 5226.0063],
[5226.0063, 5226.0063, ..., 5226.0063, 5226.0063]]], dtype=float32)
Coordinates:
* L (L) timedelta64[ns] 0 days 12:00:00 1 days 12:00:00 ...
* Y (Y) float32 -90.0 -89.0 -88.0 -87.0 -86.0 -85.0 -84.0 -83.0 ...
S datetime64[ns] 1999-01-07
* X (X) float32 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 ...
P int32 500

This is a good use-case for chunking with dask, e.g.,
import xarray as xr
url = 'http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.RSMAS/.CCSM4/.hindcast/.zg/dods'
remote_data = xr.open_dataarray(url, chunks={'S': 1, 'L': 1})
da = remote_data.sel(P=500)
da_ensmean = da.mean(dim='M')
This version will access the data server in parallel, using many smaller chunks. It will still be slow to download 231 GB of data, but your request will have much better odds of success.

Folium Choropleth + GeoJSON raises AttributeError: 'NoneType'

I'm trying to do a choropleth using folium which offers a great link between GeoJSON, Pandas and leaflet.
GeoJSON format is like below :
{
"type":"FeatureCollection",
"features":[
{
"type":"Feature",
"geometry":
{
"type":"Polygon",
"coordinates":[[[-1.6704591323124895,49.62681486270549], .....
{
"insee":"50173",
"nom":"Équeurdreville-Hainneville",
"wikipedia":"fr:Équeurdreville-Hainneville",
"surf_m2":12940306}},
Pandas DataFrame :
postal_count.head(5)
Out[98]:
Code_commune_INSEE CP_count
0 75120 723
1 75115 698
2 75112 671
3 75118 627
4 75111 622
"Code_communes_INSEE" corresponds to the attribute "insee" in the GeoJSON. I'd like to do a choropleth using the variable "CP_count" in the above DataFrame.
Here is my code (snippet from this notebook)
map_france = folium.Map(location=[47.000000, 2.000000], zoom_start=6)
map_france.choropleth(
geo_str=open(geo_path + 'simplified_communes100m.json').read(),
data=postal_count,
columns=['Code_commune_INSEE', 'CP_count'],
key_on='feature.geometry.properties.insee',
fill_color='YlGn',
)
map_france.save(table_path + 'choro_test1.html')
I'm still getting this error again and again :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-83-ea0fd2c1c207> in <module>()
8 fill_color='YlGn',
9 )
---> 10 map_france.save('/media/flo/Stockage/Data/MesAides/map/choro_test1.html')
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/element.py in save(self, outfile, close_file, **kwargs)
151
152 root = self.get_root()
--> 153 html = root.render(**kwargs)
154 fid.write(html.encode('utf8'))
155 if close_file:
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/element.py in render(self, **kwargs)
357 """Renders the HTML representation of the element."""
358 for name, child in self._children.items():
--> 359 child.render(**kwargs)
360 return self._template.render(this=self, kwargs=kwargs)
361
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/element.py in render(self, **kwargs)
665
666 for name, element in self._children.items():
--> 667 element.render(**kwargs)
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/element.py in render(self, **kwargs)
661 script = self._template.module.__dict__.get('script', None)
662 if script is not None:
--> 663 figure.script.add_children(Element(script(self, kwargs)),
664 name=self.get_name())
665
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/jinja2/runtime.py in __call__(self, *args, **kwargs)
434 raise TypeError('macro %r takes not more than %d argument(s)' %
435 (self.name, len(self.arguments)))
--> 436 return self._func(*arguments)
437
438 def __repr__(self):
<template> in macro(l_this, l_kwargs)
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/jinja2/runtime.py in call(_Context__self, _Context__obj, *args, **kwargs)
194 args = (__self.environment,) + args
195 try:
--> 196 return __obj(*args, **kwargs)
197 except StopIteration:
198 return __self.environment.undefined('value was undefined because '
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/features.py in style_data(self)
352
353 for feature in self.data['features']:
--> 354 feature.setdefault('properties', {}).setdefault('style', {}).update(self.style_function(feature)) # noqa
355 return json.dumps(self.data, sort_keys=True)
356
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/folium.py in style_function(x)
671 "color": line_color,
672 "fillOpacity": fill_opacity,
--> 673 "fillColor": color_scale_fun(x)
674 }
675
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/folium.py in color_scale_fun(x)
659 def color_scale_fun(x):
660 return color_range[len(
--> 661 [u for u in color_domain if
662 u <= color_data[get_by_key(x, key_on)]])]
663 else:
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/folium.py in <listcomp>(.0)
660 return color_range[len(
661 [u for u in color_domain if
--> 662 u <= color_data[get_by_key(x, key_on)]])]
663 else:
664 def color_scale_fun(x):
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/folium.py in get_by_key(obj, key)
655 return (obj.get(key, None) if len(key.split('.')) <= 1 else
656 get_by_key(obj.get(key.split('.')[0], None),
--> 657 '.'.join(key.split('.')[1:])))
658
659 def color_scale_fun(x):
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/folium.py in get_by_key(obj, key)
655 return (obj.get(key, None) if len(key.split('.')) <= 1 else
656 get_by_key(obj.get(key.split('.')[0], None),
--> 657 '.'.join(key.split('.')[1:])))
658
659 def color_scale_fun(x):
/home/flo/.virtualenvs/mesaides/lib/python3.4/site-packages/folium/folium.py in get_by_key(obj, key)
653
654 def get_by_key(obj, key):
--> 655 return (obj.get(key, None) if len(key.split('.')) <= 1 else
656 get_by_key(obj.get(key.split('.')[0], None),
657 '.'.join(key.split('.')[1:])))
AttributeError: 'NoneType' object has no attribute 'get'
I tried playing with key_on='feature.geometry.properties.insee' without any success.

There were 2 problems :
1 - The correct access to 'insee' parameters is : key_on='feature.properties.insee'
The best way to find the right key_on is to play with the geoJSON dict to make sure you are calling the right properties.
2- Once you have the right key_on parameters, you need to make sure that all the available keys in the geoJSON are contained in your Pandas DataFrame (otherwise it will raise a KeyError)
In this case, I used the following command line to get all the insee keys contained by my geoJSON:
ogrinfo -ro -al communes-20150101-100m.shp -geom=NO | grep insee > list_code_insee.txt
If you are experiencing the same issue, this should solve your problem.

I had the same problem on JupyterLab (on labs.cognitiveclass.ai) using Folium 0.5.0. Then I copied my code and ran it in PyCharm, and it worked! I don't understand why, perhaps there is some backend issue (?)
If you want to display a folium map outside of a Jupyter notebook, you have to save the map to html:
map_france.save('map_france.html')
and open the html in your browser.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pyspark toPandas() IndexError: index is out of bounds - pandas

I found the issue. Trying to minimize the code to reproduce the error I omitted that I was adding a pandas setting: pd.set_option('display.max_columns', -1) This caused the error independently of the dataframe being converted. To fix it I just specified a positive number of columns or None.

Related

NotImplementedError when trying to concat dataframes in Pandas

Seaborn pairplot not running only on a specific system

How to convert coordinate columns to Point column with Shapely and Dask?

xarray: mean of data stored via OPeNDAP

Folium Choropleth + GeoJSON raises AttributeError: 'NoneType'

Categories

Resources