Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I have a the following class:
class Analysis():
def __init__(self, file_dir):
self.path = file_dir #file path directory
def getData(self):
return pd.read_csv(self.path) # create a pandas dataframe
def getStd(self):
return self.getData().loc['1':'5'].apply(lambda x: x.std()) # cacluate the standard deviation of all columns
def getHighlight(self):
#a function to highlight df based on the given condition
def highlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#rows over which the highlighting function should apply
r = ['1', '2', '3', '4', '5']
#first boolean mask for selecting the df elements
m1 = (x.loc[r]>x.loc['USL']) | (x.loc[r]<x.loc['LSL'])
#second boolean mask for selecting the df elements
m2 = (x.loc[r]==x.loc['USL']) | (x.loc[r]==x.loc['LSL'])
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
#apply the highlight function on the df to get highlighted
return self.getData().style.apply(highlight, axis=None)
getData method returns the df like this:
my_analysis = Analysis(path_to_file)
my_analysis.getData()
A-A A-B A-C A-D A-E
Tg 0.37 10.24 5.02 0.63 20.30
USL 0.39 10.26 5.04 0.65 20.32
LSL 0.35 10.22 5.00 0.63 20.28
1 0.35 10.23 5.05 0.65 20.45
2 0.36 10.19 5.07 0.67 20.25
3 0.34 10.25 5.03 0.66 20.33
4 0.35 10.20 5.08 0.69 20.22
5 0.33 10.17 5.05 0.62 20.40
Max 0.36 10.25 5.08 0.69 20.45
Min 0.33 10.17 5.03 0.62 20.22
The getHighligt method has an inner function which applies to the df in order to highlight the df elements based on the given mask and it would out put something like this:
my_analysis.getHighlight()
My question is what is the pythonic or elegant way of defining the inner function inside the class method?
Disclaimer: the following remarks represent my opinion about the topic of pythonic code.
Avoid Inner Functions
You should avoid inner functions at all cost. Sometimes they're necessary, but most of the time they're an indication that you might want to refactor your code.
Avoid re-reading multiple times
I would also avoid calling pd.read_csv every time I want to perform some operation in the data. Unless there's a good reason to read the file over and over again, It's more performant to read it once and store it in a class attribute, or property.
PEP-8 Naming Conventions
Another important thing to consider, if you're trying to make your code more pythonic, is to try to follow the PEP8 naming conventions, unless you're working on legacy code that does not follow PEP-8.
Class Overkill
Finally, I think that creating a class for what you're doing seems a little overkill. Most of your methods are simply transformations that could be easily converted to functions. Aside from making your code less complex, It would improve its reusability.
How I would write the Analysis class
from __future__ import absolute_import, annotations
from pathlib import Path
from typing import Any, Collection, Iterable, Type, Union
import numpy as np
import pandas as pd
from pandas.core.dtypes.dtypes import ExtensionDtype # type: ignore
# Custom types for type hinting
Axes = Collection[Any]
NpDtype = Union[
str, np.dtype, Type[Union[str, float, int, complex, bool, object]]
]
Dtype = Union["ExtensionDtype", NpDtype]
# Auxiliary functions
def is_iterable_not_string(iterable: Any) -> bool:
"""Return True, if `iterable` is an iterable object, and not a string.
Parameters
----------
iterable: Any
The object to check whether it's an iterable except for strings,
or not.
Returns
-------
bool
True, if object is iterable, but not a string.
Otherwise, if object isn't an iterable, or if it's a string, return
False.
Examples
--------
>>> import numpy as np
>>> import pandas as pd
>>> class FakeIterable(int):
... def __iter__(self): pass
>>> print(is_iterable_not_string('abcde'))
False
>>> print(is_iterable_not_string(bytes(12345)))
False
>>> print(is_iterable_not_string(12345))
False
>>> print(is_iterable_not_string(123.45))
False
>>> print(is_iterable_not_string(type))
False
>>> print(is_iterable_not_string(list)) # Type list isn't iterable
False
>>> print(is_iterable_not_string(object))
False
>>> print(is_iterable_not_string(None))
False
>>> print(is_iterable_not_string(list())) # Empty list is still iterable
True
>>> # `FakeIterable` has a method `__iter__`, therefore it's considered
>>> # iterable, even though it isn't.
>>> print(is_iterable_not_string(FakeIterable(10)))
True
>>> print(is_iterable_not_string(list('abcde')))
True
>>> print(is_iterable_not_string(tuple('abcde')))
True
>>> print(is_iterable_not_string(set('abcde')))
True
>>> print(is_iterable_not_string(np.array(list('abcdef'))))
True
>>> print(is_iterable_not_string({col: [1, 2, 3, 4] for col in 'abcde'}))
True
>>> print(is_iterable_not_string(
... pd.DataFrame({col: [1, 2, 3, 4] for col in 'abcde'}))
... )
True
>>> print(is_iterable_not_string(pd.DataFrame()))
True
Notes
-----
In python, any object that contains a method called `__iter__` considered
an “iterable”. This means that you can, in theory, fake an “iterable”
object, by creating a method called `__iter__` that doesn't contain any
real implementation. For a concrete case, see the examples section.
Python common iterable objects:
- strings
- bytes
- lists
- tuples
- sets
- dictionaries
Python common non-iterable objects:
- integers
- floats
- None
- types
- objects
"""
return (not isinstance(iterable, (bytes, str))
and isinstance(iterable, Iterable))
def prepare_dict(data: dict) -> dict:
"""Transform non-iterable dictionary values into lists.
Parameters
----------
data : dict
The dictionary to convert non-iterable values into lists.
Returns
-------
dict
Dictionary with non-iterable values converted to lists.
Examples
--------
>>> import pandas as pd
>>> d = {'a': '1', 'b': 2}
>>> prepare_dict(d)
{'a': ['1'], 'b': [2]}
>>> pd.DataFrame(d) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
ValueError: If using all scalar values, you must pass an index
>>> pd.DataFrame(prepare_dict(d))
a b
0 1 2
Notes
-----
Use this function to prepare dictionaries, before calling
`pandas.DataFrame`, to make sure all values have the correct format.
"""
return {
key: value if is_iterable_not_string(value) else [value]
for key, value in data.items()
}
def check_dict_value_lens(data: dict) -> bool:
"""Check whether all values from dictionary have the same lenght.
Parameters
----------
data : dict
The dictionary to check the values lenghts.
Returns
-------
bool
True, if all `data` values have the same lenght. False otherwise.
"""
min_len = min(map(lambda value: len(value), data.values()))
return all(len(value) == min_len for value in data.values())
def read_file(path: Path | str, **kwargs: Any) -> pd.DataFrame:
"""
Read a DataFrame from a file.
Supported file types are:
- `.csv`
- `.xlsx`, `.xls`, `.xlsm`, `.xlsb` (Excel files)
- `.json`
- `.parquet`
- `.feather`
- `.html`
Parameters
----------
path : Path | str
The path to the file.
kwargs : Any
Keyword arguments to pass to pandas io functions.
Returns
-------
pd.DataFrame
The DataFrame read from the file.
Raises
------
ValueError
If the file type not supported.
FileNotFoundError
If the file doesn't exist.
"""
_path = Path(path)
path = str(path)
if not _path.is_file():
raise FileNotFoundError(f"File {path} does not exist.")
if _path.suffix in [".csv", ".txt"]:
return pd.read_csv(path, **kwargs)
if ".xls" in _path.suffix:
return pd.read_excel(path, **kwargs)
if _path.suffix == ".json":
return pd.read_json(path, **kwargs)
if _path.suffix == ".pickle":
return pd.read_pickle(path, **kwargs)
if _path.suffix == ".html":
return pd.read_html(path, **kwargs)
if _path.suffix == ".feather":
return pd.read_feather(path, **kwargs)
if _path.suffix in [".parquet", ".pq"]:
return pd.read_parquet(path, **kwargs)
raise ValueError(f"File {path} has an unknown extension.")
def highlight(df: pd.DataFrame) -> pd.DataFrame:
"""Highlight a DataFrame.
Parameters
----------
df : pd.DataFrame
The DataFrame to highlight. Required indexes:
- ["USL", "LSL", "1", "2", "3", "4", "5"]
Returns
-------
pd.DataFrame
The DataFrame with highlighted rows.
"""
# The dataframe cells background colors.
c1: str = "background-color:red"
c2: str = "background-color:yellow"
c3: str = "background-color:green"
# Rows over which the highlighting function should apply
rows: list[str] = ["1", "2", "3", "4", "5"]
# First boolean mask for selecting the df elements
m1 = (df.loc[rows] > df.loc["USL"]) | (df.loc[rows] < df.loc["LSL"])
# Second boolean mask for selecting the df elements
m2 = (df.loc[rows] == df.loc["USL"]) | (df.loc[rows] == df.loc["LSL"])
# DataFrame with same index, and column names as the original,
# but with filled empty strings.
df_highlight = pd.DataFrame("", index=df.index, columns=df.columns)
# Change values of df1 columns by boolean mask
df_highlight.loc[rows, :] = np.select(
[m1, m2], [c1, c2], default=c3
)
return df_highlight
class Analysis:
"""
Read a dataframe, and help performing some analysis in the data.
Parameters
----------
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
Attributes
----------
_data : pd.DataFrame
The data read from the file.
_path : str | Path
The path to the file.
Examples
--------
>>> data = {
... 'A-A': [
... 0.37, 0.39, 0.35, 0.35, 0.36, 0.34, 0.35, 0.33, 0.36, 0.33,
... ],
... 'A-B': [
... 10.24, 10.26, 10.22, 10.23, 10.19, 10.25, 10.2, 10.17, 10.25,
... 10.17,
... ],
... 'A-C': [
... 5.02, 5.04, 5.0, 5.05, 5.07, 5.03, 5.08, 5.05, 5.08, 5.03,
... ],
... 'A-D': [
... 0.63, 0.65, 0.63, 0.65, 0.67, 0.66, 0.69, 0.62, 0.69, 0.62,
... ],
... 'A-E': [
... 20.3, 20.32, 20.28, 20.45, 20.25, 20.33, 20.22, 20.4,
... 20.45, 20.22,
... ],
... }
>>> index = ['Tg', 'USL', 'LSL', '1', '2', '3', '4', '5', 'Max', 'Min']
>>> analysis = Analysis.from_dict(data, index=index)
>>> analysis.get_std()
A-A 0.011402
A-B 0.031937
A-C 0.019494
A-D 0.025884
A-E 0.097211
dtype: float64
"""
_path: Path | str | None = None
_data: pd.DataFrame | None = None
#property
def path(self) -> str | Path:
"""Get the path to the file.
Returns
-------
str | Path
The path to the file.
Raises
------
ValueError
If `_path` is `None`.
"""
if self._path is None:
raise ValueError("Path not set.")
return str(self._path)
#path.setter
def path(self, path: str | Path):
"""Set the path of the file to analyze.
Parameters
----------
path : str | Path
The path of the file to analyze.
Path should point to a `.csv` file.
Raises
------
FileNotFoundError
If the path not found.
"""
_path = Path(path)
if _path.is_file():
self._path = str(path)
else:
raise FileNotFoundError(f"Path {path} does not exist.")
#property
def data(self) -> pd.DataFrame:
"""Dataframe read from `path`.
Returns
-------
pd.DataFrame
The dataframe read from `path`.
"""
if self._data is None:
self._data = self.get_data()
return self._data
#data.setter
def data(self, data: pd.DataFrame):
"""Set the data to analyze.
Parameters
----------
data : pd.DataFrame
The data to analyze.
"""
self._data = data
def __init__(self, path_or_data: str | Path | pd.DataFrame):
"""Initialize the Analyzer.
Parameters
----------
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
Raises
------
ValueError
If `path_or_data` not a `str`, `Path`, or `pd.DataFrame`.
"""
if isinstance(path_or_data, (str, Path)):
self.path = path_or_data
elif isinstance(path_or_data, pd.DataFrame):
self.data = path_or_data
else:
raise ValueError(f"Invalid type {type(path_or_data)}.")
def get_data(self) -> pd.DataFrame:
"""Read the data from the file.
Returns
-------
pd.DataFrame
The dataframe read from the `path` property.
"""
return read_file(self.path)
def get_std(self) -> pd.Series:
"""Calcuate the standard deviation of every column.
Returns
-------
pd.Series
The standard deviation of every column.
"""
return self.data.loc["1":"5"].apply(lambda x: x.std()) # type: ignore
def highlight_frame(
self, round_values: int | None = None
) -> pd.io.formats.style.Styler: # type: ignore
"""Highlight dataframe, based on some condition.
Parameters
----------
round_values: int | None
If defined, sets the precision of the Styler object with the
highlighted dataframe.
Returns
-------
pd.io.formats.style.Styler
The Styler object with the highlighted dataframe.
"""
highlight_df = self.data.style.apply(highlight, axis=None)
if isinstance(round_values, int) and round_values >= 0:
return highlight_df.format(precision=round_values)
return highlight_df
#classmethod
def from_dict(
cls,
data: dict,
index: Axes | None = None,
columns: Axes | None = None,
dtype: Dtype | None = None,
) -> Analysis:
"""Create an Analysis object from a dictionary.
Parameters
----------
data : dict
The dictionary to create the Analysis object from.
index : Index or array-like
Index to use for resulting frame. Defaults to RangeIndex, if
no indexing information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame when data doesn't have
them, defaulting to RangeIndex(0, 1, 2, ..., n).
If data contains column labels, will perform column selection
instead.
dtype : dtype, default None
Data type to force. Only a single dtype allowed. If None, infer.
Returns
-------
Analysis
An instance of the `Analysis` class.
Raises
------
ValueError
If dictionary values have different lenghts.
"""
data = prepare_dict(data)
if check_dict_value_lens(data):
return cls(
pd.DataFrame(data, index=index, columns=columns, dtype=dtype)
)
raise ValueError(
f"Dictionary values don't have the same lenghts.\nData: {data}"
)
if __name__ == "__main__":
import doctest
doctest.testmod()
I have a dataframe whose one of the columns has a Series of shapely Points and another one in which I have a Series of Polygons.
df.head()
hash number street unit \
2024459 283e04eca5c4932a SN AVENIDA DOUTOR SEVERIANO DE ALMEIDA NaN
2024460 1a92a1c3cba7941a 485 AVENIDA DOUTOR SEVERIANO DE ALMEIDA NaN
2024461 837341c45de519a3 475 AVENIDA DOUTOR SEVERIANO DE ALMEIDA NaN
city district region postcode id geometry
2024459 Jaguari NaN RS 97760-000 NaN POINT (-54.69445 -29.49421)
2024460 Jaguari NaN RS 97760-000 NaN POINT (-54.69445 -29.49421)
2024461 Jaguari NaN RS 97760-000 NaN POINT (-54.69445 -29.49421)
poly_df.head()
centroids geometry
0 POINT (-29.31067315122428 -54.64176359828149) POLYGON ((-54.64069 -29.31161, -54.64069 -29.3...
1 POINT (-29.31067315122428 -54.63961783106958) POLYGON ((-54.63854 -29.31161, -54.63854 -29.3...
2 POINT (-29.31067315122428 -54.637472063857665) POLYGON ((-54.63640 -29.31161, -54.63640 -29.3...
I'm checking if the Point belongs to the Polygon and inserting the Point object into the cell of the second dataframe. However, I'm getting the following error:
Traceback (most recent call last):
File "/tmp/ipykernel_4771/1967309101.py", line 1, in <module>
df.loc[idx, 'centroids'] = poly_mun.loc[ix, 'centroids']
File ".local/lib/python3.8/site-packages/pandas/core/indexing.py", line 692, in __setitem__
iloc._setitem_with_indexer(indexer, value, self.name)
File ".local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1599, in _setitem_with_indexer
self.obj[key] = infer_fill_value(value)
File ".local/lib/python3.8/site-packages/pandas/core/dtypes/missing.py", line 516, in infer_fill_value
val = np.array(val, copy=False)
TypeError: float() argument must be a string or a number, not 'Point'
I'm using the following command line:
df.loc[idx, 'centroids'] = poly_df.loc[ix, 'centroids']
I have already tried at as well.
Thanks
You can't create a new column in pandas with a shapely geometry using loc:
In [1]: import pandas as pd, shapely.geometry
In [2]: df = pd.DataFrame({'mycol': [1, 2, 3]})
In [3]: df.loc[0, "centroid"] = shapely.geometry.Point([0, 0])
/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexing.py:1642: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
self.obj[key] = infer_fill_value(value)
/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/dtypes/missing.py:550: FutureWarning: The input object of type 'Point' is an array-like implementing one of the corresponding protocols (`__array__`, `__array_interface__` or `__array_struct__`); but not a sequence (or 0-D). In the future, this object will be coerced as if it was first converted using `np.array(obj)`. To retain the old behaviour, you have to either modify the type 'Point', or assign to an empty array created with `np.empty(correct_shape, dtype=object)`.
val = np.array(val, copy=False)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 df.loc[0, "centroid"] = shapely.geometry.Point([0, 0])
File ~/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexing.py:716, in _LocationIndexer.__setitem__(self, key, value)
713 self._has_valid_setitem_indexer(key)
715 iloc = self if self.name == "iloc" else self.obj.iloc
--> 716 iloc._setitem_with_indexer(indexer, value, self.name)
File ~/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexing.py:1642, in _iLocIndexer._setitem_with_indexer(self, indexer, value, name)
1639 self.obj[key] = empty_value
1641 else:
-> 1642 self.obj[key] = infer_fill_value(value)
1644 new_indexer = convert_from_missing_indexer_tuple(
1645 indexer, self.obj.axes
1646 )
1647 self._setitem_with_indexer(new_indexer, value, name)
File ~/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/dtypes/missing.py:550, in infer_fill_value(val)
548 if not is_list_like(val):
549 val = [val]
--> 550 val = np.array(val, copy=False)
551 if needs_i8_conversion(val.dtype):
552 return np.array("NaT", dtype=val.dtype)
TypeError: float() argument must be a string or a real number, not 'Point'
Essentially, pandas doesn't know how to interpret a point object, and so creates a float column with NaNs, and then can't handle the point. This might get fixed in the future, but you're best off explicitly defining the column as object dtype:
In [27]: df['centroid'] = None
In [28]: df['centroid'] = df['centroid'].astype(object)
In [29]: df
Out[29]:
mycol centroid
0 1 None
1 2 None
2 3 None
In [30]: df.loc[0, "centroid"] = shapely.geometry.Point([0, 0])
/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/internals/managers.py:304: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
applied = getattr(b, f)(**kwargs)
In [31]: df
Out[31]:
mycol centroid
0 1 POINT (0 0)
1 2 None
2 3 None
That said, joining two GeoDataFrames with polygons and points based on whether the points are in the polygons certainly sounds like a job for geopandas.sjoin:
union = gpd.sjoin(polygon_df, points_df, op='contains')
Some formerly-working code fails after I migrated from 0.25.3 Pandas to 1.2.4. Here is a reproducible example:
import numpy as np
import pandas as pd
print(f"pandas: {pd.__version__}")
!python --version
cols = pd.MultiIndex.from_product([['coz',], ['alpha', 'beta', 'gamma']], names=['health', 'protocol'])
index=pd.date_range(start="1jan2020", end=None, periods=5, freq="d", name="Date")
data = np.random.rand(5,3)
df = pd.DataFrame(data=data, index=index, columns=cols)
def foo(row):
row.index = row.index.droplevel(0)
return row['beta'] > row['alpha']
df.apply(foo, axis="columns")
in 0.25.3 this worked as I wanted:
pandas: 0.25.3
Python 3.7.11
Date
2020-01-01 False
2020-01-02 True
2020-01-03 False
2020-01-04 True
2020-01-05 False
Freq: D, dtype: bool
but in 1.2.4 the same code throws an error apparently due to the droplevel:
pandas: 1.2.4
Python 3.9.4
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-22-4242d4f13ab1> in <module>
15 return row['beta'] > row['alpha']
16
---> 17 df.apply(foo, axis="columns")
~\.conda\envs\yagi\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7766 kwds=kwds,
7767 )
-> 7768 return op.get_result()
7769
7770 def applymap(self, func, na_action: Optional[str] = None) -> DataFrame:
~\.conda\envs\yagi\lib\site-packages\pandas\core\apply.py in get_result(self)
183 return self.apply_raw()
184
--> 185 return self.apply_standard()
186
187 def apply_empty_result(self):
~\.conda\envs\yagi\lib\site-packages\pandas\core\apply.py in apply_standard(self)
274
275 def apply_standard(self):
--> 276 results, res_index = self.apply_series_generator()
277
278 # wrap results
~\.conda\envs\yagi\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
288 for i, v in enumerate(series_gen):
289 # ignore SettingWithCopy here in case the user mutates
--> 290 results[i] = self.f(v)
291 if isinstance(results[i], ABCSeries):
292 # If we have a view on v, we need to make a copy because
<ipython-input-22-4242d4f13ab1> in foo(row)
12
13 def foo(row):
---> 14 row.index = row.index.droplevel(0)
15 return row['beta'] > row['alpha']
16
~\.conda\envs\yagi\lib\site-packages\pandas\core\indexes\base.py in droplevel(self, level)
1609 levnums = sorted(self._get_level_number(lev) for lev in level)[::-1]
1610
-> 1611 return self._drop_level_numbers(levnums)
1612
1613 def _drop_level_numbers(self, levnums: List[int]):
~\.conda\envs\yagi\lib\site-packages\pandas\core\indexes\base.py in _drop_level_numbers(self, levnums)
1619 return self
1620 if len(levnums) >= self.nlevels:
-> 1621 raise ValueError(
1622 f"Cannot remove {len(levnums)} levels from an index with "
1623 f"{self.nlevels} levels: at least one level must be left."
ValueError: Cannot remove 1 levels from an index with 1 levels: at least one level must be left.
What seems to be happening is in 1.2.4 the droplevel seems to be accumulating! The first row passed into apply(), has a 2-level index. But the second row passed into apply() has a single-level index, and this is where the error screams. This I don't understand at all.
Here is same toy w/ a print diagnostic
import numpy as np
import pandas as pd
print(f"pandas: {pd.__version__}")
!python --version
cols = pd.MultiIndex.from_product([['coz',], ['alpha', 'beta', 'gamma']], names=['health', 'protocol'])
index=pd.date_range(start="1jan2020", end=None, periods=5, freq="d", name="Date")
data = np.random.rand(5,3)
df = pd.DataFrame(data=data, index=index, columns=cols)
def foo(row):
print(f"\nROW: {row} END")
row.index = row.index.droplevel(0)
return row['beta'] > row['alpha']
foo = df.apply(foo, axis="columns")
correct output:
pandas: 0.25.3
Python 3.7.11
ROW: health protocol
coz alpha 0.054421
beta 0.922885
gamma 0.843888
Name: 2020-01-01T00:00:00.000000000, dtype: float64 END
ROW: health protocol
coz alpha 0.962803
beta 0.827594
gamma 0.260147
Name: 2020-01-02T00:00:00.000000000, dtype: float64 END
ROW: health protocol
coz alpha 0.680902
beta 0.124468
gamma 0.960604
Name: 2020-01-03T00:00:00.000000000, dtype: float64 END
ROW: health protocol
coz alpha 0.133331
beta 0.664735
gamma 0.623440
Name: 2020-01-04T00:00:00.000000000, dtype: float64 END
ROW: health protocol
coz alpha 0.984164
beta 0.578701
gamma 0.538993
Name: 2020-01-05T00:00:00.000000000, dtype: float64 END
Date
2020-01-01 True
2020-01-02 False
2020-01-03 False
2020-01-04 True
2020-01-05 False
Freq: D, dtype: bool
failing output:
pandas: 1.2.4
Python 3.9.4
ROW: health protocol
coz alpha 0.374974
beta 0.137263
gamma 0.494556
Name: 2020-01-01 00:00:00, dtype: float64 END
ROW: protocol
alpha 0.591057
beta 0.560530
gamma 0.183457
Name: 2020-01-02 00:00:00, dtype: float64 END
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-28-bbef1b39f13a> in <module>
16 return row['beta'] > row['alpha']
17
---> 18 foo = df.apply(foo, axis="columns")
...
ValueError: Cannot remove 1 levels from an index with 1 levels: at least one level must be left.
========
So I can fix this by operating on a .copy() of the row, but this feels like a hack. I don't understand why the code has started working in this way after the version change.
def foo(row):
#print(f"\nROW: {row} END")
row=row.copy()
row.index = row.index.droplevel(0)
return row['beta'] > row['alpha']
https://pandas.pydata.org/docs/user_guide/gotchas.html#mutating-with-user-defined-function-udf-methods
Do not mutate with user-defined-methods like .apply(). I was just lucky that it worked in 0.25.3....
Alright, just started a new job and i have been tasked with writing a simple notebook in jupyter. I really want to impress my supervisor and have been working on this code for hours and can't get it to work, hopefully somebody here can help me.
Here is the code I have been working on:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
df = pd.read_csv(r'C:\Users\jk2588\Documents\EDA\EDA Practice\top1000_dataset.csv', converters={'GENDER': lambda x: int(x == 'Male')}, usecols = ['MEMBER_ID', 'GENDER', 'Age', 'Dement'])
df_gp_1 = df[['MEMBER_ID', 'Dement']].groupby('MEMBER_ID').agg(np.mean).reset_index()
df_gp_2 = df[['MEMBER_ID', 'GENDER', 'Age']].groupby('MEMBER_ID').agg(max).reset_index()
df_gp = pd.merge(df_gp_1, df_gp_2, on = ['MEMBER_ID'])
df.head()
Output: MEMBER_ID Age Dement GENDER
0 000000002 01 36 NaN 0
1 000000002 01 36 NaN 0
2 000000002 01 36 NaN 0
3 000000002 01 36 NaN 0
4 000000002 01 36 NaN 0
df['Dement'] = df['Dement'].fillna(0)
df['Dement'] = df['Dement'].astype('int64')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 4 columns):
MEMBER_ID 999 non-null object
Age 999 non-null int64
Dement 999 non-null int64
GENDER 999 non-null int64
dtypes: int64(3), object(1)
memory usage: 31.3+ KB
freq = ((df_gp.Age.value_counts(normalize = True).reset_index().sort_values(by = 'index').Age)*100).tolist()
number_gp = 7
def ax_settings(ax, var_name, x_min, x_max):
ax.set_xlim(x_min,x_max)
ax.set_yticks([])
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['bottom'].set_edgecolor('#444444')
ax.spines['bottom'].set_linewidth(2)
ax.text(0.02, 0.05, var_name, fontsize=17, fontweight="bold", transform = ax.transAxes)
return None
fig = plt.figure(figsize=(12,7))
gs = gridspec.GridSpec(nrows=number_gp,
ncols=2,
figure=fig,
width_ratios= [3, 1],
height_ratios= [1]*number_gp,
wspace=0.2, hspace=0.05
)
ax = [None]*(number_gp + 1)
features = ['0-17', '18-25', '26-35', '36-45', '46-50', '51-55', '55+']
for i in range(number_gp):
ax[i] = fig.add_subplot(gs[i, 0])
ax_settings(ax[i], 'Age: ' + str(features[i]), -1000, 20000)
sns.kdeplot(data=df_gp[(df_gp.GENDER == 'M') & (df_gp.Age == features[i])].Dement, ax=ax[i], shade=True, color="blue", bw=300, legend=False)
sns.kdeplot(data=df_gp[(df_gp.GENDER == 'F') & (df_gp.Age == features[i])].Dement, ax=ax[i], shade=True, color="red", bw=300, legend=False)
if i < (number_gp - 1): ax[i].set_xticks([])
ax[0].legend(['Male', 'Female'], facecolor='w')
ax[number_gp] = fig.add_subplot(gs[:, 1])
ax[number_gp].spines['right'].set_visible(False)
ax[number_gp].spines['top'].set_visible(False)
ax[number_gp].barh(features, freq, color='#004c99', height=0.4)
ax[number_gp].set_xlim(0,100)
ax[number_gp].invert_yaxis()
ax[number_gp].text(1.09, -0.04, '(%)', fontsize=10, transform = ax[number_gp].transAxes)
ax[number_gp].tick_params(axis='y', labelsize = 14)
plt.show()
I am then met with:
C:\Users\jk2588\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\ops.py:1167: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
result = method(y)
--------------------------------------------------------------------------
TypeError Traceback (most recent call last
<ipython-input-38-8665030edb1c> in <module>()
24 ax[i] = fig.add_subplot(gs[i, 0])
25 ax_settings(ax[i], 'Age: ' + str(features[i]), -1000, 20000)
---> 26 sns.kdeplot(data=df_gp[(df_gp.GENDER == 'M') & (df_gp.Age == features[i])].Dement, ax=ax[i], shade=True, color="blue", bw=300, legend=False)
27 sns.kdeplot(data=df_gp[(df_gp.GENDER == 'F') & (df_gp.Age == features[i])].Dement, ax=ax[i], shade=True, color="red", bw=300, legend=False)
28 if i < (number_gp - 1): ax[i].set_xticks([])
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
1281
1282 with np.errstate(all='ignore'):
-> 1283 res = na_op(values, other)
1284 if is_scalar(res):
1285 raise TypeError('Could not compare {typ} type with Series'
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\ops.py in na_op(x, y)
1167 result = method(y)
1168 if result is NotImplemented:
-> 1169 raise TypeError("invalid type comparison")
1170 else:
1171 result = op(x, y)
TypeError: invalid type comparison
Please help, i have been faced with an absurd amount of errors this week
I have a multindex dataframe df1 as:
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
bs.columns
MultiIndex([( 'A1', 'B1'),
( 'A2', 'B2')],
names=[node, 'bkt'])
and another similar multiindex dataframe df2 as:
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I want to concat them vertically so that resulting dataframe df3 looks as following:
df3 = pd.concat([df1, df2], axis=0)
While concatenating I want to introduce 2 blank row between dataframes df1 and df2. In addition I want to introduce two strings Basis Mean and Basis P25 in df3 as shown below.
print(df3)
Basis Mean
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
Basis P25
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I don't know whether there is anyway of doing the above.
I don't think that that is an actual concatenation you are talking about.
The following could already do the trick:
print('Basis Mean')
print(df1.to_string())
print('\n')
print('Basis P25')
print(df2.to_string())
This isn't usually how DataFrames are used, but perhaps you wish to append rows of empty strings in between df1 and df2, along with rows containing your titles?
df1 = pd.concat([pd.DataFrame([["Basis","Mean",""]],columns=df1.columns), df1], axis=0)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series(["Basis","P25",""], index=df1.columns),ignore_index=True)
df3 = pd.concat([df1, df2], axis=0)
Author clarified in the comment that he wants to make it easy to print to an excel file. It can be achieved using pd.ExcelWriter.
Below is an example of how to do it.
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
import pandas as pd
#dataclass
class SaveTask:
df: pd.DataFrame
header: Optional[str]
extra_pd_settings: Optional[Dict[str, Any]] = None
def fill_xlsx(
save_tasks: List[SaveTask],
writer: pd.ExcelWriter,
sheet_name: str = "Sheet1",
n_rows_between_blocks: int = 2,
) -> None:
current_row = 0
for save_task in save_tasks:
extra_pd_settings = save_task.extra_pd_settings or {}
if "startrow" in extra_pd_settings:
raise ValueError(
"You should not use parameter 'startrow' in extra_pd_settings"
)
save_task.df.to_excel(
writer,
sheet_name=sheet_name,
startrow=current_row + 1,
**extra_pd_settings
)
worksheet = writer.sheets[sheet_name]
worksheet.write(current_row, 0, save_task.header)
has_header = extra_pd_settings.get("header", True)
current_row += (
1 + save_task.df.shape[0] + n_rows_between_blocks + int(has_header)
)
if __name__ == "__main__":
# INPUTS
df1 = pd.DataFrame(
{"hello": [1, 2, 3, 4], "world": [0.55, 1.12313, 23.12, 0.0]}
)
df2 = pd.DataFrame(
{"foo": [3, 4]},
index=pd.MultiIndex.from_tuples([("foo", "bar"), ("baz", "qux")]),
)
# Xlsx creation
writer = pd.ExcelWriter("test.xlsx", engine="xlsxwriter")
fill_xlsx(
[
SaveTask(
df1,
"Hello World Table",
{"index": False, "float_format": "%.3f"},
),
SaveTask(df2, "Foo Table with MultiIndex"),
],
writer,
)
writer.save()
As an extra bonus, pd.ExcelWriter allows to save data on different sheets in Excel and choose their names right from Python code.