Pandas: Fast way to get cols/rows containing na - pandas

In Pandas we can drop cols/rows by .dropna(how = ..., axis = ...) but is there a way to get an array-like of True/False indicators for each col/row, which would indicate whether a col/row contains na according to how and axis arguments?
I.e. is there a way to convert .dropna(how = ..., axis = ...) to a method, which would instead of actual removal just tell us, which cols/rows would be removed if we called .dropna(...) with specific how and axis.
Thank you for your time!

You can use isna() to replicate the behaviour of dropna without actually removing data. To mimic the 'how' and 'axis' parameter, you can add any() or all() and set the axis accordingly.
Here is a simple example:
import pandas as pd
df = pd.DataFrame([[pd.NA, pd.NA, 1], [pd.NA, pd.NA, pd.NA]])
df.isna()
Output:
0 1 2
0 True True False
1 True True True
Eq. to dropna(how='any', axis=0)
df.isna().any(axis=0)
Output:
0 True
1 True
2 True
dtype: bool
Eq. to dropna(how='any', axis=1)
df.isna().any(axis=1)
Output:
0 True
1 True
dtype: bool

Related

alternative way to define a function inside a class method [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I have a the following class:
class Analysis():
def __init__(self, file_dir):
self.path = file_dir #file path directory
def getData(self):
return pd.read_csv(self.path) # create a pandas dataframe
def getStd(self):
return self.getData().loc['1':'5'].apply(lambda x: x.std()) # cacluate the standard deviation of all columns
def getHighlight(self):
#a function to highlight df based on the given condition
def highlight(x):
c1 = 'background-color:red'
c2 = 'background-color:yellow'
c3 = 'background-color:green'
#rows over which the highlighting function should apply
r = ['1', '2', '3', '4', '5']
#first boolean mask for selecting the df elements
m1 = (x.loc[r]>x.loc['USL']) | (x.loc[r]<x.loc['LSL'])
#second boolean mask for selecting the df elements
m2 = (x.loc[r]==x.loc['USL']) | (x.loc[r]==x.loc['LSL'])
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 columns by boolean mask
df1.loc[r, :] = np.select([m1, m2], [c1, c2], default=c3)
return df1
#apply the highlight function on the df to get highlighted
return self.getData().style.apply(highlight, axis=None)
getData method returns the df like this:
my_analysis = Analysis(path_to_file)
my_analysis.getData()
A-A A-B A-C A-D A-E
Tg 0.37 10.24 5.02 0.63 20.30
USL 0.39 10.26 5.04 0.65 20.32
LSL 0.35 10.22 5.00 0.63 20.28
1 0.35 10.23 5.05 0.65 20.45
2 0.36 10.19 5.07 0.67 20.25
3 0.34 10.25 5.03 0.66 20.33
4 0.35 10.20 5.08 0.69 20.22
5 0.33 10.17 5.05 0.62 20.40
Max 0.36 10.25 5.08 0.69 20.45
Min 0.33 10.17 5.03 0.62 20.22
The getHighligt method has an inner function which applies to the df in order to highlight the df elements based on the given mask and it would out put something like this:
my_analysis.getHighlight()
My question is what is the pythonic or elegant way of defining the inner function inside the class method?
Disclaimer: the following remarks represent my opinion about the topic of pythonic code.
Avoid Inner Functions
You should avoid inner functions at all cost. Sometimes they're necessary, but most of the time they're an indication that you might want to refactor your code.
Avoid re-reading multiple times
I would also avoid calling pd.read_csv every time I want to perform some operation in the data. Unless there's a good reason to read the file over and over again, It's more performant to read it once and store it in a class attribute, or property.
PEP-8 Naming Conventions
Another important thing to consider, if you're trying to make your code more pythonic, is to try to follow the PEP8 naming conventions, unless you're working on legacy code that does not follow PEP-8.
Class Overkill
Finally, I think that creating a class for what you're doing seems a little overkill. Most of your methods are simply transformations that could be easily converted to functions. Aside from making your code less complex, It would improve its reusability.
How I would write the Analysis class
from __future__ import absolute_import, annotations
from pathlib import Path
from typing import Any, Collection, Iterable, Type, Union
import numpy as np
import pandas as pd
from pandas.core.dtypes.dtypes import ExtensionDtype # type: ignore
# Custom types for type hinting
Axes = Collection[Any]
NpDtype = Union[
str, np.dtype, Type[Union[str, float, int, complex, bool, object]]
]
Dtype = Union["ExtensionDtype", NpDtype]
# Auxiliary functions
def is_iterable_not_string(iterable: Any) -> bool:
"""Return True, if `iterable` is an iterable object, and not a string.
Parameters
----------
iterable: Any
The object to check whether it's an iterable except for strings,
or not.
Returns
-------
bool
True, if object is iterable, but not a string.
Otherwise, if object isn't an iterable, or if it's a string, return
False.
Examples
--------
>>> import numpy as np
>>> import pandas as pd
>>> class FakeIterable(int):
... def __iter__(self): pass
>>> print(is_iterable_not_string('abcde'))
False
>>> print(is_iterable_not_string(bytes(12345)))
False
>>> print(is_iterable_not_string(12345))
False
>>> print(is_iterable_not_string(123.45))
False
>>> print(is_iterable_not_string(type))
False
>>> print(is_iterable_not_string(list)) # Type list isn't iterable
False
>>> print(is_iterable_not_string(object))
False
>>> print(is_iterable_not_string(None))
False
>>> print(is_iterable_not_string(list())) # Empty list is still iterable
True
>>> # `FakeIterable` has a method `__iter__`, therefore it's considered
>>> # iterable, even though it isn't.
>>> print(is_iterable_not_string(FakeIterable(10)))
True
>>> print(is_iterable_not_string(list('abcde')))
True
>>> print(is_iterable_not_string(tuple('abcde')))
True
>>> print(is_iterable_not_string(set('abcde')))
True
>>> print(is_iterable_not_string(np.array(list('abcdef'))))
True
>>> print(is_iterable_not_string({col: [1, 2, 3, 4] for col in 'abcde'}))
True
>>> print(is_iterable_not_string(
... pd.DataFrame({col: [1, 2, 3, 4] for col in 'abcde'}))
... )
True
>>> print(is_iterable_not_string(pd.DataFrame()))
True
Notes
-----
In python, any object that contains a method called `__iter__` considered
an “iterable”. This means that you can, in theory, fake an “iterable”
object, by creating a method called `__iter__` that doesn't contain any
real implementation. For a concrete case, see the examples section.
Python common iterable objects:
- strings
- bytes
- lists
- tuples
- sets
- dictionaries
Python common non-iterable objects:
- integers
- floats
- None
- types
- objects
"""
return (not isinstance(iterable, (bytes, str))
and isinstance(iterable, Iterable))
def prepare_dict(data: dict) -> dict:
"""Transform non-iterable dictionary values into lists.
Parameters
----------
data : dict
The dictionary to convert non-iterable values into lists.
Returns
-------
dict
Dictionary with non-iterable values converted to lists.
Examples
--------
>>> import pandas as pd
>>> d = {'a': '1', 'b': 2}
>>> prepare_dict(d)
{'a': ['1'], 'b': [2]}
>>> pd.DataFrame(d) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
ValueError: If using all scalar values, you must pass an index
>>> pd.DataFrame(prepare_dict(d))
a b
0 1 2
Notes
-----
Use this function to prepare dictionaries, before calling
`pandas.DataFrame`, to make sure all values have the correct format.
"""
return {
key: value if is_iterable_not_string(value) else [value]
for key, value in data.items()
}
def check_dict_value_lens(data: dict) -> bool:
"""Check whether all values from dictionary have the same lenght.
Parameters
----------
data : dict
The dictionary to check the values lenghts.
Returns
-------
bool
True, if all `data` values have the same lenght. False otherwise.
"""
min_len = min(map(lambda value: len(value), data.values()))
return all(len(value) == min_len for value in data.values())
def read_file(path: Path | str, **kwargs: Any) -> pd.DataFrame:
"""
Read a DataFrame from a file.
Supported file types are:
- `.csv`
- `.xlsx`, `.xls`, `.xlsm`, `.xlsb` (Excel files)
- `.json`
- `.parquet`
- `.feather`
- `.html`
Parameters
----------
path : Path | str
The path to the file.
kwargs : Any
Keyword arguments to pass to pandas io functions.
Returns
-------
pd.DataFrame
The DataFrame read from the file.
Raises
------
ValueError
If the file type not supported.
FileNotFoundError
If the file doesn't exist.
"""
_path = Path(path)
path = str(path)
if not _path.is_file():
raise FileNotFoundError(f"File {path} does not exist.")
if _path.suffix in [".csv", ".txt"]:
return pd.read_csv(path, **kwargs)
if ".xls" in _path.suffix:
return pd.read_excel(path, **kwargs)
if _path.suffix == ".json":
return pd.read_json(path, **kwargs)
if _path.suffix == ".pickle":
return pd.read_pickle(path, **kwargs)
if _path.suffix == ".html":
return pd.read_html(path, **kwargs)
if _path.suffix == ".feather":
return pd.read_feather(path, **kwargs)
if _path.suffix in [".parquet", ".pq"]:
return pd.read_parquet(path, **kwargs)
raise ValueError(f"File {path} has an unknown extension.")
def highlight(df: pd.DataFrame) -> pd.DataFrame:
"""Highlight a DataFrame.
Parameters
----------
df : pd.DataFrame
The DataFrame to highlight. Required indexes:
- ["USL", "LSL", "1", "2", "3", "4", "5"]
Returns
-------
pd.DataFrame
The DataFrame with highlighted rows.
"""
# The dataframe cells background colors.
c1: str = "background-color:red"
c2: str = "background-color:yellow"
c3: str = "background-color:green"
# Rows over which the highlighting function should apply
rows: list[str] = ["1", "2", "3", "4", "5"]
# First boolean mask for selecting the df elements
m1 = (df.loc[rows] > df.loc["USL"]) | (df.loc[rows] < df.loc["LSL"])
# Second boolean mask for selecting the df elements
m2 = (df.loc[rows] == df.loc["USL"]) | (df.loc[rows] == df.loc["LSL"])
# DataFrame with same index, and column names as the original,
# but with filled empty strings.
df_highlight = pd.DataFrame("", index=df.index, columns=df.columns)
# Change values of df1 columns by boolean mask
df_highlight.loc[rows, :] = np.select(
[m1, m2], [c1, c2], default=c3
)
return df_highlight
class Analysis:
"""
Read a dataframe, and help performing some analysis in the data.
Parameters
----------
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
Attributes
----------
_data : pd.DataFrame
The data read from the file.
_path : str | Path
The path to the file.
Examples
--------
>>> data = {
... 'A-A': [
... 0.37, 0.39, 0.35, 0.35, 0.36, 0.34, 0.35, 0.33, 0.36, 0.33,
... ],
... 'A-B': [
... 10.24, 10.26, 10.22, 10.23, 10.19, 10.25, 10.2, 10.17, 10.25,
... 10.17,
... ],
... 'A-C': [
... 5.02, 5.04, 5.0, 5.05, 5.07, 5.03, 5.08, 5.05, 5.08, 5.03,
... ],
... 'A-D': [
... 0.63, 0.65, 0.63, 0.65, 0.67, 0.66, 0.69, 0.62, 0.69, 0.62,
... ],
... 'A-E': [
... 20.3, 20.32, 20.28, 20.45, 20.25, 20.33, 20.22, 20.4,
... 20.45, 20.22,
... ],
... }
>>> index = ['Tg', 'USL', 'LSL', '1', '2', '3', '4', '5', 'Max', 'Min']
>>> analysis = Analysis.from_dict(data, index=index)
>>> analysis.get_std()
A-A 0.011402
A-B 0.031937
A-C 0.019494
A-D 0.025884
A-E 0.097211
dtype: float64
"""
_path: Path | str | None = None
_data: pd.DataFrame | None = None
#property
def path(self) -> str | Path:
"""Get the path to the file.
Returns
-------
str | Path
The path to the file.
Raises
------
ValueError
If `_path` is `None`.
"""
if self._path is None:
raise ValueError("Path not set.")
return str(self._path)
#path.setter
def path(self, path: str | Path):
"""Set the path of the file to analyze.
Parameters
----------
path : str | Path
The path of the file to analyze.
Path should point to a `.csv` file.
Raises
------
FileNotFoundError
If the path not found.
"""
_path = Path(path)
if _path.is_file():
self._path = str(path)
else:
raise FileNotFoundError(f"Path {path} does not exist.")
#property
def data(self) -> pd.DataFrame:
"""Dataframe read from `path`.
Returns
-------
pd.DataFrame
The dataframe read from `path`.
"""
if self._data is None:
self._data = self.get_data()
return self._data
#data.setter
def data(self, data: pd.DataFrame):
"""Set the data to analyze.
Parameters
----------
data : pd.DataFrame
The data to analyze.
"""
self._data = data
def __init__(self, path_or_data: str | Path | pd.DataFrame):
"""Initialize the Analyzer.
Parameters
----------
path_or_data : str | Path | pd.DataFrame
The path to a file, or a dataframe to analyze.
Raises
------
ValueError
If `path_or_data` not a `str`, `Path`, or `pd.DataFrame`.
"""
if isinstance(path_or_data, (str, Path)):
self.path = path_or_data
elif isinstance(path_or_data, pd.DataFrame):
self.data = path_or_data
else:
raise ValueError(f"Invalid type {type(path_or_data)}.")
def get_data(self) -> pd.DataFrame:
"""Read the data from the file.
Returns
-------
pd.DataFrame
The dataframe read from the `path` property.
"""
return read_file(self.path)
def get_std(self) -> pd.Series:
"""Calcuate the standard deviation of every column.
Returns
-------
pd.Series
The standard deviation of every column.
"""
return self.data.loc["1":"5"].apply(lambda x: x.std()) # type: ignore
def highlight_frame(
self, round_values: int | None = None
) -> pd.io.formats.style.Styler: # type: ignore
"""Highlight dataframe, based on some condition.
Parameters
----------
round_values: int | None
If defined, sets the precision of the Styler object with the
highlighted dataframe.
Returns
-------
pd.io.formats.style.Styler
The Styler object with the highlighted dataframe.
"""
highlight_df = self.data.style.apply(highlight, axis=None)
if isinstance(round_values, int) and round_values >= 0:
return highlight_df.format(precision=round_values)
return highlight_df
#classmethod
def from_dict(
cls,
data: dict,
index: Axes | None = None,
columns: Axes | None = None,
dtype: Dtype | None = None,
) -> Analysis:
"""Create an Analysis object from a dictionary.
Parameters
----------
data : dict
The dictionary to create the Analysis object from.
index : Index or array-like
Index to use for resulting frame. Defaults to RangeIndex, if
no indexing information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame when data doesn't have
them, defaulting to RangeIndex(0, 1, 2, ..., n).
If data contains column labels, will perform column selection
instead.
dtype : dtype, default None
Data type to force. Only a single dtype allowed. If None, infer.
Returns
-------
Analysis
An instance of the `Analysis` class.
Raises
------
ValueError
If dictionary values have different lenghts.
"""
data = prepare_dict(data)
if check_dict_value_lens(data):
return cls(
pd.DataFrame(data, index=index, columns=columns, dtype=dtype)
)
raise ValueError(
f"Dictionary values don't have the same lenghts.\nData: {data}"
)
if __name__ == "__main__":
import doctest
doctest.testmod()

Pandas create new column base on groupby and apply lambda if statement

I have the issue with groupby and apply
df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})
I want to create a column C for each group take value 1 if B > z_score=2 and 0 otherwise. The code:
from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)
However, I am unsuccessful with code and cannot figure out the issue
Use GroupBy.transformwith lambda, function, then compare and for convert True/False to 1/0 convert to integers:
from scipy import stats
s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)
Or use numpy.where:
df['C'] = np.where(s > 2, 1, 0)
Error in your solution is per groups:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If check gotcha in pandas docs:
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.
So if use one of solutions instead if-else:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))
print (df)
A
a [0, 0, 0]
b [0, 0, 0, 0]
Name: B, dtype: object
but then need convert to column, for avoid this problems is used groupby.transform.
You can use groupby + apply a function that finds the z-scores of each item in each group; explode the resulting list; use gt to create a boolean series and convert it to dtype int
df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)
Output:
A B C
0 a 1 0
1 a 2 0
2 a 3 0
3 b 4 0
4 b 5 0
5 b 6 0
6 b 7 0

How do I call my columns after transposing?

I have a dataframe I want to transpose, after doing this I need to call the columns but they are set as index. I have tried resetting the index to no avail
index False True
0 Scan_Periodicity_%_Changed 0.785003 0.214997
1 Assets_Scanned_%_Changed 0.542056 0.457944
I want the True and False columns to be regular columns but they are part of the index and I cannot call
df['True']
Expected Output:
False True
0 Scan_Periodicity_%_Changed 0.785003 0.214997
1 Assets_Scanned_%_Changed 0.542056 0.457944
and when i call True and False I want it to be a column not an index
df['True']
.214997
.457944
Try via reset_index(),set_index() and rename_axis() method:
out= (df.reset_index()
.set_index(['level_0','index'])
.rename_axis(index=[None,None]))
output of out:
False True
0 Scan_Periodicity_%_Changed 0.785003 0.214997
1 Assets_Scanned_%_Changed 0.542056 0.457944
Not sure what do you mean by "I want it to be a column not an index"
If you are looking for not having the 'index' value being displayed as a header you can do as
>>> import pandas as pd
>>>
>>> d = {
... 'index':['Scan_Periodicity_%_Changed','Assets_Scanned_%_Changed'],
... 'False':[0.78,0.54],
... 'True':[0.21,0.45]
... }
>>>
>>> df = pd.DataFrame(d)
>>>
>>> df = df.set_index('index')
>>>
>>> df.index.name = None
>>>
>>>
>>> df
False True
Scan_Periodicity_%_Changed 0.78 0.21
Assets_Scanned_%_Changed 0.54 0.45

Is there a nullable boolean type I can use in a Pandas dataframe?

In a program I am working on I have to explicitly set the type of a column that contains boolean data. Sometimes all of the values in this column are None. Unless I provide explicit type information Pandas will infer the wrong type information for that column.
Is there a pandas-compatible type that represents a nullable-bool? I want to do something like this, but preserve the Nones:
s = pandas.Series([True, False, None]).astype(bool)
print([v for v in s])
gives:
[True, False, False]
Python's built-in bool class cannot have a Null value. It can only be True or False. And in this case, because bool(None)==False the final Null is lost.
But what if I want to preserve my nulls? Is there a type I can give the column which allows for True, False and None?
I have solved a similar issue with numeric columns: For these I can use the Numpy Int64 which is a pandas-compatible nullable integer type:
s = pandas.Series([1, 2, None, numpy.NaN]).astype("Int64")
print([v for v in s])
gives:
[1, 2, <NA>, <NA>]
Which is exactly right behaviour for nullable integers, I just need a type I can use for my Nullable bools.
boolean dtype should work:
>>> pd.Series([True, False, None])
0 True
1 False
2 None
dtype: object
>>> pd.Series([True, False, None]).astype("boolean")
0 True
1 False
2 <NA>
dtype: boolean

A contain masking operation

Suppose such an array
In [8]: pd.Series(['testing', 'the', 'masking'])
Out[8]:
0 testing
1 the
2 masking
dtype: object
Masking is handy
In [10]: arr == 'testing'
Out[10]:
0 True
1 False
2 False
dtype: bool
If check if 't' in the individual strings, nested iterations should be applied
In [11]: [ u for u in arr if 't' in u]
Out[11]: ['testing', 'the']
Is it possible to get it done with
arr contains 't'
It is possible
s[s.str.contains('t')]