pandas filter with replace and search - pandas

I have such dataframe
I want to fitler rows and replace values based on the filter. I want to apply two different operations based on the replacement value.
How to do this as my current approach does not work.
import pandas as pd
import re
df = pd.DataFrame(data={'a': ['aa', 'bb', 'c banana a dupa'], 'b': ['\w','\d','[ab-c]','[^c b]']})
df['filter'] = (df['a'] > 2).replace({True: f"dupa {df['b']}", False: re.search(df['b'], df['a'])})

I think you want numpy.where, because replace is used for replacement by scalars, then for compare lengths use Series.str.len, for prepend value only + and for search use apply:
df = pd.DataFrame(data={'a': ['aa', 'bb', 'c banana a dupa', 'dd'],
'b': ['\w','\d','[ab-c]','[^c b]']})
df['filter'] = np.where(df['a'].str.len() > 2,
"dupa " + df['b'],
df.apply(lambda x: re.search(x['b'], x['a']), axis=1))
print (df)
a b filter
0 aa \w <re.Match object; span=(0, 1), match='a'>
1 bb \d None
2 c banana a dupa [ab-c] dupa [ab-c]
3 dd [^c b] <re.Match object; span=(0, 1), match='d'>

Related

Pandas create new column base on groupby and apply lambda if statement

I have the issue with groupby and apply
df = pd.DataFrame({'A': ['a', 'a', 'a', 'b', 'b', 'b', 'b'], 'B': np.r_[1:8]})
I want to create a column C for each group take value 1 if B > z_score=2 and 0 otherwise. The code:
from scipy import stats
df['C'] = df.groupby('A').apply(lambda x: 1 if np.abs(stats.zscore(x['B'], nan_policy='omit')) > 2 else 0, axis=1)
However, I am unsuccessful with code and cannot figure out the issue
Use GroupBy.transformwith lambda, function, then compare and for convert True/False to 1/0 convert to integers:
from scipy import stats
s = df.groupby('A')['B'].transform(lambda x: np.abs(stats.zscore(x, nan_policy='omit')))
df['C'] = (s > 2).astype(int)
Or use numpy.where:
df['C'] = np.where(s > 2, 1, 0)
Error in your solution is per groups:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: 1 if np.abs(stats.zscore(x, nan_policy='omit')) > 2 else 0)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If check gotcha in pandas docs:
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not.
So if use one of solutions instead if-else:
from scipy import stats
df = df.groupby('A')['B'].apply(lambda x: (np.abs(stats.zscore(x, nan_policy='omit')) > 2).astype(int))
print (df)
A
a [0, 0, 0]
b [0, 0, 0, 0]
Name: B, dtype: object
but then need convert to column, for avoid this problems is used groupby.transform.
You can use groupby + apply a function that finds the z-scores of each item in each group; explode the resulting list; use gt to create a boolean series and convert it to dtype int
df['C'] = df.groupby('A')['B'].apply(lambda x: stats.zscore(x, nan_policy='omit')).explode(ignore_index=True).abs().gt(2).astype(int)
Output:
A B C
0 a 1 0
1 a 2 0
2 a 3 0
3 b 4 0
4 b 5 0
5 b 6 0
6 b 7 0

Concatenating 2 dataframes vertically with empty row in middle

I have a multindex dataframe df1 as:
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
bs.columns
MultiIndex([( 'A1', 'B1'),
( 'A2', 'B2')],
names=[node, 'bkt'])
and another similar multiindex dataframe df2 as:
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I want to concat them vertically so that resulting dataframe df3 looks as following:
df3 = pd.concat([df1, df2], axis=0)
While concatenating I want to introduce 2 blank row between dataframes df1 and df2. In addition I want to introduce two strings Basis Mean and Basis P25 in df3 as shown below.
print(df3)
Basis Mean
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
Basis P25
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I don't know whether there is anyway of doing the above.
I don't think that that is an actual concatenation you are talking about.
The following could already do the trick:
print('Basis Mean')
print(df1.to_string())
print('\n')
print('Basis P25')
print(df2.to_string())
This isn't usually how DataFrames are used, but perhaps you wish to append rows of empty strings in between df1 and df2, along with rows containing your titles?
df1 = pd.concat([pd.DataFrame([["Basis","Mean",""]],columns=df1.columns), df1], axis=0)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series(["Basis","P25",""], index=df1.columns),ignore_index=True)
df3 = pd.concat([df1, df2], axis=0)
Author clarified in the comment that he wants to make it easy to print to an excel file. It can be achieved using pd.ExcelWriter.
Below is an example of how to do it.
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
import pandas as pd
#dataclass
class SaveTask:
df: pd.DataFrame
header: Optional[str]
extra_pd_settings: Optional[Dict[str, Any]] = None
def fill_xlsx(
save_tasks: List[SaveTask],
writer: pd.ExcelWriter,
sheet_name: str = "Sheet1",
n_rows_between_blocks: int = 2,
) -> None:
current_row = 0
for save_task in save_tasks:
extra_pd_settings = save_task.extra_pd_settings or {}
if "startrow" in extra_pd_settings:
raise ValueError(
"You should not use parameter 'startrow' in extra_pd_settings"
)
save_task.df.to_excel(
writer,
sheet_name=sheet_name,
startrow=current_row + 1,
**extra_pd_settings
)
worksheet = writer.sheets[sheet_name]
worksheet.write(current_row, 0, save_task.header)
has_header = extra_pd_settings.get("header", True)
current_row += (
1 + save_task.df.shape[0] + n_rows_between_blocks + int(has_header)
)
if __name__ == "__main__":
# INPUTS
df1 = pd.DataFrame(
{"hello": [1, 2, 3, 4], "world": [0.55, 1.12313, 23.12, 0.0]}
)
df2 = pd.DataFrame(
{"foo": [3, 4]},
index=pd.MultiIndex.from_tuples([("foo", "bar"), ("baz", "qux")]),
)
# Xlsx creation
writer = pd.ExcelWriter("test.xlsx", engine="xlsxwriter")
fill_xlsx(
[
SaveTask(
df1,
"Hello World Table",
{"index": False, "float_format": "%.3f"},
),
SaveTask(df2, "Foo Table with MultiIndex"),
],
writer,
)
writer.save()
As an extra bonus, pd.ExcelWriter allows to save data on different sheets in Excel and choose their names right from Python code.

Pandas apply function on multiple columns

I am trying to apply a function to every column in a dataframe, when I try to do it on just a single fixed column name it works. I tried doing it on every column, but when I try passing the column name as an argument in the function I get an error.
How do you properly pass arguments to apply a function on a data frame?
def result(row,c):
if row[c] >=0 and row[c] <=1:
return 'c'
elif row[c] >1 and row[c] <=2:
return 'b'
else:
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df.apply(result, args = (c), axis=1)
TypeError: ('result() takes exactly 2 arguments (21 given)', u'occurred at index 0')
Input data frame format:
d = {'c1': [1, 2, 1, 0], 'c2': [3, 0, 1, 2]}
df = pd.DataFrame(data=d)
df
c1 c2
0 1 3
1 2 0
2 1 1
3 0 2
You don't need to pass the column name to apply. As you only want to check if values of the columns are in certain range and should return a, b or c. You can make the following changes.
def result(val):
if 0<=val<=1:
return 'c'
elif 1<val<=2:
return 'b'
return 'a'
cols = list(df.columns.values)
for c in cols
df[c] = df[c].apply(result)
Note that this will replace your column values.
A faster way is np.select:
import numpy as np
values = ['c', 'b']
for col in df.columns:
df[col] = np.select([0<=df[col]<=1, 1<df[col]<=2], values, default = 'a')

How to use the values of one column to access values in another column?

How to use the values of one column to access values in another
import numpy
impot pandas
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=[['Value']])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
so how to access the value 'bleh' for each row?
df.Value.iloc[df['bleh']]
Edit:
Thanks to #ScottBoston. My DF constructor had one layer of [] too much.
The correct answer is:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
df['idx_int'] = range(df.shape[0])
df['haa'] = df['idx_int'] - df.bleh.values
df['newcol'] = df.Value.iloc[df['haa'].values].values
Try:
df['Value'].tolist()
Output:
[-1.0856306033005612,
0.9973454465835858,
0.28297849805199204,
-1.506294713918092,
-0.5786002519685364,
1.651436537097151,
-2.426679243393074,
-0.42891262885617726,
1.265936258705534,
-0.8667404022651017]
Your dataframe constructor still needs to be fixed.
Are you looking for:
df.set_index('bleh')
output:
Value
bleh
0 -1.085631
1 0.997345
2 0.282978
1 -1.506295
4 -0.578600
0 1.651437
0 -2.426679
4 -0.428913
1 1.265936
7 -0.866740
If so you, your dataframe constructor has as extra set of [] in it.
np.random.seed(123)
df = pd.DataFrame((np.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: np.random.randint(0, x + 1, 1)[0])
columns paramater in dataframe takes a list not a list of list.

Aggregate/Remove duplicate rows in DataFrame based on swapped index levels

Sample input
import pandas as pd
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
Which looks like this:
value
from to type
A B 1 5
B C 2 2
A 1 1
C B 1 3
Goal
I now want to remove "duplicate" rows from this in the following sense: for each row with an arbitrary index (from, to, type), if there exists a row (to, from, type), the value of the second row should be added to the first row and the second row be dropped. In the example above, the row (B, A, 1) with value 1 should be added to the first row and dropped, leading to the following desired result.
Sample result
value
from to type
A B 1 6
B C 2 2
C B 1 3
This is my best try so far. It feels unnecessarily verbose and clunky:
# aggregate val of rows with (from,to,type) == (to,from,type)
df2 = df.reset_index()
df3 = df2.rename(columns={'from':'to', 'to':'from'})
df_both = df.join(df3.set_index(
['from', 'to', 'type']),
rsuffix='_b').sum(axis=1)
# then remove the second, i.e. the (to,from,t) row
rows_to_keep = []
rows_to_remove = []
for a,b,t in df_both.index:
if (b,a,t) in df_both.index and not (b,a,t) in rows_to_keep:
rows_to_keep.append((a,b,t))
rows_to_remove.append((b,a,t))
df_final = df_both.drop(rows_to_remove)
df_final
Especially the second "de-duplication" step feels very unpythonic. (How) can I improve these steps?
Not sure how much better this is, but it's certainly different
import pandas as pd
from collections import Counter
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
ls = df.to_records()
ls = list(ls)
ls2=[]
for l in ls:
i=0
while i <= l[3]:
ls2.append(list(l)[:3])
i+=1
counted = Counter(tuple(sorted(entry)) for entry in ls2)