Find string in multiple columns ? - pandas

I have a dataframe with 3 columns tel1,tel2,tel3
I want to keep row that contains a specific value in one or more columns:
For exemple i want to keep row where columns tel1 or tel2 or tel3 start with '06'
How can i do that ?
Thanks

Let's use this df as an example DataFrame:
In [54]: df = pd.DataFrame({'tel{}'.format(j):
['{:02d}'.format(i+j)
for i in range(10)] for j in range(3)})
In [71]: df
Out[71]:
tel0 tel1 tel2
0 00 01 02
1 01 02 03
2 02 03 04
3 03 04 05
4 04 05 06
5 05 06 07
6 06 07 08
7 07 08 09
8 08 09 10
9 09 10 11
You can find which values in df['tel0'] starts with '06' using
StringMethods.startswith:
In [72]: df['tel0'].str.startswith('06')
Out[72]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
Name: tel0, dtype: bool
To combine two boolean Series with logical-or, use |:
In [73]: df['tel0'].str.startswith('06') | df['tel1'].str.startswith('06')
Out[73]:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 False
8 False
9 False
dtype: bool
Or, if you want to combine a list of boolean Series using logical-or, you could use reduce:
In [79]: import functools
In [80]: import numpy as np
In [80]: mask = functools.reduce(np.logical_or, [df['tel{}'.format(i)].str.startswith('06') for i in range(3)])
In [81]: mask
Out[81]:
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 False
9 False
Name: tel0, dtype: bool
Once you have the boolean mask, you can select the associated rows using df.loc:
In [75]: df.loc[mask]
Out[75]:
tel0 tel1 tel2
4 04 05 06
5 05 06 07
6 06 07 08
Note there are many other vectorized str methods besides startswith.
You might find str.contains useful for finding which rows contain a string. Note that str.contains interprets its argument as a regex pattern by default:
In [85]: df['tel0'].str.contains(r'6|7')
Out[85]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 False
9 False
Name: tel0, dtype: bool

I like to use dataframe.apply in such situations:
#search dataframe multip columns
#generate some random numbers
import random as r
rand_numbers = [[r.randint(100000, 9999999) for __ in range(3)] for _ in range(20)]
df = pd.DataFrame.from_records(rand_numbers, columns=['tel1','tel2','tel3'])
df.head()
#a really simple search function
#if you need speed use cpython here ;-)
def searchfilter(row, search='5'):
#df.apply returns the rows or columns as list
for string in row:
#string is a number here, so we must cast it.
if str(string).startswith(search):
return True
else:
return False
#apply the searchfunction to each row
result_bool_array =df.apply(searchfilter, axis=1) #the axis argument is to run it rowise
df[result_bool_array]
#other search with lambda in apply
result_bool_array =df.apply(lambda row: searchfilter(row, search='6'), axis=1)

Related

Pie Chart Issues With Booleans

I have a weird pie chart that isn't coming across right. The column that I'm typing in is a boolean with only true and false values and I'm just looking to make it so it returns two values.
Thank you!
As you didn't post any minimal data sample to reproduce your issue, let's take a look at some fictiv data and maybe you'll get some ideas from that. doing pie charts on booleans can be done this way. Let's assume your data looks like this:
var1 Verified
0 A True
1 A True
2 A True
3 A True
4 A True
5 A False
6 A False
7 A False
8 A False
9 A False
10 A False
11 B True
12 B True
13 B True
14 B True
15 B True
16 B False
17 B False
18 B True
19 B True
20 B True
21 B True
22 B True
23 B False
24 B False
25 B True
26 B True
27 B True
28 C True
29 C True
30 C False
31 C False
32 C True
33 C True
34 C True
35 C True
36 C True
37 C False
38 C False
39 C True
40 C True
41 C True
42 C True
43 C True
44 C False
45 C False
46 C True
47 C True
48 C True
49 C True
50 C True
51 C False
52 C False
53 C True
You can then do the following:
ef labelling(val):
return f'{val / 100 * len(df):.0f}\n{val:.0f}%'
fig, (ax1) = plt.subplots(ncols=1, figsize=(10, 5))
df.groupby('var1').size().plot(kind='pie', autopct=labelling, textprops={'fontsize': 20},colors=['red', 'green','blue'], ax=ax1)
ax1.set_ylabel('Per var1', size=22)
plt.show()
which gives you

On use of any method

a code from Kaggle, which is said to remove outliners:
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
Would not Any return a boolean item? either a an item being in a list or not?
So what the code says is, save in the mask all absolute values in Ft which are above the quantile (introduced by another variable)? What does the Any stand for? what for? thank you.
I think first part return DataFrame filled by boolean True or/and False:
(ft.abs() > ft.abs().quantile(outl_thresh))
so is added DataFrame.any for test if at least one True per rows to boolean Series.
df = pd.DataFrame({'a':[False, False, True],
'b':[False, True, True],
'c':[False, False, True]})
print (df)
a b c
0 False False False
1 False True False
2 True True True
print (df.any(axis=1))
0 False <- no True per rows
1 True <- one True per rows
2 True <- three Trues per rows
dtype: bool
Similar method for test if all values are Trues is DataFrame.all:
print (df.all(axis=1))
0 False
1 False
2 True
dtype: bool
Reason is for filtering by boolean indexing is necessary boolean Series, not boolean DataFrame.
Another sample data:
np.random.seed(2021)
ft = pd.DataFrame(np.random.randint(100, size=(10, 5))).sub(20)
print (ft)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
6 -15 29 18 -6 51
7 65 50 21 1 5
8 -10 16 -1 37 62
9 70 -5 20 56 33
outl_thresh = 0.95
print (ft.abs().quantile(outl_thresh))
0 71.65
1 46.40
2 75.40
3 75.65
4 69.85
Name: 0.95, dtype: float64
print((ft.abs() > ft.abs().quantile(outl_thresh)))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True False False False False
3 False False False True False
4 False False True False False
5 False False False False True
6 False False False False False
7 False True False False False
8 False False False False False
9 False False False False False
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
print (outliers_mask)
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 True
8 False
9 False
dtype: bool
df1 = ft[outliers_mask]
print (df1)
0 1 2 3 4
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
7 65 50 21 1 5
0 1 2 3 4
df2 = ft[~outliers_mask]
print (df2)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
6 -15 29 18 -6 51
8 -10 16 -1 37 62
9 70 -5 20 56 33

Pandas subtract columns with groupby and mask

For groups under one "SN", I would like to subtract three performance indicators for each group. One group boundaries are the serial number SN and sequential Boolean True values in mask. (So multiple True sequances can exist under one SN).
The first indicator I want is, Csub that subtracts between the first and last values of each group in column 'C'. Second, Bmean, is the mean of each group in column 'B'.
For example:
In:
df = pd.DataFrame({"SN" : ["66", "66", "66", "77", "77", "77", "77", "77"], "B" : [-2, -1, -2, 3, 1, -1, 1, 1], "C" : [1, 2, 3, 15, 11, 2, 1, 2],
"mask" : [False, False, False, True, True, False, True, True] })
SN B C mask
0 66 -2 1 False
1 66 -1 2 False
2 66 -2 3 False
3 77 3 15 True
4 77 1 11 True
5 77 -1 2 False
6 77 1 1 True
7 77 1 2 True
Out:
SN B C mask Csub Bmean CdivB
0 66 -2 1 False Nan Nan Nan
1 66 -1 2 False Nan Nan Nan
2 66 -2 3 False Nan Nan Nan
3 77 3 15 True -4 13 -0.3
4 77 1 11 True -4 13 -0.3
5 77 -1 2 False Nan Nan Nan
6 77 1 1 True 1 1 1
7 77 1 2 True 1 1 1
I cooked up something like this, but it groups by the mask T/F values. It should group by SN and sequential True values, not ALL True values. Further, I cannot figure out how to get a subtraction sqeezed in to this.
# Extracting performance values
perf = (df.assign(
Bmean = df['B'], CdivB = df['C']/df['B']
).groupby(['SN','mask'])
.agg(dict(Bmean ='mean', CdivB = 'mean'))
.reset_index(drop=False)
)
It's not pretty, but you can try the following.
First, prepare a 'group_key' column in order to group by consecutive True values in 'mask':
# Select the rows where 'mask' is True preceded by False.
first_true = df.loc[
(df['mask'] == True)
& (df['mask'].shift(fill_value=False) == False)
]
# Add the column.
df['group_key'] = pd.Series()
# Each row in first_true gets assigned a different 'group_key' value.
df.loc[first_true.index, 'group_key'] = range(len(first_true))
# Forward fill 'group_key' on mask.
df.loc[df['mask'], 'group_key'] = df.loc[df['mask'], 'group_key'].ffill()
Then we can group by 'SN' and 'group_key' and compute and assign the indicator values.
# Group by 'SN' and 'group_key'.
gdf = df.groupby(by=['SN', 'group_key'], as_index=False)
# Compute indicator values
indicators = pd.DataFrame(gdf.nth(0)) # pd.DataFrame used here to avoid a SettingwithCopyWarning.
indicators['Csub'] = gdf.nth(0)['C'].array - gdf.nth(-1)['C'].array
indicators['Bmean'] = gdf.mean()['B'].array
# Write values to original dataframe
df = df.join(indicators.reindex(columns=['Csub', 'Bmean']))
# Forward fill the indicator values
df.loc[df['mask'], ['Csub', 'Bmean']] = df.loc[df['mask'], ['Csub', 'Bmean']].ffill()
# Drop 'group_key' column
df = df.drop(columns=['group_key'])
I excluded 'CdivB' since I couldn't understand what it's value should be.

Pandas: How do I merge the values of two dataframe columns that match within some tolerance?

I am trying to match two sets of rows in pandas dataframes containing positive and negative data, to within some user-defined tolerance, e.g. (initially):
timestamp value has_a_matching_minus should_match_tolerance_equals_ten
01 36.00 False False
02 68.00 False False
03 131.00 False False
04 94.00 False True
05 -1000.00 False False
06 100.00 False True
07 540.00 False False
08 -100.00 False False
09 54.00 False False
(create with:
df = pd.DataFrame({'timestamp': range(9), 'value': [36, 68, 131, 94, -1000, 100, 540, -100, 54]})
The plusses may or may not have one (or more) match in minuses. If a plus does mave a match within the tolerance, the corresponding row of plusses must have its column 'has_a_matching_minus' set to True (otherwise it remains False).
I know I can make use of df.between(low,high), but it only takes low and high as scalars and not series/dataframe columns.
How can I avoid the following (slow!) for loop over between? Should I rather be using merge etc.?
import numpy as np
import pandas as pd
minuses=data[data['value']<0.0]
plusses=data[data['value']>0.0]
tolerance = 10.0
match_queries = np.abs(minusses)
match_queries_high = match_queries + tolerance
match_queries_low = match_queries - tolerance
plusses['has_a_matching_minus'] = False
for (l, h) in zip(match_queries_low, match_queries_high):
in_range = plusses['value'].between(l,h).astype(np.bool)
plusses['has_a_matching_minus'] = plusses['has_a_matching_minus'] | in_range
assert(plusses['has_a_matching_minus']==plusses['should_match_tolerance_equals_ten'].all()), 'The acid test'
I'm not sure I got the details of the question 100%, but the following can probably show how to approach it.
Suppose you start with
df = pd.DataFrame({'timestamp': range(9), 'value': [36, 68, 131, 94, -1000, 100, 540, -100, 54]})
Use a dummy column to perform a self outer join:
df['dummy'] = 1
merged = pd.merge(df, df, on='dummy', how='outer')
Now calculate, per timestamp, whether there's a different negative value at most different from it in the absolute value at most 10 away:
merged['has_a_matching_minus'] = (merged.timestamp_x != merged.timestamp_y) & (merged.value_y < 0) & ((merged.value_x.abs() - merged.value_y.abs()).abs() < 10)
>>> merged.has_a_matching_minus.astype(int).groupby(merged.timestamp_x).max().astype(bool).to_frame()
has_a_matching_minus
timestamp_x
0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 False
8 False
You can easily merge this into the original frame. If you need several columns, perform their calculations on merged similarly.

Calulations within same category

Small data frame example:
ID V1 V2 is
1 01 23569.5 0.138996 FALSE
2 01 23611.5 1.318343 TRUE
3 01 23636.0 0.071871 FALSE
4 01 23665.5 0.081087 FALSE
5 01 33417.5 0.102158 FALSE
6 01 33563.5 0.119645 FALSE
7 01 42929.5 0.175000 FALSE
8 01 44552.5 0.066056 FALSE
9 01 45539.5 0.227691 FALSE
10 01 46984.5 0.649687 FALSE
11 01 47018.0 0.932445 FALSE
12 02 23611.5 1.418377 TRUE
13 02 23667.5 0.474754 FALSE
14 02 46984.0 0.443233 FALSE
15 02 47018.0 0.847738 FALSE
16 02 47051.5 0.446792 FALSE
17 02 47096.5 3.602696 FALSE
18 03 23464.0 1.010199 FALSE
19 03 23523.5 0.150067 FALSE
20 03 23611.5 1.273281 TRUE
21 03 29608.0 0.071324 FALSE...
There is only one row within each ID-category with is=T. I would like to know a convenient way of calculating the ratio V2 (is=F)/V2 (is=T) within each ID and add the result in a new column/vector with a result like this:
ID V1 V2 is Ratio
1 1 23569.5 0.138996 FALSE 0.10543235
2 1 23611.5 1.318343 TRUE 1
3 1 23636 0.071871 FALSE 0.054516162
4 1 23665.5 0.081087 FALSE 0.061506755
5 1 33417.5 0.102158 FALSE 0.077489697
6 1 33563.5 0.119645 FALSE 0.090754075
7 1 42929.5 0.175000 FALSE 0.132742389
8 1 44552.5 0.066056 FALSE 0.050105322
9 1 45539.5 0.227691 FALSE 0.172709985
10 1 46984.5 0.649687 FALSE 0.492805742
11 1 47018 0.932445 FALSE 0.707285585
12 2 23611.5 1.418377 TRUE 1
13 2 23667.5 0.474754 FALSE 0.334716369
14 2 46984 0.443233 FALSE 0.312493082
15 2 47018 0.847738 FALSE 0.597681716
16 2 47051.5 0.446792 FALSE 0.315002288
17 2 47096.5 3.602696 FALSE 2.540012987
18 3 23464 1.010199 FALSE 0.793382608
19 3 23523.5 0.150067 FALSE 0.117858509
20 3 23611.5 1.273281 TRUE 1
21 3 29608 0.071324 FALSE 0.056015915...
I am sorry for the trivial question. However my search result has not helped finding the solution I am looking for.
I assume that your dataframe is called data and already sorted by ID.
Select records with is==TRUE:
data.true = data[data$is==TRUE,]
Obtain run length encoding of ID:
rle.id = rle(data$ID)
For each V2 with is==TRUE, copy it as many times as many members of the group exist:
v2.true = rep(data.true$v2, rle.id$len)
make the division
data$Ratio = data$V2/v2.true