Calulations within same category - dataframe

Small data frame example:
ID V1 V2 is
1 01 23569.5 0.138996 FALSE
2 01 23611.5 1.318343 TRUE
3 01 23636.0 0.071871 FALSE
4 01 23665.5 0.081087 FALSE
5 01 33417.5 0.102158 FALSE
6 01 33563.5 0.119645 FALSE
7 01 42929.5 0.175000 FALSE
8 01 44552.5 0.066056 FALSE
9 01 45539.5 0.227691 FALSE
10 01 46984.5 0.649687 FALSE
11 01 47018.0 0.932445 FALSE
12 02 23611.5 1.418377 TRUE
13 02 23667.5 0.474754 FALSE
14 02 46984.0 0.443233 FALSE
15 02 47018.0 0.847738 FALSE
16 02 47051.5 0.446792 FALSE
17 02 47096.5 3.602696 FALSE
18 03 23464.0 1.010199 FALSE
19 03 23523.5 0.150067 FALSE
20 03 23611.5 1.273281 TRUE
21 03 29608.0 0.071324 FALSE...
There is only one row within each ID-category with is=T. I would like to know a convenient way of calculating the ratio V2 (is=F)/V2 (is=T) within each ID and add the result in a new column/vector with a result like this:
ID V1 V2 is Ratio
1 1 23569.5 0.138996 FALSE 0.10543235
2 1 23611.5 1.318343 TRUE 1
3 1 23636 0.071871 FALSE 0.054516162
4 1 23665.5 0.081087 FALSE 0.061506755
5 1 33417.5 0.102158 FALSE 0.077489697
6 1 33563.5 0.119645 FALSE 0.090754075
7 1 42929.5 0.175000 FALSE 0.132742389
8 1 44552.5 0.066056 FALSE 0.050105322
9 1 45539.5 0.227691 FALSE 0.172709985
10 1 46984.5 0.649687 FALSE 0.492805742
11 1 47018 0.932445 FALSE 0.707285585
12 2 23611.5 1.418377 TRUE 1
13 2 23667.5 0.474754 FALSE 0.334716369
14 2 46984 0.443233 FALSE 0.312493082
15 2 47018 0.847738 FALSE 0.597681716
16 2 47051.5 0.446792 FALSE 0.315002288
17 2 47096.5 3.602696 FALSE 2.540012987
18 3 23464 1.010199 FALSE 0.793382608
19 3 23523.5 0.150067 FALSE 0.117858509
20 3 23611.5 1.273281 TRUE 1
21 3 29608 0.071324 FALSE 0.056015915...
I am sorry for the trivial question. However my search result has not helped finding the solution I am looking for.

I assume that your dataframe is called data and already sorted by ID.
Select records with is==TRUE:
data.true = data[data$is==TRUE,]
Obtain run length encoding of ID:
rle.id = rle(data$ID)
For each V2 with is==TRUE, copy it as many times as many members of the group exist:
v2.true = rep(data.true$v2, rle.id$len)
make the division
data$Ratio = data$V2/v2.true

Related

Pie Chart Issues With Booleans

I have a weird pie chart that isn't coming across right. The column that I'm typing in is a boolean with only true and false values and I'm just looking to make it so it returns two values.
Thank you!
As you didn't post any minimal data sample to reproduce your issue, let's take a look at some fictiv data and maybe you'll get some ideas from that. doing pie charts on booleans can be done this way. Let's assume your data looks like this:
var1 Verified
0 A True
1 A True
2 A True
3 A True
4 A True
5 A False
6 A False
7 A False
8 A False
9 A False
10 A False
11 B True
12 B True
13 B True
14 B True
15 B True
16 B False
17 B False
18 B True
19 B True
20 B True
21 B True
22 B True
23 B False
24 B False
25 B True
26 B True
27 B True
28 C True
29 C True
30 C False
31 C False
32 C True
33 C True
34 C True
35 C True
36 C True
37 C False
38 C False
39 C True
40 C True
41 C True
42 C True
43 C True
44 C False
45 C False
46 C True
47 C True
48 C True
49 C True
50 C True
51 C False
52 C False
53 C True
You can then do the following:
ef labelling(val):
return f'{val / 100 * len(df):.0f}\n{val:.0f}%'
fig, (ax1) = plt.subplots(ncols=1, figsize=(10, 5))
df.groupby('var1').size().plot(kind='pie', autopct=labelling, textprops={'fontsize': 20},colors=['red', 'green','blue'], ax=ax1)
ax1.set_ylabel('Per var1', size=22)
plt.show()
which gives you

On use of any method

a code from Kaggle, which is said to remove outliners:
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
Would not Any return a boolean item? either a an item being in a list or not?
So what the code says is, save in the mask all absolute values in Ft which are above the quantile (introduced by another variable)? What does the Any stand for? what for? thank you.
I think first part return DataFrame filled by boolean True or/and False:
(ft.abs() > ft.abs().quantile(outl_thresh))
so is added DataFrame.any for test if at least one True per rows to boolean Series.
df = pd.DataFrame({'a':[False, False, True],
'b':[False, True, True],
'c':[False, False, True]})
print (df)
a b c
0 False False False
1 False True False
2 True True True
print (df.any(axis=1))
0 False <- no True per rows
1 True <- one True per rows
2 True <- three Trues per rows
dtype: bool
Similar method for test if all values are Trues is DataFrame.all:
print (df.all(axis=1))
0 False
1 False
2 True
dtype: bool
Reason is for filtering by boolean indexing is necessary boolean Series, not boolean DataFrame.
Another sample data:
np.random.seed(2021)
ft = pd.DataFrame(np.random.randint(100, size=(10, 5))).sub(20)
print (ft)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
6 -15 29 18 -6 51
7 65 50 21 1 5
8 -10 16 -1 37 62
9 70 -5 20 56 33
outl_thresh = 0.95
print (ft.abs().quantile(outl_thresh))
0 71.65
1 46.40
2 75.40
3 75.65
4 69.85
Name: 0.95, dtype: float64
print((ft.abs() > ft.abs().quantile(outl_thresh)))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True False False False False
3 False False False True False
4 False False True False False
5 False False False False True
6 False False False False False
7 False True False False False
8 False False False False False
9 False False False False False
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
print (outliers_mask)
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 True
8 False
9 False
dtype: bool
df1 = ft[outliers_mask]
print (df1)
0 1 2 3 4
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
7 65 50 21 1 5
0 1 2 3 4
df2 = ft[~outliers_mask]
print (df2)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
6 -15 29 18 -6 51
8 -10 16 -1 37 62
9 70 -5 20 56 33

PostgreSQL Insert into Table A a Record for Each Record in Table B

I have three tables, checklists, checklist_items and checklist_item_types. I am wanting in my scheduled job to INSERT a record in checklist_items for every record in checklist_item_types that meets a criteria (is_active column being true).
Here is my checklists table:
id checklist_date notes
1 "2018-07-23" "Fixed extra stuff"
2 "2018-07-24" "These are some extra notes"
3 "2018-07-25" "Notes notes"
Here is my checklist_items table, data reduced:
id checklists_id checklists_item_types_id is_completed
1 1 1 false
2 1 2 true
3 1 3 true
...
34 2 16 true
35 2 17 true
36 2 18 true
And here is checklist_item_types, data reduced (for example assume all is_active are true except for 15):
id description is_active
1 "Unlock Entrances" true
2 "Ladies Locker Room Lights" true
3 "Check Hot Tubs (AM)" true
...
15 "Water Softener Boiler Room" false
16 "Water Softener Laundry" true
17 "Check/Stock Fire Logs" true
18 "Drain Steam Lines (4 locations)" true
So when my job runs I want checklist_items to get, using examples above, 17 new records (18 checklist_item_types minus 1 because it's false for is_active).
The "new" checklist_items, after the job runs once would look like:
id checklists_id checklists_item_types_id is_completed
1 1 1 false
2 1 2 true
3 1 3 true
...
34 2 16 true
35 2 17 true
36 2 18 true
---new data starting below---
37 3 1 false
38 3 2 false
39 3 3 false
40 3 4 false
41 3 5 false
42 3 6 false
43 3 7 false
44 3 8 false
45 3 9 false
46 3 10 false
47 3 11 false
48 3 12 false
49 3 13 false
50 3 14 false
51 3 16 false
52 3 17 false
53 3 18 false
You seem to want insert . . . select:
insert into checklist_items (checklists_id, checklists_item_types_id, is_completed)
select 3, cit.checklists_item_types_id, false
from checklist_item_types cit;
I would suggest creating classes to communicate with the database. Here is a portion of a class to extract your required checklist_item_types. As you can see, you need another class "DatabaseConn" in the database.php file to open connection to the database. You will need some PHP code to call this function and also the function to add to checklist_items table.
<?php
require_once("model/database.php");
class checklist_item_types extends DatabaseConn
{
...
public function get_records_by_id_and_completed($starting_id, $is_completed) {
global $dbConn;
$sql = "SELECT * FROM checklist_item_types
WHERE id >= $starting_id AND is_completed = $is_completed";
try {
$statement = $dbConn->prepare($sql);
$statement->execute();
$result = $statement->fetchAll();
$statement->closeCursor();
return $result;
} catch (PDOException $e) {
echo $sql . "<br>" . $e->getMessage();
}
}
...
}
?>

Converting boolean to zero-or-one, for all elements in an array

I have the following datasets of boolean columns
date hr energy
0 5-Feb-18 False False
1 29-Jan-18 False False
2 6-Dec-17 True False
3 16-Nov-17 False False
4 14-Nov-17 True True
5 25-Oct-17 False False
6 24-Oct-17 False False
7 5-Oct-17 False False
8 3-Oct-17 False False
9 26-Sep-17 False False
10 13-Sep-17 True False
11 7-Sep-17 False False
12 31-Aug-17 False False
I want to multiply each boolean column by 1 to turn it into a dummy
I tried:
df = df.iloc[:, 1:]
for col in df:
col = col*1
but the columns remain boolean, why?
Just using
df.iloc[:,1:]=df.iloc[:,1:].astype(int)
df
Out[477]:
date hr energy
0 5-Feb-18 0 0
1 29-Jan-18 0 0
2 6-Dec-17 1 0
3 16-Nov-17 0 0
4 14-Nov-17 1 1
5 25-Oct-17 0 0
6 24-Oct-17 0 0
7 5-Oct-17 0 0
8 3-Oct-17 0 0
9 26-Sep-17 0 0
10 13-Sep-17 1 0
11 7-Sep-17 0 0
12 31-Aug-17 0 0
For future cases other than True or False, If you want to convert categorical into numerical you could always use the replace function.
df.iloc[:,1:]=df.iloc[:,1:].replace({True:1,False:0})

Find string in multiple columns ?

I have a dataframe with 3 columns tel1,tel2,tel3
I want to keep row that contains a specific value in one or more columns:
For exemple i want to keep row where columns tel1 or tel2 or tel3 start with '06'
How can i do that ?
Thanks
Let's use this df as an example DataFrame:
In [54]: df = pd.DataFrame({'tel{}'.format(j):
['{:02d}'.format(i+j)
for i in range(10)] for j in range(3)})
In [71]: df
Out[71]:
tel0 tel1 tel2
0 00 01 02
1 01 02 03
2 02 03 04
3 03 04 05
4 04 05 06
5 05 06 07
6 06 07 08
7 07 08 09
8 08 09 10
9 09 10 11
You can find which values in df['tel0'] starts with '06' using
StringMethods.startswith:
In [72]: df['tel0'].str.startswith('06')
Out[72]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
Name: tel0, dtype: bool
To combine two boolean Series with logical-or, use |:
In [73]: df['tel0'].str.startswith('06') | df['tel1'].str.startswith('06')
Out[73]:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 False
8 False
9 False
dtype: bool
Or, if you want to combine a list of boolean Series using logical-or, you could use reduce:
In [79]: import functools
In [80]: import numpy as np
In [80]: mask = functools.reduce(np.logical_or, [df['tel{}'.format(i)].str.startswith('06') for i in range(3)])
In [81]: mask
Out[81]:
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 False
9 False
Name: tel0, dtype: bool
Once you have the boolean mask, you can select the associated rows using df.loc:
In [75]: df.loc[mask]
Out[75]:
tel0 tel1 tel2
4 04 05 06
5 05 06 07
6 06 07 08
Note there are many other vectorized str methods besides startswith.
You might find str.contains useful for finding which rows contain a string. Note that str.contains interprets its argument as a regex pattern by default:
In [85]: df['tel0'].str.contains(r'6|7')
Out[85]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 False
9 False
Name: tel0, dtype: bool
I like to use dataframe.apply in such situations:
#search dataframe multip columns
#generate some random numbers
import random as r
rand_numbers = [[r.randint(100000, 9999999) for __ in range(3)] for _ in range(20)]
df = pd.DataFrame.from_records(rand_numbers, columns=['tel1','tel2','tel3'])
df.head()
#a really simple search function
#if you need speed use cpython here ;-)
def searchfilter(row, search='5'):
#df.apply returns the rows or columns as list
for string in row:
#string is a number here, so we must cast it.
if str(string).startswith(search):
return True
else:
return False
#apply the searchfunction to each row
result_bool_array =df.apply(searchfilter, axis=1) #the axis argument is to run it rowise
df[result_bool_array]
#other search with lambda in apply
result_bool_array =df.apply(lambda row: searchfilter(row, search='6'), axis=1)