I have a weird pie chart that isn't coming across right. The column that I'm typing in is a boolean with only true and false values and I'm just looking to make it so it returns two values.
Thank you!
As you didn't post any minimal data sample to reproduce your issue, let's take a look at some fictiv data and maybe you'll get some ideas from that. doing pie charts on booleans can be done this way. Let's assume your data looks like this:
var1 Verified
0 A True
1 A True
2 A True
3 A True
4 A True
5 A False
6 A False
7 A False
8 A False
9 A False
10 A False
11 B True
12 B True
13 B True
14 B True
15 B True
16 B False
17 B False
18 B True
19 B True
20 B True
21 B True
22 B True
23 B False
24 B False
25 B True
26 B True
27 B True
28 C True
29 C True
30 C False
31 C False
32 C True
33 C True
34 C True
35 C True
36 C True
37 C False
38 C False
39 C True
40 C True
41 C True
42 C True
43 C True
44 C False
45 C False
46 C True
47 C True
48 C True
49 C True
50 C True
51 C False
52 C False
53 C True
You can then do the following:
ef labelling(val):
return f'{val / 100 * len(df):.0f}\n{val:.0f}%'
fig, (ax1) = plt.subplots(ncols=1, figsize=(10, 5))
df.groupby('var1').size().plot(kind='pie', autopct=labelling, textprops={'fontsize': 20},colors=['red', 'green','blue'], ax=ax1)
ax1.set_ylabel('Per var1', size=22)
plt.show()
which gives you
Related
a code from Kaggle, which is said to remove outliners:
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
Would not Any return a boolean item? either a an item being in a list or not?
So what the code says is, save in the mask all absolute values in Ft which are above the quantile (introduced by another variable)? What does the Any stand for? what for? thank you.
I think first part return DataFrame filled by boolean True or/and False:
(ft.abs() > ft.abs().quantile(outl_thresh))
so is added DataFrame.any for test if at least one True per rows to boolean Series.
df = pd.DataFrame({'a':[False, False, True],
'b':[False, True, True],
'c':[False, False, True]})
print (df)
a b c
0 False False False
1 False True False
2 True True True
print (df.any(axis=1))
0 False <- no True per rows
1 True <- one True per rows
2 True <- three Trues per rows
dtype: bool
Similar method for test if all values are Trues is DataFrame.all:
print (df.all(axis=1))
0 False
1 False
2 True
dtype: bool
Reason is for filtering by boolean indexing is necessary boolean Series, not boolean DataFrame.
Another sample data:
np.random.seed(2021)
ft = pd.DataFrame(np.random.randint(100, size=(10, 5))).sub(20)
print (ft)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
6 -15 29 18 -6 51
7 65 50 21 1 5
8 -10 16 -1 37 62
9 70 -5 20 56 33
outl_thresh = 0.95
print (ft.abs().quantile(outl_thresh))
0 71.65
1 46.40
2 75.40
3 75.65
4 69.85
Name: 0.95, dtype: float64
print((ft.abs() > ft.abs().quantile(outl_thresh)))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True False False False False
3 False False False True False
4 False False True False False
5 False False False False True
6 False False False False False
7 False True False False False
8 False False False False False
9 False False False False False
outliers_mask = (ft.abs() > ft.abs().quantile(outl_thresh)).any(axis=1)
print (outliers_mask)
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 True
8 False
9 False
dtype: bool
df1 = ft[outliers_mask]
print (df1)
0 1 2 3 4
2 73 4 -8 50 50
3 13 -13 -19 77 6
4 46 28 79 43 29
5 -4 30 34 32 73
7 65 50 21 1 5
0 1 2 3 4
df2 = ft[~outliers_mask]
print (df2)
0 1 2 3 4
0 65 37 -20 74 66
1 24 42 71 9 1
6 -15 29 18 -6 51
8 -10 16 -1 37 62
9 70 -5 20 56 33
I have the following MultiIndex dataframe:
Close ATR condition
Date Symbol
1990-01-01 A 24 1 True
B 72 1 False
C 40 3 False
D 21 5 True
1990-01-02 A 65 4 True
B 19 2 True
C 43 3 True
D 72 1 False
1990-01-03 A 92 5 False
B 32 3 True
C 52 2 False
D 33 1 False
I perform the following calculation on this dataframe:
data.loc[data.index.levels[0][0], 'Shares'] = 0
data.loc[data.index.levels[0][0], 'Closed_P/L'] = 0
data = data.reset_index()
Equity = 10000
def calcs(x):
global Equity
# Skip first date
if x.index[0]==0: return x
# calculate Shares where condition is True
x.loc[x['condition'] == True, 'Shares'] = np.floor((Equity * 0.02 / x['ATR']).astype(float))
# other calulations
x['Closed_P/L'] = x['Shares'] * x['Close']
Equity += x['Closed_P/L'].sum()
return x
data = data.groupby('Date').apply(calcs)
data['Equity'] = data.groupby('Date')['Closed_P/L'].transform('sum')
data['Equity'] = data.groupby('Symbol')['Equity'].cumsum() + Equity
data = data.set_index(['Date','Symbol'])
The output is:
Close ATR condition Shares Closed_P/L Equity
Date Symbol
1990-01-01 A 24 1.2 True 0.0 0.0 10000.0
B 72 1.4 False 0.0 0.0 10000.0
C 40 3 False 0.0 0.0 10000.0
D 21 5 True 0.0 0.0 10000.0
1990-01-02 A 65 4 True 50.0 3250.0 17988.0
B 19 2 True 100.0 1900.0 17988.0
C 43 3 True 66.0 2838.0 17988.0
D 72 1 False NaN NaN 17988.0
1990-01-03 A 92 5 False NaN NaN 21796.0
B 32 3 True 119.0 3808.0 21796.0
C 52 2 False NaN NaN 21796.0
D 33 1 False NaN NaN 21796.0
I want to forward fill Shares values - grouped by Symbol - in case condition evaluates to False (except for first date). So the Shares value on 1990-01-02 for D should be 0 (because on 1990-01-01 the Shares value for D was 0 and the condition on 1990-01-02 is False). Also values for Shares on 1990-01-03 for A, C and D should be 50, 66 and 0 respectively based on the logic described above. How can I do that?
I have three tables, checklists, checklist_items and checklist_item_types. I am wanting in my scheduled job to INSERT a record in checklist_items for every record in checklist_item_types that meets a criteria (is_active column being true).
Here is my checklists table:
id checklist_date notes
1 "2018-07-23" "Fixed extra stuff"
2 "2018-07-24" "These are some extra notes"
3 "2018-07-25" "Notes notes"
Here is my checklist_items table, data reduced:
id checklists_id checklists_item_types_id is_completed
1 1 1 false
2 1 2 true
3 1 3 true
...
34 2 16 true
35 2 17 true
36 2 18 true
And here is checklist_item_types, data reduced (for example assume all is_active are true except for 15):
id description is_active
1 "Unlock Entrances" true
2 "Ladies Locker Room Lights" true
3 "Check Hot Tubs (AM)" true
...
15 "Water Softener Boiler Room" false
16 "Water Softener Laundry" true
17 "Check/Stock Fire Logs" true
18 "Drain Steam Lines (4 locations)" true
So when my job runs I want checklist_items to get, using examples above, 17 new records (18 checklist_item_types minus 1 because it's false for is_active).
The "new" checklist_items, after the job runs once would look like:
id checklists_id checklists_item_types_id is_completed
1 1 1 false
2 1 2 true
3 1 3 true
...
34 2 16 true
35 2 17 true
36 2 18 true
---new data starting below---
37 3 1 false
38 3 2 false
39 3 3 false
40 3 4 false
41 3 5 false
42 3 6 false
43 3 7 false
44 3 8 false
45 3 9 false
46 3 10 false
47 3 11 false
48 3 12 false
49 3 13 false
50 3 14 false
51 3 16 false
52 3 17 false
53 3 18 false
You seem to want insert . . . select:
insert into checklist_items (checklists_id, checklists_item_types_id, is_completed)
select 3, cit.checklists_item_types_id, false
from checklist_item_types cit;
I would suggest creating classes to communicate with the database. Here is a portion of a class to extract your required checklist_item_types. As you can see, you need another class "DatabaseConn" in the database.php file to open connection to the database. You will need some PHP code to call this function and also the function to add to checklist_items table.
<?php
require_once("model/database.php");
class checklist_item_types extends DatabaseConn
{
...
public function get_records_by_id_and_completed($starting_id, $is_completed) {
global $dbConn;
$sql = "SELECT * FROM checklist_item_types
WHERE id >= $starting_id AND is_completed = $is_completed";
try {
$statement = $dbConn->prepare($sql);
$statement->execute();
$result = $statement->fetchAll();
$statement->closeCursor();
return $result;
} catch (PDOException $e) {
echo $sql . "<br>" . $e->getMessage();
}
}
...
}
?>
df = [bigdataframe[['Action', 'Adventure','Animation',
'Childrens', 'Comedy', 'Crime','Documentary',
'Drama', 'Fantasy', 'FilmNoir', 'Horror',
'Musical',
'Mystery', 'Romance','SciFi', 'Thriller', 'War',
'Western']].sum(axis=1) > 1]
df
Out[8]:
[0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 False
9 True
10 False
11 True
12 True
13 True
14 True
15 False
16 True
17 False
18 True
19 False
20 False
21 True
22 True
23 True
24 False
25 True
26 True
27 True
28 True
29 True
99970 True
99971 True
99972 False
99973 True
99974 True
99975 True
99976 True
99977 True
99978 False
99979 False
99980 True
99981 False
99982 True
99983 False
99984 True
99985 True
99986 True
99987 True
99988 False
99989 True
99990 True
99991 True
99992 False
99993 True
99994 True
99995 True
99996 True
99997 True
99998 True
99999 False
Length: 100000, dtype: bool]
I have tried:
len(df[df==True])
Masking
They are in a list so shouldn't I just be able to count them? Or do I need to assign them numerical values, 1 for true and 0 for false and then use the count or sum function to find how many are true?
Demo:
In [386]: df = pd.DataFrame(np.random.rand(5,3), columns=list('ABC'))
In [387]: df
Out[387]:
A B C
0 0.228687 0.647431 0.526471
1 0.795122 0.915011 0.950481
2 0.386244 0.705412 0.420596
3 0.343213 0.928993 0.192527
4 0.201023 0.209281 0.304799
In [388]: df[['A','B','C']].sum(axis=1).gt(1.5)
Out[388]:
0 False
1 True
2 True
3 False
4 False
dtype: bool
In [389]: df[['A','B','C']].sum(axis=1).gt(1.5).sum()
Out[389]: 2
to count number of true in a list
sum(unlist(your.list.object))
Small data frame example:
ID V1 V2 is
1 01 23569.5 0.138996 FALSE
2 01 23611.5 1.318343 TRUE
3 01 23636.0 0.071871 FALSE
4 01 23665.5 0.081087 FALSE
5 01 33417.5 0.102158 FALSE
6 01 33563.5 0.119645 FALSE
7 01 42929.5 0.175000 FALSE
8 01 44552.5 0.066056 FALSE
9 01 45539.5 0.227691 FALSE
10 01 46984.5 0.649687 FALSE
11 01 47018.0 0.932445 FALSE
12 02 23611.5 1.418377 TRUE
13 02 23667.5 0.474754 FALSE
14 02 46984.0 0.443233 FALSE
15 02 47018.0 0.847738 FALSE
16 02 47051.5 0.446792 FALSE
17 02 47096.5 3.602696 FALSE
18 03 23464.0 1.010199 FALSE
19 03 23523.5 0.150067 FALSE
20 03 23611.5 1.273281 TRUE
21 03 29608.0 0.071324 FALSE...
There is only one row within each ID-category with is=T. I would like to know a convenient way of calculating the ratio V2 (is=F)/V2 (is=T) within each ID and add the result in a new column/vector with a result like this:
ID V1 V2 is Ratio
1 1 23569.5 0.138996 FALSE 0.10543235
2 1 23611.5 1.318343 TRUE 1
3 1 23636 0.071871 FALSE 0.054516162
4 1 23665.5 0.081087 FALSE 0.061506755
5 1 33417.5 0.102158 FALSE 0.077489697
6 1 33563.5 0.119645 FALSE 0.090754075
7 1 42929.5 0.175000 FALSE 0.132742389
8 1 44552.5 0.066056 FALSE 0.050105322
9 1 45539.5 0.227691 FALSE 0.172709985
10 1 46984.5 0.649687 FALSE 0.492805742
11 1 47018 0.932445 FALSE 0.707285585
12 2 23611.5 1.418377 TRUE 1
13 2 23667.5 0.474754 FALSE 0.334716369
14 2 46984 0.443233 FALSE 0.312493082
15 2 47018 0.847738 FALSE 0.597681716
16 2 47051.5 0.446792 FALSE 0.315002288
17 2 47096.5 3.602696 FALSE 2.540012987
18 3 23464 1.010199 FALSE 0.793382608
19 3 23523.5 0.150067 FALSE 0.117858509
20 3 23611.5 1.273281 TRUE 1
21 3 29608 0.071324 FALSE 0.056015915...
I am sorry for the trivial question. However my search result has not helped finding the solution I am looking for.
I assume that your dataframe is called data and already sorted by ID.
Select records with is==TRUE:
data.true = data[data$is==TRUE,]
Obtain run length encoding of ID:
rle.id = rle(data$ID)
For each V2 with is==TRUE, copy it as many times as many members of the group exist:
v2.true = rep(data.true$v2, rle.id$len)
make the division
data$Ratio = data$V2/v2.true