How can I percentage makeup of TRUE from a vector? - sql

This is the code used to derive the first table in my question.
JH %>% group_by(ATT_ID, CAR=="B") %>%
summarize(count = n(), .groups = "drop")
ATT_ID
CAR == "B"
Count
ONE
FALSE
1
TWO
TRUE
1
THREE
TRUE
3
THREE
FALSE
5
FOUR
FALSE
2
FIVE
TRUE
4
SIX
TRUE
8
SIX
FALSE
4
How can I get the table above to look like:
ATT_ID
Percentage of "B"
ONE
0%
TWO
100%
THREE
37.5%
FOUR
0%
FIVE
100%
SIX
67%
Notice how some ID's are seen twice so as to show the presence of both FALSE & TRUE whereas other ID's appear once to showcase the presence of only one or the other.
Thank you

You can do the following:
dt %>%
group_by(ATT_ID) %>%
summarize(perc = sprintf("%3.1f%%", 100*sum(Count*`CAR =="B"`)/sum(Count)))
Output:
# A tibble: 6 × 2
ATT_ID perc
<chr> <chr>
1 FIVE 100.0%
2 FOUR 0.0%
3 ONE 0.0%
4 SIX 66.7%
5 THREE 37.5%
6 TWO 100.0%
Input:
structure(list(ATT_ID = c("ONE", "TWO", "THREE", "THREE", "FOUR",
"FIVE", "SIX", "SIX"), `CAR =="B"` = c(FALSE, TRUE, TRUE, FALSE,
FALSE, TRUE, TRUE, FALSE), Count = c(1, 1, 3, 5, 2, 4, 8, 4)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))

Related

Pandas - drop n rows by column value

I need to remove last n rows where Status equals 1
v = df[df['Status'] == 1].count()
f = df[df['Status'] == 0].count()
diff = v - f
diff
df2 = df[~df['Status'] == 1].tail(diff).all() #ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
df2
Check whether Status is equal to 1 and get only those places where it is (.loc[lambda s: s] is doing that using boolean indexing). The index of n such rows from tail will be dropped:
df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
sample run:
In [343]: df
Out[343]:
Status
0 1
1 2
2 3
3 2
4 1
5 1
6 1
7 2
In [344]: n
Out[344]: 2
In [345]: df.Status.eq(1)
Out[345]:
0 True
1 False
2 False
3 False
4 True
5 True
6 True
7 False
Name: Status, dtype: bool
In [346]: df.Status.eq(1).loc[lambda s: s]
Out[346]:
0 True
4 True
5 True
6 True
Name: Status, dtype: bool
In [347]: df.Status.eq(1).loc[lambda s: s].tail(n)
Out[347]:
5 True
6 True
Name: Status, dtype: bool
In [348]: df.Status.eq(1).loc[lambda s: s].tail(n).index
Out[348]: Int64Index([5, 6], dtype='int64')
In [349]: df.drop(df.Status.eq(1).loc[lambda s: s].tail(n).index)
Out[349]:
Status
0 1
1 2
2 3
3 2
4 1
7 2
Using groupBy() and transform() to mark columns to keep:
df = pd.DataFrame({"Status": [1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1]})
n = 3
df["Keep"] = df.groupby("Status")["Status"].transform(
lambda x: x.reset_index().index < len(x) - n if x.name == 1 else True
)
df.loc[df["Keep"]].drop(columns="Keep")

How to change structure of a pandas dataframe

I have this structure
Structure 1
And I want to transform it into this one
Structure 2
Given df,
df1 = pd.DataFrame({'result1':[True, True, False],
'result2':[False, True, False],
'result3':[False, True, True]}, index=[1,2,3])
If, 'id' is in the index, you'll need to reset_index first,
df1 = df1.reset_index()
df1.melt('index').query('value')
Output:
index variable value
0 1 result1 True
1 2 result1 True
4 2 result2 True
7 2 result3 True
8 3 result3 True

Removing 'dominated' rows from a Pandas dataframe (rows with all values lower than the values of any other row)

Edit: changed example df for clarity
I have a dataframe, similar to the one given below (except the real one has a few thousand rows and columns, and values being floats):
df = pd.DataFrame([[6,5,4,3,8], [6,5,4,3,6], [1,1,3,9,5], [0,1,2,7,4], [2, 0, 0, 4, 0])
0 1 2 3 4
0 6 5 4 3 8
1 6 5 4 3 6
2 1 1 3 9 5
3 0 1 2 7 4
4 2 0 0 4 0
From this dataframe, I would like to drop all rows for which all values are lower than or equal to any other row. For this simple example, row 1 and row 3 should be deleted ('dominated' by row 0 and row 2 respectively'):
filtered df:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
It would be even better if the approach could take into account floating point errors, since my real dataframe contains floats (i.e. instead of dropping rows where all values are lower/equal, the values shouldn't be lower than a small amount (e.g. 0.0001).
My initial idea to tackle this problem was as follows:
Select the first row
Compare the other rows with it using a list comprehension (see below)
Drop all rows that returned True
Repeat for the next row
List comprehension code:
selected_row = df.loc[0
[(df.loc[r]<=selected_row).all() and (df.loc[r]<selected_row).any() for r in range(len(df))]
[False, True, False, False, False]
This seems hardly efficient however. Any suggestions on how to (efficiently) tackle this problem would be greatly appreciated.
We can try with broadcasting:
import pandas as pd
df = pd.DataFrame([
[6, 5, 4, 3, 8], [6, 5, 4, 3, 6], [1, 1, 3, 9, 5],
[0, 1, 2, 7, 4], [2, 0, 0, 4, 0]
])
# Need to ensure only one of each row present since comparing to 1
# there needs to be one and only one of each row
df = df.drop_duplicates()
# Broadcasted comparison explanation below
cmp = (df.values[:, None] <= df.values).all(axis=2).sum(axis=1) == 1
# Filter using the results from the comparison
df = df[cmp]
df:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
Intuition:
Broadcast the comparison operation over the DataFrame:
(df.values[:, None] <= df.values)
[[[ True True True True True]
[ True True True True False]
[False False False True False]
[False False False True False]
[False False False True False]] # df vs [6 5 4 3 8]
[[ True True True True True]
[ True True True True True]
[False False False True False]
[False False False True False]
[False False False True False]] # df vs [6 5 4 3 6]
[[ True True True False True]
[ True True True False True]
[ True True True True True]
[False True False False False]
[ True False False False False]] # df vs [1 1 3 9 5]
[[ True True True False True]
[ True True True False True]
[ True True True True True]
[ True True True True True]
[ True False False False False]] # df vs [0 1 2 7 4]
[[ True True True False True]
[ True True True False True]
[False True True True True]
[False True True True True]
[ True True True True True]]] # df vs [2 0 0 4 0]
Then we can check for all on axis=2:
(df.values[:, None] <= df.values).all(axis=2)
[[ True False False False False] # Rows le [6 5 4 3 8]
[ True True False False False] # Rows le [6 5 4 3 6]
[False False True False False] # Rows le [1 1 3 9 5]
[False False True True False] # Rows le [0 1 2 7 4]
[False False False False True]] # Rows le [2 0 0 4 0]
Then we can use sum to total how many rows are less than or equal to:
(df.values[:, None] <= df.values).all(axis=2).sum(axis=1)
[1 2 1 2 1]
The rows where the is only 1 row that is less than or equal to (self match only) are the rows to keep. Because we drop_duplicates there will be no duplicates in the dataframe so the only True values will be the self-match and those that are less than or equal to:
(df.values[:, None] <= df.values).all(axis=2).sum(axis=1) == 1
[ True False True False True]
This then becomes the filter for the DataFrame:
df = df[[True, False, True, False, True]]
df:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0
What is the expected proportion of dominant rows?
What is the size of the datasets that you will handle and the available memory?
While a solution like the broadcasting approach is very clever and efficient (vectorized), it will not be able to handle large dataframes as the size of the broadcast will quickly explode the memory limit (a 100,000×10 input array will not run on most computers).
Here is another approach to avoid testing all combinations and computing everything at once in the memory. It is slower due to the loop, but it should be able to handle much larger arrays. It will also run faster when the proportion of dominated rows increases.
In summary, it compares the dataset with the first row, drops the dominated rows, shifts the first row to the end and start again until doing a full loop. If rows get dropped over time, the number of comparison decrease.
def get_dominants_loop(df):
from tqdm import tqdm
seen = [] # keep track of tested rows
idx = df.index # initial index
for i in tqdm(range(len(df)+1)):
x = idx[0]
if x in seen: # done a full loop
return df.loc[idx]
seen.append(idx[0])
# check which rows are dominated and drop them from the index
idx = (df.loc[idx]-df.loc[x]).le(0).all(axis=1)
# put tested row at the end
idx = list(idx[~idx].index)+[x]
To drop the dominated rows:
df = get_dominants_loop(df)
NB. I used tqdm here to have a progress bar. It is not needed for the code to run
Quick benchmarking in cases where the broadcast approach could not run: <2min for 100k×10 in a cas where most rows are not dominated ; 4s when most rows are dominated
you can try:
df[df.shift(1)[0] >= df[1][0]]
output
0
1
2
3
4
1
6
5
4
3
6
2
1
1
3
9
5
You can try something like that:
# Cartesian product
x = np.tile(df, df.shape[0]).reshape(-1, df.shape[1])
y = np.tile(df.T, df.shape[0]).T
# Remove same rows
#dups = np.all(x == y, axis=1)
#x = x[~dups]
#y = y[~dups]
x = np.delete(x, slice(None, None, df.shape[0]+1), axis=0)
y = np.delete(y, slice(None, None, df.shape[0]+1), axis=0)
# Keep dominant rows
m = x[np.all(x >= y, axis=1)]
>>> m
array([[6, 5, 4, 3, 8],
[1, 1, 3, 9, 5]])
# Before remove duplicates
# df1 = pd.DataFrame({'x': x.tolist(), 'y': y.tolist()})
>>> df1
x y
0 [6, 5, 4, 3, 8] [6, 5, 4, 3, 8] # dup
1 [6, 5, 4, 3, 8] [6, 5, 4, 3, 6] # DOMINANT
2 [6, 5, 4, 3, 8] [1, 1, 3, 9, 5]
3 [6, 5, 4, 3, 8] [0, 1, 2, 7, 4]
4 [6, 5, 4, 3, 6] [6, 5, 4, 3, 8]
5 [6, 5, 4, 3, 6] [6, 5, 4, 3, 6] # dup
6 [6, 5, 4, 3, 6] [1, 1, 3, 9, 5]
7 [6, 5, 4, 3, 6] [0, 1, 2, 7, 4]
8 [1, 1, 3, 9, 5] [6, 5, 4, 3, 8]
9 [1, 1, 3, 9, 5] [6, 5, 4, 3, 6]
10 [1, 1, 3, 9, 5] [1, 1, 3, 9, 5] # dup
11 [1, 1, 3, 9, 5] [0, 1, 2, 7, 4] # DOMINANT
12 [0, 1, 2, 7, 4] [6, 5, 4, 3, 8]
13 [0, 1, 2, 7, 4] [6, 5, 4, 3, 6]
14 [0, 1, 2, 7, 4] [1, 1, 3, 9, 5]
15 [0, 1, 2, 7, 4] [0, 1, 2, 7, 4] # dup
Here is a way using df.apply()
m = (pd.concat(df.apply(lambda x: df.ge(x,axis=1),axis=1).tolist(),keys = df.index)
.all(axis=1)
.groupby(level=0)
.sum()
.eq(1))
ndf = df.loc[m]
Output:
0 1 2 3 4
0 6 5 4 3 8
2 1 1 3 9 5
4 2 0 0 4 0

geom_violin using the weight aesthetic unexpectedly drop levels

library(tidyverse)
set.seed(12345)
dat <- data.frame(year = c(rep(1990, 100), rep(1991, 100), rep(1992, 100)),
fish_length = sample(x = seq(from = 10, 131, by = 0.1), 300, replace = F),
nb_caught = sample(x = seq(from = 1, 200, by = 0.1), 300, replace = T),
stringsAsFactors = F) %>%
mutate(age = ifelse(fish_length < 20, 1,
ifelse(fish_length >= 20 & fish_length < 100, 2,
ifelse(fish_length >= 100 & fish_length < 130, 3, 4)))) %>%
arrange(year, fish_length)
head(dat)
year fish_length nb_caught age
1 1990 10.1 45.2 1
2 1990 10.7 170.0 1
3 1990 10.9 62.0 1
4 1990 12.1 136.0 1
5 1990 14.1 80.8 1
6 1990 15.0 188.9 1
dat %>% group_by(year) %>% summarise(ages = n_distinct(age)) # Only 1992 has age 4 fish
# A tibble: 3 x 2
year ages
<dbl> <int>
1 1990 3
2 1991 3
3 1992 4
dat %>% filter(age == 4) # only 1 row for age 4
year fish_length nb_caught age
1 1992 130.8 89.2 4
Here:
year = year of sampling
fish_length = length of the fish in cm
nb_caught = number of fish caught following the use of an age-length key, hence explaining the presence of decimals
age = age of the fish
graph1: geom_violin not using the weight aesthetic.
Here, I got to copy each line of dat according to the value found in nb_caught.
dim(dat) # 300 rows
dat_graph1 <- dat[rep(1:nrow(dat), floor(dat$nb_caught)), ]
dim(dat_graph1) # 30932 rows
dat_graph1$nb_caught <- NULL # useless now
sum(dat$nb_caught) - nrow(dat_graph1) # 128.2 rows lost here
Since I have decimal values of nb_caught, I took the integer value to create dat_graph1. I lost 128.2 "rows" in the process.
Now for the graph:
dat_tile <- data.frame(year = sort(unique(dat$year))[sort(unique(dat$year)) %% 2 == 0])
# for the figure's background
graph1 <- ggplot(data = dat_graph1,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph1") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph1
graph1
Note here that I have a flat bar for age 4 in year 1992.
dat_graph1 %>% filter(year == 1992, age == 4) %>% pull(fish_length) %>% unique
[1] 130.8
That is because I only have one length for that particular year-age combination.
graph2: geom_violin using the weight aesthetic.
Now, instead of copying each row of dat by the value of number_caught, let's use the weight aesthetic.
Let's calculate the weight wt that each line of dat will have in the calculation of the density curve of each year-age combinations.
dat_graph2 <- dat %>%
group_by(year, age) %>%
mutate(wt = nb_caught / sum(nb_caught)) %>%
as.data.frame()
head(dat_graph2)
year fish_length nb_caught age wt
1 1990 10.1 45.2 1 0.03573123
2 1990 10.7 170.0 1 0.13438735
3 1990 10.9 62.0 1 0.04901186
4 1990 12.1 136.0 1 0.10750988
5 1990 14.1 80.8 1 0.06387352
6 1990 15.0 188.9 1 0.14932806
graph2 <- ggplot(data = dat_graph2,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(aes(weight = wt), draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph2") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph2
dat_graph2 %>% filter(year == 1992, age == 4)
year fish_length nb_caught age wt
1 1992 130.8 89.2 4 1
graph2
Note here that the flat bar for age 4 in year 1992 seen on graph1 has been dropped here even though the line exists in dat_graph2.
My questions
Why is the age 4 in 1992 level dropped when using the weight aesthetic? How can I overcome this?
Why are the two graphs not visually alike even though they used the same data?
Thanks in advance for your help!
1.
Problem 1 is not related to using the weight aesthetic. You can check this by dropping the weight aesthetic in the code for your second graph. The problem is, that the algorithm for computing the density fails, when there are too less observations.
That is the reason, why group 4 shows up in graph 1 with the expanded dataset (grpah 1). Here you increase the number of observations by replicating the number of obs.
Unfortunately, geom_violin gives no warning in your specific case. However, if you filter dat_graph2 for age == 4 geom_violin gives you the warning
Warning message:
Computation failed in `stat_ydensity()`:
replacement has 1 row, data has 0
geom_density is much clearer on this issue, giving a warning, that groups with less than two obs have been dropped.
Unfortunately, I have no solution to overcome this, besides working with the expanded dataset.
2.
Concerning problem 2 I have no convincing answer except that I guess that this is related to the details of the kernel density estimator used by geom_violin, geom_density, ... and perhaps also somehow related to the number of data points.

Pandas subtract columns with groupby and mask

For groups under one "SN", I would like to subtract three performance indicators for each group. One group boundaries are the serial number SN and sequential Boolean True values in mask. (So multiple True sequances can exist under one SN).
The first indicator I want is, Csub that subtracts between the first and last values of each group in column 'C'. Second, Bmean, is the mean of each group in column 'B'.
For example:
In:
df = pd.DataFrame({"SN" : ["66", "66", "66", "77", "77", "77", "77", "77"], "B" : [-2, -1, -2, 3, 1, -1, 1, 1], "C" : [1, 2, 3, 15, 11, 2, 1, 2],
"mask" : [False, False, False, True, True, False, True, True] })
SN B C mask
0 66 -2 1 False
1 66 -1 2 False
2 66 -2 3 False
3 77 3 15 True
4 77 1 11 True
5 77 -1 2 False
6 77 1 1 True
7 77 1 2 True
Out:
SN B C mask Csub Bmean CdivB
0 66 -2 1 False Nan Nan Nan
1 66 -1 2 False Nan Nan Nan
2 66 -2 3 False Nan Nan Nan
3 77 3 15 True -4 13 -0.3
4 77 1 11 True -4 13 -0.3
5 77 -1 2 False Nan Nan Nan
6 77 1 1 True 1 1 1
7 77 1 2 True 1 1 1
I cooked up something like this, but it groups by the mask T/F values. It should group by SN and sequential True values, not ALL True values. Further, I cannot figure out how to get a subtraction sqeezed in to this.
# Extracting performance values
perf = (df.assign(
Bmean = df['B'], CdivB = df['C']/df['B']
).groupby(['SN','mask'])
.agg(dict(Bmean ='mean', CdivB = 'mean'))
.reset_index(drop=False)
)
It's not pretty, but you can try the following.
First, prepare a 'group_key' column in order to group by consecutive True values in 'mask':
# Select the rows where 'mask' is True preceded by False.
first_true = df.loc[
(df['mask'] == True)
& (df['mask'].shift(fill_value=False) == False)
]
# Add the column.
df['group_key'] = pd.Series()
# Each row in first_true gets assigned a different 'group_key' value.
df.loc[first_true.index, 'group_key'] = range(len(first_true))
# Forward fill 'group_key' on mask.
df.loc[df['mask'], 'group_key'] = df.loc[df['mask'], 'group_key'].ffill()
Then we can group by 'SN' and 'group_key' and compute and assign the indicator values.
# Group by 'SN' and 'group_key'.
gdf = df.groupby(by=['SN', 'group_key'], as_index=False)
# Compute indicator values
indicators = pd.DataFrame(gdf.nth(0)) # pd.DataFrame used here to avoid a SettingwithCopyWarning.
indicators['Csub'] = gdf.nth(0)['C'].array - gdf.nth(-1)['C'].array
indicators['Bmean'] = gdf.mean()['B'].array
# Write values to original dataframe
df = df.join(indicators.reindex(columns=['Csub', 'Bmean']))
# Forward fill the indicator values
df.loc[df['mask'], ['Csub', 'Bmean']] = df.loc[df['mask'], ['Csub', 'Bmean']].ffill()
# Drop 'group_key' column
df = df.drop(columns=['group_key'])
I excluded 'CdivB' since I couldn't understand what it's value should be.