I want to create a column called lag_diff3. This column is made by the difference of the lagged value in A = 2003. I can do this using the following code, but it seems ugly. Can I rewrite this problem by using a simple way
A = c("2001", "2002", "2003", "2004")
B = c(10, 20, 60, 70)
dat = tibble(A = A, B = B) %>%
mutate(lag_diff1 = B - lag(B, 1),
lag_diff2 = ifelse(A != "2003", -100, lag_diff1),
lag_diff3 = max(lag_diff2))
> dat
# A tibble: 4 × 5
A B lag_diff1 lag_diff2 lag_diff3
<chr> <dbl> <dbl> <dbl> <dbl>
1 2001 10 NA -100 40
2 2002 20 10 -100 40
3 2003 60 40 40 40
4 2004 70 10 -100 40
You could do lag_diff1[A == "2003"]:
library(dplyr)
A = c("2001", "2002", "2003", "2004")
B = c(10, 20, 60, 70)
tibble(A = A, B = B) %>%
mutate(lag_diff1 = B - lag(B, 1),
lag_diff3 = lag_diff1[A == "2003"])
#> # A tibble: 4 × 4
#> A B lag_diff1 lag_diff3
#> <chr> <dbl> <dbl> <dbl>
#> 1 2001 10 NA 40
#> 2 2002 20 10 40
#> 3 2003 60 40 40
#> 4 2004 70 10 40
Related
Consider this sample data
# A tibble: 10 x 3
x y z
<int> <dbl> <dbl>
1 1 1 5
2 2 3 6
3 3 4 7
4 4 7 8
5 5 NA 9
6 6 12 10
7 7 NA NA
8 8 2 20
9 9 5.5 25
10 10 NA 8
I would like to mutate a new column value that rowSums if there is no NAs present in any of the columns.
If there are, take the highest value in the row times 1.2.
BUT, if there are only one column with a value, take that value
Finally, another column NA_column with the names of columns containing NA in that row!
What I have in mind, but could not figure out the rest.
df %>%
mutate(
value = case_when(any_vars(is.na()) ~ BIGGEST VALUE * 1.2,
TRUE ~ rowsum()),
NA_column = columns that has NA in that row
)
DATA
df <- tibble(
x = 1:10,
y = c(1, 3, 4, 7, NA, 12, NA, 2, 5.5, NA),
z = c(5:10, NA, 20, 25, 8)
)
Use rowwise and c_across to perform operations
library(dplyr)
df_out <- df %>%
rowwise() %>%
mutate(
value = ifelse(anyNA(c_across(everything())), max(c_across(everything()), na.rm = T) * 1.2, x), # or 'y' or 'z' instead of x
NA_column = paste(colnames(df)[is.na(c_across(x:z))], collapse = " | ")
) %>%
ungroup()
df_out
# A tibble: 10 × 5
x y z value NA_column
<int> <dbl> <dbl> <dbl> <chr>
1 1 1 5 1 ""
2 2 3 6 2 ""
3 3 4 7 3 ""
4 4 7 8 4 ""
5 5 NA 9 10.8 "y"
6 6 12 10 6 ""
7 7 NA NA 8.4 "y | z"
8 8 2 20 8 ""
9 9 5.5 25 9 ""
10 10 NA 8 12 "y"
Often when solving a problem such as this I find it best to lay out the solution in discrete steps. In this case, using tidyverse syntax, it is possible to create temporary columns containing the bits of data needed to ultimately compute the desired value.
I couldn't immediately improve upon the solution provided for the second part (NA_column) by #Julien above, so I've added that code chunk below.
df <- tibble(
x = 1:10,
y = c(1, 3, 4, 7, NA, 12, NA, 2, 5.5, NA),
z = c(5:10, NA, 20, 25, 8)
)
out <-
df %>%
mutate(
# get number of columns of data
num_cols = ncol(.),
# get number of values in row that are not NA
num_not_na = rowSums(!is.na(.))
) %>%
rowwise() %>%
mutate(
# get maximum value of data in row, note we need to be explicit about which columns are data, e.g., (1:3)
max_value = max(across(1:3), na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
# get the desired value for the row
# if there are no NA values or only one non-NA value then we can just get the row sum,
# again we need to be explicit about the columns, e.g. [, 1:3]
# otherwise take the maximum value multiplied by 1.2
value = ifelse(num_cols == num_not_na | num_not_na == 1, rowSums(.[, 1:3], na.rm = TRUE), max_value * 1.2)
)
# with credit to #Julien for the following code to get the names of the NA columns
out$NA_column <- sapply(1:nrow(out), function(i) {
out[i, ] %$%
colnames(.)[is.na(.)] %>%
paste(collapse = " | ")
})
# can optionally remove the temporary columns
out <-
out %>%
dplyr::select(
-c(num_cols, num_not_na, max_value)
)
# > out
# # A tibble: 10 x 5
# x y z value NA_column
# <int> <dbl> <dbl> <dbl> <chr>
# 1 1 1 5 7 ""
# 2 2 3 6 11 ""
# 3 3 4 7 14 ""
# 4 4 7 8 19 ""
# 5 5 NA 9 10.8 "y"
# 6 6 12 10 28 ""
# 7 7 NA NA 7 "y | z"
# 8 8 2 20 30 ""
# 9 9 5.5 25 39.5 ""
#10 10 NA 8 12 "y"
I made a reproducible example of a dataframe with 2 patients ID (ID 1 and ID 2), the value of a measurement (m_value) on different days (m_day).
df <- data.frame (ID = c (1, 1, 1, 1, 2, 2, 2),
m_value = c (10, 15, 12, 13, 18, 16, 19),
m_day = c (14, 143, 190, 402, 16, 55, 75)
ID m_value m_day
1 1 10 14
2 1 15 143
3 1 12 190
4 1 13 402
5 2 18 16
6 2 16 55
7 2 19 75
Now I want to obtain, for each patient, the best value of m before day 100 (period 1), and >= day 100 (period 2), and the dates of these best values, such as I can obtain the following table:
ID m_value m_day best_m_period1 best_m_period2 date_best_m_period1 date_best_m_period2
1 1 10 14 10 15 14 143
2 1 15 143 10 15 14 143
3 1 12 190 10 15 14 143
4 1 13 402 10 15 14 143
5 2 18 16 19 NA 75 NA
6 2 16 55 19 NA 75 NA
7 2 19 75 19 NA 75 NA
I tried the following code:
df2 <- df %>%
group_by (ID)%>%
mutate (best_m_period1 = max(m_value[m_day < 100]))%>%
mutate (best_m_period2 = max (m_value[m_day >=100])) %>%
mutate (date_best_m_period1 =
ifelse (is.null(which.max(m_value[m_day<100])), NA,
m_day[which.max(m_value[m_day < 100])])) %>%
mutate (date_best_m_period2 =
ifelse (is.null(which.max(m_value[m_day >= 100])), NA,
m_day[which.max(m_value[m_day >= 100])]))
But I obtain the following table:
ID m_value m_day best_m_period1 best_m_period2 date_best_m_period1 date_best_m_period2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 14 10 15 14 14
2 1 15 143 10 15 14 14
3 1 12 190 10 15 14 14
4 1 13 402 10 15 14 14
5 2 18 16 19 -Inf 75 NA
6 2 16 55 19 -Inf 75 NA
7 2 19 75 19 -Inf 75 NA
The date_best_m_period2 for ID1 is not 143 as expected (corresponding to the max value of 15 for ID1 in period 2 (>= 100 day)), but returns 14, the max value in period 1.
How can I resolve this problem ? Thank you very much for your help
Question updated, see below
I have a large dataframe similar in structure to e.g.
df=pd.DataFrame({'A': [0, 0, 0, 11, 22,33], 'B': [10, 20,30, 110, 220, 330], 'C':['x', 'y', 'z', 'x', 'y', 'z']})
df
A B C
0 0 10 x
1 0 20 y
2 0 30 z
3 11 110 x
4 22 220 y
5 33 330 z
I want to create a new column by selecting the column value of B from a different row based on the value of C being equal to the current row and the value of A being 0, so the expected result is
A B C new_B_based_on_A_and_C
0 0 10 x 10
1 0 20 y 20
2 0 30 z 30
3 11 110 x 10
4 22 220 y 20
5 33 330 z 30
I want to have a simple solution without needing to have a for loop over the rows. Something like
df.apply(lambda row: df[df[(df['C']==row.C) & (df['A']==0)]]['B'].iloc[0], axis=1)
The dataframe is guaranteed to have those values and the values are unique
Update for a more general case
I am looking for a general solution that would also work for multiple columns to match on e.g.
df=pd.DataFrame({'A': [0, 0, 0,0, 11, 22,33, 44], 'B': [10, 20,30, 40, 110, 220, 330, 440], 'C':['x', 'y', 'x', 'y', 'x', 'y', 'x', 'y'], 'D': [1, 1, 5, 5, 1,1 ,5, 5]})
A B C D
0 0 10 x 1
1 0 20 y 1
2 0 30 x 5
3 0 40 y 5
4 11 110 x 1
5 22 220 y 1
6 33 330 x 5
7 44 440 y 5
and the result would be then
A B C D new_B_based_on_A_C_D
0 0 10 x 1 10
1 0 20 y 1 20
2 0 30 x 5 30
3 0 40 y 5 40
4 11 110 x 1 10
5 22 220 y 1 20
6 33 330 x 5 30
7 44 440 y 5 40
You can do a map:
# you **must** make sure that for each unique `C` value,
# there is only one row with `A==0`.
df['new'] = df['C'].map(df.loc[df['A']==0].set_index('C')['B'])
Output:
A B C new
0 0 10 x 10
1 0 20 y 20
2 0 30 z 30
3 11 110 x 10
4 22 220 y 20
5 33 330 z 30
Explanation: Imagine you have a series s indicating the mapping:
idx
idx1 value1
idx2 value2
idx3 value3
then that's what map does: df['C'].map(s).
Now, for a dataframe d:
C B
c1 b1
c2 b2
c3 b3
we do s=d.set_index('C')['B'] to get the above form.
Finally, as mentioned, you mapping happens where A==0, so d = df[df['A']==0].
Composing the forward path:
mapping_data = df[df['A']==0]
mapping_series = mapping_data.set_index('C')['B']
new_values = df['C'].map(mapping_series)
and the first piece of code is just all these lines combined.
If I understood the question, for the general case you could use a merge like this:
df.merge(df.loc[df['A'] == 0, ['B', 'C', 'D']], on=['C', 'D'], how='left', suffixes=('', '_new'))
Output:
A B C D B_new
0 10 x 1 10
0 20 y 1 20
0 30 x 5 30
0 40 y 5 40
11 110 x 1 10
22 220 y 1 20
33 330 x 5 30
44 440 y 5 40
I have a dataframe and I'd like to group by a column value and then do a calculation to create a new column. Below is the set up data:
import pandas as pd
df = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100]
})
df.groupby('Groups').apply(print)
What I want to do is create a 'TOTAL' column in the original dataframe. If it is the first record of the group 'TOTAL' gets a zero otherwise TOTAL will get the ['Blue'] at index subtracted by ['Red'] at index-1.
I tried to do this in a function below but it does not work.
def funct(group):
count = 0
lst = []
for info in group:
if count == 0:
lst.append(0)
count += 1
else:
num = group.iloc[count]['Blue'] - group.iloc[count-1]['Red']
lst.append(num)
count += 1
group['Total'] = lst
return group
df = df.join(df.groupby('Groups').apply(funct))
The code works for the first group but then errors out.
The desired outcome is:
df_final = pd.DataFrame({
'Red' : [1,2,3,4,5,6,7,8,9,10],
'Groups':['A','B','A','A','B','C','B','C','B','C'],
'Blue':[10,20,30,40,50,60,70,80,90,100],
'Total':[0,0,29,37,48,0,65,74,83,92]
})
df_final
df_final.groupby('Groups').apply(print)
Thank you for the help!
For each group, calculate the difference between Blue and shifted Red (Red at previous index):
df['Total'] = (df.groupby('Groups')
.apply(lambda g: g.Blue - g.Red.shift().fillna(g.Blue))
.reset_index(level=0, drop=True))
df
Red Groups Blue Total
0 1 A 10 0.0
1 2 B 20 0.0
2 3 A 30 29.0
3 4 A 40 37.0
4 5 B 50 48.0
5 6 C 60 0.0
6 7 B 70 65.0
7 8 C 80 74.0
8 9 B 90 83.0
9 10 C 100 92.0
Or as #anky has commented, you can avoid apply by shifting Red column first:
df['Total'] = (df.Blue - df.Red.groupby(df.Groups).shift()).fillna(0, downcast='infer')
df
Red Groups Blue Total
0 1 A 10 0
1 2 B 20 0
2 3 A 30 29
3 4 A 40 37
4 5 B 50 48
5 6 C 60 0
6 7 B 70 65
7 8 C 80 74
8 9 B 90 83
9 10 C 100 92
My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id and player2.player.id.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side column denoting either p1 or p2. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long() to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?
A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72