Mutate column conditionally, if any NAs take the highest value and grab the column names - dataframe

Consider this sample data
# A tibble: 10 x 3
x y z
<int> <dbl> <dbl>
1 1 1 5
2 2 3 6
3 3 4 7
4 4 7 8
5 5 NA 9
6 6 12 10
7 7 NA NA
8 8 2 20
9 9 5.5 25
10 10 NA 8
I would like to mutate a new column value that rowSums if there is no NAs present in any of the columns.
If there are, take the highest value in the row times 1.2.
BUT, if there are only one column with a value, take that value
Finally, another column NA_column with the names of columns containing NA in that row!
What I have in mind, but could not figure out the rest.
df %>%
mutate(
value = case_when(any_vars(is.na()) ~ BIGGEST VALUE * 1.2,
TRUE ~ rowsum()),
NA_column = columns that has NA in that row
)
DATA
df <- tibble(
x = 1:10,
y = c(1, 3, 4, 7, NA, 12, NA, 2, 5.5, NA),
z = c(5:10, NA, 20, 25, 8)
)

Use rowwise and c_across to perform operations
library(dplyr)
df_out <- df %>%
rowwise() %>%
mutate(
value = ifelse(anyNA(c_across(everything())), max(c_across(everything()), na.rm = T) * 1.2, x), # or 'y' or 'z' instead of x
NA_column = paste(colnames(df)[is.na(c_across(x:z))], collapse = " | ")
) %>%
ungroup()
df_out
# A tibble: 10 × 5
x y z value NA_column
<int> <dbl> <dbl> <dbl> <chr>
1 1 1 5 1 ""
2 2 3 6 2 ""
3 3 4 7 3 ""
4 4 7 8 4 ""
5 5 NA 9 10.8 "y"
6 6 12 10 6 ""
7 7 NA NA 8.4 "y | z"
8 8 2 20 8 ""
9 9 5.5 25 9 ""
10 10 NA 8 12 "y"

Often when solving a problem such as this I find it best to lay out the solution in discrete steps. In this case, using tidyverse syntax, it is possible to create temporary columns containing the bits of data needed to ultimately compute the desired value.
I couldn't immediately improve upon the solution provided for the second part (NA_column) by #Julien above, so I've added that code chunk below.
df <- tibble(
x = 1:10,
y = c(1, 3, 4, 7, NA, 12, NA, 2, 5.5, NA),
z = c(5:10, NA, 20, 25, 8)
)
out <-
df %>%
mutate(
# get number of columns of data
num_cols = ncol(.),
# get number of values in row that are not NA
num_not_na = rowSums(!is.na(.))
) %>%
rowwise() %>%
mutate(
# get maximum value of data in row, note we need to be explicit about which columns are data, e.g., (1:3)
max_value = max(across(1:3), na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
# get the desired value for the row
# if there are no NA values or only one non-NA value then we can just get the row sum,
# again we need to be explicit about the columns, e.g. [, 1:3]
# otherwise take the maximum value multiplied by 1.2
value = ifelse(num_cols == num_not_na | num_not_na == 1, rowSums(.[, 1:3], na.rm = TRUE), max_value * 1.2)
)
# with credit to #Julien for the following code to get the names of the NA columns
out$NA_column <- sapply(1:nrow(out), function(i) {
out[i, ] %$%
colnames(.)[is.na(.)] %>%
paste(collapse = " | ")
})
# can optionally remove the temporary columns
out <-
out %>%
dplyr::select(
-c(num_cols, num_not_na, max_value)
)
# > out
# # A tibble: 10 x 5
# x y z value NA_column
# <int> <dbl> <dbl> <dbl> <chr>
# 1 1 1 5 7 ""
# 2 2 3 6 11 ""
# 3 3 4 7 14 ""
# 4 4 7 8 19 ""
# 5 5 NA 9 10.8 "y"
# 6 6 12 10 28 ""
# 7 7 NA NA 7 "y | z"
# 8 8 2 20 30 ""
# 9 9 5.5 25 39.5 ""
#10 10 NA 8 12 "y"

Related

Simple way to get lagged value of a specific row

I want to create a column called lag_diff3. This column is made by the difference of the lagged value in A = 2003. I can do this using the following code, but it seems ugly. Can I rewrite this problem by using a simple way
A = c("2001", "2002", "2003", "2004")
B = c(10, 20, 60, 70)
dat = tibble(A = A, B = B) %>%
mutate(lag_diff1 = B - lag(B, 1),
lag_diff2 = ifelse(A != "2003", -100, lag_diff1),
lag_diff3 = max(lag_diff2))
> dat
# A tibble: 4 × 5
A B lag_diff1 lag_diff2 lag_diff3
<chr> <dbl> <dbl> <dbl> <dbl>
1 2001 10 NA -100 40
2 2002 20 10 -100 40
3 2003 60 40 40 40
4 2004 70 10 -100 40
You could do lag_diff1[A == "2003"]:
library(dplyr)
A = c("2001", "2002", "2003", "2004")
B = c(10, 20, 60, 70)
tibble(A = A, B = B) %>%
mutate(lag_diff1 = B - lag(B, 1),
lag_diff3 = lag_diff1[A == "2003"])
#> # A tibble: 4 × 4
#> A B lag_diff1 lag_diff3
#> <chr> <dbl> <dbl> <dbl>
#> 1 2001 10 NA 40
#> 2 2002 20 10 40
#> 3 2003 60 40 40
#> 4 2004 70 10 40

Reshape wide to long for many columns with a common prefix

My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id and player2.player.id.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side column denoting either p1 or p2. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long() to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?
A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

How to do a three-way weighted table in R - similar to wtd.table

I have found MANY questions similar to mine, but either they don't want weighted tables or only want two-ways tables. I am trying to do both.
Using wtd.table, I have the following line of code:
wtd.table(fulldata2$income, fulldata2$WIHH, fulldata2$hhsize, weights = fulldata2$WGTP)
This output only provides incomes and WIHH weighted. It does not also include hhsize.
Using regular table, I get the correct output in a three-way format, but not weighted.
tab <- table(fulldata2$income, fulldata2$WIHH, fulldata2$hhsize)
tab2 <- prop.table(tab)
What function can do both three-way and weighted frequency tables? Ideally, also give it in a proportion like prop.table does.
Thanks!
First, here are some sample data (try to include these in your questions, even if it requires creating a sample data set like this). Note that I am using the tidyverse packages here:
test <-
tibble(
var1 = "A"
, var2 = "b"
, var3 = "alpha") %>%
complete(
var1 = c("A", "B")
, var2 = c("a", "b")
, var3 = c("alpha", "beta")) %>%
mutate(wt = 1:n())
So, the data are:
# A tibble: 8 x 4
var1 var2 var3 wt
<chr> <chr> <chr> <int>
1 A a alpha 1
2 A a beta 2
3 A b alpha 3
4 A b beta 4
5 B a alpha 5
6 B a beta 6
7 B b alpha 7
8 B b beta 8
The function you are looking for then is xtabs:
xtabs(wt ~ var1 + var2 + var3
, data = test)
gives:
, , var3 = alpha
var2
var1 a b
A 1 3
B 5 7
, , var3 = beta
var2
var1 a b
A 2 4
B 6 8
If you don't need the result to have the table class, you can also do this by just using count from dplyr (part of the tidyverse):
test %>%
count(var1, var2, var3
, wt = wt)
gives a tibble (a modified data.frame) with your results:
# A tibble: 8 x 4
var1 var2 var3 n
<chr> <chr> <chr> <int>
1 A a alpha 1
2 A a beta 2
3 A b alpha 3
4 A b beta 4
5 B a alpha 5
6 B a beta 6
7 B b alpha 7
8 B b beta 8
And you can then perform whatever calculations you want on it, e.g. the percent within each var3:
test %>%
count(var1, var2, var3
, wt = wt) %>%
group_by(var3) %>%
mutate(prop_in_var3 = n / sum(n))
gives:
# A tibble: 8 x 5
# Groups: var3 [2]
var1 var2 var3 n prop_in_var3
<chr> <chr> <chr> <int> <dbl>
1 A a alpha 1 0.0625
2 A a beta 2 0.1
3 A b alpha 3 0.188
4 A b beta 4 0.2
5 B a alpha 5 0.312
6 B a beta 6 0.3
7 B b alpha 7 0.438
8 B b beta 8 0.4

inplace apply to columns of pandas dataframe satisfying conditions

Consider the following pandas dataframe:
df = pd.DataFrame({'t': [1,2,3], 'x1': [4,5,6], 'x2': [7,8,9]} )
>>> print(df)
t x1 x2
0 1 4 7
1 2 5 8
2 3 6 9
I would like to apply a function (say multiplying by 2) to those columns with names containing the character 'x'
This can be done by:
df.filter(regex='x').apply(lambda c: 2*c)
but not in place. My solution is:
tmp = df.filter(regex='x')
tmp = tmp.apply(lambda c: 2*c)
tmp['t'] = df['t']
df = tmp
which has the added problem of changing the order of the columns. Is there a better way?
IIUC you can do something like this:
In [239]: df.apply(lambda x: x*2 if 'x' in x.name else x)
Out[239]:
t x1 x2
0 1 8 14
1 2 10 16
2 3 12 18
UPDATE:
In [258]: df.apply(lambda x: x*2 if 'x' in x.name else x) \
.rename(columns=lambda x: 'ytext_{}_moretext'.format(x[-1]) if 'x' in x else x)
Out[258]:
t ytext_1_moretext ytext_2_moretext
0 1 8 14
1 2 10 16
2 3 12 18
Use df.columns.str.contains('x') to get boolean mask to slice df
df.loc[:, df.columns.str.contains('x')] *= 2
print(df)
t x1 x2
0 1 8 14
1 2 10 16
2 3 12 18
More generalized
def f(x):
return 2 * x
m = df.columns.str.contains('x')
df.loc[:, m] = f(df.loc[:, m])
print(df)
t x1 x2
0 1 8 14
1 2 10 16
2 3 12 18
Using apply
m = df.columns.str.contains('x')
df.loc[:, m] = df.loc[:, m].apply(f)
print(df)
t x1 x2
0 1 8 14
1 2 10 16
2 3 12 18

Combine multiple columns into two columns: "column name" and "value"

There is probably an easy way of doing this, so I hope someone has a nice solution (currently I am doing it with ugly for loops).
My data looks like:
In [1]: df = pd.DataFrame({'Ref': [5, 6, 7],
'Col1': [10,11,12],
'Col2': [20,21,22],
'Col3': [30,31,32]})
In [2]: df
Out[2]:
Col1 Col2 Col3 Ref
0 10 20 30 5
1 11 21 31 6
2 12 22 32 7
And I am trying to flatten the table (for 2D histograms) to use a single column for the column id and one column for the actual values while keeping the corresponding Ref, like this:
Ref Col Value
0 5 1 10
1 5 2 20
2 5 3 30
3 6 1 11
4 6 2 21
5 6 3 31
6 7 1 12
7 7 2 22
8 7 3 32
I remember there was some kind of a join/group operation to do the reverse operation, but I cannot recall it anymore...
Maybe not the most elegant solution, but it works on your data. Using a combination of pivot_table and stack.
import pandas as pd
df = pd.DataFrame({'Ref': [5, 6, 7],
'Col1': [10,11,12],
'Col2': [20,21,22],
'Col3': [30,31,32]})
# In [23]: df
# Out[23]:
# Col1 Col2 Col3 Ref
# 0 10 20 30 5
# 1 11 21 31 6
# 2 12 22 32 7
piv = df.pivot_table(index=['Ref']).stack()
df2 = pd.DataFrame(piv)
df2.reset_index(inplace=True)
df2.columns = ['Ref','Col','Value']
# In [19]: df2
# Out[19]:
# Ref Col Value
# 0 5 Col1 10
# 1 5 Col2 20
# 2 5 Col3 30
# 3 6 Col1 11
# 4 6 Col2 21
# 5 6 Col3 31
# 6 7 Col1 12
# 7 7 Col2 22
# 8 7 Col3 32
If you want 'Col' to just be the last digit of the column name, could do something like this:
df2.Col = df2.Col.apply(lambda x: x[-1:])
# In [21]: df2
# Out[21]:
# Ref Col Value
# 0 5 1 10
# 1 5 2 20
# 2 5 3 30
# 3 6 1 11
# 4 6 2 21
# 5 6 3 31
# 6 7 1 12
# 7 7 2 22
# 8 7 3 32