How to do a three-way weighted table in R - similar to wtd.table - frequency

I have found MANY questions similar to mine, but either they don't want weighted tables or only want two-ways tables. I am trying to do both.
Using wtd.table, I have the following line of code:
wtd.table(fulldata2$income, fulldata2$WIHH, fulldata2$hhsize, weights = fulldata2$WGTP)
This output only provides incomes and WIHH weighted. It does not also include hhsize.
Using regular table, I get the correct output in a three-way format, but not weighted.
tab <- table(fulldata2$income, fulldata2$WIHH, fulldata2$hhsize)
tab2 <- prop.table(tab)
What function can do both three-way and weighted frequency tables? Ideally, also give it in a proportion like prop.table does.
Thanks!

First, here are some sample data (try to include these in your questions, even if it requires creating a sample data set like this). Note that I am using the tidyverse packages here:
test <-
tibble(
var1 = "A"
, var2 = "b"
, var3 = "alpha") %>%
complete(
var1 = c("A", "B")
, var2 = c("a", "b")
, var3 = c("alpha", "beta")) %>%
mutate(wt = 1:n())
So, the data are:
# A tibble: 8 x 4
var1 var2 var3 wt
<chr> <chr> <chr> <int>
1 A a alpha 1
2 A a beta 2
3 A b alpha 3
4 A b beta 4
5 B a alpha 5
6 B a beta 6
7 B b alpha 7
8 B b beta 8
The function you are looking for then is xtabs:
xtabs(wt ~ var1 + var2 + var3
, data = test)
gives:
, , var3 = alpha
var2
var1 a b
A 1 3
B 5 7
, , var3 = beta
var2
var1 a b
A 2 4
B 6 8
If you don't need the result to have the table class, you can also do this by just using count from dplyr (part of the tidyverse):
test %>%
count(var1, var2, var3
, wt = wt)
gives a tibble (a modified data.frame) with your results:
# A tibble: 8 x 4
var1 var2 var3 n
<chr> <chr> <chr> <int>
1 A a alpha 1
2 A a beta 2
3 A b alpha 3
4 A b beta 4
5 B a alpha 5
6 B a beta 6
7 B b alpha 7
8 B b beta 8
And you can then perform whatever calculations you want on it, e.g. the percent within each var3:
test %>%
count(var1, var2, var3
, wt = wt) %>%
group_by(var3) %>%
mutate(prop_in_var3 = n / sum(n))
gives:
# A tibble: 8 x 5
# Groups: var3 [2]
var1 var2 var3 n prop_in_var3
<chr> <chr> <chr> <int> <dbl>
1 A a alpha 1 0.0625
2 A a beta 2 0.1
3 A b alpha 3 0.188
4 A b beta 4 0.2
5 B a alpha 5 0.312
6 B a beta 6 0.3
7 B b alpha 7 0.438
8 B b beta 8 0.4

Related

Mutate column conditionally, if any NAs take the highest value and grab the column names

Consider this sample data
# A tibble: 10 x 3
x y z
<int> <dbl> <dbl>
1 1 1 5
2 2 3 6
3 3 4 7
4 4 7 8
5 5 NA 9
6 6 12 10
7 7 NA NA
8 8 2 20
9 9 5.5 25
10 10 NA 8
I would like to mutate a new column value that rowSums if there is no NAs present in any of the columns.
If there are, take the highest value in the row times 1.2.
BUT, if there are only one column with a value, take that value
Finally, another column NA_column with the names of columns containing NA in that row!
What I have in mind, but could not figure out the rest.
df %>%
mutate(
value = case_when(any_vars(is.na()) ~ BIGGEST VALUE * 1.2,
TRUE ~ rowsum()),
NA_column = columns that has NA in that row
)
DATA
df <- tibble(
x = 1:10,
y = c(1, 3, 4, 7, NA, 12, NA, 2, 5.5, NA),
z = c(5:10, NA, 20, 25, 8)
)
Use rowwise and c_across to perform operations
library(dplyr)
df_out <- df %>%
rowwise() %>%
mutate(
value = ifelse(anyNA(c_across(everything())), max(c_across(everything()), na.rm = T) * 1.2, x), # or 'y' or 'z' instead of x
NA_column = paste(colnames(df)[is.na(c_across(x:z))], collapse = " | ")
) %>%
ungroup()
df_out
# A tibble: 10 × 5
x y z value NA_column
<int> <dbl> <dbl> <dbl> <chr>
1 1 1 5 1 ""
2 2 3 6 2 ""
3 3 4 7 3 ""
4 4 7 8 4 ""
5 5 NA 9 10.8 "y"
6 6 12 10 6 ""
7 7 NA NA 8.4 "y | z"
8 8 2 20 8 ""
9 9 5.5 25 9 ""
10 10 NA 8 12 "y"
Often when solving a problem such as this I find it best to lay out the solution in discrete steps. In this case, using tidyverse syntax, it is possible to create temporary columns containing the bits of data needed to ultimately compute the desired value.
I couldn't immediately improve upon the solution provided for the second part (NA_column) by #Julien above, so I've added that code chunk below.
df <- tibble(
x = 1:10,
y = c(1, 3, 4, 7, NA, 12, NA, 2, 5.5, NA),
z = c(5:10, NA, 20, 25, 8)
)
out <-
df %>%
mutate(
# get number of columns of data
num_cols = ncol(.),
# get number of values in row that are not NA
num_not_na = rowSums(!is.na(.))
) %>%
rowwise() %>%
mutate(
# get maximum value of data in row, note we need to be explicit about which columns are data, e.g., (1:3)
max_value = max(across(1:3), na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
# get the desired value for the row
# if there are no NA values or only one non-NA value then we can just get the row sum,
# again we need to be explicit about the columns, e.g. [, 1:3]
# otherwise take the maximum value multiplied by 1.2
value = ifelse(num_cols == num_not_na | num_not_na == 1, rowSums(.[, 1:3], na.rm = TRUE), max_value * 1.2)
)
# with credit to #Julien for the following code to get the names of the NA columns
out$NA_column <- sapply(1:nrow(out), function(i) {
out[i, ] %$%
colnames(.)[is.na(.)] %>%
paste(collapse = " | ")
})
# can optionally remove the temporary columns
out <-
out %>%
dplyr::select(
-c(num_cols, num_not_na, max_value)
)
# > out
# # A tibble: 10 x 5
# x y z value NA_column
# <int> <dbl> <dbl> <dbl> <chr>
# 1 1 1 5 7 ""
# 2 2 3 6 11 ""
# 3 3 4 7 14 ""
# 4 4 7 8 19 ""
# 5 5 NA 9 10.8 "y"
# 6 6 12 10 28 ""
# 7 7 NA NA 7 "y | z"
# 8 8 2 20 30 ""
# 9 9 5.5 25 39.5 ""
#10 10 NA 8 12 "y"

Cartesian product in R

What is the fastest way to find cartesian product of two lists in R? For example, I have:
x <- c(a,b,c,d) y <- c(1, 2, 3)
I need to make from them the following data.frame:
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
Assuming x cross y, this would be one way:
# Tideyverse solution
library(tidyr)
x <- letters[1:4]
y <- c(1, 2, 3)
tibble(
x = x,
y = list(y)
) %>%
unnest(y)
# A tibble: 12 x 2
x y
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
10 d 1
11 d 2
12 d 3
# Base R solution
expand.grid(y = y, x = x)
y x
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
7 1 c
8 2 c
9 3 c
10 1 d
11 2 d
12 3 d

How to melt four column in the same table in one column with pandas.melt

I am learning data analytics and have a project which need 4 column to melt in one column and using - between them
Ex puppo none doggo none
We want them in one column
As puppo-doggo
Assume that your source DataFrame contains:
id A B C D
0 X1 1 5 9 12
1 X2 2 6 10 14
2 X3 3 7 11 15
3 X4 4 8 12 16
To melt columns A thru D you can run:
result = df.melt(id_vars=['id'], value_vars=['A', 'B', 'C', 'D'],
var_name='Column', value_name='Value')
The result is:
id Column Value
0 X1 A 1
1 X2 A 2
2 X3 A 3
3 X4 A 4
4 X1 B 5
5 X2 B 6
6 X3 B 7
7 X4 B 8
8 X1 C 9
9 X2 C 10
10 X3 C 11
11 X4 C 12
12 X1 D 12
13 X2 D 14
14 X3 D 15
15 X4 D 16
Read the documentation cocerning melt and experiment with
other setting of available parameters and their default values.

how to merge two pandas dataframe correctly

I have 2 dataframes
df1
Code Sales Store
A 10 alpha
B 5 beta
C 4 gamma
B 3 alpha
df2
Code Unit_Price
A 2
B 3
C 4
D 5
E 6
I want do 2 things here.
First I want to check that all unique codes in df1 are there in df2
Second, I want to merge these 2 df2 by codes
df3, should look like
Code Sales Store unit_price
A 10 alpha 2
B 5 beta 3
C 4 gamma 4
D 3 alpha 5
I did
df3 = df1.merge(df2,on='Code',how='left')
Not sure if I am right , I will appreciate your time and effort to help me in this record
Need numpy.setdiff1d for check membership unique values of columns:
print (np.setdiff1d(df1['Code'].unique(), df1['Code'].unique()))
[]
print (np.setdiff1d(df2['Code'].unique(), df1['Code'].unique()))
['D' 'E']
Your solution is good, especially if need add more columns like:
print (df2)
Code Unit_Price col
0 A 2 7
1 B 3 2
2 C 4 1
3 D 5 0
4 E 6 3
df3 = df1.merge(df2,on='Code',how='left')
print (df3)
Code Sales Store Unit_Price col
0 A 10 alpha 2 7
1 B 5 beta 3 2
2 C 4 gamma 4 1
3 B 3 alpha 3 2
If need add only one column is possible use map by Series what should be faster:
df1['unit_price'] = df1['Code'].map(df2.set_index('Code')['Unit_Price'])
print (df1)
Code Sales Store unit_price
0 A 10 alpha 2
1 B 5 beta 3
2 C 4 gamma 4
3 B 3 alpha 3

Apply an element-wise function on a pandas dataframe with index and column values as inputs

I often have this need, and I can't seem to find the way to do it efficiently.
Let's say I have a pandas DataFrame object and I want the value of each element (i,j) to be equal to f(index[i], columns[j]).
Using applymap, value of index and column for each element is lost.
What is the best way to do it?
It depends on what you are trying to do specifically.
clever hack
using pd.Panel.apply
it works because it will iterate over each series along the major and minor axes. It's name will be the tuple we need.
df = pd.DataFrame(index=range(5), columns=range(5))
def f1(x):
n = x.name
return n[0] + n[1] ** 2
pd.Panel(dict(A=df)).apply(f1, 0)
0 1 2 3 4
0 0 1 4 9 16
1 1 2 5 10 17
2 2 3 6 11 18
3 3 4 7 12 19
4 4 5 8 13 20
example 1
Here is one such use case and one possible solution for that use case
df = pd.DataFrame(index=range(5), columns=range(5))
f = lambda x: x[0] + x[1]
s = df.stack(dropna=False)
s.loc[:] = s.index.map(f)
s.unstack()
0 1 2 3 4
0 0 1 2 3 4
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
4 4 5 6 7 8
or this will do the same thing
df.stack(dropna=False).to_frame().apply(lambda x: f(x.name), 1).unstack()
example 2
df = pd.DataFrame(index=list('abcd'), columns=list('xyz'))
v = df.values
c = df.columns.values
i = df.index.values
pd.DataFrame(
(np.tile(i, len(c)) + c.repeat(len(i))).reshape(v.shape),
i, c
)
x y z
a ax bx cx
b dx ay by
c cy dy az
d bz cz dz