Creating a dataset (data frame,df) in R consisting of all possible combinations of 10 different variables - dataframe

Thanks in advance for any advice. As part of a study, I need to:
Part 1:
I need to create a .csv dataset (or r data frame?) that produces all possible combinations of 10 different variables. Each of the 10 variables has either 2 (i.e., binary 0,1) or 4 levels. I think this should be possible easily in both excel and r, but r would be preferable. They are provided in the table below:
For example, the first set of combinations would keep "druga_LIFE" at 0.5 and then would cycle through all combinations of the other variables, then it would follow by fixing "druga_LIFE" at 1 and cycling through all other combinations of variables. Eventually, it would move on to fix "druga.NEED" at 0 changing other variables, then at 1 and so on.
The dataset should be a full set of combinations with no repeat combinations.
I understand there is a large number of possible combinations - this is as expected, but I don't think this should be too difficult to compute.
Part 2:
I then need to go through this dataset, selecting only the possible combinations where:
"druga.LIFE" is the same as "drugb.LIFE"
AND
2)"druga.NEED" is the same as "drugb.NEED"
I think this should be simple with the dplyr package in R.
I have created the df in r, but do not know how to begin with cycling through to produce all possible combinations.
#DATASET OF ALL POSSIBLE CHOICE SETS#
#Creating the Vectors of choices
DrugA_LIFE <- c(0.5, 1, 2,5)
DrugA_NEED <- c(0,1)
DrugA_CERT <- c(0, 0.2, 0.4, 0.6)
DrugA_RISK <- c(0.1, 0.2, 0.4, 0.6)
DrugA_WAIT <- c(0, 0.5, 1, 2)
DrugB_LIFE <- c(0.5, 1, 2,5)
DrugB_NEED <- c(0,1)
DrugB_CERT <- c(0, 0.2, 0.4, 0.6)
DrugB_RISK <- c(0.1, 0.2, 0.4, 0.6)
DrugB_WAIT <- c(0, 0.5, 1, 2)
#Create data frame
df <- data.frame(DrugA_LIFE, DrugA_NEED, DrugA_CERT, DrugA_RISK, DrugA_WAIT, DrugB_LIFE, DrugB_NEED, DrugB_CERT, DrugB_RISK, DrugB_WAIT)

All possible combinations? expand.grid or tidyr::expand_big. We can apply either function to an already-made frame using do.call.
Unique? Use R's unique or dplyr::distinct.
Filtering? Use ... dplyr::filter (or base R subset).
library(dplyr)
# library(tidyr) # expand_grid
do.call(tidyr::expand_grid, df) %>%
distinct() %>%
filter(DrugA_LIFE == DrugB_LIFE, DrugA_NEED == DrugB_NEED)
# # A tibble: 32,768 × 10
# DrugA_LIFE DrugA_NEED DrugA_CERT DrugA_RISK DrugA_WAIT DrugB_LIFE DrugB_NEED DrugB_CERT DrugB_RISK DrugB_WAIT
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.5 0 0 0.1 0 0.5 0 0 0.1 0
# 2 0.5 0 0 0.1 0 0.5 0 0 0.1 0.5
# 3 0.5 0 0 0.1 0 0.5 0 0 0.1 1
# 4 0.5 0 0 0.1 0 0.5 0 0 0.1 2
# 5 0.5 0 0 0.1 0 0.5 0 0 0.2 0
# 6 0.5 0 0 0.1 0 0.5 0 0 0.2 0.5
# 7 0.5 0 0 0.1 0 0.5 0 0 0.2 1
# 8 0.5 0 0 0.1 0 0.5 0 0 0.2 2
# 9 0.5 0 0 0.1 0 0.5 0 0 0.4 0
# 10 0.5 0 0 0.1 0 0.5 0 0 0.4 0.5
# # … with 32,758 more rows
# # ℹ Use `print(n = ...)` to see more rows
Data:
df <- structure(list(DrugA_LIFE = c(0.5, 1, 2, 5), DrugA_NEED = c(0, 1, 0, 1), DrugA_CERT = c(0, 0.2, 0.4, 0.6), DrugA_RISK = c(0.1, 0.2, 0.4, 0.6), DrugA_WAIT = c(0, 0.5, 1, 2), DrugB_LIFE = c(0.5, 1, 2, 5), DrugB_NEED = c(0, 1, 0, 1), DrugB_CERT = c(0, 0.2, 0.4, 0.6), DrugB_RISK = c(0.1, 0.2, 0.4, 0.6), DrugB_WAIT = c(0, 0.5, 1, 2)), class = "data.frame", row.names = c(NA, -4L))

Related

How to create a new column based on row values in python?

I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.
This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300

The efficient way to compare value between two cell and assign value based on condition in Numpy

The objective is to count the frequency when two nodes have similar value.
Say, for example, we have a vector
pd.DataFrame([0,4,1,1,1],index=['A','B','C','D','E'])
as below
0
A 0
B 4
C 1
D 1
E 1
And, the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise.
N is then
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 1 1
D 0 0 1 1 1
E 0 0 1 1 1
This simple example can be extended to 2D. For example, here create array of shape (4,5)
A B C D E
0 0 0 0 0 0
1 0 4 1 1 1
2 0 1 1 2 2
3 0 3 2 2 2
Similarly, we go row wise and set the element Nij is equal to 1 if nodes i and j have similar value and is equal to zero otherwise. At every iteration of the row, we sum the cell value.
The frequency is then equal to
A B C D E
A 4.0 1.0 1.0 1.0 1.0
B 1.0 4.0 2.0 1.0 1.0
C 1.0 2.0 4.0 3.0 3.0
D 1.0 1.0 3.0 4.0 4.0
E 1.0 1.0 3.0 4.0 4.0
Based on this, the following code is proposed. But, the current implementation used 3 for-loops and some if-else statement.
I am curios whether the code below can be enhanced further, or maybe, there is a build-in method within Pandas or Numpy that can be used to achieve similar objective.
import numpy as np
arr=[[ 0,0,0,0,0],
[0,4,1,1,1],
[0,1,1,2,2],
[0,3,2,2,2]]
arr=np.array(arr)
# C=arr
# nrows
npart = len(arr[:,0])
# Ncolumns
m = len(arr[0,:])
X = np.zeros(shape =(m,m), dtype = np.double)
for i in range(npart):
for k in range(m):
for p in range(m):
# Check whether the pair have similar value or not
if arr[i,k] == arr[i,p]:
X[k,p] = X[k,p] + 1
else:
X[k,p] = X[k,p] + 0
Output:
4.00000,1.00000,1.00000,1.00000,1.00000
1.00000,4.00000,2.00000,1.00000,1.00000
1.00000,2.00000,4.00000,3.00000,3.00000
1.00000,1.00000,3.00000,4.00000,4.00000
1.00000,1.00000,3.00000,4.00000,4.00000
p.s. The index A,B,C,D,E and use of pandas are for clarification purpose.
With numpy, you can use broadcasting:
1D
a = np.array([0,4,1,1,1])
(a==a[:, None])*1
output:
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 1]])
2D
a = np.array([[0, 0, 0, 0, 0],
[0, 4, 1, 1, 1],
[0, 1, 1, 2, 2],
[0, 3, 2, 2, 2]])
(a.T == a.T[:,None]).sum(2)
output:
array([[4, 1, 1, 1, 1],
[1, 4, 2, 1, 1],
[1, 2, 4, 3, 3],
[1, 1, 3, 4, 4],
[1, 1, 3, 4, 4]])

geom_violin using the weight aesthetic unexpectedly drop levels

library(tidyverse)
set.seed(12345)
dat <- data.frame(year = c(rep(1990, 100), rep(1991, 100), rep(1992, 100)),
fish_length = sample(x = seq(from = 10, 131, by = 0.1), 300, replace = F),
nb_caught = sample(x = seq(from = 1, 200, by = 0.1), 300, replace = T),
stringsAsFactors = F) %>%
mutate(age = ifelse(fish_length < 20, 1,
ifelse(fish_length >= 20 & fish_length < 100, 2,
ifelse(fish_length >= 100 & fish_length < 130, 3, 4)))) %>%
arrange(year, fish_length)
head(dat)
year fish_length nb_caught age
1 1990 10.1 45.2 1
2 1990 10.7 170.0 1
3 1990 10.9 62.0 1
4 1990 12.1 136.0 1
5 1990 14.1 80.8 1
6 1990 15.0 188.9 1
dat %>% group_by(year) %>% summarise(ages = n_distinct(age)) # Only 1992 has age 4 fish
# A tibble: 3 x 2
year ages
<dbl> <int>
1 1990 3
2 1991 3
3 1992 4
dat %>% filter(age == 4) # only 1 row for age 4
year fish_length nb_caught age
1 1992 130.8 89.2 4
Here:
year = year of sampling
fish_length = length of the fish in cm
nb_caught = number of fish caught following the use of an age-length key, hence explaining the presence of decimals
age = age of the fish
graph1: geom_violin not using the weight aesthetic.
Here, I got to copy each line of dat according to the value found in nb_caught.
dim(dat) # 300 rows
dat_graph1 <- dat[rep(1:nrow(dat), floor(dat$nb_caught)), ]
dim(dat_graph1) # 30932 rows
dat_graph1$nb_caught <- NULL # useless now
sum(dat$nb_caught) - nrow(dat_graph1) # 128.2 rows lost here
Since I have decimal values of nb_caught, I took the integer value to create dat_graph1. I lost 128.2 "rows" in the process.
Now for the graph:
dat_tile <- data.frame(year = sort(unique(dat$year))[sort(unique(dat$year)) %% 2 == 0])
# for the figure's background
graph1 <- ggplot(data = dat_graph1,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph1") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph1
graph1
Note here that I have a flat bar for age 4 in year 1992.
dat_graph1 %>% filter(year == 1992, age == 4) %>% pull(fish_length) %>% unique
[1] 130.8
That is because I only have one length for that particular year-age combination.
graph2: geom_violin using the weight aesthetic.
Now, instead of copying each row of dat by the value of number_caught, let's use the weight aesthetic.
Let's calculate the weight wt that each line of dat will have in the calculation of the density curve of each year-age combinations.
dat_graph2 <- dat %>%
group_by(year, age) %>%
mutate(wt = nb_caught / sum(nb_caught)) %>%
as.data.frame()
head(dat_graph2)
year fish_length nb_caught age wt
1 1990 10.1 45.2 1 0.03573123
2 1990 10.7 170.0 1 0.13438735
3 1990 10.9 62.0 1 0.04901186
4 1990 12.1 136.0 1 0.10750988
5 1990 14.1 80.8 1 0.06387352
6 1990 15.0 188.9 1 0.14932806
graph2 <- ggplot(data = dat_graph2,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(aes(weight = wt), draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph2") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph2
dat_graph2 %>% filter(year == 1992, age == 4)
year fish_length nb_caught age wt
1 1992 130.8 89.2 4 1
graph2
Note here that the flat bar for age 4 in year 1992 seen on graph1 has been dropped here even though the line exists in dat_graph2.
My questions
Why is the age 4 in 1992 level dropped when using the weight aesthetic? How can I overcome this?
Why are the two graphs not visually alike even though they used the same data?
Thanks in advance for your help!
1.
Problem 1 is not related to using the weight aesthetic. You can check this by dropping the weight aesthetic in the code for your second graph. The problem is, that the algorithm for computing the density fails, when there are too less observations.
That is the reason, why group 4 shows up in graph 1 with the expanded dataset (grpah 1). Here you increase the number of observations by replicating the number of obs.
Unfortunately, geom_violin gives no warning in your specific case. However, if you filter dat_graph2 for age == 4 geom_violin gives you the warning
Warning message:
Computation failed in `stat_ydensity()`:
replacement has 1 row, data has 0
geom_density is much clearer on this issue, giving a warning, that groups with less than two obs have been dropped.
Unfortunately, I have no solution to overcome this, besides working with the expanded dataset.
2.
Concerning problem 2 I have no convincing answer except that I guess that this is related to the details of the kernel density estimator used by geom_violin, geom_density, ... and perhaps also somehow related to the number of data points.

Apply Process Function to Groups in a Dataframe

I've got a dataframe, like this:
df_1 = pd.DataFrame({'X' : ['A','A','A','A','B','B','B'],
'Y' : [1, 0, 1, 1, 0, 0,'Nan']})
I would like to group it by X and create a column Z:
df_2 = pd.DataFrame({'X' : ['A','B'],
'Z' : [0.5, 0.5]})
But the difficult to describe thing that I would like to do is to apply this function:
def fun(Y,Z):
if Y == 1:
Z = Z + 1
elif Y == 0:
Z = Z - 1
So the first Y in df_1 is a 1, that is in group A so the Z for group A increases to 1.5. Then the next one is a 0 so it goes back to 0.5, then there are 2 more 1's so it ends up at 2.5.
Which would give me:
X Z
A 2.5
B -1.5
You can modify your first DataFrame and use sum with index alignment.
0 -> -1 (subtract 1 when a zero is found)
NaN --> 0 (unchanged when a NaN is found
df_1['Y'] = pd.to_numeric(df_1.Y, errors='coerce')
u = df_1.assign(Z=df_1.Y.replace({0: -1, np.nan: 0})).groupby('X')['Z'].sum().to_frame()
df_2 = df_2.set_index('X') + u
Z
X
A 2.5
B -1.5

Get frequency of items in a pandas column in given intervals of values stored in another pandas column

My dataframe
class_lst = ["B","A","C","Z","H","K","O","W","L","R","M","Y","Q","X","X","G","G","G","G","G"]
value_lst = [1,0.999986,1,0.999358,0.999906,0.995292,0.998481,0.388307,0.99608,0.99829,1,0.087298,1,1,0.999993,1,1,1,1,1]
df =pd.DataFrame(
{'class': class_lst,
'val': value_lst
})
For any interval of 'val' in ranges
ranges = np.arange(0.0, 1.1, 0.1)
I would like to get the frequency of 'val' items, as follows:
class range frequency
A (0, 0.10] 0
A (0.10, 0.20] 0
A (0.20, 0.30] 0
...
A (0.90, 100] 1
G (0, 0.10] 0
G (0.10, 0.20] 0
G (0.20, 0.30] 0
...
G (0.80, 0.90] 0
G (0.90, 100] 5
...
I tried
df.groupby(pd.cut(df.val, ranges)).count()
but the output looks like
class val
val
(0, 0.1] 1 1
(0.1, 0.2] 0 0
(0.2, 0.3] 0 0
(0.3, 0.4] 1 1
(0.4, 0.5] 0 0
(0.5, 0.6] 0 0
(0.6, 0.7] 0 0
(0.7, 0.8] 0 0
(0.8, 0.9] 0 0
(0.9, 1] 18 18
and does not match with the expected one
This might be a good start:
df["range"] = pd.cut(df['val'], ranges)
class val range
0 B 1.000000 (0.9, 1.0]
1 A 0.999986 (0.9, 1.0]
2 C 1.000000 (0.9, 1.0]
3 Z 0.999358 (0.9, 1.0]
4 H 0.999906 (0.9, 1.0]
5 K 0.995292 (0.9, 1.0]
6 O 0.998481 (0.9, 1.0]
7 W 0.388307 (0.3, 0.4]
8 L 0.996080 (0.9, 1.0]
9 R 0.998290 (0.9, 1.0]
10 M 1.000000 (0.9, 1.0]
11 Y 0.087298 (0.0, 0.1]
12 Q 1.000000 (0.9, 1.0]
13 X 1.000000 (0.9, 1.0]
14 X 0.999993 (0.9, 1.0]
15 G 1.000000 (0.9, 1.0]
16 G 1.000000 (0.9, 1.0]
17 G 1.000000 (0.9, 1.0]
18 G 1.000000 (0.9, 1.0]
19 G 1.000000 (0.9, 1.0]
and then
df.groupby(["class", "range"]).size()
class range
A (0.9, 1.0] 1
B (0.9, 1.0] 1
C (0.9, 1.0] 1
G (0.9, 1.0] 5
H (0.9, 1.0] 1
K (0.9, 1.0] 1
L (0.9, 1.0] 1
M (0.9, 1.0] 1
O (0.9, 1.0] 1
Q (0.9, 1.0] 1
R (0.9, 1.0] 1
W (0.3, 0.4] 1
X (0.9, 1.0] 2
Y (0.0, 0.1] 1
Z (0.9, 1.0] 1
This will give already the right bin for each class and its frequency.