R : which.max + condition does not return the expected value - conditional-statements

I made a reproducible example of a dataframe with 2 patients ID (ID 1 and ID 2), the value of a measurement (m_value) on different days (m_day).
df <- data.frame (ID = c (1, 1, 1, 1, 2, 2, 2),
m_value = c (10, 15, 12, 13, 18, 16, 19),
m_day = c (14, 143, 190, 402, 16, 55, 75)
ID m_value m_day
1 1 10 14
2 1 15 143
3 1 12 190
4 1 13 402
5 2 18 16
6 2 16 55
7 2 19 75
Now I want to obtain, for each patient, the best value of m before day 100 (period 1), and >= day 100 (period 2), and the dates of these best values, such as I can obtain the following table:
ID m_value m_day best_m_period1 best_m_period2 date_best_m_period1 date_best_m_period2
1 1 10 14 10 15 14 143
2 1 15 143 10 15 14 143
3 1 12 190 10 15 14 143
4 1 13 402 10 15 14 143
5 2 18 16 19 NA 75 NA
6 2 16 55 19 NA 75 NA
7 2 19 75 19 NA 75 NA
I tried the following code:
df2 <- df %>%
group_by (ID)%>%
mutate (best_m_period1 = max(m_value[m_day < 100]))%>%
mutate (best_m_period2 = max (m_value[m_day >=100])) %>%
mutate (date_best_m_period1 =
ifelse (is.null(which.max(m_value[m_day<100])), NA,
m_day[which.max(m_value[m_day < 100])])) %>%
mutate (date_best_m_period2 =
ifelse (is.null(which.max(m_value[m_day >= 100])), NA,
m_day[which.max(m_value[m_day >= 100])]))
But I obtain the following table:
ID m_value m_day best_m_period1 best_m_period2 date_best_m_period1 date_best_m_period2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 14 10 15 14 14
2 1 15 143 10 15 14 14
3 1 12 190 10 15 14 14
4 1 13 402 10 15 14 14
5 2 18 16 19 -Inf 75 NA
6 2 16 55 19 -Inf 75 NA
7 2 19 75 19 -Inf 75 NA
The date_best_m_period2 for ID1 is not 143 as expected (corresponding to the max value of 15 for ID1 in period 2 (>= 100 day)), but returns 14, the max value in period 1.
How can I resolve this problem ? Thank you very much for your help

Related

How can I plot two lines in one graph where values of the lines do not exist for the same x axis?

I would like to plot SupDem (variable) where e_boix_regime==1 and SupDem where e_boix_regime==0.
My data:
year
SupDem
e_boix_regime
1997
0.98
1
1998
0.75
0
My code:
dem = dem_aut[dem_aut["e_boix_regime"]==1].SupDem
aut = dem_aut[dem_aut["e_boix_regime"]==0].SupDem
year = dem_aut["year"]
plt.plot(year, dem, label="Suuport for Democracy in Demcoracies")
plt.plot(year, aut, label="Support for Democracy in Autocracies")
plt.show()```
The error is follwoing: x and y must have same first dimension, but have shapes (53,) and (28,)
I just wanted to plot two lines together.
This can help you solve the problem. I hope you can reproduce the codee with it:
two (or more) graphs in one plot with different x-axis AND y-axis scales in python
Issue
Your issue is regarding shape of x and y. For plotting graph you need same data point/shape of x-values and y-values.
Solution
Take each year with dem_aut["e_boix_regime"]==1 and dem_aut["e_boix_regime"]==2 condition as you are doing with SupDem
Source Code
df = pd.DataFrame(
{
"SupDem": np.random.randint(1, 11, 30),
"year": np.random.randint(10, 21, 30),
"e_boix_regime": np.random.randint(1, 3, 30),
}
) # see DataFrame below
df["e_boix_regime"].value_counts() # 1 = 18, 2 = 12
df[df["e_boix_regime"] == 2][["SupDem", "year"]] # see below
# you need same no. of data points for both x/y axis i.e. `year` and `SupDem`
plt.plot(
df[df["e_boix_regime"] == 1]["year"], df[df["e_boix_regime"] == 1]["SupDem"], marker="o", label="e_boix_regime==1"
)
# hence applying same condition for grabbing year which is applied for SupDem
plt.plot(
df[df["e_boix_regime"] == 2]["year"], df[df["e_boix_regime"] == 2]["SupDem"], marker="o", label="e_boix_regime==2"
)
plt.xlabel("Year")
plt.ylabel("SupDem")
plt.legend()
plt.show()
Output
PS: Ignore the data point plots, it's generated from random values
DataFrame Outputs
SupDem year e_boix_regime
0 1 12 2
1 10 10 1
2 5 19 2
3 4 14 2
4 8 14 2
5 4 17 2
6 2 15 2
7 10 11 1
8 8 11 2
9 6 19 2
10 5 15 1
11 8 17 1
12 9 10 2
13 1 14 2
14 8 18 1
15 3 13 2
16 6 16 2
17 1 16 1
18 7 13 1
19 8 15 2
20 2 17 2
21 5 10 2
22 1 19 2
23 5 20 2
24 7 16 1
25 10 14 1
26 2 11 2
27 1 18 1
28 5 16 1
29 10 18 2
df[df["e_boix_regime"] == 2][["SupDem", "year"]]
SupDem year
0 1 12
2 5 19
3 4 14
4 8 14
5 4 17
6 2 15
8 8 11
9 6 19
12 9 10
13 1 14
15 3 13
16 6 16
19 8 15
20 2 17
21 5 10
22 1 19
23 5 20
26 2 11
29 10 18

ValueError: Data must be 1-dimensional......verify_integrity

Bonjour,
I don't understand why this issue occurs.
print("p.shape= ", p.shape)
print("dfmj_dates['deces'].shape = ",dfmj_dates['deces'].shape)
cross_dfmj = pd.crosstab(p, dfmj_dates['deces'])
That produces:
p.shape= (683, 1)
dfmj_dates['deces'].shape = (683,)
----> 3 cross_dfmj = pd.crosstab(p, dfmj_dates['deces'])
--> 654 df = DataFrame(data, index=common_idx)
--> 614 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
--> 589 val = sanitize_array(
--> 576 subarr = _sanitize_ndim(subarr, data, dtype, index, allow_2d=allow_2d)
--> 627 raise ValueError("Data must be 1-dimensional")
ValueError: Data must be 1-dimensional
From me, I suspect issue comes from the difference between (683, 1)
and (683,). I tried something like p.flatten(order = 'C') to get
(683,) but pd.DataFrame(dfmj_dates['deces']) too. That failed.
Do you have any idea? Regards, Atapalou
print(p.head(30))
print(df.head(30))
that produces
week
0 8
1 8
2 8
3 9
4 9
5 9
6 9
7 9
8 9
9 9
10 10
11 10
12 10
13 10
14 10
15 10
16 10
17 11
18 11
19 11
20 11
21 11
22 11
23 11
24 12
25 12
26 12
27 12
28 12
29 12
deces
0 0
1 1
2 0
3 0
4 0
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 0
13 3
14 4
15 5
16 3
17 11
18 3
19 15
20 13
21 18
22 12
23 36
24 21
25 27
26 69
27 128
28 78
29 112
Try to squeeze p:
cross_dfmj = pd.crosstab(p.squeeze(), dfmj_dates['deces'])
Example:
p = np.random.random((5, 1))
p.shape
# (5, 1)
p.squeeze().shape
# (5,)

Required data frame after explode or other option to fill a running difference b/w two columns pandas dataframe

Input data frame as given given below,
data = {
'labels': ["A","B","A","B","A","B","M","B","M","B","M"],
'start': [0,9,13,23,47,77,81,92,100,104,118],
'stop': [9,13,23,47,77,81,92,100,104,118,145],
}
df = pd.DataFrame.from_dict(data)
labels start stop
0 A 0 9
1 B 9 13
2 A 13 23
3 B 23 47
4 A 47 77
5 B 77 81
6 M 81 92
7 B 92 100
8 M 100 104
9 B 104 118
10 M 118 145
The output data frame required is as below,
Try this:
df['start'] = df.apply(lambda x: range(x['start'] + 1, x['stop'] + 1), axis=1)
df = df.explode('start')
Output:
>>> df
labels start stop
0 A 1 9
0 A 2 9
0 A 3 9
0 A 4 9
0 A 5 9
0 A 6 9
0 A 7 9
0 A 8 9
0 A 9 9
1 B 10 13
1 B 11 13
1 B 12 13
1 B 13 13
2 A 14 23
2 A 15 23
2 A 16 23
2 A 17 23
2 A 18 23
2 A 19 23
2 A 20 23
2 A 21 23
2 A 22 23
2 A 23 23
...

How to count each x entries and mark the occurence of this sequence with a value in a pandas dataframe?

I want to create a column C (based on B) which counts each beginning of a series of 4 entries in B (or the dataframe as general). I have the following pandas data frame:
A B
1 100
2 102
3 103
4 104
5 105
6 106
7 108
8 109
9 110
10 112
11 113
12 115
13 116
14 118
15 120
16 121
I want to create the following column C:
A C
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 4
This column C should count each series of 4 entries of the dataframe.
Thanks in advance.
Use:
df['C'] = df.index // 4 + 1
Given that you have fairly simple dataframe it's okay to assume that you have generic index which is a RangeIndex object.
In your example it would look like this:
df.index
#RangeIndex(start=0, stop=16, step=1)
That being said values of this index are the following:
df.index.values
#array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype=int64)
Converting such array into your desired output is performed using the formula:
x // 4 + 1
Where // is the operator used for floor division.
General solution is create numpy array by np.arange, then use integer division by 4 and add 1, because python count from 0:
df['C'] = np.arange(len(df)) // 4 + 1
print (df)
A B C
0 1 100 1
1 2 102 1
2 3 103 1
3 4 104 1
4 5 105 2
5 6 106 2
6 7 108 2
7 8 109 2
8 9 110 3
9 10 112 3
10 11 113 3
11 12 115 3
12 13 116 4
13 14 118 4
14 15 120 4
15 16 121 4

Pandas search in ascending index and match certain column value

I have a DF with thousands of rows. Column 'col1' is repeatedly from 1 to 6. Column 'value' is with unique numbers:
diction = {'col1': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6], 'target': [34, 65, 23, 65, 12, 87, 36, 51, 26, 74, 34, 87]}
df1 = pd.DataFrame(diction, index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
col1 target
0 1 34
1 2 65
2 3 23
3 4 65
4 5 12
5 6 87
6 1 36
7 2 51
8 3 26
9 4 74
10 5 34
11 6 87
I'm trying to create a new column (let's call it previous_col) that match col1 value (let's say COL1 value 2 with TARGET column value -> 65) so next time COL1 with value 2 to refer to previous TARGET value from the same row as col1 value 1:
col1 previous_col target
0 1 0 34
1 2 0 65
2 3 0 23
3 4 0 65
4 5 0 12
5 6 0 87
6 1 34 36
7 2 65 51
8 3 23 26
9 4 65 74
10 5 12 34
11 6 87 79
Note that first 6 rows are 0 values for previous column cuz no previous target values exist :D
The tricky part here is that I need to extract previous target's by DF index ascending order or the first met COL1 value ascending. So if we have a DF with 10k rows not just to match from the top or from the middle same COL1 value and to take the TARGET value. Each value in PREVIOUS_COL should be taken ascending to index and COL1 matching values. I know I can do it with shift but sometimes COL1 is with a missing order not from 1 to 6 strictly so I need to match exactly the COL1 value.
df1['Per_col']=df1.groupby('col1').target.shift(1).fillna(0)
df1
Out[1117]:
col1 target Per_col
0 1 34 0.0
1 2 65 0.0
2 3 23 0.0
3 4 65 0.0
4 5 12 0.0
5 6 87 0.0
6 1 36 34.0
7 2 51 65.0
8 3 26 23.0
9 4 74 65.0
10 5 34 12.0
11 6 87 87.0