Pandas search in ascending index and match certain column value - pandas

I have a DF with thousands of rows. Column 'col1' is repeatedly from 1 to 6. Column 'value' is with unique numbers:
diction = {'col1': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6], 'target': [34, 65, 23, 65, 12, 87, 36, 51, 26, 74, 34, 87]}
df1 = pd.DataFrame(diction, index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
col1 target
0 1 34
1 2 65
2 3 23
3 4 65
4 5 12
5 6 87
6 1 36
7 2 51
8 3 26
9 4 74
10 5 34
11 6 87
I'm trying to create a new column (let's call it previous_col) that match col1 value (let's say COL1 value 2 with TARGET column value -> 65) so next time COL1 with value 2 to refer to previous TARGET value from the same row as col1 value 1:
col1 previous_col target
0 1 0 34
1 2 0 65
2 3 0 23
3 4 0 65
4 5 0 12
5 6 0 87
6 1 34 36
7 2 65 51
8 3 23 26
9 4 65 74
10 5 12 34
11 6 87 79
Note that first 6 rows are 0 values for previous column cuz no previous target values exist :D
The tricky part here is that I need to extract previous target's by DF index ascending order or the first met COL1 value ascending. So if we have a DF with 10k rows not just to match from the top or from the middle same COL1 value and to take the TARGET value. Each value in PREVIOUS_COL should be taken ascending to index and COL1 matching values. I know I can do it with shift but sometimes COL1 is with a missing order not from 1 to 6 strictly so I need to match exactly the COL1 value.

df1['Per_col']=df1.groupby('col1').target.shift(1).fillna(0)
df1
Out[1117]:
col1 target Per_col
0 1 34 0.0
1 2 65 0.0
2 3 23 0.0
3 4 65 0.0
4 5 12 0.0
5 6 87 0.0
6 1 36 34.0
7 2 51 65.0
8 3 26 23.0
9 4 74 65.0
10 5 34 12.0
11 6 87 87.0

Related

R : which.max + condition does not return the expected value

I made a reproducible example of a dataframe with 2 patients ID (ID 1 and ID 2), the value of a measurement (m_value) on different days (m_day).
df <- data.frame (ID = c (1, 1, 1, 1, 2, 2, 2),
m_value = c (10, 15, 12, 13, 18, 16, 19),
m_day = c (14, 143, 190, 402, 16, 55, 75)
ID m_value m_day
1 1 10 14
2 1 15 143
3 1 12 190
4 1 13 402
5 2 18 16
6 2 16 55
7 2 19 75
Now I want to obtain, for each patient, the best value of m before day 100 (period 1), and >= day 100 (period 2), and the dates of these best values, such as I can obtain the following table:
ID m_value m_day best_m_period1 best_m_period2 date_best_m_period1 date_best_m_period2
1 1 10 14 10 15 14 143
2 1 15 143 10 15 14 143
3 1 12 190 10 15 14 143
4 1 13 402 10 15 14 143
5 2 18 16 19 NA 75 NA
6 2 16 55 19 NA 75 NA
7 2 19 75 19 NA 75 NA
I tried the following code:
df2 <- df %>%
group_by (ID)%>%
mutate (best_m_period1 = max(m_value[m_day < 100]))%>%
mutate (best_m_period2 = max (m_value[m_day >=100])) %>%
mutate (date_best_m_period1 =
ifelse (is.null(which.max(m_value[m_day<100])), NA,
m_day[which.max(m_value[m_day < 100])])) %>%
mutate (date_best_m_period2 =
ifelse (is.null(which.max(m_value[m_day >= 100])), NA,
m_day[which.max(m_value[m_day >= 100])]))
But I obtain the following table:
ID m_value m_day best_m_period1 best_m_period2 date_best_m_period1 date_best_m_period2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 14 10 15 14 14
2 1 15 143 10 15 14 14
3 1 12 190 10 15 14 14
4 1 13 402 10 15 14 14
5 2 18 16 19 -Inf 75 NA
6 2 16 55 19 -Inf 75 NA
7 2 19 75 19 -Inf 75 NA
The date_best_m_period2 for ID1 is not 143 as expected (corresponding to the max value of 15 for ID1 in period 2 (>= 100 day)), but returns 14, the max value in period 1.
How can I resolve this problem ? Thank you very much for your help

Reshape wide to long for many columns with a common prefix

My frame has many pairs of identically named columns, with the only difference being the prefix. For example, player1.player.id and player2.player.id.
Here's an example (with fewer and shorter columns):
pd.DataFrame({'p1.a': {0: 4, 1: 0}, 'p1.b': {0: 1, 1: 4},
'p1.c': {0: 2, 1: 8}, 'p1.d': {0: 3, 1: 12},
'p1.e': {0: 4, 1: 16}, 'p1.f': {0: 5, 1: 20},
'p1.g': {0: 6, 1: 24},
'p2.a': {0: 0, 1: 0}, 'p2.b': {0: 3, 1: 12},
'p2.c': {0: 6, 1: 24}, 'p2.d': {0: 9, 1: 36},
'p2.e': {0: 12, 1: 48}, 'p2.f': {0: 15, 1: 60},
'p2.g': {0: 18, 1: 72}})
p1.a p1.b p1.c p1.d p1.e p1.f p1.g p2.a p2.b p2.c p2.d p2.e p2.f p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
I'd like to turn it into a long format, with a new side column denoting either p1 or p2. I have several crappy ways of doing it, for example:
df1 = df.filter(regex='^p1.*').assign(side='p1')
df2 = df.filter(regex='^p2.*').assign(side='p2')
df1.columns = [c.replace('p1.', '') for c in df1.columns]
df2.columns = [c.replace('p2.', '') for c in df2.columns]
pd.concat([df1, df2]).head()
a b c d e f g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
This feels non-idiomatic, and I couldn't get pd.wide_to_long() to work here.
I'd appreciate an answer which also handles arbitrary substrings, not just prefix, i.e., I'm also interested in something like this:
foo.p1.a foo.p1.b foo.p1.c foo.p1.d foo.p1.e foo.p1.f foo.p1.g foo.p2.a foo.p2.b foo.p2.c foo.p2.d foo.p2.e foo.p2.f foo.p2.g
0 4 1 2 3 4 5 6 0 3 6 9 12 15 18
1 0 4 8 12 16 20 24 0 12 24 36 48 60 72
Turning into:
foo.a foo.b foo.c foo.d foo.e foo.f foo.g side
0 4 1 2 3 4 5 6 p1
1 0 4 8 12 16 20 24 p1
0 0 3 6 9 12 15 18 p2
1 0 12 24 36 48 60 72 p2
But if there's an idiomatic way to handle prefixes whereas substrings require complexity, I'd appreciate learning about both.
What's the idiomatic (pythonic? pandonic?) way of doing this?
A couple of options to do this:
with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:
temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.
Another option is via stack, where the columns are split, based on the delimiter and reshaped:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
temp.stack(0).droplevel(0).rename_axis('side').reset_index()
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.
Another option is pivot_longer from pyjanitor, which abstracts the process:
# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")
side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.
For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.
With stack:
first split the columns and convert into a MultiIndex:
temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)
Reshape the data:
temp = temp.stack(1).droplevel(0).rename_axis('side')
Merge the column names:
temp.columns = temp.columns.map(".".join)
Reset the index:
temp.reset_index()
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72
With pivot_longer, one option is to reorder the columns, before reshaping:
temp = df.copy()
temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]
(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72
Another option with names_sep:
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

Pandas: how to group on column change?

I am working with a log system, and I need to group data not in a standard way.
Alas with my limited knowledge of Pandas I couldn't find any example, probably because I don't know proper search terms.
This is a sample dataframe:
df = pd.DataFrame({
"speed": [2, 4, 6, 8, 8, 9, 2, 3, 8, 9, 13, 18, 25, 27, 18, 8, 6, 8, 12, 20, 27, 34, 36, 41, 44, 54, 61, 60, 61, 40, 17, 12, 15, 24],
"class": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 3, 1, 1, 1, 2]
})
df.groupby(by="class").groups returns indexed of each row, all grouped together by class value:
class indexes
1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 30, 32],
2: [12, 13, 19, 20, 21, 22, 33],
3: [23, 24, 29],
4: [25],
5: [26, 27, 28]
I need instead to split every time column class changes:
speed class
0 2 1
1 4 1
2 6 1
3 8 1
4 8 1
5 9 1
6 2 1
7 3 1
8 8 1
9 9 1
10 13 1
11 18 1
12 25 2 <= split here
13 27 2
14 18 1 <= split here
15 8 1
16 6 1
17 8 1
18 12 1 <= split here
19 20 2
20 27 2
21 34 2
22 36 2 <= split here
23 41 3
24 44 3 <= split here
25 54 4 <= split here
26 61 5
27 60 5
28 61 5 <= split here
29 40 3 <= split here
30 17 1 <= split here
31 12 1
32 15 1
33 24 2 <= split here
The desired grouping should return something like:
class count mean
0 1 12 70.50
1 2 2 26.00
2 1 5 10.40
3 2 4 29.25
4 3 2 42.50
5 4 1 54.00
6 5 3 60.66
7 3 1 40.00
8 1 3 14.66
9 2 1 24.00
Is there any command to do it not iteratively?
Use Series.cumsum with compare if not equal shifted values and aggregate by GroupBy.agg:
g = df["class"].ne(df["class"].shift()).cumsum()
df = (df.groupby(['class', g], sort=False)['speed'].agg(['count','mean'])
.reset_index(level=1, drop=True)
.reset_index())
print (df)
class count mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000
You can groupby the cumsum of when the class column differs from the value below it:
df.groupby(df["class"].diff().ne(0).cumsum()).speed.agg(['size', 'mean'])
size mean
class
1 12 7.500000
2 2 26.000000
3 5 10.400000
4 4 29.250000
5 2 42.500000
6 1 54.000000
7 3 60.666667
8 1 40.000000
9 3 14.666667
10 1 24.000000
Update: I hadn't seen how you wanted the class column: what you can do is group by the original class column as well as the cumsum above, and do a bit of index-sorting and resetting (but at this point this answer just converges with #jezrael's answer :P)
result = (
df.groupby(["class", df["class"].diff().ne(0).cumsum()])
.speed.agg(["size", "mean"])
.sort_index(level=1)
.reset_index(level=0)
.reset_index(drop=True)
)
class size mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000

How to compute mean for different size subsets within pandas dataframe?

compute mean of particular column for each unique subset of rows in pandas dataframe. In following example each subset is till 1 appears in column "Flag" i.e. (54+34+78+91+29)/5 = 57.2 and (81+44+61)/3 = 62.0
Currently unable to compute rolling subset of different sizes based on particular column condition
>>> import pandas as pd
>>> df = pd.DataFrame({"Indx": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "Units": [54, 34, 78, 91, 29, 81, 44, 61, 73, 19], "Flag": [0, 0, 0, 0, 1, 0, 0, 1, 0, 1]})
>>> df
Indx Units Flag
0 1 54 0
1 2 34 0
2 3 78 0
3 4 91 0
4 5 29 1
5 6 81 0
6 7 44 0
7 8 61 1
8 9 73 0
9 10 19 1
# DESIRED OUTPUT
>>> df
Indx Units Flag avg
0 1 54 0 57.2
1 2 34 0 57.2
2 3 78 0 57.2
3 4 91 0 57.2
4 5 29 1 57.2
5 6 81 0 62.0
6 7 44 0 62.0
7 8 61 1 62.0
8 9 73 0 46.0
9 10 19 0 46.0
Create the group key by using cumsum then transform
df['Units'].groupby(df.Flag.iloc[::-1].cumsum()).transform('mean')
0 57.2
1 57.2
2 57.2
3 57.2
4 57.2
5 62.0
6 62.0
7 62.0
8 46.0
9 46.0
Name: Units, dtype: float64
#df['new']=df['Units'].groupby(df.Flag.iloc[::-1].cumsum()).transform('mean')
The shortest solution (I think) is:
df['avg'] = df.groupby(df.Flag[::-1].cumsum()).Units.transform('mean')
You don't even need to use iloc, as df.Flag[::-1] retrieves Flag
column in reversed order.

How to count each x entries and mark the occurence of this sequence with a value in a pandas dataframe?

I want to create a column C (based on B) which counts each beginning of a series of 4 entries in B (or the dataframe as general). I have the following pandas data frame:
A B
1 100
2 102
3 103
4 104
5 105
6 106
7 108
8 109
9 110
10 112
11 113
12 115
13 116
14 118
15 120
16 121
I want to create the following column C:
A C
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 4
This column C should count each series of 4 entries of the dataframe.
Thanks in advance.
Use:
df['C'] = df.index // 4 + 1
Given that you have fairly simple dataframe it's okay to assume that you have generic index which is a RangeIndex object.
In your example it would look like this:
df.index
#RangeIndex(start=0, stop=16, step=1)
That being said values of this index are the following:
df.index.values
#array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype=int64)
Converting such array into your desired output is performed using the formula:
x // 4 + 1
Where // is the operator used for floor division.
General solution is create numpy array by np.arange, then use integer division by 4 and add 1, because python count from 0:
df['C'] = np.arange(len(df)) // 4 + 1
print (df)
A B C
0 1 100 1
1 2 102 1
2 3 103 1
3 4 104 1
4 5 105 2
5 6 106 2
6 7 108 2
7 8 109 2
8 9 110 3
9 10 112 3
10 11 113 3
11 12 115 3
12 13 116 4
13 14 118 4
14 15 120 4
15 16 121 4