iteration calculation based on another dataframe - pandas

How to do iteration calculation as shown in df2 as desired output ?
any reference links for this > many thanks for helping
df1
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
df2 :
a b c
0 1 0 5 >> values from df1
1 19 18 9 >> values from (df1.iloc[1] * 2) + df2.iloc[0] *1)
2 23 22 25 >> values from (df1.iloc[2] * 2) + df2.iloc[1] *1)
3 35 28 25 >> values from (df1.iloc[3] * 2) + df2.iloc[2] *1)
4 47 30 39 >> values from (df1.iloc[4] * 2) + df2.iloc[3] *1)

IIUC, you can try:
df2 = df1.mul(2).cumsum().sub(df1.iloc[0])
Output:
a b c
0 1 0 5
1 19 18 9
2 23 22 25
3 35 28 25
4 47 30 39
more complex operation
If you want x[n] = x[n]*2 + x[n-1]*2, you need to iterate:
def process(s):
out = [s[0]]
for x in s[1:]:
out.append(x*2+out[-1]*3)
return out
df1.apply(process)
Output:
a b c
0 1 0 5
1 21 18 19
2 67 58 73
3 213 180 219
4 651 542 671

Related

ValueError: Data must be 1-dimensional......verify_integrity

Bonjour,
I don't understand why this issue occurs.
print("p.shape= ", p.shape)
print("dfmj_dates['deces'].shape = ",dfmj_dates['deces'].shape)
cross_dfmj = pd.crosstab(p, dfmj_dates['deces'])
That produces:
p.shape= (683, 1)
dfmj_dates['deces'].shape = (683,)
----> 3 cross_dfmj = pd.crosstab(p, dfmj_dates['deces'])
--> 654 df = DataFrame(data, index=common_idx)
--> 614 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
--> 589 val = sanitize_array(
--> 576 subarr = _sanitize_ndim(subarr, data, dtype, index, allow_2d=allow_2d)
--> 627 raise ValueError("Data must be 1-dimensional")
ValueError: Data must be 1-dimensional
From me, I suspect issue comes from the difference between (683, 1)
and (683,). I tried something like p.flatten(order = 'C') to get
(683,) but pd.DataFrame(dfmj_dates['deces']) too. That failed.
Do you have any idea? Regards, Atapalou
print(p.head(30))
print(df.head(30))
that produces
week
0 8
1 8
2 8
3 9
4 9
5 9
6 9
7 9
8 9
9 9
10 10
11 10
12 10
13 10
14 10
15 10
16 10
17 11
18 11
19 11
20 11
21 11
22 11
23 11
24 12
25 12
26 12
27 12
28 12
29 12
deces
0 0
1 1
2 0
3 0
4 0
5 1
6 0
7 0
8 0
9 0
10 1
11 1
12 0
13 3
14 4
15 5
16 3
17 11
18 3
19 15
20 13
21 18
22 12
23 36
24 21
25 27
26 69
27 128
28 78
29 112
Try to squeeze p:
cross_dfmj = pd.crosstab(p.squeeze(), dfmj_dates['deces'])
Example:
p = np.random.random((5, 1))
p.shape
# (5, 1)
p.squeeze().shape
# (5,)

Required data frame after explode or other option to fill a running difference b/w two columns pandas dataframe

Input data frame as given given below,
data = {
'labels': ["A","B","A","B","A","B","M","B","M","B","M"],
'start': [0,9,13,23,47,77,81,92,100,104,118],
'stop': [9,13,23,47,77,81,92,100,104,118,145],
}
df = pd.DataFrame.from_dict(data)
labels start stop
0 A 0 9
1 B 9 13
2 A 13 23
3 B 23 47
4 A 47 77
5 B 77 81
6 M 81 92
7 B 92 100
8 M 100 104
9 B 104 118
10 M 118 145
The output data frame required is as below,
Try this:
df['start'] = df.apply(lambda x: range(x['start'] + 1, x['stop'] + 1), axis=1)
df = df.explode('start')
Output:
>>> df
labels start stop
0 A 1 9
0 A 2 9
0 A 3 9
0 A 4 9
0 A 5 9
0 A 6 9
0 A 7 9
0 A 8 9
0 A 9 9
1 B 10 13
1 B 11 13
1 B 12 13
1 B 13 13
2 A 14 23
2 A 15 23
2 A 16 23
2 A 17 23
2 A 18 23
2 A 19 23
2 A 20 23
2 A 21 23
2 A 22 23
2 A 23 23
...

Keep only the first value on duplicated column (set 0 to others)

Supposing I have the following situation:
A dataframe where the first column ['ID'] will eventually have duplicated values.
import pandas as pd
df = pd.DataFrame({"ID": [1,2,3,4,4,5,5,5,6,6],
"l_1": [10,12,32,45,45,20,20,20,20,20],
"l_2": [11,12,32,11,21,27,38,12,9,6],
"l_3": [5,9,32,12,21,21,18,12,8,1],
"l_4": [6,21,12,77,77,2,2,2,8,8]})
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 45 21 21 77
5 20 27 21 2
5 20 38 18 2
5 20 12 12 2
6 20 9 8 8
6 20 6 1 8
When duplicated IDs occurs:
I need to keep only the first values for column l_1 and l_4 (other duplicated rows must be zero).
Columns 'l_2' and 'l_3' must stay the same.
When duplicated IDs occurs, the values on these rows on columns l_1 and l_4 will be also duplicated.
Expected output:
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 0 21 21 0
5 20 27 21 2
5 0 38 18 0
5 0 12 12 0
6 20 9 8 8
6 0 6 1 0
Is there a Straightforward way using pandas or numpy to accomplish this ?
I could just accomplish it doing all these steps:
x1 = df[df.duplicated(subset=['ID'], keep=False)].copy()
x1.loc[x1.groupby('ID')['l_1'].apply(lambda x: (x.shift(1) == x)), 'l_1'] = 0
x1.loc[x1.groupby('ID')['l_4'].apply(lambda x: (x.shift(1) == x)), 'l_4'] = 0
df = df.drop_duplicates(subset=['ID'], keep=False)
df = pd.concat([df, x1])
Isn't this just:
df.loc[df.duplicated('ID'), ['l_1','l_4']] = 0
Output:
ID l_1 l_2 l_3 l_4
0 1 10 11 5 6
1 2 12 12 9 21
2 3 32 32 32 12
3 4 45 11 12 77
4 4 0 21 21 0
5 5 20 27 21 2
6 5 0 38 18 0
7 5 0 12 12 0
8 6 20 9 8 8
9 6 0 6 1 0

Pandas Groupby and divide the dataset into subgroups based on user input and label numbers to each subgroup

Here is my data:
ID Mnth Amt Flg
B 1 10 0
B 2 12 0
B 3 14 0
B 4 41 0
B 5 134 0
B 6 14 0
B 7 134 0
B 8 134 0
B 9 12 0
B 10 41 0
B 11 4 0
B 12 14 0
B 12 14 0
A 1 34 0
A 2 22 0
A 3 56 0
A 4 129 0
A 5 40 0
A 6 20 0
A 7 58 0
A 8 123 0
If I give 3 as input, my output should be:
ID Mnth Amt Flg Level_Flag
B 1 10 0 0
B 2 12 0 1
B 3 14 0 1
B 4 41 0 1
B 5 134 0 2
B 6 14 0 2
B 7 134 0 2
B 8 134 0 3
B 9 12 0 3
B 10 41 0 3
B 11 4 0 4
B 12 14 0 4
B 12 14 0 4
A 1 34 0 0
A 2 22 0 0
A 3 56 0 1
A 4 129 0 1
A 5 40 0 1
A 6 20 0 2
A 7 58 0 2
A 8 123 0 2
So basically I want to divide the data into subgroups with 3 rows in each subgroup from bottom up and label those subgroups as mentioned in level_flag column. I have IDs like A,C and so on. So I want to do this for each group of ID.Thanks in Advance.
Edit :- I want the same thing to be done after grouping it by ID
First we decide the unique numbers nums by dividing the length of your df by n. Then we repeat those numbers n times. Finally we reverse the array and chop it of at the length of df and reverse it one more time.
def create_flags(d, n):
nums = np.ceil(len(d) / n)
level_flag = np.repeat(np.arange(nums), n)[::-1][:len(d)][::-1]
return level_flag
df['Level_Flag'] = df.groupby('ID')['ID'].transform(lambda x: create_flags(x, 3))
ID Mnth Amt Flg Level_Flag
0 B 1 10 0 0.0
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0
To remove the incomplete rows, use GroupBy.transform:
m = df.groupby(['ID', 'Level_Flag'])['Level_Flag'].transform('count').ge(3)
df = df[m]
ID Mnth Amt Flg Level_Flag
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0

Sum of group but keep the same value for each row in pandas

How to solve same problem in this link Sum of group but keep the same value for each row in r using pandas?
I can generate separate df have the sum for each group and then merge the generated df with the original.
You can use groupby & transform as below to get your output.
df['sumx']=df.groupby(['ID', 'Group'],sort=False)['x'].transform(sum)
df['sumy']=df.groupby(['ID', 'Group'],sort=False)['y'].transform(sum)
df
output
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23