Convert a Pandas DataFrame into a multiple rows - pandas

I have a dataframe with "N" rows and 4 columns (N,4).I want to make something like this (1, 4.N). I would like (4,4) to become a row every time
For example if I have:
A B C D
1, 5, 9, 13
2, 6, 10, 14
3, 7, 11, 15
4, 8, 12, 16
1, 11, 56, 9
1, 34, 87, 91
67, 67, 9, 1
1, 37, 77, 9
I want a result like this:
A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 A4 B4 C4 D4
1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
1 11 56 9 1 34 87 91 67 67 9 1 1 37 77 9

You can do it using reshape from numpy:
import pandas as pd
from io import StringIO
df = pd.read_csv(
StringIO(
"""
1, 5, 9, 13
2, 6, 10, 14
3, 7, 11, 15
4, 8, 12, 16
1, 11, 56, 9
1, 34, 87, 91
67, 67, 9, 1
1, 37, 77, 9 """
)
)
pd.DataFrame(
df.values.reshape(2, 16),
columns=[f"{i}{n}" for n in range(1, 5) for i in ["A", "B", "C", "D"]],
)
This will return:
A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 A4 B4 C4 D4
0 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16
1 1 11 56 9 1 34 87 91 67 67 9 1 1 37 77 9
as requested.

Related

How to stack and rename N successive columns in df

How would I achieve the desired output as shown below? Ie, stack the first 3 columns underneath each other, stack the second 3 columns underneath each other and rename the columns.
d = {'A': [76, 34], 'B': [21, 48], 'C': [45, 89], 'D': [56, 41], 'E': [3, 2],
'F': [78, 32]}
df = pd.DataFrame(data=d)
df.columns=['A', 'A', 'A', 'A', 'A', 'A']
Output
df
A A A A A A
0 76 21 45 56 3 78
1 34 48 89 41 2 32
Desired Output
Z1 Z2
0 76 56
1 34 41
2 21 3
3 48 2
4 45 78
5 89 32
Go down into numpy, reshape and create a new dataframe:
pd.DataFrame(df.to_numpy().reshape((-1, 2), order='F'), columns = ['Z1','Z2'])
Out[19]:
Z1 Z2
0 76 56
1 34 41
2 21 3
3 48 2
4 45 78
5 89 32

How to duplicate each row having only one column different than the previous row pandas data frame?

I have a big data and I want to duplicate each row just below the original column by changing just one column value
I want to copy the previous row value in place of "same" and I want to change the last column which is the same as the c column
import numpy as np
import pandas as pd
import sys
df = pd.DataFrame([[45, 20, 'A1', 46, 20, 'A2'],
[45, 20 ,'B2', 46, 20, 'B1'],
[46, 20, 'A2', 47, 20, 'A1'],
[46, 20, 'B1', 47, 20, 'B2']],columns=['A','B','C','D','E','F'])
new_row = {"A":0,"B":0,"C":0,"D":0, "E":0,"F":0}
s = pd.Series(new_row, df.columns)
f = lambda d: d.append(s, ignore_index=True)
grp = np.arange(len(df)) // 1
df.groupby(grp, group_keys=False).apply(f).reset_index(drop=True)
input:
expected output:
Assuming this input:
A B C D E F
45 20 A1 46 20 A2
45 20 B2 46 20 B1
46 20 A2 47 20 A1
46 20 B1 47 20 B2
and the fact that you want to duplicate rows while getting the values of C for column F:
out = (pd.concat([df, df.assign(F=df['C'])])
.sort_index(kind='stable').reset_index(drop=True)
)
output:
A B C D E F
0 45 20 A1 46 20 A2
1 45 20 A1 46 20 A1
2 45 20 B2 46 20 B1
3 45 20 B2 46 20 B2
4 46 20 A2 47 20 A1
5 46 20 A2 47 20 A2
6 46 20 B1 47 20 B2
7 46 20 B1 47 20 B1

Pandas: how to group on column change?

I am working with a log system, and I need to group data not in a standard way.
Alas with my limited knowledge of Pandas I couldn't find any example, probably because I don't know proper search terms.
This is a sample dataframe:
df = pd.DataFrame({
"speed": [2, 4, 6, 8, 8, 9, 2, 3, 8, 9, 13, 18, 25, 27, 18, 8, 6, 8, 12, 20, 27, 34, 36, 41, 44, 54, 61, 60, 61, 40, 17, 12, 15, 24],
"class": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 3, 1, 1, 1, 2]
})
df.groupby(by="class").groups returns indexed of each row, all grouped together by class value:
class indexes
1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 30, 32],
2: [12, 13, 19, 20, 21, 22, 33],
3: [23, 24, 29],
4: [25],
5: [26, 27, 28]
I need instead to split every time column class changes:
speed class
0 2 1
1 4 1
2 6 1
3 8 1
4 8 1
5 9 1
6 2 1
7 3 1
8 8 1
9 9 1
10 13 1
11 18 1
12 25 2 <= split here
13 27 2
14 18 1 <= split here
15 8 1
16 6 1
17 8 1
18 12 1 <= split here
19 20 2
20 27 2
21 34 2
22 36 2 <= split here
23 41 3
24 44 3 <= split here
25 54 4 <= split here
26 61 5
27 60 5
28 61 5 <= split here
29 40 3 <= split here
30 17 1 <= split here
31 12 1
32 15 1
33 24 2 <= split here
The desired grouping should return something like:
class count mean
0 1 12 70.50
1 2 2 26.00
2 1 5 10.40
3 2 4 29.25
4 3 2 42.50
5 4 1 54.00
6 5 3 60.66
7 3 1 40.00
8 1 3 14.66
9 2 1 24.00
Is there any command to do it not iteratively?
Use Series.cumsum with compare if not equal shifted values and aggregate by GroupBy.agg:
g = df["class"].ne(df["class"].shift()).cumsum()
df = (df.groupby(['class', g], sort=False)['speed'].agg(['count','mean'])
.reset_index(level=1, drop=True)
.reset_index())
print (df)
class count mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000
You can groupby the cumsum of when the class column differs from the value below it:
df.groupby(df["class"].diff().ne(0).cumsum()).speed.agg(['size', 'mean'])
size mean
class
1 12 7.500000
2 2 26.000000
3 5 10.400000
4 4 29.250000
5 2 42.500000
6 1 54.000000
7 3 60.666667
8 1 40.000000
9 3 14.666667
10 1 24.000000
Update: I hadn't seen how you wanted the class column: what you can do is group by the original class column as well as the cumsum above, and do a bit of index-sorting and resetting (but at this point this answer just converges with #jezrael's answer :P)
result = (
df.groupby(["class", df["class"].diff().ne(0).cumsum()])
.speed.agg(["size", "mean"])
.sort_index(level=1)
.reset_index(level=0)
.reset_index(drop=True)
)
class size mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000

Pandas search in ascending index and match certain column value

I have a DF with thousands of rows. Column 'col1' is repeatedly from 1 to 6. Column 'value' is with unique numbers:
diction = {'col1': [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6], 'target': [34, 65, 23, 65, 12, 87, 36, 51, 26, 74, 34, 87]}
df1 = pd.DataFrame(diction, index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
col1 target
0 1 34
1 2 65
2 3 23
3 4 65
4 5 12
5 6 87
6 1 36
7 2 51
8 3 26
9 4 74
10 5 34
11 6 87
I'm trying to create a new column (let's call it previous_col) that match col1 value (let's say COL1 value 2 with TARGET column value -> 65) so next time COL1 with value 2 to refer to previous TARGET value from the same row as col1 value 1:
col1 previous_col target
0 1 0 34
1 2 0 65
2 3 0 23
3 4 0 65
4 5 0 12
5 6 0 87
6 1 34 36
7 2 65 51
8 3 23 26
9 4 65 74
10 5 12 34
11 6 87 79
Note that first 6 rows are 0 values for previous column cuz no previous target values exist :D
The tricky part here is that I need to extract previous target's by DF index ascending order or the first met COL1 value ascending. So if we have a DF with 10k rows not just to match from the top or from the middle same COL1 value and to take the TARGET value. Each value in PREVIOUS_COL should be taken ascending to index and COL1 matching values. I know I can do it with shift but sometimes COL1 is with a missing order not from 1 to 6 strictly so I need to match exactly the COL1 value.
df1['Per_col']=df1.groupby('col1').target.shift(1).fillna(0)
df1
Out[1117]:
col1 target Per_col
0 1 34 0.0
1 2 65 0.0
2 3 23 0.0
3 4 65 0.0
4 5 12 0.0
5 6 87 0.0
6 1 36 34.0
7 2 51 65.0
8 3 26 23.0
9 4 74 65.0
10 5 34 12.0
11 6 87 87.0

pandas: conditionally select a row cell for each column based on a mask

I want to be able to extract values from a pandas dataframe using a mask. However, after searching around, I cannot find a solution to my problem.
df = pd.DataFrame(np.random.randint(0,2, size=(2,10)))
mask = np.random.randint(0,2, size=(1,10))
I basically want the mask to serve as a index lookup for each column.
So if the mask was [0,1] for columns [a,b], I want to return:
df.iloc[0,a], df.iloc[1,b]
but in a pythonic way.
I have tried e.g.:
df.apply(lambda x: df.iloc[mask[x], x] for x in range(len(mask)))
which gives a Type error that I don't understand.
A for loop can work but is slow.
With NumPy, that's covered as advanced-indexing and should be pretty efficient -
df.values[mask, np.arange(mask.size)]
Sample run -
In [59]: df = pd.DataFrame(np.random.randint(11,99, size=(5,10)))
In [60]: mask = np.random.randint(0,5, size=(1,10))
In [61]: df
Out[61]:
0 1 2 3 4 5 6 7 8 9
0 17 87 73 98 32 37 61 58 35 87
1 52 64 17 79 20 19 89 88 19 24
2 50 33 41 75 19 77 15 59 84 86
3 69 13 88 78 46 76 33 79 27 22
4 80 64 17 95 49 16 87 82 60 19
In [62]: mask
Out[62]: array([[2, 3, 0, 4, 2, 2, 4, 0, 0, 0]])
In [63]: df.values[mask, np.arange(mask.size)]
Out[63]: array([[50, 13, 73, 95, 19, 77, 87, 58, 35, 87]])