How to group consecutive values in other columns into ranges based on one column - pandas

I have the following dataframe:
I would like to get the following output from the dataframe
Is there anyway to group other columns ['B', 'index'] based on column 'A' using groupby aggregate function, pivot_table in pandas.
I couldn't think about an approach to write code.

Use:
df=df.reset_index() #if 'index' not is a colum
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
In pandas <0.25:
new_df=df.groupby(g,as_index=False).agg({'index':list,'A':'first','B':lambda x: list(x.unique())})
if you want to repeat repeated in the index use the same function for the index column as for B:
new_df=df.groupby(g,as_index=False).agg(index=('index',lambda x: list(x.unique())),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
Here is an example:
df=pd.DataFrame({'index':range(20),
'A':[1,1,1,1,2,2,0,0,0,1,1,1,1,1,1,0,0,0,3,3]
,'B':[1,2,3,5,5,5,7,8,9,9,9,12,12,14,15,16,17,18,19,20]})
print(df)
index A B
0 0 1 1
1 1 1 2
2 2 1 3
3 3 1 5
4 4 2 5
5 5 2 5
6 6 0 7
7 7 0 8
8 8 0 9
9 9 1 9
10 10 1 9
11 11 1 12
12 12 1 12
13 13 1 14
14 14 1 15
15 15 0 16
16 16 0 17
17 17 0 18
18 18 3 19
19 19 3 20
g=df['A'].ne(df['A'].shift()).cumsum()
new_df=df.groupby(g,as_index=False).agg(index=('index',list),A=('A','first'),B=('B',lambda x: list(x.unique())))
print(new_df)
index A B
0 [0, 1, 2, 3] 1 [1, 2, 3, 5]
1 [4, 5] 2 [5]
2 [6, 7, 8] 0 [7, 8, 9]
3 [9, 10, 11, 12, 13, 14] 1 [9, 12, 14, 15]
4 [15, 16, 17] 0 [16, 17, 18]
5 [18, 19] 3 [19, 20]

Related

How to find min and max values of a dataframe within x rolling rows without a loop?

I have this dataframe that looks like that:
df = pd.DataFrame(
[
[5, 8],
[8, 10],
[3, 15],
[16, 20],
[12, 21],
[5, 9],
[10, 12],
[20, 22],
[4, 10],
[7, 13],
[9, 15],
[6, 9],
],
columns=list("lh"),
)
I would like to know the min and max within x previous and forward rows for each row but without using a loop, as the dataframe is quite large, and it takes a long time.
I have this function that works:
def pivotid(df1, l, n1, n2): # n1 n2 before and after candle L
if l - n1 < 0 or l + n2 >= len(df1):
return 0
pividlow = 1
pividhigh = 1
for i in range(l - n1, l + n2 + 1):
if df1.l[l] > df1.l[i]:
pividlow = 0
if df1.h[l] < df1.h[i]:
pividhigh = 0
if pividlow and pividhigh:
return 3
elif pividlow:
return 1
elif pividhigh:
return 2
else:
return 0
Here's how I call it:
df['pivot'] = df.apply(lambda x: pivotid(df, x.name, 2, 2), axis=1)
Here's the expected result:
l h pivot
0 5 8 0
1 8 10 0
2 3 15 1
3 16 20 0
4 12 21 2
5 5 9 1
6 10 12 0
7 20 22 2
8 4 10 1
9 7 13 0
10 9 15 0
11 6 9 0
Do you think that there's a way to achieve that without using a for loop with pandas?
With the dataframe you provided, here is one way to do it using Pandas rolling:
df["pivot"] = (
(df["l"] == df["l"].rolling(window=5, center=True).min()).astype(int)
+ (df["h"] == df["h"].rolling(window=5, center=True).max()).astype(int) * 2
)
Then:
print(df)
# Output
l h pivot
0 5 8 0
1 8 10 0
2 3 15 1
3 16 20 0
4 12 21 2
5 5 9 1
6 10 12 0
7 20 22 2
8 4 10 1
9 7 13 0
10 9 15 0
11 6 9 0

Pandas: how to group on column change?

I am working with a log system, and I need to group data not in a standard way.
Alas with my limited knowledge of Pandas I couldn't find any example, probably because I don't know proper search terms.
This is a sample dataframe:
df = pd.DataFrame({
"speed": [2, 4, 6, 8, 8, 9, 2, 3, 8, 9, 13, 18, 25, 27, 18, 8, 6, 8, 12, 20, 27, 34, 36, 41, 44, 54, 61, 60, 61, 40, 17, 12, 15, 24],
"class": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 3, 1, 1, 1, 2]
})
df.groupby(by="class").groups returns indexed of each row, all grouped together by class value:
class indexes
1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 30, 32],
2: [12, 13, 19, 20, 21, 22, 33],
3: [23, 24, 29],
4: [25],
5: [26, 27, 28]
I need instead to split every time column class changes:
speed class
0 2 1
1 4 1
2 6 1
3 8 1
4 8 1
5 9 1
6 2 1
7 3 1
8 8 1
9 9 1
10 13 1
11 18 1
12 25 2 <= split here
13 27 2
14 18 1 <= split here
15 8 1
16 6 1
17 8 1
18 12 1 <= split here
19 20 2
20 27 2
21 34 2
22 36 2 <= split here
23 41 3
24 44 3 <= split here
25 54 4 <= split here
26 61 5
27 60 5
28 61 5 <= split here
29 40 3 <= split here
30 17 1 <= split here
31 12 1
32 15 1
33 24 2 <= split here
The desired grouping should return something like:
class count mean
0 1 12 70.50
1 2 2 26.00
2 1 5 10.40
3 2 4 29.25
4 3 2 42.50
5 4 1 54.00
6 5 3 60.66
7 3 1 40.00
8 1 3 14.66
9 2 1 24.00
Is there any command to do it not iteratively?
Use Series.cumsum with compare if not equal shifted values and aggregate by GroupBy.agg:
g = df["class"].ne(df["class"].shift()).cumsum()
df = (df.groupby(['class', g], sort=False)['speed'].agg(['count','mean'])
.reset_index(level=1, drop=True)
.reset_index())
print (df)
class count mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000
You can groupby the cumsum of when the class column differs from the value below it:
df.groupby(df["class"].diff().ne(0).cumsum()).speed.agg(['size', 'mean'])
size mean
class
1 12 7.500000
2 2 26.000000
3 5 10.400000
4 4 29.250000
5 2 42.500000
6 1 54.000000
7 3 60.666667
8 1 40.000000
9 3 14.666667
10 1 24.000000
Update: I hadn't seen how you wanted the class column: what you can do is group by the original class column as well as the cumsum above, and do a bit of index-sorting and resetting (but at this point this answer just converges with #jezrael's answer :P)
result = (
df.groupby(["class", df["class"].diff().ne(0).cumsum()])
.speed.agg(["size", "mean"])
.sort_index(level=1)
.reset_index(level=0)
.reset_index(drop=True)
)
class size mean
0 1 12 7.500000
1 2 2 26.000000
2 1 5 10.400000
3 2 4 29.250000
4 3 2 42.500000
5 4 1 54.000000
6 5 3 60.666667
7 3 1 40.000000
8 1 3 14.666667
9 2 1 24.000000

Get and Modify column in groups on rows that meet a condition

I have this DataFrame:
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12], 'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
day hour sales
0 1 10 0
1 1 10 40
2 1 10 30
3 2 11 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 12 20
And I would like to filter to get the first entry of each day that has volume greater than 0. And as an additional thing I would like to change the 'sales' column for these to 9.
So to get something like this:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
I only came up with this iterative solution. But is there a solution, how I can apply it in a more functional way?
# Group by day:
groups = df.groupby(by=['day'])
# Get all indices of first non-zero sales entry per day:
indices = []
for name, group in groups:
group = group[group['sales'] > 0]
indices.append(group.index.to_list()[0])
# Change their values:
df.iloc[indices, df.columns.get_loc('hour')] = 9
You can create a group of df['day'] after checking of sales is greater than 0 , then get idxmax and filter out groups which doesnot have any value greater than 0 by using any , then assign with loc[]
g = df['sales'].gt(0).groupby(df['day'])
idx = g.idxmax()
df.loc[idx[g.any()],'hour']=9
print(df)
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20
Create a mask m that groups by day as well as rows where the sales are not 0.
Then, use this mask as well as df['sales'] > 0 to change those specific rows to 9 with np.where()
df = pd.DataFrame({'day': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'hour': [10, 10, 10, 11, 11, 11, 12, 12, 12],
'sales': [0, 40, 30, 10, 80, 70, 0, 0, 20]})
m = df.groupby(['day', df['sales'].ne(0)])['sales'].transform('first')
df['hour'] = np.where((df['sales'] == m) & (df['sales'] > 0), 9, df['hour'])
df
Out[37]:
day hour sales
0 1 10 0
1 1 9 40
2 1 10 30
3 2 9 10
4 2 11 80
5 2 11 70
6 3 12 0
7 3 12 0
8 3 9 20

ValueError: operands could not be broadcast together with shapes in concatenatinng arrays across pandas columns

I am working with a pandas dataframe that something looks like this:
col1 col2 col3 col_num
0 [-0.20447069290738076, 0.4159556680196389, -0.... [-0.10935000772973974, -0.04425263358067333, -... [51.0834196, 10.4234469] 3160
1 [-0.42439951483476124, -0.3135960467759942, 0.... [0.3842614765721414, -0.06756644506033657, 0.4... [45.5643442, 17.0118954] 3159
3 [0.3158755226012898, -0.007057682056994253, 0.... [-0.33158941456615376, 0.09637640660002277, -0... [50.6402809, 4.6667145] 3157
5 [-0.011089723491692679, -0.01649481399305317, ... [-0.02827408211098023, 0.00019040943944721592,... [53.45733965, -2.22695880505223] 3157
I would like to concatenate vectors across rows as so:
df['col1'] + df['col2'] + df['col3'] + df['col_num'].transform(lambda item: [item])
However I am prompted with the following error:
/opt/conda/lib/python3.6/site-packages/pandas/core/ops.py in <lambda>(x)
708 if is_object_dtype(lvalues):
709 return libalgos.arrmap_object(lvalues,
--> 710 lambda x: op(x, rvalues))
711 raise
712
ValueError: operands could not be broadcast together with shapes (30,) (86597,)
It's looking like for some reason ti's getting stuck at concatenating the 3rd column, which only has 2 dimensions. The data is 86597 rows long. How can I fix this error?
You can convert problematic column to list like:
df['col1'] + df['col2'] + df['col3'].apply(list) + df['col_num'].transform(lambda x: [x])
Another solution is convert all lists to 2d numpy arrays and use hstack, if same length of lists in each column, because you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks:
np.random.seed(123)
N = 10
df = pd.DataFrame({
"col1": [np.random.randint(10, size=3) for i in range(N)],
"col2": [np.random.randint(10, size=3) for i in range(N)],
"col3": [np.random.randint(10, size=2) for i in range(N)],
'col_num': range(N)
})
print (df)
col1 col2 col3 col_num
0 [2, 2, 6] [9, 3, 4] [2, 4] 0
1 [1, 3, 9] [6, 1, 5] [8, 1] 1
2 [6, 1, 0] [6, 2, 1] [2, 1] 2
3 [1, 9, 0] [8, 3, 5] [1, 3] 3
4 [0, 9, 3] [0, 2, 6] [5, 9] 4
5 [4, 0, 0] [2, 4, 4] [0, 8] 5
6 [4, 1, 7] [6, 3, 0] [1, 6] 6
7 [3, 2, 4] [6, 4, 7] [3, 3] 7
8 [7, 2, 4] [6, 7, 1] [5, 9] 8
9 [8, 0, 7] [5, 7, 9] [7, 9] 9
a = np.array(df['col1'].values.tolist())
b = np.array(df['col2'].values.tolist())
c = np.array(df['col3'].values.tolist())
#create Nx1 array
d = df['col_num'].values[:, None]
arr = np.hstack((a,b,c, d))
print (arr)
[[2 2 6 9 3 4 2 4 0]
[1 3 9 6 1 5 8 1 1]
[6 1 0 6 2 1 2 1 2]
[1 9 0 8 3 5 1 3 3]
[0 9 3 0 2 6 5 9 4]
[4 0 0 2 4 4 0 8 5]
[4 1 7 6 3 0 1 6 6]
[3 2 4 6 4 7 3 3 7]
[7 2 4 6 7 1 5 9 8]
[8 0 7 5 7 9 7 9 9]]
df = pd.DataFrame(arr)
print (df)
0 1 2 3 4 5 6 7 8
0 2 2 6 9 3 4 2 4 0
1 1 3 9 6 1 5 8 1 1
2 6 1 0 6 2 1 2 1 2
3 1 9 0 8 3 5 1 3 3
4 0 9 3 0 2 6 5 9 4
5 4 0 0 2 4 4 0 8 5
6 4 1 7 6 3 0 1 6 6
7 3 2 4 6 4 7 3 3 7
8 7 2 4 6 7 1 5 9 8
9 8 0 7 5 7 9 7 9 9

Merging dataframes in pandas

I am new to pandas and I am facing the following problem:
I have 2 data frames:
df1 :
x y
1 3 4
2 nan
3 6
4 nan
5 9 2
6 1 4 9
df2:
x y
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
The size of the two is same.
I want to merge the two dataframes such that all the resulting dataframe i get is the following:
result :
x y
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 6 7
So in the result, priority is given to df2. If there is a value in df2, it is put first and the remaining values are put from df1 (they have the same position as in df1). There should be no repeated values in the result (i.e if a value is in position 1 in df1 and position 3 in df2, then that value should come only in position 1 in the result and not repeat)
Any kind of help will be appreciated.
Thanks!
IIUC
Setup
df1 = pd.DataFrame(dict(x=range(1, 7),
y=[[3, 4], None, [6], None, [9, 2], [1, 4, 9]]))
df2 = pd.DataFrame(dict(x=range(1, 7), y=[[2, 3, 6, 1, 5], [4, 1, 8, 7, 5],
[6, 3, 1, 4, 5], [2, 1, 3, 5, 4],
[9, 2, 3, 8, 7], [1, 4, 5, 3, 7]]))
print df1
print
print df2
x y
0 1 [3, 4]
1 2 None
2 3 [6]
3 4 None
4 5 [9, 2]
5 6 [1, 4, 9]
x y
0 1 [2, 3, 6, 1, 5]
1 2 [4, 1, 8, 7, 5]
2 3 [6, 3, 1, 4, 5]
3 4 [2, 1, 3, 5, 4]
4 5 [9, 2, 3, 8, 7]
5 6 [1, 4, 5, 3, 7]
convert to something more usable:
df1_ = df1.set_index('x').y.apply(pd.Series)
df2_ = df2.set_index('x').y.apply(pd.Series)
print df1_
print
print df2_
0 1 2
x
1 3.0 4.0 NaN
2 NaN NaN NaN
3 6.0 NaN NaN
4 NaN NaN NaN
5 9.0 2.0 NaN
6 1.0 4.0 9.0
0 1 2 3 4
x
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
Combine with priority given to df1 (I think you meant df1 as that what was consistent with my interpretation of your question and the expected output you provided) then reducing to eliminate duplicates:
print df1_.combine_first(df2_).apply(lambda x: x.unique(), axis=1)
0 1 2 3 4
x
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 9 3 7