I have a pandas df that looks like this:
Ln2
[C, C, C, C, C, C, G, I, O, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, R, R, R, R, R, R]
[C, C, C, C, C, C, G, I, O, P, P, P, R, R, R, R, R, R]
43,000+ rows where each array's length varies from 5 to 20, and this is 1 column fyi, so each row is a single cell.
I want to one hot this column. It was suggested to use torch.nn.functional.one_hot, but it wants it to be a float, and an index tensor. Any suggestions would be greatly appreciated.
How to get the values of the previous three rows in a new column?
data = { 'foo':['a','b','c','d','e','f','g']}
df = pd.DataFrame(data)
df = some_function(x)
print(df)
foo bar
1 a None
2 b None
3 c None
4 d ['a','b','c']
5 e ['b','c','d']
6 f ['c','d','e']
7 g ['d','e','f']
I could use the following method, by adding columns and then merging it to a new one, but i wonder if there is a better way to do this
def some_function_v1(df)
df[foo1] = df.foo.shift(1)
df[foo2] = df.foo.shift(2)
df[foo3] = df.foo.shift(3)
df['bar'] = df.apply(lambda x: [x['foo1'],x['foo2'],x['foo3']], axis=1)
df = df.drop(columns=[foo1,foo2,foo3]
return df
Try sliding_window_view on foo to create a new DataFrame with the grouped lists:
window = 3
bar_df = pd.DataFrame({
'bar': np.lib.stride_tricks.sliding_window_view(df['foo'], window).tolist()
})
Offset the index:
bar_df.index += window
bar_df:
bar
3 [a, b, c]
4 [b, c, d]
5 [c, d, e]
6 [d, e, f]
7 [e, f, g]
Then join back to the original frame:
out = df.join(bar_df)
out:
foo bar
0 a NaN
1 b NaN
2 c NaN
3 d [a, b, c]
4 e [b, c, d]
5 f [c, d, e]
6 g [d, e, f]
Complete Working Example:
import numpy as np
import pandas as pd
data = {'foo': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}
df = pd.DataFrame(data)
window = 3
bar_df = pd.DataFrame({
'bar': np.lib.stride_tricks.sliding_window_view(df['foo'], window).tolist()
})
bar_df.index += window
out = df.join(bar_df)
print(out)
We can try list comprehension to generate sliding window view
n, v = 3, df['foo'].to_numpy()
df['bar'] = [None] * n + [v[i: i + n] for i in range(len(v) - n)]
Alternative approach with sliding_window_view method
n, v = 3, df['foo'].to_numpy()
df['bar'] = [None] * n + list(np.lib.stride_tricks.sliding_window_view(v[:-1], n))
foo bar
0 a None
1 b None
2 c None
3 d [a, b, c]
4 e [b, c, d]
5 f [c, d, e]
6 g [d, e, f]
You can use shift with zip to shift and merge lists element-wise instead of creating new columns-
df['bar'] = pd.Series(zip(df.foo.shift(3), df.foo.shift(2), df.foo.shift(1))).apply(lambda x:None if np.nan in x else list(x))
Here's a function to make the shift dynamic-
n_shift = lambda s, n: pd.Series(zip(*[s.shift(x) for x in range(n,0,-1)])).apply(lambda x:None if np.nan in x else list(x))
df['bar'] = n_shift(df.foo, 3))
Output-
foo bar
1 a None
2 b None
3 c None
4 d ['a','b','c']
5 e ['b','c','d']
6 f ['c','d','e']
7 g ['d','e','f']
I have a df like this:
ix y1 y2 id
ix1 X X AP10579
ix2 E E AP17998
ix3 C C AP283716
ix4 C C AP283716
ix5 E E AP17998
ix6 T T AP21187
ix7 X Z AP10579
ix8 T K AP21187
ix9 E E AP12457
ix10 C C Ap87930
in id column, we have two ids which are similar (f.x. ix1 & ix7 have the same id, ix2 & ix5, and so on) . also we have some unique ids,
I want to check if y1+y2 of each of these two ids are the same or not,
and if they are the same so move one of them in a new df,
also move every unique id,
so I should have a new df, df_new, like this:
ix y1 y2 id
ix2 E E AP17998
ix3 C C AP283716
ix9 E E AP12457
ix10 C C Ap87930
any suggestions is appreciated.
df = {
'ix': ['ix1','ix2','ix3','ix4','ix5','ix6','ix7','ix8','ix9','ix10'],
'y1': ['X','E','C','C','E','T','X','T', 'E','C'],
'y2': ['X','E','C','C','E','T','Z','K', 'E','C'],
'id': ['AP10579','AP17998','AP283716','AP283716','AP17998','AP21187','AP10579','AP21187', 'AP12457', 'Ap87930']
}
This is a possible approach:
df = pd.DataFrame({
'ix': ['ix1','ix2','ix3','ix4','ix5','ix6','ix7','ix8','ix9','ix10'],
'y1': ['X','E','C','C','E','T','X','T', 'E','C'],
'y2': ['X','E','C','C','E','T','Z','K', 'E','C'],
'id': ['AP10579','AP17998','AP283716','AP283716','AP17998','AP21187','AP10579','AP21187', 'AP12457', 'Ap87930']
})
def filter_df(g):
if len(g) == 1:
return g.iloc[0]
if g.y1.unique().size + g.y2.unique().size == 2:
return g.iloc[0]
df.groupby('id').agg(filter_df).dropna().reset_index()
output:
id ix y1 y2
0 AP12457 ix9 E E
1 AP17998 ix2 E E
2 AP283716 ix3 C C
3 Ap87930 ix10 C C