Pandas: Create several rows from column that is a list - pandas

Let's say I have something like this:
df = pd.DataFrame({'key':[1,2,3], 'type':[[1,3],[1,2,3],[1,2]], 'value':[5,1,8]})
key type value
1 [1, 3] 5
2 [1, 2, 3] 1
3 [1] 8
Where one of the columns contains a list of items.
I would like to create several rows for each row that contains multiple types.
Ontaining this:
key type value
1 1 5
1 3 5
2 1 1
2 2 1
2 3 1
3 1 8
I've been playing with apply with axis=1 but I can't find a way to return more than 1 row per row of the DataFrame.
Extracting all different 'types' and then looping-concatenating seems to be ugly.
any ideas?
Thanks!!!

import itertools
import pandas as pd
import numpy as np
def melt_series(s):
lengths = s.str.len().values
flat = [i for i in itertools.chain.from_iterable(s.values.tolist())]
idx = np.repeat(s.index.values, lengths)
return pd.Series(flat, idx, name=s.name)
melt_series(df.type).to_frame().join(df.drop('type', 1)).reindex_axis(df.columns, 1)
setup
df = pd.DataFrame({'key':[1,2,3],
'type':[[1,3],[1,2,3],[1,2]],
'value':[5,1,8]})
df

Related

Creating a dataframe using roll-forward window on multivariate time series

Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

Check if list cell contains value

Having a dataframe like this:
month transactions_ids
0 1 [0, 5, 1]
1 2 [7, 4]
2 3 [8, 10, 9, 11]
3 6 [2]
4 9 [3]
For a given transaction_id, I would like to get the month when it took place. Notice that a transaction_id can only be related to one single month.
So for example, given transaction_id = 4, the month would be 2.
I know this can be done in a loop by looking month by month if the transactions_ids related contain the given transaction_id, but I'm wondering if there is any way more efficient than that.
Cheers
The best way in my opinion is to explode your data frame and avoid having python lists in your cells.
df = df.explode('transaction_ids')
which outputs
month transactions_ids
0 1 0
0 1 5
0 1 1
1 2 7
1 2 4
2 3 8
2 3 10
2 3 9
2 3 11
3 6 2
4 9 3
Then, simply
id_to_find = 1 # example
df.loc[df.transactions_ids == id_to_find, 'month']
P.S: be aware of the duplicated indexes that explode outputs. In general, it is better to do explode(...).reset_index(drop=True) for most cases to avoid unwanted behavior.
You can use pandas string methods to find the id in the "list" (it's really just a string as far as pandas is concerned when read in using StringIO):
import pandas as pd
from io import StringIO
data = StringIO("""
month transactions_ids
1 [0,5,1]
2 [7,4]
3 [8,10,9,11]
6 [2]
9 [3]
""")
df = pd.read_csv(data, delim_whitespace=True)
df.loc[df['transactions_ids'].str.contains('4'), 'month']
In case your transactions_ids are real lists, then you can use map to check for membership:
df['transactions_ids'].map(lambda x: 3 in x)

How to make pandas work for cross multiplication

I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13

In Python Pandas using cumsum with groupby

I am trying to do a pandas cumsum(), where want to initialize the value to 0 every time group changes.
Say I have below dataframe where after group by I have col2(Group) and expect col3(cumsum) while using the function
Value Group Cumsum
a 1 0
a 1 1
a 1 2
b 2 0
b 2 1
b 2 2
b 2 3
c 3 0
c 3 1
d 4 0
This doesnt work
df['Cumsum'] = df['Group'].cumsum()
Please advise.
Thanks!
Hmm, this turned out more complicated than I imagined, due to getting the groups' keys back in. Perhaps someone else will find something shorter.
First, imports
import pandas as pd
import itertools
Now a DataFrame:
df = pd.DataFrame({
'a': ['a', 'b', 'a', 'b'],
'b': [0, 1, 2, 3]})
So now we separately do a groupby-cumsum, some itertools stuff for finding the keys, and combine both:
>>> pd.DataFrame({
'keys': list(itertools.chain.from_iterable([len(g) * [k] for k, g in df.b.groupby(df.a)])),
'cumsum': df.b.groupby(df.a).cumsum()})
cumsum keys
0 0 a
1 1 a
2 2 b
3 4 b