Mapping dictionary of lists to a pandas df - pandas

I have a dictionary which contains a id and a list of corresponding values for that id.
I am attempting to map this dictionary to a pandas df.
The df contains the same id to map to, but it needs to map the items in that list in order of appearance within the df.
For example:
sample_dict = {0:[0.1,0.4,0.5], 1:[0.2,0.14,0.3], 2:[0.2,0.1,0.4]}
The df looks like:
The output of mapping the dictionary to the df would look like:
Sorry for typing the table out like this, the actual df is very large, and I'm still new to stack exchange and pandas.
The end output should just map the id list value in order to the players as they appear in order as the df is sorted by id and then player

Let us try explode with reindex
df['new'] = pd.Series(sample_dict).reindex(df.id.unique()).explode().values
df
Out[140]:
id Player new
0 0 1 0.1
1 0 2 0.4
2 0 3 0.5
3 1 1 0.2
4 1 2 0.14
5 1 3 0.3
6 2 1 0.2
7 2 2 0.1
8 2 3 0.4

Related

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

python - List of Lists into pandas dataframe including name of columns

I would like to transfer a list of lists into a dataframe with columns based on the lists in the list.
This is still easy.
list = [[....],[....],[...]]
df = pd.DataFrame(list)
df = df.transpose()
The problem is: I would like to give the columns a column-name based on entries I have in another list:
list_two = [A,B,C,...]
This is my issue Im still struggling with.
Is there any approach to solve this problem?
Thanks a lot in advance for your help.
Best regards
Sascha
Use zip with dict for dictionary of lists and pass to DataFrame:
L= [[1,2,3,5],[4,8,9,8],[1,2,5,3]]
list_two = list('ABC')
df = pd.DataFrame(dict(zip(list_two, L)))
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3
Or if pass index parameter after transpose get columns names by this list:
df = pd.DataFrame(L, index=list_two).T
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3

I want to remove specific rows and restart the values from 1

I have a dataframe that looks like this:
Time Value
1 5
2 3
3 3
4 2
5 1
I want to remove the first two rows and then restart time from 1. The dataframe should then look like:
Time Value
1 3
2 2
3 1
I attach the code:
file = pd.read_excel(r'C:......xlsx')
df = file0.loc[(file0['Time']>2) & (file0['Time']<11)]
df = df.reset_index()
Now what I get is:
index Time Value
0 3 3
1 4 2
2 5 1
Thank you!
You can use .loc[] accessor and reset_index() method:
df=df.loc[2:].reset_index(drop=True)
Finally use list comprehension:
df['Time']=[x for x in range(1,len(df)+1)]
Now If you print df you will get your desired output:
Time Value
0 1 3
1 2 2
2 3 1
You can use df.loc to extract the subset of dataframe, Reset the index and then change the value of Time column.
df = df.loc[2:].reset_index(drop=True)
df['Time'] = df.index + 1
print(df)
you have two ways to do that.
first :
df[2:].assign(time = df.time.values[:-2])
Which returns your desired output.
time
value
1
3
2
2
3
1
second :
df = df.set_index('time')
df['value'] = df['value'].shift(-2)
df.dropna()
this return your output too but turn the numbers to float64
time
value
1
3.0
2
2.0
3
1.0

How to make pandas work for cross multiplication

I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13

Lookup a pandas df for a column value by matching rows with another dataframe

Say I have a pandas dataframe df1 as follows:
OpDay Rid Tid Sid Dist
0 18Sep 1 1 1 10
1 18Sep 1 1 1 15
2 18Sep 1 1 1 20
3 18Sep 1 5 4 5
4 18Sep 1 5 4 50
and df2 like:
S_Day R_ID T_ID S_ID ABC XYZ
0 18Sep 1 1 1 100 60
1 18Sep 1 5 4 125 100
Number of rows in df2 is equal to total number of unique combinations of OpDay+Rid+Tid+Sid in df1.
Now, I want the values of columns ABC and XYZ from df2 corresponding to this each unique combination. But I don't want to store these values in df1. Just need these values for some computation purpose and then I want to store the result in df2 only by creating a new column.
To summarize, lets say ,I want to do some computation using df1.Dist[3] for which I need values from columns df2.ABC and df2.XYZ also, so first find the row index in df2 where,
S_Day = OpDay[3],
R_ID = Rid[3],
T_ID = Tid[3] and
S_ID = Sid[3]
(In this case its row#1),
so use df2.ABC[1] and df2.XYZ[1] and store results in df2.RESULT[1].
So now df2 will look something like:
S_Day R_ID T_ID S_ID ABC XYZ RESULT
0 18Sep 1 1 1 100 60 Nan
1 18Sep 1 5 4 125 100 some computed value
Basically I guess I need a lookup kind of a function but don't know how to proceed further.
Please help as I am new to the world of python and programming. Many thanks in advance.
You can use .loc and Boolean indices to do what you want. Let's say that you're after the ith row of df1:
i = 3
Next, you can use Boolean indexing to find the corresponding rows in df2:
bool_index = (df1.loc[i, 'OpDay'] == df2.loc[:, 'S_Day']) & (df1.loc[i, 'Rid'] == df2.loc[:, 'R_ID']) & (df1.loc[i, 'Tid'] == df2.loc[:, 'T_ID']) & (df1.loc[i, 'Sid'] == df2.loc[:, 'S_ID'])
You might want to include a check to verify that you found one and only one combination:
sum(bool_index) == 1
And finally, you can use the boolean index to call the right values from df2:
ABC_for_computation = df2.loc[bool_index, 'ABC']
XYZ_for_computation = df2.loc[bool_index, 'XYZ']
Note that I'm not too sure about the speed of this operation if you have large datasets. In my experience, if speed is affected you should switch to numpy arrays instead of dataframes, particularly when writing data into your dataframe.