Merging Shuffled DF with new output DF - Pandas - pandas

I'm trying to merge two data frames; "shuffled" DF1 that is used to derive (predicted outputs) DF2. I was able to merge them together; however, the end results are always "not shuffled" DF1 merged with DF2. Therefore, giving a mismatch to the DF3.
df1 = shuffle(pd.read_csv("C:/Users/.."))
filename = ('C:/Users/..')
loaded_model = pickle.load(open(filename, 'rb'))
df2 = loaded_model.predict(df1)
df3= pd.merge(df1, df2, left_index=True, right_index=True)
Here's an example the problem.:
df1 df1 (Shuffled)
0 3
1 2
2 1
3 0
df2
0
0
0
1
expected df3 = (shuffled df1) + df2:
df1 df2
3 0
2 0
1 0
0 1
however, im getting:
df1 df2
0 0
1 0
2 0
3 1
Thanks in advance for the time and effort!

Related

Create dataframe based on conditionals from other dataframes AND past (t-1) values

I would like to create a new dataframe depending on the values from two other dataframes (same shape) and, also, comparing t and t-1 (immediately before) values from one of these dataframes.
I have two dataframes: P and PR. The logic for the third one would be: If (P_t >= 30.6 & PR_t <PR_t-1) then 1; else 0. So for example:
P = pd.DataFrame({'col1': [26.7,24.7,26.1,26.4,24.5], 'col2': [30.8,30.8,30.7,30.8,29.8], 'col3': [30.8,30.7,30.5,30.6,30.0]})
PR = pd.DataFrame({'col1': [79.8,73.6,81.1,79.4,75.7], 'col2': [74.1,74.1,77.0,74.7,74.1], 'col3': [74.0,74.0,76.4,74.3,74.8]})
Would give me a resulting dataframe like:
pd.DataFrame({'col1': [0,0,0,0,0], 'col2': [0,1,0,1,0], 'col3': [0,1,0,0,0]})
Any help is much appreciated!
You can use:
(P.ge(30.6) & PR.diff().lt(0)).astype(int)
Output:
col1 col2 col3
0 0 0 0
1 0 0 0
2 0 0 0
3 0 1 1
4 0 0 0

How to update at a specific row, after finding the same value in two tables

import pandas as pd
data1 = [['xx'], ['4']]
data2 = [['4', 'x0'], ['aa', 'bb'], ['cc', 'dd']]
df1 = pd.DataFrame(data=data1, columns=["isin"])
print(df1)
df2 = pd.DataFrame(data=data2, columns=["isin", "data"])
print(df2)
df1.loc[df1['isin'] == df2['isin'], 'data'] = df2['data']
print (df1)
# Exception has occurred: ValueError
# Can only compare identically-labeled Series objects
# df1.loc[df1['isin'] == df2['isin'], 'data'] = df2['data']
# THIS IS IT NOW
# df1:
# isin
# 0 xx
# 1 4
# df2:
# isin data
# 0 4 x0
# 1 aa bb
# 2 cc dd
Problem:
algorithm find the row with the '4' - in column 'isin' in both df
pull from df2 the 'data' at this row (in this case 'x0')
add it to df1 - in this case(x0) - at the row of '4' at new column 'data'
# df1:
# isin
# 0 xx
# 1 4
# df2:
# isin data
# 0 4 x0
# 1 aa bb
# 2 cc dd
# df3:
# isin data
# 0 xx NaN
# 1 4 x0
I agree with tranbi that the question needs more clarity. 100 is not in df1 anywhere. But, if you want to update just one cell in the dataframe, assuming we have this:
years isin toast
0 55 55 55
1 55 55 55
2 this information 4 55
then
df2.loc[df2['years']=='this information',['years']]='that information'
will update just that cell. You could use df.loc instead of 'this information' to find the value in df1. Couldn't do it because that value doesn't exists in the example you gave so not quite sure that is what you are referring to.
import pandas as pd
data1 = [['xx'], ['4']]
data2 = [['4', 'x0'], ['aa', 'bb'], ['cc', 'dd']]
df1 = pd.DataFrame(data=data1, columns=["isin"])
print(df1)
df2 = pd.DataFrame(data=data2, columns=["isin", "data"])
print(df2)
# merge was the solution ive looked for
df3 = df1.merge(df2, how = 'left')
print (df3)

How to make pandas work for cross multiplication

I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13

How to add a new row to pandas dataframe with non-unique multi-index

df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,2,1,2]], columns=list('xyz'))
where df looks like:
Now I add a new row by:
df.loc['new',:]=[0,0,0]
Then df becomes:
Now I want to do the same but with a different df that has non-unique multi-index:
df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,1,2,2]], columns=list('xyz'))
,which looks like:
and call
df.loc['new',:]=[0,0,0]
The result is "Exception: cannot handle a non-unique multi-index!"
How could I achieve the goal?
Use append or concat with helper DataFrame:
df1 = pd.DataFrame([[0,0,0]],
columns=df.columns,
index=pd.MultiIndex.from_arrays([['new'], ['']]))
df2 = df.append(df1)
df2 = pd.concat([df, df1])
print (df2)
x y z
a 1 0 1 2
1 3 4 5
b 2 6 7 8
2 9 10 11
new 0 0 0

How to expand one row to multiple rows according to its value in Pandas

This is a DataFrame I have for example. Please refer the image link.
Before:
Before
d = {1: ['2134',20, 1,1,1,0], 2: ['1010',5, 1,0,0,0], 3: ['3457',15, 0,1,1,0]}
columns=['Code', 'Price', 'Bacon','Onion','Tomato', 'Cheese']
df = pd.DataFrame.from_dict(data=d, orient='index').sort_index()
df.columns = columns
What I want to do is expanding a single row into multiple rows. Then the Dataframe should be look like the image of below link. The intention is using some columns(from 'Bacon' to 'Cheese') as categories.
After:
After
I tried to find the answer, but failed. Thanks.
You can first reshape with set_index and stack, then filter by query and get_dummies from column level_2 and last reindex columns for add missing with no 1 and reset_index:
df = df.set_index(['Code', 'Price']) \
.stack() \
.reset_index(level=2, name='val') \
.query('val == 1') \
.level_2.str.get_dummies() \
.reindex(columns=df.columns[2:], fill_value=0) \
.reset_index()
print (df)
Code Price Bacon Onion Tomato Cheese
0 2134 20 1 0 0 0
1 2134 20 0 1 0 0
2 2134 20 0 0 1 0
3 1010 5 1 0 0 0
4 3457 15 0 1 0 0
5 3457 15 0 0 1 0
You can use stack and transpose to do this operation and format accordingly.
df = df.stack().to_frame().T
df.columns = ['{}_{}'.format(*c) for c in df.columns]
Use pd.melt to put all the food in one column and then pd.get_dummies to expand the columns.
df1 = pd.melt(df, id_vars=['Code', 'Price'])
df1 = df1[df1['value'] == 1]
df1 = pd.get_dummies(df1, columns=['variable'], prefix='', prefix_sep='').sort_values(['Code', 'Price'])
df1.reindex(columns=df.columns, fill_value=0)
Edited after I saw how jezrael used reindex to both add and drop a column.