Trying to create a new dataframe first spliting the original one in two:
df1 - that contains only rows from original frame which in selected colomn has values from a given list
df2 - that contains only rows from original which in selected colomn has other values, with these values then changed to a new given value.
Return new dataframe as concatenation of df1 and df2
This works fine:
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
print(df)
cat val
0 a 1
1 b 2
2 c 3
3 d 4
4 a 5
5 b 6
df['cat'] = df['cat'].apply(lambda x: 'other')
print(df)
cat val
0 other 1
1 other 2
2 other 3
3 other 4
4 other 5
5 other 6
Yet when I define function:
def create_df(df, select, vals, other):
df1 = df.loc[df[select].isin(vals)]
df2 = df.loc[~df[select].isin(vals)]
df2[select] = df2[select].apply(lambda x: other)
result = pd.concat([df1, df2])
return result
and call it:
df3 = create_df(df, 'cat', ['a','b'], 'xxx')
print(df3)
Which results in what I actually need:
cat val
0 a 1
1 b 2
4 a 5
5 b 6
2 xxx 3
3 xxx 4
And for some reason in this case I get a warning:
..\usr\conda\lib\site-packages\ipykernel\__main__.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
So how this case (when I assign value to a column in a function) is different from the first one, when I assign value not in a function?
What is the right way to change column value?
Well there are many ways that code can be optimized I guess but for it to work you could simply save copies of the input dataframe and concat those:
def create_df(df, select, vals, other):
df1 = df.copy()[df[select].isin(vals)] #boolean.index
df2 = df.copy()[~df[select].isin(vals)] #boolean-index
df2[select] = other # this is sufficient
result = pd.concat([df1, df2])
return result
Alternative version:
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
# define a mask
mask = df['cat'].isin(list("ab"))
# concatenate mask, nonmask
df2 = pd.concat([df[mask],df[-mask]])
# change values to 'xxx'
df2.loc[-mask,["cat"]] = "xxx"
Outputs
cat val
0 a 1
1 b 2
4 a 5
5 b 6
2 xxx 3
3 xxx 4
Or function:
def create_df(df, filter_, isin_, value):
# define a mask
mask = df[filter_].isin(isin_)
# concatenate mask, nonmask
df = pd.concat([df[mask],df[-mask]])
# change values to 'xxx'
df.loc[-mask,[filter_]] = value
return df
df2 = create_df(df, 'cat', ['a','b'], 'xxx')
df2
Related
How to fill '0' value in df1 from unique value from another dataframe (df2). the expected output is no duplicate in df1.
any reference links for this. thank for helping out.
data1 = {'test' :['b',0,'e',0,0,'f']}
df1 = pd.DataFrame(data=data1)
data2 = {'test' :['a','b','c','d','e','f']}
df2 = pd.DataFrame(data=data2)
df1
test
0 b
1 0
2 e
3 0
4 0
5 f
df2
test
0 a
1 b
2 c
3 d
4 e
5 f
expected output:
test
0 b -- df1
1 a -- fill with a from df2
2 e -- df1
3 c -- fill with c from df2
4 d -- fill with d from df2
5 f -- df1
Assuming you have enough unique values in df2 to fill the 0s in df1, extract those unique values, and assign them with boolean indexing:
# which rows are 0?
m = df1['test'].eq(0)
# extract values of df2 that are not in df1
vals = df2.loc[~df2['test'].isin(df1['test']), 'test'].tolist()
# ['b', 'e', 'f']
# assign the values in the limit of the needed number
df1.loc[m, 'test'] = vals[:m.sum()]
print(df1)
Output:
test
0 b
1 a
2 e
3 c
4 d
5 f
If there is not always enough values in df2 and you want to fill the first possible 0s:
m = df1['test'].eq(0)
vals = df2.loc[~df2['test'].isin(df1['test']), 'test'].unique()
# ['b', 'e', 'f']
m2 = m.cumsum().le(len(vals))
df1.loc[m&m2, 'test'] = vals[:m.sum()]
print(df1)
Solution Assumptions:
number of '0' == unique values in df2
have a column like 'test' to be manipulated
# get the unique values in df1
uni = df1['test'].unique()
# get the unique values in df2 which are not in df1
unique_df2 = df2[~df2['test'].isin(uni)]
# get the index of all the '0' in df1 in a list
index_df1_0 = df1.index[df1['test'] == 0].tolist()
# replace the '0' in df1 with unique values from df1 (assumption #1 imp!)
for val_ in range(len(index_df1_0)):
df1.iloc[index_df1_0[val_]] = unique_df2.iloc[val_]
print(df1)
I have 3 data frame:
df1
id,k,a,b,c
1,2,1,5,1
2,3,0,1,0
3,6,1,1,0
4,1,0,5,0
5,1,1,5,0
df2
name,a,b,c
p,4,6,8
q,1,2,3
df3
type,w_ave,vac,yak
n,3,5,6
v,2,1,4
from the multiplication, using pandas and numpy, I want to the output in df1:
id,k,a,b,c,w_ave,vac,yak
1,2,1,5,1,16,15,18
2,3,0,1,0,0,3,6
3,6,1,1,0,5,4,7
4,1,0,5,0,0,11,14
5,1,1,5,0,13,12,15
the conditions are:
The value of the new column will be =
#its not a code
df1["w_ave"][1] = df3["w_ave"]["v"]+ df1["a"][1]*df2["a"]["q"]+df1["b"][1]*df2["b"]["q"]+df1["c"][1]*df2["c"]["q"]
for output["w_ave"][1]= 2 +(1*1)+(5*2)+(1*3)
df3["w_ave"]["v"]=2
df1["a"][1]=1, df2["a"]["q"]=1 ;
df1["b"][1]=5, df2["b"]["q"]=2 ;
df1["c"][1]=1, df2["c"]["q"]=3 ;
Which means:
- a new column will be added in df1, from the name of the column from df3.
- for each row of the df1, the value of a, b, c will be multiplied with the same-named q value from df2. and summed together with the corresponding value of df3.
-the column name of df1 , matched will column name of df2 will be multiplied. The other not matched column will not be multiplied, like df1[k].
- However, if there is any 0 in df1["a"], the corresponding output will be zero.
I am struggling with this. It was tough to explain also. My attempts are very silly. I know this attempt will not work. However, I have added this:
import pandas as pd, numpy as np
data1 = "Sample_data1.csv"
data2 = "Sample_data2.csv"
data3 = "Sample_data3.csv"
folder = '~Sample_data/'
df1 =pd.read_csv(folder + data1)
df2 =pd.read_csv(folder + data2)
df3 =pd.read_csv(folder + data3)
df1= df2 * df1
Ok, so this will in no way resemble your desired output, but vectorizing the formula you provided:
df2=df2.set_index("name")
df3=df3.set_index("type")
df1["w_ave"] = df3.loc["v", "w_ave"]+ df1["a"].mul(df2.loc["q", "a"])+df1["b"].mul(df2.loc["q", "b"])+df1["c"].mul(df2.loc["q", "c"])
Outputs:
id k a b c w_ave
0 1 2 1 5 1 16
1 2 3 0 1 0 4
2 3 6 1 1 0 5
3 4 1 0 5 0 12
4 5 1 1 5 0 13
Hello friends,
I would like to iterate trough all the numeric columns in the df (in a generic way).
For each unique df["Type"] group in each numeric column:
Replace all values that are greater than each column mean + 2 standard
deviation values with "nan"
df = pd.DataFrame(data=d)
df = pd.DataFrame(data=d)
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']
Sample df:
PRODUCT Test1 Test2 Type
A 7 99 Y
B 1 10 X
C 2 13 X
A 5 12 Y
B 1 11 Y
C 90 87 X
Expected output:
RODUCT Test1 Test2 Type
A 7 nan Y
B 1 10 X
C 2 13 X
A 5 12 Y
B 1 11 Y
C nan nan X
Logically, it can go like this:
test_cols = ['Test1', 'Test2']
# calculate mean and std with groupby
groups = df.groupby('Type')
test_mean = groups[test_cols].transform('mean')
test_std = groups[test_cols].transform('std')
# threshold
thresh = test_mean + 2 * test_std
# thresholding
df[test_cols] = np.where(df[test_cols]>thresh, np.nan, df[test_cols])
However, from your sample data set, thresh is:
Test1 Test2
0 10.443434 141.707912
1 133.195890 123.898159
2 133.195890 123.898159
3 10.443434 141.707912
4 10.443434 141.707912
5 133.195890 123.898159
So, it wouldn't change anything.
You can get this through a groupby and transform:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['Product'] = ['A', 'B', 'C', 'A', 'B', 'C']
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']
df = df.set_index('Product')
def nan_out_values(type_df):
type_df[type_df > type_df.mean() + 2*type_df.std()] = np.nan
return type_df
df[['Test1', 'Test2']] = df.groupby('Type').transform(nan_out_values)
I'm trying to subset a pandas df by removing rows that fall between specific values. The problem is these values can be at different rows so I can't select fixed rows.
Specifically, I want to remove rows that fall between ABC xxx and the integer 5. These values could fall anywhere in the df and be of unequal length.
Note: The string ABC will be followed by different values.
I thought about returning all the indexes that contain these two values.
But mask could work better if I could return all rows between these two values?
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'X',1,2,'ABC',1,4,5,'Y',1,2],
})
mask = (df['Val'].str.contains(r'ABC(?!$)')) & (df['Val'] == 5)
Intended Output:
Val
0 None
8 X
9 1
10 2
15 Y
16 1
17 2
If ABC is always before 5 and always pairs (ABC, 5) get indices of values with np.where, zip and get index values between - last filter by isin with invert mask by ~:
#2 values of ABC, 5 in data
df = pd.DataFrame({
'Val' : ['None','ABC','None',1,2,3,4,5,'None','None','None',
'None','ABC','None',1,2,3,4,5,'None','None','None']
})
m1 = np.where(df['Val'].str.contains(r'ABC', na=False))[0]
m2 = np.where(df['Val'] == 5)[0]
print (m1)
[ 1 12]
print (m2)
[ 7 18]
idx = [x for y, z in zip(m1, m2) for x in range(y, z + 1)]
print (df[~df.index.isin(idx)])
Val
0 None
8 X
9 1
10 2
11 None
19 X
20 1
21 2
a = df.index[df['Val'].str.contains('ABC')==True][0]
b = df.index[df['Val']==5][0]+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
Output
Val
0 None
8 X
9 1
10 2
If there are more than one 'ABC' and 5, then you the below version.
With this you get the df other than the first ABC & the last 5
a = (df['Val'].str.contains('ABC')==True).idxmax()
b = df['Val'].where(df['Val']==5).last_valid_index()+1
c = np.array(range (a,b))
bad_df = df.index.isin(c)
df[~bad_df]
df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,2,1,2]], columns=list('xyz'))
where df looks like:
Now I add a new row by:
df.loc['new',:]=[0,0,0]
Then df becomes:
Now I want to do the same but with a different df that has non-unique multi-index:
df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,1,2,2]], columns=list('xyz'))
,which looks like:
and call
df.loc['new',:]=[0,0,0]
The result is "Exception: cannot handle a non-unique multi-index!"
How could I achieve the goal?
Use append or concat with helper DataFrame:
df1 = pd.DataFrame([[0,0,0]],
columns=df.columns,
index=pd.MultiIndex.from_arrays([['new'], ['']]))
df2 = df.append(df1)
df2 = pd.concat([df, df1])
print (df2)
x y z
a 1 0 1 2
1 3 4 5
b 2 6 7 8
2 9 10 11
new 0 0 0