Pandas dividing filtered column from df 1 by filtered column of df 2 warning and weird behavior - pandas

I have a data frame which is conditionally broken up into two separate dataframes as follows:
df = pd.read_csv(file, names)
df = df.loc[df['name1'] == common_val]
df1 = df.loc[df['name2'] == target1]
df2 = df.loc[df['name2'] == target2]
# each df has a 'name3' I want to perform a division on after this filtering
The original df is filtered by a value shared by the two dataframes, and then each of the two new dataframes are further filtered by another shared column.
What I want to work:
df1['name3'] = df1['name3']/df2['name3']
However, as many questions have pointed out, this causes a setting with copy warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I tried what was recommended in this question:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2.loc[:,'name3']
# also tried:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2['name3']
But in both cases I still get weird behavior and the set by copy warning.
I then tried what was recommended in this answer:
df.loc[df['name2']==target1, 'name3'] = df.loc[df['name2']==target1, 'name3']/df.loc[df['name2'] == target2, 'name3']
which still results in the same copy warning.
If possible I would like to avoid copying the data frame to get around this because of the size of these dataframes (and I'm already somewhat wastefully making two almost identical dfs from the original).
If copying is the best way to go with this problem I'm interested to hear why that works over all the options I explored above.
Edit: here is a simple data frame along the lines of what df would look like after the line df.loc[df['name1'] == common_val]
name1 other1 other2 name2 name3
a x y 1 2
a x y 1 4
a x y 2 5
a x y 2 3
So if target1=1 and target2=2,
I would like df1 to contain only rows where name1=1 and df2 to contain only rows where name2=2, then divide the resulting df1['name3'] by the resulting df2['name3'].
If there is a less convoluted way to do this (without splitting the original df) I'm open to that as well!

Related

Creating batches based on city in pandas

I have two different dataframes that I want to fuzzy match against each other to find and remove duplicates. To make the process faster/more accurate I want to only fuzzy match records from both dataframes in the same cities. So that makes it necessary to create batches based on cities in the one dataframe then running the fuzzy matcher between each batch and a subset of the other dataframe with like cities. I can't find another post that does this and I am stuck. Here is what I have so far. Thanks!
df = pd.DataFrame({'A':[1,1,2,2,2,2,3,3],'B':['Q','Q','R','R','R','P','L','L'],'origin':['file1','file2','file3','file4','file5','file6','file7','file8']})
cols = ['B']
df1 = df[df.duplicated(subset=cols,keep=False)].copy()
df1 = df1.sort_values(cols)
df1['group'] = 'g' + (df1.groupby(cols).ngroup() + 1).astype(str)
df1['duplicate_count'] = df1.groupby(cols)['origin'].transform('size')
df1_g1 = df1.loc[df1['group'] == 'g1']
print(df1_g1)
which will not factor in anything that isn't duplicated so if a value only appears once then it will be skipped as is the case with 'P' in column B. It also requires me to go in and hard-code the group in each time which is not ideal. I haven't been able to figure out a for loop or any other method to solve this. Thanks!
You can pass to locals
variables = locals()
for i,j in df1.groupby('group'):
variables["df1_{0}".format(i)] = j
df1_g1
Out[314]:
A B origin group duplicate_count
6 3 L file7 g1 2
7 3 L file8 g1 2

Vaex: apply changes to selection

Using Vaex, I would like to make a selection of rows, modify the values of some columns on that selection and get the changes applied on the original dataframe.
I can do a selection and make changes to that selection, but how can I get them ported to the original dataframe?
df = vaex.from_pandas(pd.DataFrame({'a':[1,2], 'b':[3,4]}))
df_selected = df[df.a==1]
df_selected['b'] = df_selected.b * 0 + 5
df_selected
# a b
0 1 5
df
# a b
0 1 3
1 2 4
So far, the only solution that comes to my mind is to obtain two complementary selections, modify the one I am interested in, and then concatenate it with the other selection. Is there a more direct way of doing this?
You are probably looking for the where method.
I think it should be something like this:
df = vaex.from_pandas(pd.DataFrame({'a':[1,2], 'b':[3,4]}))
df['c'] = df.func.where(df.a==1, df.b * 0 + 5, df.a)
The where syntax is
where(if, then, else) or where(condition, if condition satisfied, otherwise).

pandas vertical concat not working as expected

I have 2 dataframes which I am trying to merge/stack vertically. First dataframe has 25 columns and second dataframe has 13 columns, out of which I only want to select 1.
When I execute the code, I get more records than expected.
I dont understand where does the problem lie.
To understand this, I tried loading the data again in a fresh pandas dataframe.
input_df = pd.read_csv()
print(input_df.shape)
(8809, 11)
filtered_df = input_df[input_df['label'] != -1] # try to filter based on label column
print(filtered_df.shape)
(6603, 11)
But when I simply print the filtered_df, I can still see 8809 records.

show observation that got lost in merge

Lets say I want to merge two different dataframes by the key of two columns.
Dataframe One has 70000 obs of 10 variables.
Dataframe Two has 4500 obs of 5 variables.
Now I checked how my observations from my New dataframe are left by using this code.
So I realize that my columns from my dataframe Two are now only 4490 obs of 10 variables.
Thats all right.
My question is:
Is there way of giving me back the 5 observations from my dataframe Two I lost during the process. The names would be enough.
Thank you :)
I think you can use dplyr::anti_join for this. From its documentation:
return all rows from x where there are not matching values in y, keeping just columns from x.
You'd probably have to pass your data frame TWO as x.
EDIT: as mentioned in the comments, the syntax for its by argument is different.
Example:
df1 <- data.frame(Name=c("a", "b", "c"),
Date1=c(1,2,3),
stringsAsFactors=FALSE)
df2 <- data.frame(Name=c("a", "d"),
Date2=c(1,2),
stringsAsFactors=FALSE)
> dplyr::anti_join(df2, df1, by=c("Name"="Name", "Date2"="Date1"))
Name Date
1 d 2

How to avoid temporary variables when creating new column via groupby.apply

I would like to create a new column newcol in a dataframe df as the result of
df.groupby('keycol').apply(somefunc)
The obvious:
df['newcol'] = df.groupby('keycol').apply(somefunc)
does not work: either df['newcol'] ends up containing all nan's (which is certainly not what the RHS evaluates to), OR some exception is raised (the details of the exception vary wildly depending on what somefunc returns).
I have tried many variations of the above, including stuff like
import pandas as pd
df['newcol'] = pd.Series(df.groupby('keycol').apply(somefunc), index=df.index)
They all fail.
The only thing that has worked requires defining an intermediate variable:
import pandas as pd
tmp = df.groupby('keycol').apply(lambda x: pd.Series(somefunc(x)))
tmp.index = df.index
df['rank'] = tmp
Is there a way to achieve this without having to create an intermediate variable?
(The documentation for GroupBy.apply is almost content-free.)
Let's build up an example and I think I can illustrate why your first attempts are failing:
Example data:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : randn(n)})
print df.head(10)
results in:
coef expenditure groupid
0 0.874076 bar one
1 -0.972586 foo two
2 -0.003457 bar one
3 -0.893106 bar one
4 -0.387922 bar two
5 -0.109405 bar two
6 1.275657 foo two
7 -0.318801 foo two
8 -1.134889 bar two
9 1.812964 foo two
So if apply a simple function, mean, to the grouped data we get the following:
df2= df.groupby('groupid').apply(mean)
print df2
Which is:
coef
groupid
one -0.215539
two 0.149459
So the dataframe above is indexed by groupid and has one column, coef.
What you tried to do first was, effectively, the following:
df['newcol'] = df2
That gives all NaNs for newcol. Honestly I have no idea why that doesn't throw an error. I'm not sure why it would produce anything at all. I think what you really want to do is merge df2 back into df
To merge df and df2 we need to remove the index from df2, rename the new column, then merge:
df2= df.groupby('groupid').apply(mean)
df2.reset_index(inplace=True)
df2.columns = ['groupid','newcol']
df.merge(df2)
which I think is what you were after.
This is such a common idiom that Pandas includes the transform method which wraps all this up into a much simpler syntax:
df['newcol'] = df.groupby('groupid').transform(mean)
print df.head()
results:
coef expenditure groupid newcol
0 1.705825 foo one -0.025112
1 -0.608750 bar one -0.025112
2 -1.215015 bar one -0.025112
3 -0.831478 foo two -0.073560
4 2.174040 bar one -0.025112
Better documentation is here.