Vaex: apply changes to selection - vaex

Using Vaex, I would like to make a selection of rows, modify the values of some columns on that selection and get the changes applied on the original dataframe.
I can do a selection and make changes to that selection, but how can I get them ported to the original dataframe?
df = vaex.from_pandas(pd.DataFrame({'a':[1,2], 'b':[3,4]}))
df_selected = df[df.a==1]
df_selected['b'] = df_selected.b * 0 + 5
df_selected
# a b
0 1 5
df
# a b
0 1 3
1 2 4
So far, the only solution that comes to my mind is to obtain two complementary selections, modify the one I am interested in, and then concatenate it with the other selection. Is there a more direct way of doing this?

You are probably looking for the where method.
I think it should be something like this:
df = vaex.from_pandas(pd.DataFrame({'a':[1,2], 'b':[3,4]}))
df['c'] = df.func.where(df.a==1, df.b * 0 + 5, df.a)
The where syntax is
where(if, then, else) or where(condition, if condition satisfied, otherwise).

Related

Creating batches based on city in pandas

I have two different dataframes that I want to fuzzy match against each other to find and remove duplicates. To make the process faster/more accurate I want to only fuzzy match records from both dataframes in the same cities. So that makes it necessary to create batches based on cities in the one dataframe then running the fuzzy matcher between each batch and a subset of the other dataframe with like cities. I can't find another post that does this and I am stuck. Here is what I have so far. Thanks!
df = pd.DataFrame({'A':[1,1,2,2,2,2,3,3],'B':['Q','Q','R','R','R','P','L','L'],'origin':['file1','file2','file3','file4','file5','file6','file7','file8']})
cols = ['B']
df1 = df[df.duplicated(subset=cols,keep=False)].copy()
df1 = df1.sort_values(cols)
df1['group'] = 'g' + (df1.groupby(cols).ngroup() + 1).astype(str)
df1['duplicate_count'] = df1.groupby(cols)['origin'].transform('size')
df1_g1 = df1.loc[df1['group'] == 'g1']
print(df1_g1)
which will not factor in anything that isn't duplicated so if a value only appears once then it will be skipped as is the case with 'P' in column B. It also requires me to go in and hard-code the group in each time which is not ideal. I haven't been able to figure out a for loop or any other method to solve this. Thanks!
You can pass to locals
variables = locals()
for i,j in df1.groupby('group'):
variables["df1_{0}".format(i)] = j
df1_g1
Out[314]:
A B origin group duplicate_count
6 3 L file7 g1 2
7 3 L file8 g1 2

Pandas dividing filtered column from df 1 by filtered column of df 2 warning and weird behavior

I have a data frame which is conditionally broken up into two separate dataframes as follows:
df = pd.read_csv(file, names)
df = df.loc[df['name1'] == common_val]
df1 = df.loc[df['name2'] == target1]
df2 = df.loc[df['name2'] == target2]
# each df has a 'name3' I want to perform a division on after this filtering
The original df is filtered by a value shared by the two dataframes, and then each of the two new dataframes are further filtered by another shared column.
What I want to work:
df1['name3'] = df1['name3']/df2['name3']
However, as many questions have pointed out, this causes a setting with copy warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I tried what was recommended in this question:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2.loc[:,'name3']
# also tried:
df1.loc[:,'name3'] = df1.loc[:,'name3'] / df2['name3']
But in both cases I still get weird behavior and the set by copy warning.
I then tried what was recommended in this answer:
df.loc[df['name2']==target1, 'name3'] = df.loc[df['name2']==target1, 'name3']/df.loc[df['name2'] == target2, 'name3']
which still results in the same copy warning.
If possible I would like to avoid copying the data frame to get around this because of the size of these dataframes (and I'm already somewhat wastefully making two almost identical dfs from the original).
If copying is the best way to go with this problem I'm interested to hear why that works over all the options I explored above.
Edit: here is a simple data frame along the lines of what df would look like after the line df.loc[df['name1'] == common_val]
name1 other1 other2 name2 name3
a x y 1 2
a x y 1 4
a x y 2 5
a x y 2 3
So if target1=1 and target2=2,
I would like df1 to contain only rows where name1=1 and df2 to contain only rows where name2=2, then divide the resulting df1['name3'] by the resulting df2['name3'].
If there is a less convoluted way to do this (without splitting the original df) I'm open to that as well!

Error in using Pandas groupby.apply to drop duplication

I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
The problem with your code is that it attempts to modify
the original group.
Other problem is that this function should return a single row
not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
Edit following the comment
In my opinion, it is not allowed to modify the original group,
as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice.
Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group.
It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you
create some new content, e.g. as the result of groupby instruction
and then save it under the same variable which so far held the source
DataFrame.

Pandas - Trying to create a list or Series in a data frame cell

I have the following data frame
df = pd.DataFrame({'A':[74.75, 91.71, 145.66], 'B':[4, 3, 3], 'C':[25.34, 33.52, 54.70]})
A B C
0 74.75 4 25.34
1 91.71 3 33.52
2 145.66 3 54.70
I would like to create another column df['D'] that would be a list or series from the first 3 columns suitable for use in another column with the np.irr function that would look like this
D
0 [ -74.75, 2.34, 25.34, 25.34, 25.34]
1 [ -91.71, 33.52, 33.52, 33.52]
2 [-145.66, 54.70, 54.70, 54.70]
so I could ultimately do something like this
df['E'] = np.irr(df['D'])
I did get as far as this
[-df.A[0]]+[df.C[0]]*df.B[0]
but it is not quite there.
Do you really need the column 'D'?
By the way you can easily add it as:
df['D'] = [[-df.A[i]]+[df.C[i]]*df.B[i] for i in xrange(len(df))]
df['E'] = df['D'].map(np.irr)
if you don't need it, you can directly set E
df['E'] = [np.irr([-df.A[i]]+[df.C[i]]*df.B[i]) for i in xrange(len(df))]
or:
df['E'] = df.apply(lambda x: np.irr([-x.A] + [x.C] * x.B), axis=1)

selecting data from pandas panel with MultiIndex

I have a DataFrame with MultiIndex, for example:
In [1]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [2]: df = DataFrame(randn(6,2),index=MultiIndex.from_tuples(zip(*arrays)),columns=['A','B'])
In [3]: df
Out [3]:
A B
one 1 -2.028736 -0.466668
2 -1.877478 0.179211
3 0.886038 0.679528
two 1 1.101735 0.169177
2 0.756676 -1.043739
3 1.189944 1.342415
Now I want to compute the means of elements 2 and 3 (index level 1) for each row (index level 0) and each column. So I need a DataFrame which would look like
A B
one 1 mean(df['A'].ix['one'][1:3]) mean(df['B'].ix['one'][1:3])
two 1 mean(df['A'].ix['two'][1:3]) mean(df['B'].ix['two'][1:3])
How do I do that without using loops over rows (index level 0) of the original data frame? What if I want to do the same for a Panel? There must be a simple solution with groupby, but I'm still learning it and can't think of an answer.
You can use the xs function to select on levels.
Starting with:
A B
one 1 -2.712137 -0.131805
2 -0.390227 -1.333230
3 0.047128 0.438284
two 1 0.055254 -1.434262
2 2.392265 -1.474072
3 -1.058256 -0.572943
You can then create a new dataframe using:
DataFrame({'one':df.xs('one',level=0)[1:3].apply(np.mean), 'two':df.xs('two',level=0)[1:3].apply(np.mean)}).transpose()
which gives the result:
A B
one -0.171549 -0.447473
two 0.667005 -1.023508
To do the same without specifying the items in the level, you can use groupby:
grouped = df.groupby(level=0)
d = {}
for g in grouped:
d[g[0]] = g[1][1:3].apply(np.mean)
DataFrame(d).transpose()
I'm not sure about panels - it's not as well documented, but something similar should be possible
I know this is an old question, but for reference who searches and finds this page, the easier solution I think is the level keyword in mean:
In [4]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [5]: df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(z
ip(*arrays)),columns=['A','B'])
In [6]: df
Out[6]:
A B
one 1 -0.472890 2.297778
2 -2.002773 -0.114489
3 -1.337794 -1.464213
two 1 1.964838 -0.623666
2 0.838388 0.229361
3 1.735198 0.170260
In [7]: df.mean(level=0)
Out[7]:
A B
one -1.271152 0.239692
two 1.512808 -0.074682
In this case it means that level 0 is kept over axis 0 (the rows, default value for mean)
Do the following:
# Specify the indices you want to work with.
idxs = [("one", elem) for elem in [2,3]] + [("two", elem) for elem in [2,3]]
# Compute grouped mean over only those indices.
df.ix[idxs].mean(level=0)