Vectorized method to sync two arrays - numpy

I have two Pandas TimeSeries: x, and y, which I would like to sync "as of". I would like to find for every element in x the latest (by index) element in y that preceeds it (by index value). For example, I would like to compute this new_x:
x new_x
---- -----
13:01 13:00
14:02 14:00
y
----
13:00
13:01
13:30
14:00
I am looking for a vectorized solution, not a Python loop. The time values are based on Numpy datetime64. The y array's length is in the order of millions, so O(n^2) solutions are probably not practical.

In some circles this operation is known as the "asof" join. Here is an implementation:
def diffCols(df1, df2):
""" Find columns in df1 not present in df2
Return df1.columns - df2.columns maintaining the order which the resulting
columns appears in df1.
Parameters:
----------
df1 : pandas dataframe object
df2 : pandas dataframe objct
Pandas already offers df1.columns - df2.columns, but unfortunately
the original order of the resulting columns is not maintained.
"""
return [i for i in df1.columns if i not in df2.columns]
def aj(df1, df2, overwriteColumns=True, inplace=False):
""" KDB+ like asof join.
Finds prevailing values of df2 asof df1's index. The resulting dataframe
will have same number of rows as df1.
Parameters
----------
df1 : Pandas dataframe
df2 : Pandas dataframe
overwriteColumns : boolean, default True
The columns of df2 will overwrite the columns of df1 if they have the same
name unless overwriteColumns is set to False. In that case, this function
will only join columns of df2 which are not present in df1.
inplace : boolean, default False.
If True, adds columns of df2 to df1. Otherwise, create a new dataframe with
columns of both df1 and df2.
*Assumes both df1 and df2 have datetime64 index. """
joiner = lambda x : x.asof(df1.index)
if not overwriteColumns:
# Get columns of df2 not present in df1
cols = diffCols(df2, df1)
if len(cols) > 0:
df2 = df2.ix[:,cols]
result = df2.apply(joiner)
if inplace:
for i in result.columns:
df1[i] = result[i]
return df1
else:
return result
Internally, this uses pandas.Series.asof().

What about using Series.searchsorted() to return the index of y where you would insert x. You could then subtract one from that value and use it to index y.
In [1]: x
Out[1]:
0 1301
1 1402
In [2]: y
Out[2]:
0 1300
1 1301
2 1330
3 1400
In [3]: y[y.searchsorted(x)-1]
Out[3]:
0 1300
3 1400
note: the above example uses int64 Series

Related

How do I offset a dataframe with values in another dataframe?

I have two dataframes. One is the basevales (df) and the other is an offset (df2).
How do I create a third dataframe that is the first dataframe offset by matching values (the ID) in the second dataframe?
This post doesn't seem to do the offset... Update only some values in a dataframe using another dataframe
import pandas as pd
# initialize list of lists
data = [['1092', 10.02], ['18723754', 15.76], ['28635', 147.87]]
df = pd.DataFrame(data, columns = ['ID', 'Price'])
offsets = [['1092', 100.00], ['28635', 1000.00], ['88273', 10.]]
df2 = pd.DataFrame(offsets, columns = ['ID', 'Offset'])
print (df)
print (df2)
>>> print (df)
ID Price
0 1092 10.02
1 18723754 15.76 # no offset to affect it
2 28635 147.87
>>> print (df2)
ID Offset
0 1092 100.00
1 28635 1000.00
2 88273 10.00 # < no match
This is want I want to produce: The price has been offset by matching
ID Price
0 1092 110.02
1 18723754 15.76
2 28635 1147.87
I've also looked at Pandas Merging 101
I don't want to add columns to the dataframe, and I don;t want to just replace column values with values from another dataframe.
What I want is to add (sum) column values from the other dataframe to this dataframe, where the IDs match.
The closest I come is df_add=df.reindex_like(df2) + df2 but the problem is that it sums all columns - even the ID column.
Try this :
df['Price'] = pd.merge(df, df2, on=["ID"], how="left")[['Price','Offset']].sum(axis=1)

DataFrame Index Created From Columns

I have a dataframe that I am using TIA to populate data from Bloomberg. When I look at df.index I see that the data that I intended to be columns is presented to me as what appears to be a multi-index. The output for df.columns is like this:
Index([u'column1','u'column2'])
I have tried various iterations of reset_index but have not been able to remedy this situation.
1) what about the TIA manager causes the dataframe columns to be read in as an index?
2) How can I properly identify these columns as columns instead of a multi-index?
The ultimate problem that I'm trying to fix is that when I try to add this column to df2, the values for that column in df2 come out as NaT. Like below:
df2['column3'] = df1['column1']
Produces:
df2
column1 column2 column3
1135 32 NaT
1351 43 NaT
35 13 NaT
135 13 NaT
From the comments it appears df1 and df2 have completely different indexes
In [396]: df1.index
Out[400]: Index(['Jan', 'Feb', 'Mar', 'Apr', 'May'], dtype='object')
In [401]: df2.index
Out[401]: Index(['One', 'Two', 'Three', 'Four', 'Five'], dtype='object')
but we wish to assign values from df1 to df2, preserving order.
Usually, Pandas operations try to automatically align values based on index (and/or column) labels.
In this case, we wish to ignore the labels. To do that, use
df2['columns3'] = df1['column1'].values
df1['column1'].values is a NumPy array. Since it doesn't have a Index, Pandas simply assigns the values in the array into df2['columns3'] in order.
The assignment would behave the same way if the right-hand side were a list or a tuple.
Note that this also relies on len(df1) equaling len(df2).
For example,
import pandas as pd
df1 = pd.DataFrame(
{"column1": [1135, 1351, 35, 135, 0], "column2": [32, 43, 13, 13, 0]},
index=[u"Jan", u"Feb", u"Mar", u"Apr", u"May"],
)
df2 = pd.DataFrame(
{"column1": range(len(df1))}, index=[u"One", u"Two", u"Three", u"Four", u"Five"]
)
df2["columns3"] = df1["column1"].values
print(df2)
yields
column1 columns3
One 0 1135
Two 1 1351
Three 2 35
Four 3 135
Five 4 0
Alternatively, you could make the two Indexs the same, and then df2["columns3"] = df1["column1"] would produce the same result (but now because the index labels are being aligned):
df1.index = df2.index
df2["columns3"] = df1["column1"]
Another way to make the Indexs match, is to reset the index on both DataFrames:
df1 = df1.reset_index()
df2 = df2.reset_index()
df2["columns3"] = df1["column1"]
reset_index moves the old index into a column named index by default (if index.name was None). Integers (starting with 0) are assigned as the new index labels:
In [402]: df1.reset_index()
Out[410]:
index column1 column2
0 Jan 1135 32
1 Feb 1351 43
2 Mar 35 13
3 Apr 135 13
4 May 0 0

pandas add multiple columns with apply [duplicate]

This question already has answers here:
How can I split a column of tuples in a Pandas dataframe?
(6 answers)
Closed 3 years ago.
I am currently projecting the latitude, longitude coordinates to a cartesian plane in my pandas data frame. So, I have a method for projection as:
def convert_lat_long_xy(lat, lo):
return x, y
So this returns a tuple and I can use this method on my dataframe as:
df.apply(lambda x: convert_lat_long_xy(x.latitude, x.longitude), axis=1))
Now, what I would like to do is create two extra columns in my data frame called 'x' and 'y' to hold these values. I know I can do something like:
df['proj'] = df.apply(lambda x: convert_lat_long_xy(x.latitude, x.longitude), axis=1))
But is it possible to add the values to two different columns?
Yes, you need to convert the output of lambda into pd.Series. Here's an example:
In [1]: import pandas as pd
In [2]: pd.DataFrame(["1,2", "2,3"], columns=["coord"])
Out[2]:
coord
0 1,2
1 2,3
In [3]: df = pd.DataFrame(["1,2", "2,3"], columns=["coord"])
In [4]: df.apply(lambda x: pd.Series(x["coord"].split(",")), axis=1)
Out[4]:
0 1
0 1 2
1 2 3
In [5]: df[["x", "y"]] = df.apply(lambda x: pd.Series(x["coord"].split(",")), axis=1)
In [6]: df
Out[6]:
coord x y
0 1,2 1 2
1 2,3 2 3
For your particular case, df.apply will become like this:
df[['x', 'y']] = df.apply(lambda x: pd.Series(convert_lat_long_xy(x.latitude, x.longitude)), axis=1))

search and compare data between dataframes

I have an issue about merge of data-frame.
I have two data-frames as follow,
df1:
ID name-group status
1 bob,david good
2 CC,robben good
3 jack bad
df2:
ID leader location
2 robben JAPAN
3 jack USA
4 bob UK
I want to get a result as flow.
dft
ID name-group Leader location
1 bob,david
2 CC,robben Robben JAPAN
3 jack Jack USA
the [Leader] and [location] will be merged when
[leader] in df2 **IN** [name-group] of df1
&
[ID] of df2 **=** [ID] of df1
I have tried for loop, but its time-cost is very high.
any ideas for this issue?
Thanks
See the end of the post for runnable code. The proposed solution is in the function, using_tidy.
The main problem here is that having multiple names in name-group, separated
by commas, makes searching for membership difficult. If, instead, df1 had each
member of name-group in its own row, then testing for membership would be
easy. That is, suppose df1 looked like this:
ID name-group status
0 1 bob good
0 1 david good
1 2 CC good
1 2 robben good
2 3 jack bad
Then you could simply merge df1 and df2 on ID and test if leader
equals name-group... almost (see why "almost" below).
Putting df1 in tidy format (PDF)
is the main idea in the solution below. The reason why it improves performance
is because testing for equality between two columns is much much faster than
testing if a column of strings are substrings of another column of strings, or
are members of a column containing a list of strings.
The reason why I said "almost" above is because there is another difficulty --
after merging df1 and df2 on ID, some rows are leaderless such as the bob,david row:
ID name-group Leader location
1 bob,david
Since we simply want to keep these rows and we don't want to test if criteria #1 holds in this case, we need to treat these rows differently -- don't expand them.
We can handle this problem by separating the leaderless rows from those with potential leaders (see below).
The second criteria, that the IDs match is easy to enforce by merging df1 and df2 on ID:
dft = pd.merge(df1, df2, on='ID', how='left')
The first criteria is that dft['leader'] is in dft['name-group'].
This criteria could be expressed as
In [293]: dft.apply(lambda x: pd.isnull(x['leader']) or (x['leader'] in x['name-group'].split(',')), axis=1)
Out[293]:
0 True
1 True
2 True
dtype: bool
but using dft.apply(..., axis=1) calls the lambda function once for each
row. This can be very slow if there are many rows in dft.
If there are many rows in dft we can do better by first converting dft to
tidy format (PDF) -- placing each
member in dft['name-group'] on its own row. But first, let's split dft into 2
sub-DataFrames, those which have a leader, and those which don't:
has_leader = pd.notnull(dft['leader'])
leaderless, leaders = dft.loc[~has_leader, :], dft.loc[has_leader, :]
Now put the leaders in tidy format (one member per row):
member = leaders['name-group'].str.split(',', expand=True)
member = member.stack()
member.index = member.index.droplevel(1)
member.name = 'member'
leaders = pd.concat([member, leaders], axis=1)
The pay off for all this work is that criteria #1 can now be expressed by a fast calculation:
# this enforces criteria #1 (leader of df2 is in name-group of df1)
mask = (leaders['leader'] == leaders['member'])
leaders = leaders.loc[mask, :]
leaders = leaders.drop('member', axis=1)
and the desired result is:
dft = pd.concat([leaderless, leaders], axis=0)
We had to do some work to get df1 into tidy format. We need to benchmark to
determine if the cost of doing that extra work pays off by being able to compute criteria #1 faster.
Here is a benchmark using largish dataframes of 1000 rows for df1 and df2:
In [356]: %timeit using_tidy(df1, df2)
100 loops, best of 3: 17.8 ms per loop
In [357]: %timeit using_apply(df1, df2)
10 loops, best of 3: 98.2 ms per loop
The speed advantage of using_tidy over using_apply increases as the number
of rows in pd.merge(df1, df2, on='ID', how='left') increases.
Here is the setup for the benchmark:
import string
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'name-group':['bob,david', 'CC,robben', 'jack'],
'status':['good','good','bad'],
'ID':[1,2,3]})
df2 = pd.DataFrame({'leader':['robben','jack','bob'],
'location':['JAPAN','USA','UK'],
'ID':[2,3,4]})
def using_apply(df1, df2):
dft = pd.merge(df1, df2, on='ID', how='left')
mask = dft.apply(lambda x: pd.isnull(x['leader']) or (x['leader'] in x['name-group'].split(',')), axis=1)
return dft.loc[mask, :]
def using_tidy(df1, df2):
# this enforces criteria #2 (the IDs are the same)
dft = pd.merge(df1, df2, on='ID', how='left')
# split dft into 2 sub-DataFrames, based on rows which have a leader and those which do not.
has_leader = pd.notnull(dft['leader'])
leaderless, leaders = dft.loc[~has_leader, :], dft.loc[has_leader, :]
# expand leaders so each member in name-group has its own row
member = leaders['name-group'].str.split(',', expand=True)
member = member.stack()
member.index = member.index.droplevel(1)
member.name = 'member'
leaders = pd.concat([member, leaders], axis=1)
# this enforces criteria #1 (leader of df2 is in name-group of df1)
mask = (leaders['leader'] == leaders['member'])
leaders = leaders.loc[mask, :]
leaders = leaders.drop('member', axis=1)
dft = pd.concat([leaderless, leaders], axis=0)
return dft
def make_random_str_array(letters=string.ascii_uppercase, strlen=10, size=100):
return (np.random.choice(list(letters), size*strlen)
.view('|U{}'.format(strlen)))
def make_dfs(N=1000):
names = make_random_str_array(strlen=4, size=10)
df1 = pd.DataFrame({
'name-group':[','.join(np.random.choice(names, size=np.random.randint(1,10), replace=False)) for i in range(N)],
'status':np.random.choice(['good','bad'], size=N),
'ID':np.random.randint(4, size=N)})
df2 = pd.DataFrame({
'leader':np.random.choice(names, size=N),
'location':np.random.randint(10, size=N),
'ID':np.random.randint(4, size=N)})
return df1, df2
df1, df2 = make_dfs()
Why don’t you use
Dft = pd.merge(df1,df2,how=‘left’,left_on = [‘ID’],right_on =[‘ID’])

correct accessing of slices with duplicate index-values present

I have a dataframe with an index that sometimes contains rows with the same index-value. Now I want to slice that dataframe and set values based on row-indices.
Consider the following example:
import pandas as pd
df = pd.DataFrame({'index':[1,2,2,3], 'values':[10,20,30,40]})
df.set_index(['index'], inplace=True)
df1 = df.copy()
df2 = df.copy()
#copy warning
df1.iloc[0:2]['values'] = 99
print(df1)
df2.loc[df.index[0:2], 'values'] = 99
print(df2)
df1 is the expected result, but gives me a SettingWithCopyWarning.
df2 seems to be the suggested way of accessing by the doc, but gives me the wrong result (because of the duplicate index)
Is there a "proper" way to set those values correctly with the duplicate index-values present?
.loc is not recommended when you have duplicate index. So you have to go for position based selection iloc. Since we need to pass the positions, we have to use get_loc for getting position of column:
print (df2.columns.get_loc('values'))
0
df1.iloc[0:2, df2.columns.get_loc('values')] = 99
print(df1)
values
index
1 99
2 99
2 30
3 40