sort dataframe by other dataframe - pandas

Given a df's
knn_df
0 1 2 3
0 1.1565523 1.902790 1.927971 1.1530536
1 1.927971 1.1565523 1.815097 1.1530536
2 1.902790 1.1565523 1.815097 1.927971
3 1.815097 1.927971 1.902790 1.1530536
4 1.902790 1.1565523 1.815097 1.1530536
dates_df
0 1 2 3
0 2011-11-14 02:30:00.601 2003-08-12 00:00:00.000 2003-11-30 23:00:00.000 2011-10-25 12:00:00.000
1 2003-11-30 23:00:00.000 2011-11-14 02:30:00.601 2002-08-06 00:00:00.000 2011-10-25 12:00:00.000
2 2003-08-12 00:00:00.000 2011-11-14 02:30:00.601 2002-08-06 00:00:00.000 2003-11-30 23:00:00.000
3 2002-08-06 00:00:00.000 2003-11-30 23:00:00.000 2003-08-12 00:00:00.000 2011-10-25 12:00:00.000
4 2003-08-12 00:00:00.000 2011-11-14 02:30:00.601 2002-08-06 00:00:00.000 2011-10-25 12:00:00.000
I have to sort the values of knn_df be the dates of the dates_df.
Every row in dates_df correspond to row in knn_df
I tried to sort like this.
np.argsort(dates_df.values,axis=1)[:,::-1]
array([[0, 3, 2, 1],
[1, 3, 0, 2],
[1, 3, 0, 2],
[3, 1, 2, 0],
[1, 3, 0, 2]])
Which give the right order of the values by columns, But when i tried to reorder
Sorted_knn = (knn_df.values[np.arange(len(knn_df)),
np.argsort(dates_df.values,axis=1)[:,::-1]])
I get an error
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (5,) (5,4)
I am missing something...

Add [:, None] for making two-dimensional 5x1 array for correct broadcasting:
a = np.argsort(dates_df.values,axis=1)[:,::-1]
b = knn_df.values[np.arange(len(knn_df))[:, None], a]
print (b)
[[1.1565523 1.1530536 1.927971 1.90279 ]
[1.1565523 1.1530536 1.927971 1.815097 ]
[1.1565523 1.927971 1.90279 1.815097 ]
[1.1530536 1.927971 1.90279 1.815097 ]
[1.1565523 1.1530536 1.90279 1.815097 ]]

Related

How to filter a dataframe given a specific daily hour?

Given the two data frames:
df1:
datetime v
2020-10-01 12:00:00 15
2020-10-02 4
2020-10-03 07:00:00 3
2020-10-03 08:01:00 51
2020-10-03 09:00:00 9
df2:
datetime p
2020-10-01 11:00:00 1
2020-10-01 12:00:00 2
2020-10-02 13:00:00 14
2020-10-02 13:01:00 5
2020-10-03 20:00:00 12
2020-10-03 02:01:00 30
2020-10-03 07:00:00 7
I want to merge these two dataframes into one, and the policy is looking up the nearest value around 08:00 daily. The final result should be
datetime v p
2020-10-01 08:00:00 15 1
2020-10-02 08:00:00 4 14
2020-10-03 08:00:00 51 7
How can I implement this?
Given the following dataframes:
import pandas as pd
df1 = pd.DataFrame(
{
"datetime": [
"2020-10-01 12:00:00",
"2020-10-02",
"2020-10-03 07:00:00",
"2020-10-03 08:01:00",
"2020-10-03 09:00:00",
],
"v": [15, 4, 3, 51, 9],
}
)
df2 = pd.DataFrame(
{
"datetime": [
"2020-10-01 11:00:00",
"2020-10-01 12:00:00",
"2020-10-02 13:00:00",
"2020-10-02 13:01:00",
"2020-10-03 20:00:00",
"2020-10-03 02:01:00",
"2020-10-03 07:00:00",
],
"p": [1, 2, 14, 5, 12, 30, 7],
}
)
You can define a helper function:
def align(df):
# Set proper type
df["datetime"] = pd.to_datetime(df["datetime"])
# Slice df by day
dfs = [
df.copy().loc[df["datetime"].dt.date == item, :]
for item in df["datetime"].dt.date.unique()
]
# Evaluate distance in seconds between given hour and 08:00:00 and filter on min
for i, df in enumerate(dfs):
df["target"] = pd.to_datetime(df["datetime"].dt.date.astype(str) + " 08:00:00")
df["distance"] = (
df["target"].map(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
- df["datetime"].map(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
).abs()
dfs[i] = df.loc[df["distance"].idxmin(), :]
# Concatenate filtered dataframes
df = (
pd.concat(dfs, axis=1)
.T.drop(columns=["datetime", "distance"])
.rename(columns={"target": "datetime"})
.set_index("datetime")
)
return df
To apply on df1 and df2 and then merge:
df = pd.merge(
right=align(df1), left=align(df2), how="outer", right_index=True, left_index=True
).reindex(columns=["v", "p"])
print(df)
# Output
v p
datetime
2020-10-01 08:00:00 15 1
2020-10-02 08:00:00 4 14
2020-10-03 08:00:00 51 7

How to convert dictionary with list to dataframe with default index and column names

How to convert dictionary to dataframe with default index and column names
dictionary d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23
Use DataFrame.from_dict with orient='index' parameter:
d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df = pd.DataFrame.from_dict(d, orient='index', columns=['id','type','value'])
print (df)
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23

append columns from rows in pandas

convert rows into new columns, like:
original dataframe:
attr_0 attr_1 attr_2 attr_3
0 day_0 -0.032546 0.161111 -0.488420 -0.811738
1 day_1 -0.341992 0.779818 -2.937992 -0.236757
2 day_2 0.592365 0.729467 0.421381 0.571941
3 day_3 -0.418947 2.022934 -1.349382 1.411210
4 day_4 -0.726380 0.287871 -1.153566 -2.275976
...
after convertion:
day_0_attr_0 day_0_attr_1 day_0_attr_2 day_0_attr_3 day_1_attr_0 \
0 -0.032546 0.144388 -0.992263 0.734864 -0.936625
day_1_attr_1 day_1_attr_2 day_1_attr_3 day_2_attr_0 day_2_attr_1 \
0 -1.717135 -0.228005 -0.330573 -0.28034 0.834345
day_2_attr_2 day_2_attr_3 day_3_attr_0 day_3_attr_1 day_3_attr_2 \
0 1.161089 0.385277 -0.014138 -1.05523 -0.618873
day_3_attr_3 day_4_attr_0 day_4_attr_1 day_4_attr_2 day_4_attr_3
0 0.724463 0.137691 -1.188638 -2.457449 -0.171268
If MultiIndex use:
print (df.index)
MultiIndex(levels=[[0, 1, 2, 3, 4], ['day_0', 'day_1', 'day_2', 'day_3', 'day_4']],
labels=[[0, 1, 2, 3, 4], [0, 1, 2, 3, 4]])
df = df.reset_index(level=0, drop=True).stack().reset_index()
level_0 level_1 0
0 day_0 attr_0 -0.032546
1 day_0 attr_1 0.161111
2 day_0 attr_2 -0.488420
3 day_0 attr_3 -0.811738
4 day_1 attr_0 -0.341992
5 day_1 attr_1 0.779818
6 day_1 attr_2 -2.937992
7 day_1 attr_3 -0.236757
8 day_2 attr_0 0.592365
9 day_2 attr_1 0.729467
10 day_2 attr_2 0.421381
11 day_2 attr_3 0.571941
12 day_3 attr_0 -0.418947
13 day_3 attr_1 2.022934
14 day_3 attr_2 -1.349382
15 day_3 attr_3 1.411210
16 day_4 attr_0 -0.726380
17 day_4 attr_1 0.287871
18 day_4 attr_2 -1.153566
19 day_4 attr_3 -2.275976
df = pd.DataFrame([df[0].values], columns = df['level_0'] + '_' + df['level_1'])
print (df)
day_0_attr_0 day_0_attr_1 ... day_4_attr_2 day_4_attr_3
0 -0.032546 0.161111 ... -1.153566 -2.275976
[1 rows x 20 columns
Another solution with product:
from itertools import product
cols = ['{}_{}'.format(a,b) for a, b in product(df.index.get_level_values(1), df.columns)]
print (cols)
['day_0_attr_0', 'day_0_attr_1', 'day_0_attr_2', 'day_0_attr_3',
'day_1_attr_0', 'day_1_attr_1', 'day_1_attr_2', 'day_1_attr_3',
'day_2_attr_0', 'day_2_attr_1', 'day_2_attr_2', 'day_2_attr_3',
'day_3_attr_0', 'day_3_attr_1', 'day_3_attr_2', 'day_3_attr_3',
'day_4_attr_0', 'day_4_attr_1', 'day_4_attr_2', 'day_4_attr_3']
df = pd.DataFrame([df.values.ravel()], columns=cols)
print (df)
day_0_attr_0 day_0_attr_1 ... day_4_attr_2 day_4_attr_3
0 -0.032546 0.161111 ... -1.153566 -2.275976
[1 rows x 20 columns]
If no MultiIndex solutions are a bit changed:
print (df.index)
Index(['day_0', 'day_1', 'day_2', 'day_3', 'day_4'], dtype='object')
df = df.stack().reset_index()
df = pd.DataFrame([df[0].values], columns = df['level_0'] + '_' + df['level_1'])
from itertools import product
cols = ['{}_{}'.format(a,b) for a, b in product(df.index, df.columns)]
df = pd.DataFrame([df.values.ravel()], columns=cols)
print (df)
You can use melt and string concatenation approach i.e
idx = df.index
temp = df.melt()
# Repeat the index
temp['variable'] = pd.Series(np.concatenate([idx]*len(df.columns))) + '_' + temp['variable']
# Set index and transpose
temp.set_index('variable').T
variable day_0_attr_0 day_1_attr_0 day_2_attr_0 day_3_attr_0 day_4_attr_0 . . . .
value -0.032546 -0.341992 0.592365 -0.418947 -0.72638 . . . .

Efficiently Creating A Pandas DataFrame From A Numpy 3d array

Suppose we start with
import numpy as np
a = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
How can this be efficiently be made into a pandas DataFrame equivalent to
import pandas as pd
>>> pd.DataFrame({'a': [0, 0, 1, 1], 'b': [1, 3, 5, 7], 'c': [2, 4, 6, 8]})
a b c
0 0 1 2
1 0 3 4
2 1 5 6
3 1 7 8
The idea is to have the a column have the index in the first dimension in the original array, and the rest of the columns be a vertical concatenation of the 2d arrays in the latter two dimensions in the original array.
(This is easy to do with loops; the question is how to do it without them.)
Longer Example
Using #Divakar's excellent suggestion:
>>> np.random.randint(0,9,(4,3,2))
array([[[0, 6],
[6, 4],
[3, 4]],
[[5, 1],
[1, 3],
[6, 4]],
[[8, 0],
[2, 3],
[3, 1]],
[[2, 2],
[0, 0],
[6, 3]]])
Should be made to something like:
>>> pd.DataFrame({
'a': [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'b': [0, 6, 3, 5, 1, 6, 8, 2, 3, 2, 0, 6],
'c': [6, 4, 4, 1, 3, 4, 0, 3, 1, 2, 0, 3]})
a b c
0 0 0 6
1 0 6 4
2 0 3 4
3 1 5 1
4 1 1 3
5 1 6 4
6 2 8 0
7 2 2 3
8 2 3 1
9 3 2 2
10 3 0 0
11 3 6 3
Here's one approach that does most of the processing on NumPy before finally putting it out as a DataFrame, like so -
m,n,r = a.shape
out_arr = np.column_stack((np.repeat(np.arange(m),n),a.reshape(m*n,-1)))
out_df = pd.DataFrame(out_arr)
If you precisely know that the number of columns would be 2, such that we would have b and c as the last two columns and a as the first one, you can add column names like so -
out_df = pd.DataFrame(out_arr,columns=['a', 'b', 'c'])
Sample run -
>>> a
array([[[2, 0],
[1, 7],
[3, 8]],
[[5, 0],
[0, 7],
[8, 0]],
[[2, 5],
[8, 2],
[1, 2]],
[[5, 3],
[1, 6],
[3, 2]]])
>>> out_df
a b c
0 0 2 0
1 0 1 7
2 0 3 8
3 1 5 0
4 1 0 7
5 1 8 0
6 2 2 5
7 2 8 2
8 2 1 2
9 3 5 3
10 3 1 6
11 3 3 2
Using Panel:
a = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
b=pd.Panel(rollaxis(a,2)).to_frame()
c=b.set_index(b.index.labels[0]).reset_index()
c.columns=list('abc')
then a is :
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
b is :
0 1
major minor
0 0 1 2
1 3 4
1 0 5 6
1 7 8
and c is :
a b c
0 0 1 2
1 0 3 4
2 1 5 6
3 1 7 8

Select subset by a conditional expression from a PANDAS dataframe , but a error

a sample like this :
In [39]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [40]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [42]: t['shift_one'] = t.base - t.base.shift(1)
In [43]: t['shift_two'] = t.shift_one.shift(1)
In [44]: t
Out[44]:
base shift_one shift_two
2000-01-01 -1.239924 NaN NaN
2000-01-02 1.116260 2.356184 NaN
2000-01-03 0.401853 -0.714407 2.356184
2000-01-04 -0.823275 -1.225128 -0.714407
2000-01-05 -0.562103 0.261171 -1.225128
2000-01-06 0.347143 0.909246 0.261171
.............
2000-01-20 -0.062557 -0.467466 0.512293
now , if we use t[t.shift_one > 0 ] , it works ok ,but when we use:
In [48]: t[t.shift_one > 0 and t.shift_two < 0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
----> 1 t[t.shift_one > 0 and t.shift_two < 0]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Suppose we want to get a subset that include both two conditions , how to ? thanks a lot.
you need parens and use &, not and
see docs here:
http://pandas.pydata.org/pandas-docs/dev/indexing.html#boolean-indexing
In [3]: ts = pd.Series(np.random.randn(20),index=pd.date_range('1/1/2000',periods=20))
In [4]: t = pd.DataFrame(ts,columns=['base'],index=ts.index)
In [5]: t['shift_one'] = t.base - t.base.shift(1)
In [6]: t['shift_two'] = t.shift_one.shift(1)
In [7]: t
Out[7]:
base shift_one shift_two
2000-01-01 -1.116040 NaN NaN
2000-01-02 1.592079 2.708118 NaN
2000-01-03 0.958642 -0.633436 2.708118
2000-01-04 0.431970 -0.526672 -0.633436
2000-01-05 1.275624 0.843654 -0.526672
2000-01-06 0.498401 -0.777223 0.843654
2000-01-07 -0.351465 -0.849865 -0.777223
2000-01-08 -0.458742 -0.107277 -0.849865
2000-01-09 -2.100404 -1.641662 -0.107277
2000-01-10 0.601658 2.702062 -1.641662
2000-01-11 -2.026495 -2.628153 2.702062
2000-01-12 0.391426 2.417921 -2.628153
2000-01-13 -1.177292 -1.568718 2.417921
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-15 0.338649 0.713192 0.802749
2000-01-16 -1.124820 -1.463469 0.713192
2000-01-17 0.484175 1.608995 -1.463469
2000-01-18 -1.477772 -1.961947 1.608995
2000-01-19 0.481843 1.959615 -1.961947
2000-01-20 0.760168 0.278325 1.959615
In [8]: t[(t.shift_one>0) & (t.shift_two<0)]
Out[8]:
base shift_one shift_two
2000-01-05 1.275624 0.843654 -0.526672
2000-01-10 0.601658 2.702062 -1.641662
2000-01-12 0.391426 2.417921 -2.628153
2000-01-14 -0.374543 0.802749 -1.568718
2000-01-17 0.484175 1.608995 -1.463469
2000-01-19 0.481843 1.959615 -1.961947