How do I merge two pandas dataframes by time series index? - pandas

Currently I have two dataframes that look like this:
FSample
GMSample
What I want is something that ideally looks like this:
I attempted to do something similar to
result = pd.concat([FSample,GMSample],axis=1)
result
But my result has the data stacked on top of each other.
Then I attempted to use the merge command like this
result = pd.merge(FSample,GMSample,how='inner',on='Date')
result
From that I got a KeyError on 'Date'
So I feel like I am missing both an understanding of how I should be trying to combine these dataframes (i.e. multi-index?) and the syntax to do so properly.

You get a key error, because the Date is an index, whereas the "on" keyword in merge takes a column. Alternatively, you could remove Symbol from the indexes and then join the dataframes by the Date indexes.
FSample.reset_index("Symbol").join(GMSample.reset_index("Symbol"), lsuffix="_x", rsuffix="_y")

Working with MultiIndexes in pandas usually requires you to constantly set/reset the index. That is probably going to be the easiest thing to do in this case as well, as pd.merge does not immediately support merging on specific levels of a MultiIndex.
df_f = pd.DataFrame(
data = {
'Symbol': ['F'] * 5,
'Date': pd.to_datetime(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09']),
'Close': [11.13, 11.30, 11.59, 11.71, 11.80],
},
).set_index(['Symbol', 'Date']).sort_index()
df_gm = pd.DataFrame(
data = {
'Symbol': ['GM'] * 5,
'Date': pd.to_datetime(['2012-01-03', '2012-01-04', '2012-01-05', '2012-01-06', '2012-01-09']),
'Close': [21.05, 21.15, 22.17, 22.92, 22.84],
},
).set_index(['Symbol', 'Date']).sort_index()
pd.merge(df_f.reset_index(level='Date'),
df_gm.reset_index(level='Date'),
how='inner',
on='Date',
suffixes=('_F', '_GM')
).set_index('Date')
The result:
Close_F Close_GM
Date
2012-01-03 11.13 21.05
2012-01-04 11.30 21.15
2012-01-05 11.59 22.17
2012-01-06 11.71 22.92
2012-01-09 11.80 22.84

Related

Stuck merging two dataframes, getting unexpected results

Thank you, I only 3 weeks into learning Pandas, and I am getting unexpected results, any guidance would be appreciated.
I would like to merge two DataFrames together and retain my set_index.
I have a simple DataFrame
import pandas as pd
data = {
'part_number': [123,123,123],
'part_name': ['some name in 11', 'some name in 12', 'some name in 13'],
'part_size': [11,12,13]
}
df = pd.DataFrame(data=data)
df.set_index('part_name', inplace=True)
I groupby the part_sizes, and merge.
This is where my knowledge breaks down, I lose my index which is the part_name.
I see there are joins and concats, am I using the wrong syntax?
part_size_merge = df.groupby(['part_number'], dropna=False)['part_size'].agg(tuple).to_frame()
merged = df.merge(part_size_merge, on=['part_number'])
display(merged.head())
I tried concat, however, it looks like it stacks the two df's together, which isn't how I'd like it.
x = pd.concat([df, part_size_merge], axis=0, join='inner')
x.head()
Yes that is normal merge
out = df.reset_index().merge(part_size_merge, on=['part_number']).set_index('part_name')
Out[334]:
part_number part_size_x part_size_y
part_name
some name in 11 123 11 (11, 12, 13)
some name in 12 123 12 (11, 12, 13)
some name in 13 123 13 (11, 12, 13)

Creating a Pandas Dataframe and Assigning Values Based on Another Dataframe

I have a dataframe that looks like this:
df1
ticker period calendarDate updated dateKey assetsAverage
WMT Q 2021-01-01 2021-03-31 2021-04-02 100000000
What I want to do is take these values and put them into another dataframe that looks like this:
df2
ticker period Calendar Date Updated Date Key Assets Average
WMT Q 2021-01-01 2021-03-31 2021-04-02 100000000
I'm using the 2nd dataframe as my output and using my 1st dataframe as temporary storage.
Any suggestions?
I tried doing something like this:
df2 = pd.DataFrame(
{
"Ticker":df1["ticker"],
"Period":df1["period"],
"Calendar Date":df1["calendarDate"],
"Updated":df1["updated"],
"Date Key":df1["dateKey"],
"Assets Average":df1["assetsAverage"]
}
)
The error message I got was
TypeError: init() takes from 1 to 6 positional arguments but 112 (I'm actually using more columns, but getting my point across only required a few).
were given
Edit #1:
This is what I am trying to do now:
df2 = df1.copy()
df2 = df2.rename(columns = {
"ticker":"Ticker",
"period":"Period",
"calendarDate":"Calendar Date",
"updated":"Updated",
"dateKey":"Date Key",
"assetsAverage":"Assets Average"
}
)
Unfortunately, I got the same error message as before, any suggestions?
Do this.
df2 = df1.copy()
df2.columns = ['ticker', 'period', 'Calendar Date', 'Updated', 'Date Key', 'Assets Average']

pandas groupby keeping other columns

This question is similar to this one, but in my case I need to apply a function that returns a Series rather than a single value for each group — that question is about aggregating with sum, but I need to use rank (so the difference is like that between agg and transform).
I have data on firms over time. This generates some dummy data that looks like my use case:
import numpy as np
import pandas as pd
dates = pd.date_range('1926', '2020', freq='M')
ndates = len(dates)
nfirms = 5000
cols = list('ABCDE')
df = pd.DataFrame(np.random.randn(nfirms*ndates,len(cols)),
index=np.tile(dates,nfirms),
columns=cols)
df.insert(0, 'id', np.repeat(np.arange(nfirms), ndates))
I need to calculate ranks of column E within each date (the index), but keeping column id.
If I just use groupby and .rank I get this:
df.groupby(level=0)['E'].rank()
1926-01-31 3226.0
1926-02-28 1042.0
1926-03-31 1611.0
1926-04-30 2591.0
1926-05-31 30.0
...
2019-08-31 1973.0
2019-09-30 227.0
2019-10-31 4381.0
2019-11-30 1654.0
2019-12-31 1572.0
Name: E, Length: 5640000, dtype: float64
This has the same dimension as df but I'm not sure it's safe to merge on the index — I really need to join on the id column also. Can I assume that the order remains the same?
If the order in the output is the same as in the output, I think I can do this:
df['ranks'] = df.groupby(level=0)['E'].rank()
But something about this seems strange, and I assume there is a way to include additional columns in the groupby output.
(I'm also not clear if calling .rank() is equivalent to .transform('rank').)

How to resample a dataframe with different functions applied to each column if we have more than 20 columns?

I know this question has been asked before. The answer is as follows:
df.resample('M').agg({'col1': np.sum, 'col2': np.mean})
But I have 27 columns and I want to sum the first 25, and average the remaining two. Should I write this('col1' - 'col25': np.sum) for 25 columns and this('col26': np.mean, 'col27': np.mean) for two columns?
Mt dataframe contains hourly data and I want to convert it to monthly data. I want to try something like that but it is nonsense:
for i in col_list:
df = df.resample('M').agg({i-2: np.sum, 'col26': np.mean, 'col27': np.mean})
Is there any shortcut for this situation?
You can try this, not for loop :
sum_col = ['col1','col2','col3','col4', ...]
sum_df = df.resample('M')[sum_col].sum()
mean_col = ['col26','col27']
mean_df = df.resample('M')[mean_col].mean()
df = sum_col.join(mean_df)

Trouble working with date indexes with Multi-Index

I am trying to understand how the date-related features of indexing in pandas work.
If I have this data frame:
dates = pd.date_range('6/1/2000', periods=12, freq='M')
df1 = DataFrame(randn(12, 2), index=dates, columns=['A', 'B'])
I know that we can extract records from 2000 using df1['2000'] or a range of dates using df1['2000-09':'2001-03'].
But suppose instead I have a dataframe with a multi-index
index = pd.MultiIndex.from_arrays([dates, list('HIJKHIJKHIJK')], names=['date', 'id'])
df2 = DataFrame(randn(12, 2), index=index, columns=['C', 'D'])
Is there a way to extract rows with a year 2000 as we did with a single index? It appears that df2.xs('2000-06-30') works for accessing a particular date, but df2.xs('2000') does not return anything. Is xs not the right way to go about this?
You don't need to use xs for this, but you can index using .loc.
One of the example you tried, would then look like df2.loc['2000-09':'2001-03']. The only problem is that the 'partial string parsing' feature does not work yet when using multi-index. So you have to provide actual datetimes:
In [17]: df2.loc[pd.Timestamp('2000-09'):pd.Timestamp('2001-04')]
Out[17]:
C D
date id
2000-09-30 K -0.441505 0.364074
2000-10-31 H 2.366365 -0.404136
2000-11-30 I 0.371168 1.218779
2000-12-31 J -0.579180 0.026119
2001-01-31 K 0.450040 1.048433
2001-02-28 H 1.090321 1.676140
2001-03-31 I -0.272268 0.213227
But note that in this case pd.Timestamp('2001-03') would be interpreted as 2001-03-01 00:00:00(an actual moment in time). Therefore, you have to adjust the start/stop values a little bit.
A selection for a full year (eg df1['2000']) would then become df2.loc[pd.Timestamp('2000'):pd.Timestamp('2001')] or df2.loc[pd.Timestamp('2000-01-01'):pd.Timestamp('2000-12-31')]