Combine Pandas DataFrames while creating MultiIndex Columns - pandas

I have two DataFrames, something like this:
import pandas as pd
dates = pd.Index(['2016-10-03', '2016-10-04', '2016-10-05'], name='Date')
close = pd.DataFrame( {'AAPL': [112.52, 113., 113.05],
'CSCO': [ 31.5, 31.35, 31.59 ],
'MSFT': [ 57.42, 57.24, 57.64 ] }, index = dates )
volume= pd.DataFrame( {'AAPL': [21701800, 29736800, 21453100] ,
'CSCO': [14070500, 18460400, 11808600] ,
'MSFT': [19189500, 20085900, 16726400] }, index = dates )
The output of DataFrame 'close' looks like this:
AAPL CSCO MSFT
Date
2016-10-03 112.52 31.50 57.42
2016-10-04 113.00 31.35 57.24
2016-10-05 113.05 31.59 57.64
And the output of DataFrame 'volume' looks like this:
AAPL CSCO MSFT
Date
2016-10-03 21701800 14070500 19189500
2016-10-04 29736800 18460400 20085900
2016-10-05 21453100 11808600 16726400
I would like to combine these two DataFrames into a single DataFrame with MultiIndex COLUMNS so that it looks like this:
AAPL CSCO MSFT
Close Volume Close Volume Close Volume
Date
2016-10-03 112.52 21701800 31.50 14070500 57.42 19189500
2016-10-04 113.00 29736800 31.35 18460400 57.24 20085900
2016-10-05 113.05 21453100 31.59 11808600 57.64 16726400
Can anyone give me an idea how to do that? I've been playing with pd.concat and pd.merge, but it's not clear to me how to get it to line up on the date index and allow me to provide names for the sub-index ('Close" and 'Volume') on the columns.

You can use the keys kwarg of concat:
In [11]: res = pd.concat([close, volume], axis=1, keys=["close", "volume"])
In [12]: res
Out[12]:
close volume
AAPL CSCO MSFT AAPL CSCO MSFT
Date
2016-10-03 112.52 31.50 57.42 21701800 14070500 19189500
2016-10-04 113.00 31.35 57.24 29736800 18460400 20085900
2016-10-05 113.05 31.59 57.64 21453100 11808600 16726400
With a little rearrangement:
In [13]: res.swaplevel(0, 1, axis=1).sort_index(axis=1)
Out[13]:
AAPL CSCO MSFT
close volume close volume close volume
Date
2016-10-03 112.52 21701800 31.50 14070500 57.42 19189500
2016-10-04 113.00 29736800 31.35 18460400 57.24 20085900
2016-10-05 113.05 21453100 31.59 11808600 57.64 16726400

Related

Python DataFrame: How to write and read multiple tickers time-series dataframe?

This seems a fairly complicated dataframe using a simple download. After saving to file (to_csv), I can't seem to read it properly (read_csv) back into a dataframe as before. Please help.
import yfinance as yf
import pandas as pd
tickers=['AAPL', 'MSFT']
header = ['Open', 'High', 'Low', 'Close', 'Adj Close']
df = yf.download(tickers, period='1y')[header]
df.to_csv("data.csv", index=True)
dfr = pd.read_csv("data.csv")
dfr = dfr.set_index('Date')
print(dfr)`
KeyError: "None of ['Date'] are in the columns"
Note:
df: Date is the Index
Open High
AAPL MSFT AAPL MSFT
Date
2022-02-07 172.86 306.17 173.95 307.84
2022-02-08 171.73 301.25 175.35 305.56
2022-02-09 176.05 309.87 176.65 311.93
2022-02-10 174.14 304.04 175.48 309.12
2022-02-11 172.33 303.19 173.08 304.29
But dfr (after read_csv)
Unnamed: 0 Open ... High High.1
0 NaN AAPL ... AAPL MSFT
1 Date NaN ... NaN NaN
2 2022-02-07 172.86 ... 173.94 307.83
3 2022-02-08 171.72 ... 175.35 305.55
4 2022-02-09 176.05 ... 176.64 311.92
How to make dfr like df?
I run the code, but got the error:
KeyError: "None of ['Date'] are in the columns"

Create a new columns based on multiple conditions between two different dataframes with different dimensions

I am trying to build a stock portfolio. I have mainly 2 dfs: 1 with my transactions and 1 with the stock prices of the individual stocks.
My transactions df looks like this:
Date Ticker Position
0 2022-11-01 MSFT 20
1 2022-11-15 PG 10
2 2022-11-25 JNJ 10
3 2022-11-22 MSFT 10
The position column indicates how many shares where purchased. So in the third row I bought a second position in MSFT, adding 10 stocks, taking my total to 30 stocks in MSFT.
My stock prices df looks like this (from yfinance):
Ticker Adj Close
Date
2022-11-01 MSFT 227.528793
2022-11-02 MSFT 219.481476
2022-11-03 MSFT 213.647903
2022-11-04 MSFT 220.767838
2022-11-07 MSFT 227.229630
... ... ...
2022-12-05 JNJ 178.779999
2022-12-06 JNJ 176.100006
2022-12-07 JNJ 177.169998
2022-12-08 JNJ 177.199997
2022-12-09 JNJ 175.740005
I would like to add a column to my stock prices df called Position which shows stock positions at that time.
I think this should not be so difficult with a double condition:
If prices_date >= transaction_date AND transaction_ticker = prices_ticker
THEN prices_postion = prices_position + transaction_position
I thought of initially loading an new column with all zeros, which should should allow for a simple addition (or subtraction if shares were sold)
I tried resetting the index so that column to allow for better comparison, nested for loops, but only errors occurred. I cannot wrap my head around how to do it in Python nor have I found an answer online.
Any suggestions are appreciated
Here's a solution that relies on merge, fillna, and groupby.cumsum():
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"Date": pd.to_datetime(
["2022-11-01", "2022-11-15", "2022-12-07", "2022-11-04"]
),
"Ticker": ["MSFT", "JNJ", "JNJ", "MSFT"],
"Position": [20, 10, 10, 10],
}
)
df1
df2 = pd.DataFrame(
{
"Date": pd.to_datetime(
[
"2022-11-01",
"2022-11-02",
"2022-11-03",
"2022-11-04",
"2022-11-07",
"2022-12-05",
"2022-12-06",
"2022-12-07",
"2022-12-08",
"2022-12-09",
]
),
"Ticker": [
"MSFT",
"MSFT",
"MSFT",
"MSFT",
"MSFT",
"JNJ",
"JNJ",
"JNJ",
"JNJ",
"JNJ",
],
"Close": [
227.528793,
219.481476,
213.647903,
220.767838,
227.229630,
178.779999,
176.100006,
177.169998,
177.199997,
175.740005,
],
}
)
df3 = df2.merge(df1,how='outer',on=['Ticker','Date'])
df3 = df3.sort_values(by='Date')
df3 = df3.dropna(axis=0,subset='Close')
df3['Position'] = df3['Position'].fillna(0)
df3['Position'] = df3.groupby('Ticker')['Position'].cumsum()
df3
Result:
Date Ticker Close Position
0 2022-11-01 MSFT 227.528793 20.0
1 2022-11-02 MSFT 219.481476 20.0
2 2022-11-03 MSFT 213.647903 20.0
3 2022-11-04 MSFT 220.767838 30.0
4 2022-11-07 MSFT 227.22963 30.0
5 2022-12-05 JNJ 178.779999 0.0
6 2022-12-06 JNJ 176.100006 0.0
7 2022-12-07 JNJ 177.169998 10.0
8 2022-12-08 JNJ 177.199997 10.0
9 2022-12-09 JNJ 175.740005 10.0

Select rows in dataframes inside a list then append to another dataframe inside another list

I have daily stock data inside a list of n dataframes (each stock has its own dataframe). I want to select m rows on equal time intervals from each dataframe and append them to dataframes inside another list. Basically the new list should have m dataframes - which is the number the number of days, and each dataframe length n - the number of stocks.
I tried with nested for loops but it just didn't work
cross_section = []
cross_sections_list = []
for m in range(0, len(datalist[0]), 100):
for n in range(len(datalist)):
cross_section.append(datalist[n].iloc[m])
cross_sections_list.append(cross_section)
this code didnt do anything. my machine just stacked on it. if there is another way like multiindexing for example I would love trying it too.
For example
input:
[
Adj Close Ticker
Date
2020-06-01 321.850006 AAPL
2020-06-02 323.339996 AAPL
2020-06-03 325.119995 AAPL
2020-06-04 322.320007 AAPL
2020-06-05 331.500000 AAPL
2020-06-08 333.459991 AAPL
2020-06-09 343.989990 AAPL
2020-06-10 352.839996 AAPL ,
Adj Close Ticker
Date
2020-06-01 182.830002 MSFT
2020-06-02 184.910004 MSFT
2020-06-03 185.360001 MSFT
2020-06-04 182.919998 MSFT
2020-06-05 187.199997 MSFT
2020-06-08 188.360001 MSFT
2020-06-09 189.800003 MSFT
2020-06-10 196.839996 MSFT ]
output:
[
Adj Close Ticker
Date
2020-06-01 321.850006 AAPL
2020-06-01 182.830002 MSFT ,
Adj Close Ticker
Date
2020-06-03 325.119995 AAPL
2020-06-03 185.360001 MSFT ,
Adj Close Ticker
Date
2020-06-05 331.500000 AAPL
2020-06-05 187.199997 MSFT ]
and so on.
Thank you
Not exactly clear what you want, but here is some code that hopefully helps.
list_of_df = [] #list of all the df's
alldf = pd.concat(list_of_df) #brings all df's into one df
list_of_grouped = [y for x, y in alldf.groupby('Date')] #separates df's by date and puts them in a list
number_of_days = alldf.groupby('Date').ngroups #Total number of groups (Dates)
list_of_Dates = [x for x, y in alldf.groupby('Date')] #List of all the dates that were grouped
count_of_stocks = []
for i in range(len(list_of_grouped)):
count_of_stocks.append(len(list_of_grouped[i])) #puts count of each grouped df into a list
zipped = list(zip(list_of_data,count_of_stocks)) #combines the dates and count of stocks in a list to see how many stocks are in each date.
data_global = pd.DataFrame()
for i in datalist:
data_global = data_global.append(i) #first merge all dataframes into one
data_by_date = [i for x, i in data_global.groupby('Date')] #step 2: group the data by date
each_nth_day = []
for i in range(0, len(data_by_date), 21):
each_nth_day.append(data_by_date[i]) #lastly take each n-th dataframe (21 in this case)
thanks for your help user13802115

How to sort 2 indexes in python

I am trying to achieve some thing like the flowing result:
date symbol Open High low close
2016-12-23 AAPL 804.6 809.9 800.5 809.1
CSCO 29.8 29.8 29.8 29.8
2016-12-27 AAPL 824.6 842.3 822.15 835.77
CSCO 29.32 29.9 29.3 29.85
Here is my code:
from datetime import datetime
from iexfinance.stocks import get_historical_data
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
start = '2014-01-01'
end = datetime.today().utcnow()
symbol = ['AAPL', 'MSFT']
out = pd.DataFrame()
datasets_test = []
for d in symbol:
data_original = data.DataReader(d, 'iex', start, end)
data_original['symbol'] = d
data_original = data_original.set_index(['symbol'], append=True)
out = pd.concat([out,data_original],axis=0)
out.sort_index()
print(out.tail(5))
and this is my outcome:
open high low close volume
date symbol
2019-02-11 MSFT 106.20 106.58 104.9650 105.25 18914123
2019-02-12 MSFT 106.14 107.14 105.4800 106.89 25056595
2019-02-13 MSFT 107.50 107.78 106.7100 106.81 18394869
2019-02-14 MSFT 106.31 107.29 105.6600 106.90 21784703
2019-02-15 MSFT 107.91 108.30 107.3624 108.22 26606886
I am trying to get a sort within the 2 indexes (date + symbol) and getting confused on the use of the sort
thanks!

Not getting top5 values for each month using grouper and groupby in pandas

I'm trying to get top5 values for amount for each month along with the text column. I've tried resampling and group by statement
Dataset:
text amount date
123… 11.00 11-05-17
123abc… 10.00 11-08-17
Xyzzy… 22.00. 12-07-17
Xyzzy… 221.00. 11-08-17
Xyzzy… 212.00. 10-08-17
Xyzzy… 242.00. 18-08-17
Code:
df1 = df.groupby([’text', pd.Grouper(key=‘date', freq='M')])[‘amount'].apply(lambda x: x.nlargest(5))
I get group of text but not arranged by month or largest values sorted in descending order.
df1 = df.groupby([pd.Grouper(key=‘date', freq='M')])[‘amount'].apply(lambda x: x.nlargest(5))
THis code works fine but does not give text column.
assuming that amount is a numeric column:
In [8]: df.groupby(['text', pd.Grouper(key='date', freq='M')]).apply(lambda x: x.nlargest(2, 'amount'))
Out[8]:
text amount date
text date
123abc… 2017-11-30 1 123abc… 10.0 2017-11-08
123… 2017-11-30 0 123… 11.0 2017-11-05
Xyzzy… 2017-08-31 5 Xyzzy… 242.0 2017-08-18
2017-10-31 4 Xyzzy… 212.0 2017-10-08
2017-11-30 3 Xyzzy… 221.0 2017-11-08
2017-12-31 2 Xyzzy… 22.0 2017-12-07
You can using head with sort_values
df1 = df.sort_values('amount',ascending=False).groupby(['text', pd.Grouper(key='date', freq='M')]).head(2)