How to sort 2 indexes in python - pandas

I am trying to achieve some thing like the flowing result:
date symbol Open High low close
2016-12-23 AAPL 804.6 809.9 800.5 809.1
CSCO 29.8 29.8 29.8 29.8
2016-12-27 AAPL 824.6 842.3 822.15 835.77
CSCO 29.32 29.9 29.3 29.85
Here is my code:
from datetime import datetime
from iexfinance.stocks import get_historical_data
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
start = '2014-01-01'
end = datetime.today().utcnow()
symbol = ['AAPL', 'MSFT']
out = pd.DataFrame()
datasets_test = []
for d in symbol:
data_original = data.DataReader(d, 'iex', start, end)
data_original['symbol'] = d
data_original = data_original.set_index(['symbol'], append=True)
out = pd.concat([out,data_original],axis=0)
out.sort_index()
print(out.tail(5))
and this is my outcome:
open high low close volume
date symbol
2019-02-11 MSFT 106.20 106.58 104.9650 105.25 18914123
2019-02-12 MSFT 106.14 107.14 105.4800 106.89 25056595
2019-02-13 MSFT 107.50 107.78 106.7100 106.81 18394869
2019-02-14 MSFT 106.31 107.29 105.6600 106.90 21784703
2019-02-15 MSFT 107.91 108.30 107.3624 108.22 26606886
I am trying to get a sort within the 2 indexes (date + symbol) and getting confused on the use of the sort
thanks!

Related

Python DataFrame: How to write and read multiple tickers time-series dataframe?

This seems a fairly complicated dataframe using a simple download. After saving to file (to_csv), I can't seem to read it properly (read_csv) back into a dataframe as before. Please help.
import yfinance as yf
import pandas as pd
tickers=['AAPL', 'MSFT']
header = ['Open', 'High', 'Low', 'Close', 'Adj Close']
df = yf.download(tickers, period='1y')[header]
df.to_csv("data.csv", index=True)
dfr = pd.read_csv("data.csv")
dfr = dfr.set_index('Date')
print(dfr)`
KeyError: "None of ['Date'] are in the columns"
Note:
df: Date is the Index
Open High
AAPL MSFT AAPL MSFT
Date
2022-02-07 172.86 306.17 173.95 307.84
2022-02-08 171.73 301.25 175.35 305.56
2022-02-09 176.05 309.87 176.65 311.93
2022-02-10 174.14 304.04 175.48 309.12
2022-02-11 172.33 303.19 173.08 304.29
But dfr (after read_csv)
Unnamed: 0 Open ... High High.1
0 NaN AAPL ... AAPL MSFT
1 Date NaN ... NaN NaN
2 2022-02-07 172.86 ... 173.94 307.83
3 2022-02-08 171.72 ... 175.35 305.55
4 2022-02-09 176.05 ... 176.64 311.92
How to make dfr like df?
I run the code, but got the error:
KeyError: "None of ['Date'] are in the columns"

window function for moving average

I am trying to replicate SQL's window function in pandas.
SELECT avg(totalprice) OVER (
PARTITION BY custkey
ORDER BY orderdate
RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)
FROM orders
I have this dataframe:
from io import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100
cust_2,2020-10-10,15
cust_1,2020-10-15,200
cust_1,2020-10-16,240
cust_2,2020-12-20,25
cust_1,2020-12-25,140
cust_2,2021-01-01,5
"""
u_cols=['customer_id', 'date', 'price']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
df=df.sort_values(list(df.columns))
And after calculating moving average restricted to last 1 month, it will look like this...
from io import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100,100
cust_2,2020-10-10,15,15
cust_1,2020-10-15,200,150
cust_1,2020-10-16,240,180
cust_2,2020-12-20,25,25
cust_1,2020-12-25,140,140
cust_2,2021-01-01,5,15
"""
u_cols=['customer_id', 'date', 'price', 'my_average']
myf = StringIO(myst)
import pandas as pd
my_df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
my_df=my_df.sort_values(list(my_df.columns))
As shown in this image:
https://trino.io/assets/blog/window-features/running-average-range.svg
I tried to write a function like this...
import numpy as np
def mylogic(myro):
mylist = list()
mydate = myro['date'][0]
for i in range(len(myro)):
if myro['date'][i] > mydate:
mylist.append(myro['price'][i])
mydate = myro['date'][i]
return np.mean(mylist)
But that returned a key_error.
You can use the rolling function on the last 30 days
df['date'] = pd.to_datetime(df['date'])
df['my_average'] = (df.groupby('customer_id')
.apply(lambda d: d.rolling('30D', on='date')['price'].mean())
.reset_index(level=0, drop=True)
.astype(int)
)
output:
customer_id date price my_average
0 cust_1 2020-10-10 100 100
2 cust_1 2020-10-15 200 150
3 cust_1 2020-10-16 240 180
5 cust_1 2020-12-25 140 140
1 cust_2 2020-10-10 15 15
4 cust_2 2020-12-20 25 25
6 cust_2 2021-01-01 5 15

Pandas SetIndex with DatetimeIndex

I have a csv file with the following
Symbol, Date, Unix_Tick, OpenPrice, HighPrice, LowPrice, ClosePrice, volume,
AAPL, 2021-01-04 09:00:00, 1609750800, 133.31, 133.49, 133.02, 133.49, 25000
AAPL, 2021-01-04 09:01:00, 1609750860, 133.49, 133.49, 133.49, 133.49, 700
AAPL, 2021-01-04 09:02:00, 1609750920, 133.6, 133.6, 133.5, 133.5, 500
So I attempt to create a pandas index using Date like this
import pandas as pd
import numpy as np
df = pd.read_csv(csvFile)
df = df.set_index(pd.DatetimeIndex(df["Date"]))
I get KeyError: 'Date'
It's because the file isn't strictly a comma-separated one, but it is comma plus space-separated.
You can either strip the column names to remove spaces:
df = pd.read_csv(csvFile)
df.columns = df.columns.str.strip()
df = df.set_index(pd.DatetimeIndex(df["Date"]))
or read the CSV file with separator ", ":
df = pd.read_csv(csvFile, sep=", ")
df = df.set_index(pd.DatetimeIndex(df["Date"]))
The problem is most probably in space after ,. You can try load the data with custom sep= parameter:
df = pd.read_csv("a1.txt", sep=r",\s+", engine="python")
df = df.set_index(pd.DatetimeIndex(df["Date"]))
print(df)
Prints:
Symbol Date Unix_Tick OpenPrice HighPrice LowPrice ClosePrice volume,
Date
2021-01-04 09:00:00 AAPL 2021-01-04 09:00:00 1609750800 133.31 133.49 133.02 133.49 25000
2021-01-04 09:01:00 AAPL 2021-01-04 09:01:00 1609750860 133.49 133.49 133.49 133.49 700
2021-01-04 09:02:00 AAPL 2021-01-04 09:02:00 1609750920 133.60 133.60 133.50 133.50 500

TypeError: '<=' not supported between instances of 'Timestamp' and 'numpy.float64'

I am trying to plot using hvplot, and I am getting this:
TypeError: '<=' not supported between instances of 'Timestamp' and 'numpy.float64'
Here is my data:
TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
3 2020-04-06 1153
4 2020-04-07 1252
5 2020-04-08 1491
... ... ...
71 2020-06-13 2242
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
75 NaT NaN
Below is my code:
import numpy as np
import matplotlib.pyplot as plt
import xlsxwriter
import pandas as pd
from pandas import DataFrame
path = ('Casecountdata.xlsx')
xl = pd.ExcelFile(path)
df1 = xl.parse('Hospitalization by Day')
df2 = df1[['Unnamed: 1','Unnamed: 2']]
df2 = df2.drop(df2.index[0])
df2 = df2.rename(columns={"Unnamed: 1": "Time", "Unnamed: 2": "Hospitalizations"})
df2['TimeConv'] = pd.to_datetime(df2.Time)
df3 = df2[['TimeConv','Hospitalizations']]
When I take a sample of your data above and try to plot it, it works for me, so there might be something wrong in the way you read your data from excel to pandas. You can try to do df.info() to see what the datatypes of your data look like. Column TimeConv should be datetime64[ns] and column Hospitalizations should be int64 (or float). Could also be a version problem... do you have the latest versions of hvplot etc installed? But my guess is, your data doesn't look right.
In any case, when I run the following, it works and plots your data:
# import libraries
import pandas as pd
import hvplot.pandas
import holoviews as hv
hv.extension('bokeh')
from io import StringIO # need this to read your text data
# your sample data
text_data = StringIO("""
column1 TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
""")
# read text data to dataframe
df = pd.read_csv(text_data, sep="\s+")
df['TimeConv'] = pd.to_datetime(df.TimeConv, yearfirst=True)
# shortly checkout datatypes of your data
df.info()
# create scatter plot of your data
df.hvplot.scatter(
x='TimeConv',
y='Hospitalizations',
width=500,
title='Showing hospitalizations over time',
)
This code results in the following plot:

Combine Pandas DataFrames while creating MultiIndex Columns

I have two DataFrames, something like this:
import pandas as pd
dates = pd.Index(['2016-10-03', '2016-10-04', '2016-10-05'], name='Date')
close = pd.DataFrame( {'AAPL': [112.52, 113., 113.05],
'CSCO': [ 31.5, 31.35, 31.59 ],
'MSFT': [ 57.42, 57.24, 57.64 ] }, index = dates )
volume= pd.DataFrame( {'AAPL': [21701800, 29736800, 21453100] ,
'CSCO': [14070500, 18460400, 11808600] ,
'MSFT': [19189500, 20085900, 16726400] }, index = dates )
The output of DataFrame 'close' looks like this:
AAPL CSCO MSFT
Date
2016-10-03 112.52 31.50 57.42
2016-10-04 113.00 31.35 57.24
2016-10-05 113.05 31.59 57.64
And the output of DataFrame 'volume' looks like this:
AAPL CSCO MSFT
Date
2016-10-03 21701800 14070500 19189500
2016-10-04 29736800 18460400 20085900
2016-10-05 21453100 11808600 16726400
I would like to combine these two DataFrames into a single DataFrame with MultiIndex COLUMNS so that it looks like this:
AAPL CSCO MSFT
Close Volume Close Volume Close Volume
Date
2016-10-03 112.52 21701800 31.50 14070500 57.42 19189500
2016-10-04 113.00 29736800 31.35 18460400 57.24 20085900
2016-10-05 113.05 21453100 31.59 11808600 57.64 16726400
Can anyone give me an idea how to do that? I've been playing with pd.concat and pd.merge, but it's not clear to me how to get it to line up on the date index and allow me to provide names for the sub-index ('Close" and 'Volume') on the columns.
You can use the keys kwarg of concat:
In [11]: res = pd.concat([close, volume], axis=1, keys=["close", "volume"])
In [12]: res
Out[12]:
close volume
AAPL CSCO MSFT AAPL CSCO MSFT
Date
2016-10-03 112.52 31.50 57.42 21701800 14070500 19189500
2016-10-04 113.00 31.35 57.24 29736800 18460400 20085900
2016-10-05 113.05 31.59 57.64 21453100 11808600 16726400
With a little rearrangement:
In [13]: res.swaplevel(0, 1, axis=1).sort_index(axis=1)
Out[13]:
AAPL CSCO MSFT
close volume close volume close volume
Date
2016-10-03 112.52 21701800 31.50 14070500 57.42 19189500
2016-10-04 113.00 29736800 31.35 18460400 57.24 20085900
2016-10-05 113.05 21453100 31.59 11808600 57.64 16726400