building a DataFrame of a portfolio of symbols - pandas

I'm new to pandas.
I'm like to read the quotes for a number of symbols (e.g. ['SPY', 'IWM', 'QQQ']) from Yahoo (which I do with no problem) and then I'd like to use only the 'Adj Close' columns to build a portfolio of ETFs over a given period of time.
Say that I'd like to start with an empty DataFrame whose index are the dates where the market is open, taken for example from the first df. Subsequently, I'd like to "append" to the right one single column at a time with the 'Adj Close' of each symbol, renamed with the ticker name.
I'm sure it must be simple, but I can't get it. Can anybody help me? thank you in advance.

If you are just using the Adj Close column, it is easiest to extract it immediately after reading the data.
import pandas.io.data as web
df = web.DataReader(['F', 'AAPL', 'IBM'], 'yahoo', '2016-05-02', '2016-05-06')['Adj Close']
>>> df
AAPL F IBM
Date
2016-05-02 93.073328 13.62 143.881476
2016-05-03 94.604009 13.43 142.752373
2016-05-04 93.620002 13.31 142.871221
2016-05-05 93.239998 13.32 145.070003
2016-05-06 92.720001 13.44 147.289993

Related

pd.read_csv - dates in pandas multiindex column names

I import a csv file into a pandas dataframe.
df=pd.read_csv('data.csv',index_col=[0],header=[0,1])
My data has a column multiindex with two levels. Level(0) contains strings and level(1) contains dates.
By default, these dates become strings when imported.
I would like to convert level(1) column names to date format either when I import the data (but I cannot figure out the right way to do that when reading the documentation) or subsequently, if not possible during the import phase.
However, if I do:
cdq.columns.levels[1]=cdq.columns.levels[1].astype('datetime64[ns]')
I get the message that 'FrozenList' does not support mutable operations.
Is there a way to do that?
d = {'Ticker':['ABBN.SW','ABBN.SW','ABBN.SW','ABBN.SW'],
'Date':['31/12/2021 00:00','30/09/2021 00:00','30/06/2021 00:00','31/03/2021 00:00'],
'investments':[-480000000,251000000,892000000,162000000],
'changeToLiabilities':[298000000,52000000,267000000,42000000]}
pd.DataFrame(d)
Ticker Date investments changeToLiabilities
0 ABBN.SW 31/12/2021 00:00 -480000000 298000000
1 ABBN.SW 30/09/2021 00:00 251000000 52000000
2 ABBN.SW 30/06/2021 00:00 892000000 267000000
3 ABBN.SW 31/03/2021 00:00 162000000 42000000

Iterating over timeseries data in pandas

I am working with historical stock data stored in a dataframe, h2, that looks like so,
print (h2)
Open High Low Close Adj Close Volume
Date
2021-10-14 439.079987 442.660004 438.579987 442.500000 442.500000 70236800
2021-10-15 444.750000 446.260010 444.089996 445.869995 445.869995 66226800
2021-10-18 443.970001 447.549988 443.269989 447.190002 447.190002 62213200
2021-10-19 448.920013 450.709991 448.269989 450.640015 450.640015 46881100
2021-10-20 451.130005 452.730011 451.010010 451.875000 451.875000 21651910
I am trying to iterate over this data with a for loop and I get an error relating to pandas Timestamps,
for d,r in h2.iterrows():
print (d,h2[d])
KeyError: Timestamp('2021-10-14 00:00:00')
It seems that iterrows() is changing the type of the index value, so that it becomes in accessible
Is there a better way to iterate over a pandas timeseries

Reshape Pandas dataframe (partial transpose)

I have a csv similar to the following, where the column heading specifies the time (hour number):
Day,Location,1,2,3
1/1/2021,A,0.26,0.25,0.49
1/1/2021,B,0.8,0.23,0.55
1/1/2021,C,0.32,0.11,0.58
1/2/2021,A,0.67,0.72,0.49
1/2/2021,B,0.25,0.09,0.56
1/2/2021,C,0.83,0.54,0.7
When I load it as a dataframe using
df = pd.read_csv(open('VirusLevels.csv', 'r'), index_col=[0,1], header=0)
Pandas creates a dataframe with indices Day and Location, and column names 1, 2, and 3.
I need it to be reshaped as shown below, where Day and Time are the indices, and the Location is the column heading:
I've tried a lot of things and followed a lot of rabbitholes, but haven't been successful. The most on-point example I could find suggested something like the following, but it doesn't work (says "KeyError: 'Day'").
df.melt(id_vars=['Day'], var_name= 'Time',
value_name = 'VirusLevels').sort_values(by='Location').reset_index(drop=True)
Thanks in advance for any help.
Try:
df = pd.read_csv('VirusLevels.csv', index_col=[0,1])
df.rename_axis(columns='Time').stack().unstack('Location')
# or
# df.rename_axis('Time',axis='columns').stack().unstack('Location')
Output:
Location A B C
Day Time
1/1/2021 1 0.345307 0.099403 0.474077
2 0.299947 0.853091 0.352472
3 0.400975 0.599249 0.743099
1/2/2021 1 0.660258 0.003976 0.295406
2 0.425434 0.953433 0.418783
3 0.421021 0.844761 0.369561

"None of [Float64Index([56.0, ..\n dtype='float64', length=1057499)] are in the [columns]" Pandas dataframe

Please excuse any obvious mistakes as I am new to Pandas and coding in general.
I am filtering the original dataframe and creating a copy with chosen columns. This is how my data frame looks like:
(dataframe filter routine):
df_new=df.filter(['date','location','value','lat_final','lon_final'], axis=1)
df_new = df_new.set_index('date')
print (df_new.head())
The new dataframe:
location value lat_final lon_final
date
2015-06-30 09:40:00+05:30 XYZI 56.0 28.6508 77.3152
2015-06-30 11:00:00+05:30 MNOP 36.0 28.6683 77.1167
2015-06-30 17:10:00+05:30 QRST 71.0 28.6508 77.3152
2015-06-30 11:00:00+05:30 UVWX 98.0 28.6508 77.3152
2015-06-30 09:40:00+05:30 XXYZ 26.0 28.6683 77.1167
While trying to perform some operations on columns in this new dataframe, I am getting the none type error. These are the operations I am performing:
(This step goes fine)
f=df_new[df_new['value']>=0]
f.drop(f[f['value'] >1500].index, inplace = True)
f.drop(f[f['value'] <2].index, inplace = True)
(The error crops up here):
#Filteration steps:
#Step1: grouping into 12h or n hour intervals:
diurnal = f[f['value']].resample('12h')
Where am I going wrong?
Any help will be much appreciated.
This: f[f['value']] will give you an error. If you want to resample the value column, you should select it properly, and also tell resample how you want to aggregate the values (sum, mean?). Something like this:
f['value'].resample('12h').sum()

How to move the timestamp bounds for datetime in pandas (working with historical data)?

I'm working with historical data, and have some very old dates that are outside the timestamp bounds for pandas. I've consulted the Pandas Time series/date functionality documentation, which has some information on out of bounds spans, but from this information, it still wasn't clear to me what, if anything I could do to convert my data into a datetime type.
I've also seen a few threads on Stack Overflow on this, but they either just point out the problem (i.e. nanoseconds, max range 570-something years), or suggest setting errors = coerce which turns 80% of my data into NaTs.
Is it possible to turn dates lower than the default Pandas lower bound into dates? Here's a sample of my data:
import pandas as pd
df = pd.DataFrame({'id': ['836', '655', '508', '793', '970', '1075', '1119', '969', '1166', '893'],
'date': ['1671-11-25', '1669-11-22', '1666-05-15','1673-01-18','1675-05-07','1677-02-08','1678-02-08', '1675-02-15', '1678-11-28', '1673-12-23']})
You can create day periods by lambda function:
df['date'] = df['date'].apply(lambda x: pd.Period(x, freq='D'))
Or like mentioned #Erfan in comment (thank you):
df['date'] = df['date'].apply(pd.Period)
print (df)
id date
0 836 1671-11-25
1 655 1669-11-22
2 508 1666-05-15
3 793 1673-01-18
4 970 1675-05-07
5 1075 1677-02-08
6 1119 1678-02-08
7 969 1675-02-15
8 1166 1678-11-28
9 893 1673-12-23