This seems a fairly complicated dataframe using a simple download. After saving to file (to_csv), I can't seem to read it properly (read_csv) back into a dataframe as before. Please help.
import yfinance as yf
import pandas as pd
tickers=['AAPL', 'MSFT']
header = ['Open', 'High', 'Low', 'Close', 'Adj Close']
df = yf.download(tickers, period='1y')[header]
df.to_csv("data.csv", index=True)
dfr = pd.read_csv("data.csv")
dfr = dfr.set_index('Date')
print(dfr)`
KeyError: "None of ['Date'] are in the columns"
Note:
df: Date is the Index
Open High
AAPL MSFT AAPL MSFT
Date
2022-02-07 172.86 306.17 173.95 307.84
2022-02-08 171.73 301.25 175.35 305.56
2022-02-09 176.05 309.87 176.65 311.93
2022-02-10 174.14 304.04 175.48 309.12
2022-02-11 172.33 303.19 173.08 304.29
But dfr (after read_csv)
Unnamed: 0 Open ... High High.1
0 NaN AAPL ... AAPL MSFT
1 Date NaN ... NaN NaN
2 2022-02-07 172.86 ... 173.94 307.83
3 2022-02-08 171.72 ... 175.35 305.55
4 2022-02-09 176.05 ... 176.64 311.92
How to make dfr like df?
I run the code, but got the error:
KeyError: "None of ['Date'] are in the columns"
Related
I have a dataframe, df, with datetimeindex and a single column, like this:
I need to count how many non-zero entries i have at each month. For example, according to those images, in January i would have 2 entries, in February 1 entry and in March 2 entries. I have more months in the dataframe, but i guess that explains the problem.
I tried using pandas groupby:
df.groupby(df.index.month).count()
But that just gives me total days at each month and i don't saw any other parameter in count() that i could use here.
Any ideas?
Try index.to_period()
For example:
In [1]: import pandas as pd
import numpy as np
x_df = pd.DataFrame(
{
'values': np.random.randint(low=0, high=2, size=(120,))
} ,
index = pd.date_range("2022-01-01", periods=120, freq="D")
)
In [2]: x_df
Out[2]:
values
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
...
2022-04-26 1
2022-04-27 0
2022-04-28 0
2022-04-29 1
2022-04-30 1
[120 rows x 1 columns]
In [3]: x_df[x_df['values'] != 0].groupby(lambda x: x.to_period("M")).count()
Out[3]:
values
2022-01 17
2022-02 15
2022-03 16
2022-04 17
can you try this:
#drop nans
import numpy as np
dfx['col1']=dfx['col1'].replace(0,np.nan)
dfx=dfx.dropna()
dfx=dfx.resample('1M').count()
I have a csv file with the following
Symbol, Date, Unix_Tick, OpenPrice, HighPrice, LowPrice, ClosePrice, volume,
AAPL, 2021-01-04 09:00:00, 1609750800, 133.31, 133.49, 133.02, 133.49, 25000
AAPL, 2021-01-04 09:01:00, 1609750860, 133.49, 133.49, 133.49, 133.49, 700
AAPL, 2021-01-04 09:02:00, 1609750920, 133.6, 133.6, 133.5, 133.5, 500
So I attempt to create a pandas index using Date like this
import pandas as pd
import numpy as np
df = pd.read_csv(csvFile)
df = df.set_index(pd.DatetimeIndex(df["Date"]))
I get KeyError: 'Date'
It's because the file isn't strictly a comma-separated one, but it is comma plus space-separated.
You can either strip the column names to remove spaces:
df = pd.read_csv(csvFile)
df.columns = df.columns.str.strip()
df = df.set_index(pd.DatetimeIndex(df["Date"]))
or read the CSV file with separator ", ":
df = pd.read_csv(csvFile, sep=", ")
df = df.set_index(pd.DatetimeIndex(df["Date"]))
The problem is most probably in space after ,. You can try load the data with custom sep= parameter:
df = pd.read_csv("a1.txt", sep=r",\s+", engine="python")
df = df.set_index(pd.DatetimeIndex(df["Date"]))
print(df)
Prints:
Symbol Date Unix_Tick OpenPrice HighPrice LowPrice ClosePrice volume,
Date
2021-01-04 09:00:00 AAPL 2021-01-04 09:00:00 1609750800 133.31 133.49 133.02 133.49 25000
2021-01-04 09:01:00 AAPL 2021-01-04 09:01:00 1609750860 133.49 133.49 133.49 133.49 700
2021-01-04 09:02:00 AAPL 2021-01-04 09:02:00 1609750920 133.60 133.60 133.50 133.50 500
I am trying to plot using hvplot, and I am getting this:
TypeError: '<=' not supported between instances of 'Timestamp' and 'numpy.float64'
Here is my data:
TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
3 2020-04-06 1153
4 2020-04-07 1252
5 2020-04-08 1491
... ... ...
71 2020-06-13 2242
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
75 NaT NaN
Below is my code:
import numpy as np
import matplotlib.pyplot as plt
import xlsxwriter
import pandas as pd
from pandas import DataFrame
path = ('Casecountdata.xlsx')
xl = pd.ExcelFile(path)
df1 = xl.parse('Hospitalization by Day')
df2 = df1[['Unnamed: 1','Unnamed: 2']]
df2 = df2.drop(df2.index[0])
df2 = df2.rename(columns={"Unnamed: 1": "Time", "Unnamed: 2": "Hospitalizations"})
df2['TimeConv'] = pd.to_datetime(df2.Time)
df3 = df2[['TimeConv','Hospitalizations']]
When I take a sample of your data above and try to plot it, it works for me, so there might be something wrong in the way you read your data from excel to pandas. You can try to do df.info() to see what the datatypes of your data look like. Column TimeConv should be datetime64[ns] and column Hospitalizations should be int64 (or float). Could also be a version problem... do you have the latest versions of hvplot etc installed? But my guess is, your data doesn't look right.
In any case, when I run the following, it works and plots your data:
# import libraries
import pandas as pd
import hvplot.pandas
import holoviews as hv
hv.extension('bokeh')
from io import StringIO # need this to read your text data
# your sample data
text_data = StringIO("""
column1 TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
""")
# read text data to dataframe
df = pd.read_csv(text_data, sep="\s+")
df['TimeConv'] = pd.to_datetime(df.TimeConv, yearfirst=True)
# shortly checkout datatypes of your data
df.info()
# create scatter plot of your data
df.hvplot.scatter(
x='TimeConv',
y='Hospitalizations',
width=500,
title='Showing hospitalizations over time',
)
This code results in the following plot:
I am trying to achieve some thing like the flowing result:
date symbol Open High low close
2016-12-23 AAPL 804.6 809.9 800.5 809.1
CSCO 29.8 29.8 29.8 29.8
2016-12-27 AAPL 824.6 842.3 822.15 835.77
CSCO 29.32 29.9 29.3 29.85
Here is my code:
from datetime import datetime
from iexfinance.stocks import get_historical_data
from pandas_datareader import data
import matplotlib.pyplot as plt
import pandas as pd
start = '2014-01-01'
end = datetime.today().utcnow()
symbol = ['AAPL', 'MSFT']
out = pd.DataFrame()
datasets_test = []
for d in symbol:
data_original = data.DataReader(d, 'iex', start, end)
data_original['symbol'] = d
data_original = data_original.set_index(['symbol'], append=True)
out = pd.concat([out,data_original],axis=0)
out.sort_index()
print(out.tail(5))
and this is my outcome:
open high low close volume
date symbol
2019-02-11 MSFT 106.20 106.58 104.9650 105.25 18914123
2019-02-12 MSFT 106.14 107.14 105.4800 106.89 25056595
2019-02-13 MSFT 107.50 107.78 106.7100 106.81 18394869
2019-02-14 MSFT 106.31 107.29 105.6600 106.90 21784703
2019-02-15 MSFT 107.91 108.30 107.3624 108.22 26606886
I am trying to get a sort within the 2 indexes (date + symbol) and getting confused on the use of the sort
thanks!
I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494