I have a csv file with the following
Symbol, Date, Unix_Tick, OpenPrice, HighPrice, LowPrice, ClosePrice, volume,
AAPL, 2021-01-04 09:00:00, 1609750800, 133.31, 133.49, 133.02, 133.49, 25000
AAPL, 2021-01-04 09:01:00, 1609750860, 133.49, 133.49, 133.49, 133.49, 700
AAPL, 2021-01-04 09:02:00, 1609750920, 133.6, 133.6, 133.5, 133.5, 500
So I attempt to create a pandas index using Date like this
import pandas as pd
import numpy as np
df = pd.read_csv(csvFile)
df = df.set_index(pd.DatetimeIndex(df["Date"]))
I get KeyError: 'Date'
It's because the file isn't strictly a comma-separated one, but it is comma plus space-separated.
You can either strip the column names to remove spaces:
df = pd.read_csv(csvFile)
df.columns = df.columns.str.strip()
df = df.set_index(pd.DatetimeIndex(df["Date"]))
or read the CSV file with separator ", ":
df = pd.read_csv(csvFile, sep=", ")
df = df.set_index(pd.DatetimeIndex(df["Date"]))
The problem is most probably in space after ,. You can try load the data with custom sep= parameter:
df = pd.read_csv("a1.txt", sep=r",\s+", engine="python")
df = df.set_index(pd.DatetimeIndex(df["Date"]))
print(df)
Prints:
Symbol Date Unix_Tick OpenPrice HighPrice LowPrice ClosePrice volume,
Date
2021-01-04 09:00:00 AAPL 2021-01-04 09:00:00 1609750800 133.31 133.49 133.02 133.49 25000
2021-01-04 09:01:00 AAPL 2021-01-04 09:01:00 1609750860 133.49 133.49 133.49 133.49 700
2021-01-04 09:02:00 AAPL 2021-01-04 09:02:00 1609750920 133.60 133.60 133.50 133.50 500
Related
With a datetime index to a Pandas dataframe, it is easy to get a range of dates:
df[datetime(2018,1,1):datetime(2018,1,10)]
Filtering is straightforward too:
df[ (df['column A'] = 'Done') & (df['column B'] < 3.14 )]
But what is the best way to simultaneously filter by range of dates and any other non-date criteria?
3 boolean conditions
c0 = df.index.to_series().between('2018-01-01', '2018-01-10')
c1 = df['column A'] == 'Done'
c2 = df['column B'] < 3.14
df[c0 & c1 & c2]
column A column B
2018-01-04 Done 2.533385
2018-01-06 Done 2.789072
2018-01-08 Done 2.230017
Setup
np.random.seed([3, 1415])
df = pd.DataFrame({
'column A': ['Done', 'Not Done'] * 10,
'column B': np.random.randn(20) + np.pi
}, pd.date_range('2017-12-25', periods=20))
df
column A column B
2017-12-25 Done 1.011868
2017-12-26 Not Done 1.873127
2017-12-27 Done 1.171093
2017-12-28 Not Done 0.882538
2017-12-29 Done 2.792306
2017-12-30 Not Done 3.114638
2017-12-31 Done 3.457829
2018-01-01 Not Done 3.490375
2018-01-02 Done 3.856957
2018-01-03 Not Done 3.912356
2018-01-04 Done 2.533385
2018-01-05 Not Done 3.493983
2018-01-06 Done 2.789072
2018-01-07 Not Done 2.725724
2018-01-08 Done 2.230017
2018-01-09 Not Done 2.999055
2018-01-10 Done 3.888432
2018-01-11 Not Done 1.637436
2018-01-12 Done 3.752955
2018-01-13 Not Done 3.541812
If there is multiple boolean masks is possible use np.logical_and.reduce:
m1 = df.index > '2018-01-01'
m2 = df.index < '2018-01-10'
m3 = df['column A'] == 'Done'
m4 = df['column B'] < 3.14
#piRSquared's data sample
df = df[np.logical_and.reduce([m1, m2, m3, m4])]
print (df)
column A column B
2018-01-04 Done 2.533385
2018-01-06 Done 2.789072
2018-01-08 Done 2.230017
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2018-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2018-2-1':'2018-2-10'])
Hope! it will helpful
I did this below to filter for both dataframes to have the same date
corn_url = 'https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=WPU012202&scale=left&cosd=1971-01-01&coed=2020-04-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2009-06-01&line_index=1&transformation=lin&vintage_date=2020-06-09&revision_date=2020-06-09&nd=1971-01-01'
wheat_url ='https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=WPU0121&scale=left&cosd=1947-01-01&coed=2020-04-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2009-06-01&line_index=1&transformation=lin&vintage_date=2020-06-09&revision_date=2020-06-09&nd=1947-01-01'
corn = pd.read_csv(corn_url,index_col=0,parse_dates=True)
wheat = pd.read_csv(wheat_url,index_col=0, parse_dates=True)
corn.head()
PP Index 1982
DATE
1971-01-01 63.4
1971-02-01 63.6
1971-03-01 62.0
1971-04-01 60.8
1971-05-01 60.2
wheat.head()
PP Index 1982
DATE
1947-01-01 53.1
1947-02-01 56.5
1947-03-01 68.0
1947-04-01 66.0
1947-05-01 66.7
wheat = wheat[wheat.index > '1970-12-31']
wheat.head()
PP Index 1982
DATE
1971-01-01 42.6
1971-02-01 42.6
1971-03-01 41.4
1971-04-01 41.7
1971-05-01 41.8
This seems a fairly complicated dataframe using a simple download. After saving to file (to_csv), I can't seem to read it properly (read_csv) back into a dataframe as before. Please help.
import yfinance as yf
import pandas as pd
tickers=['AAPL', 'MSFT']
header = ['Open', 'High', 'Low', 'Close', 'Adj Close']
df = yf.download(tickers, period='1y')[header]
df.to_csv("data.csv", index=True)
dfr = pd.read_csv("data.csv")
dfr = dfr.set_index('Date')
print(dfr)`
KeyError: "None of ['Date'] are in the columns"
Note:
df: Date is the Index
Open High
AAPL MSFT AAPL MSFT
Date
2022-02-07 172.86 306.17 173.95 307.84
2022-02-08 171.73 301.25 175.35 305.56
2022-02-09 176.05 309.87 176.65 311.93
2022-02-10 174.14 304.04 175.48 309.12
2022-02-11 172.33 303.19 173.08 304.29
But dfr (after read_csv)
Unnamed: 0 Open ... High High.1
0 NaN AAPL ... AAPL MSFT
1 Date NaN ... NaN NaN
2 2022-02-07 172.86 ... 173.94 307.83
3 2022-02-08 171.72 ... 175.35 305.55
4 2022-02-09 176.05 ... 176.64 311.92
How to make dfr like df?
I run the code, but got the error:
KeyError: "None of ['Date'] are in the columns"
i have a dataframe like this:
import pandas as pd
import sqlalchemy
con = sqlalchemy.create_engine('....')
df=pd.DataFrame({'user_id':[1,2,3],'start_date':pd.Series(['2022-05-01 00:00:00','2022-05-10 00:00:00','2022-05-20 00:00:00'],dtype='datetime64[ns]'),
'end_date':pd.Series(['2022-06-01 00:00:00','2022-06-10 00:00:00','2022-06-20 00:00:00'],dtype='datetime64[ns]')})
'''
user_id start_date end_date
1 2022-05-01 00:00:00 2022-06-01 00:00:00
2 2022-05-10 00:00:00 2022-06-10 00:00:00
3 2022-05-20 00:00:00 2022-06-20 00:00:00
'''
I want to get the sales data for each user from the database in the date ranges specified in the df. Below is a code that I am currently using and it is working correctly.
df_stats=pd.DataFrame()
for k,j in df.iterrows():
sql='''
select '{}' as user_id,sum(item_price) as sales,count(return) as return from sales
where created_at between '{}' and '{}' and user_id={}'''.format(j['user_id'],j['start_date'],j['end_date'],j['user_id'])
sql_to_df = pd.read_sql(sql, con)
df_stats = df_stats.append(sql_to_df)
final=df.merge(df_stats,on='user_id')
'''
final:
user_id start_date end_date sales return
1 2022-05-01 00:00:00 2022-06-01 00:00:00 1500 5
2 2022-05-10 00:00:00 2022-06-10 00:00:00 2900 9
3 2022-05-20 00:00:00 2022-06-20 00:00:00 1450 1
'''
But in the articles I read it is mentioned that using iterrows() is very slow. Is there a way to make this process more efficient ?
note: Similar to my question but i couldn't find a satisfactory answer in this previously asked question.
You can use .to_records to transform the rows to a list of tuples. Then iterate the list and unpack the tuple and pass the args to "your_sql_function"
import pandas as pd
data = {
"user_id": [1, 2, 3],
"start_date": pd.Series(["2022-05-01 00:00:00", "2022-05-10 00:00:00", "2022-05-20 00:00:00"], dtype="datetime64[ns]"),
"end_date": pd.Series(["2022-06-01 00:00:00", "2022-06-10 00:00:00", "2022-06-20 00:00:00"], dtype="datetime64[ns]")
}
df = pd.DataFrame(data)
for user, start, end in df.to_records(index=False):
your_sql_function(user, start, end)
I have a pandas dataframe
published | sentiment
2022-01-31 10:00:00 | 0
2021-12-29 00:30:00 | 5
2021-12-20 | -5
Since some rows don't have hours, minutes and seconds I delete them:
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
I get:
published | sentiment
2022-01-31 | 0
2021-12-29 | 5
2021-12-20 | -5
If I plot the data:
plt.pyplot.plot_date(df['published'],df['sentiment'] )
I get this error:
TypeError: float() argument must be a string or a number, not 'datetime.datetime'
But I don't know why since it should be a string.
How can I plot it (possibly keeping the temporal order)? Thank you
Try like this:
import pandas as pd
from matplotlib import pyplot as plt
values=[('2022-01-31 10:00:00',0),('2021-12-29 00:30:00',5),('2021-12-20',-5)]
cols=['published','sentiment']
df_dominant_topic2 = pd.DataFrame.from_records(values, columns=cols)
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
#you may sort the data by date
df_dominant_topic2.sort_values(by='published', ascending=True, inplace=True)
plt.plot(df_dominant_topic2['published'],df_dominant_topic2['sentiment'])
plt.show()
I hope with these additional information someone could find time to help me with this new issue.
sample date here --> file
'Date as index' (datetime.date)
As I said I'm trying to select a range in a dataframe every time x is in interval [-20 -190] and create a new dataframe with a new column which is the sum of the selected rows and keep the last "encountered" date as index
EDIT : The "loop" start at the first date/beginning of the df and when a value which is less than 0 or -190 is found, then sum it up and continue to find and sum it up and so on
BUT I still got values which are still in the intervall (-190, 0)
example and code below.
Thks
import pandas as pd
df = pd.read_csv('http://www.sharecsv.com/s/0525f76a07fca54717f7962d58cac692/sample_file.csv', sep = ';')
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3
##### output #####
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 11:28:00 -154.35
3 2019-01-02 12:08:00 -4706.87
4 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-29 16:58:00 -0.38
833 2019-09-30 17:08:00 -129365.71
834 2019-09-30 17:13:00 -157.05
835 2019-10-01 08:58:00 -111911.98
########## expected output #############
Date sum
0 2019-01-01 13:48:00 -131395.21
1 2019-01-02 11:23:00 -250830.08
2 2019-01-02 12:08:00 -4706.87
3 2019-01-03 12:03:00 -260158.22
... ... ...
831 2019-09-29 09:18:00 -245939.92
832 2019-09-30 17:08:00 -129365.71
833 2019-10-01 08:58:00 -111911.98
...
...
Use Series.where with Series.between for replace values to NaNs of Date column with back filling missing values and then aggregate sum, next step is filter out rows with match range by boolean indexing and last use DataFrame.resample with cast Series to one column DataFrame by Series.to_frame:
#range -190, 0
df['Date'] = df['Date'].where(df['x'].between(-190, 0)).bfill()
df3 = df.groupby('Date', as_index=False)['x'].sum()
df3 = df3[~df3['x'].between(-190, 0)]
df3 = df3.resample('D', on='Date')['x'].sum().to_frame()