setting pandas series row values to multiple column values - pandas

I am having one dataframe object df. wherein i have got some data from excel sheet. then i have added certain Date columns to this df object. this df also has certain stock ticker from yahoo finance. now i try to get the history of prices for these stocks tickers for 2 months history from yahoo finance (which will be 60 rows) and then trying to assign these price values to column header, with relevant dates, in the df object. however i am not able to do so.
In the last line of code, i am trying to set the values of "Volume", which will be in different rows, to the column values for respective dates in df. but i am not able to do so. need help. Thanks
df = pd.read_excel(r"D:\Volume Trading\python\excel"
r"\Nifty-sector-cap.xlsx")
start_date = date(2022,3,1) # Date YYYY MM DD
end_date = date(2022,4,25)
## downloading below data just to get dates which will be columns of df.
temp_data = yf.download("HDFCBANK.NS", start_date, end_date, interval = '1d', index = False)["Adj Close"]
temp_data.index = temp_data.index.date
# setting the dates as columns header in df
df = df.reindex(columns = df.columns.tolist() + temp_data.index.tolist())
i = 0
# putting the volume for each ticker on each date in df
for i in range(0,len(df)):
temp_vol = yf.download(df["Yahoo_Symbol"].iloc[i], start_date, end_date, interval ="1d")["Volume"]
temp_vol.index = temp_vol.index.date
df[temp_vol.index.tolist()].iloc[i] = temp_vol.("Volume").transpose()

Related

Pandas grouping values and getting most recent date

I have a large csv file that I read into Pandas that gets me a DataFrame for "Community_Name" and "Date" its about 186k lines with about 120 unique "Community_Names" and a range of Dates. I would like to Group the data by Community and find the most recent Date for each in the file. I will use this later on to pull data from each community up to that most recent date later on.
I am struggling with getting the most recent date value for each community. I thought the .max() would work, but it returns the greatest value overall rather than for each community...
with open('communitydates.csv', 'r', newline='', encoding='utf-8') as csv_file:
csv_reader = csv.DictReader(csv_file)
for line in csv_reader:
date = line['Date'] + " " + line['Year']
date = datetime.datetime.strptime(date, '%B %d %Y').strftime('%Y %d %m')
community_name = line['Community']
entry = community_name, date
dates_list.append(entry)df = pd.DataFrame(dates_list)
df.columns = ["Community", "Date"]
df["Date"] = pd.to_datetime(df["Date"], format='%Y %d %m').max()
grouped_by_community = df.groupby("Community")
recent_date_by_community = grouped_by_community.first()
Ideally I want to convert the DataFrame into a Dictionary or List to do the check later on.
max_dates = recent_date_by_community.to_dict('index')
for k in max_dates:
print(k, max_dates[k]['Date'])
Which currently gives me this...but...the date is the same for all 102 communities vs the actual date in the file.
Addison 2019-10-09 00:00:00
I assuming I am using the .max() statement incorrectly, but have not been able to figure out how change it.

How to plot only business hours and weekdays in pandas

I have hourly stock data.
I need a) to format it so that matplotlib ignores weekends and non-business hours and b) an hourly frequency.
The problem:
Currently, the graph looks crammed and I suspect it is because matplotlib is taking into account 24 hours instead of 8, and 7 days a week instead of business days.
How do I tell pandas to only take into account business hours, M- F?
How I am graphing the data:
I am looping through a list of price data dataframes, graphing each data frame:
mm = 0
for ii in df:
Ddate = ii['Date']
Pprice = ii['Price']
d = Ddate.to_list()
p = Pprice.to_list()
dates = make_dt(d)
prices = unstring(p)
plt.figure()
plt.plot(dates,prices)
plt.title(stocks[mm])
plt.grid(True)
plt.xlabel('Dates')
plt.ylabel('Prices')
mm += 1
the graph:
To fetch business days, you can use below function:
df["IsBDay"] = bool(len(pd.bdate_range(df['date'], df['date'])))
//Above line should add a new column into the DF as IsBday.
//You can also use Lambda expression to check and have new column for BDay.
df['IsBDay'] = df['date'].apply(lambda x: 'True' if bool(len(pd.bdate_range(x, x))) else 'False')
Now create a new DF that will have only True IsBday column value and other columns.
df[df.IsBday != 'False']
Now your DF is ready for ploting.
Hope this helps.

Conditional join using sqldf in R with time data

So I have a have a table (~2000 rows, call it df1) of when a particular subject received a medication on a particular date, and I have a large excel file (>1 million rows) of weight data for subjects for different dates (call it df2).
AIM: I want to group by subject and find the weight in df2 that was recorded closest to the medication admin time in df1 using sqldf(because tables are too big to load into R). Or alternatively, I can set up a time frame of interest (e.g. +/- 1 week of medication given) and find a row that falls within that timeframe.
Example:
df1 <- data.frame(
PtID = rep(c(1:5), each=2),
Dose = rep(seq(100,200,25),2),
ADMIN_TIME =seq.Date(as.Date("2016/01/01"), by = "month", length.out = 10)
)
df2 <- data.frame(
PtID = rep(c(1:5),each=10),
Weight = rnorm(50, 50, 10),
Wt_time = seq.Date(as.Date("2016/01/01"), as.Date("2016/10/31"), length.out = 50)
)
So I think i want to left_join df1 and df2, group by PtID, and set up some condition that identifies either the closest df2$Weight to the df1$Admin_time or a df2$Weight within an acceptable range around df1$Admin_time using sql formatting.
So I tried creating a range and then querying the following:
library(dplry)
library(lubridate)
df1 <- df1 %>%
mutate(ADMIN_START = ADMIN_TIME - ddays(30),
ADMIN_END = ADMIN_TIME + ddays(30))
#df2.csv is the large spreadsheet saved in my working directory
result <- read.csv.sql("df2.csv", sql = "select Weight from file
left join df1
on file.Wt_time between df1.ADMIN_START and df1.ADMIN_END")
This will run but it never results anything and I have to escape out of it. Any thoughts are appreciated.
Thanks!

How do I usee ffill with a multiindex

I asked (and answered) a question here Pandas ffill resampled data grouped by column where I wanted to know how to ffill a date range for each unique entry for a column (my assets column).
My solution requires that the asset "id" is a column. However, the data makes more sense to me as a multiindex. Furthermore I would like more fields in the multiindex. Is the only way of filling forward to drop the non-date fields from the multiiindex before ffilling?
A modified version of my example (to work on a df with multiindex) here:
from datetime import datetime, timedelta
import pytz
some_time = datetime(2018,4,2,20,20,42)
start_date = datetime(some_time.year,some_time.month,some_time.day).astimezone(pytz.timezone('Europe/London'))
end_date = start_date + timedelta(days=1)
start_date = start_date + timedelta(hours=some_time.hour,minutes=(0 if some_time.minute < 30 else 30 ))
df = pd.DataFrame(['A','B'],columns=['asset_id'])
df2=df.copy()
df['datetime'] = start_date
df2['datetime'] = end_date
df['some_property']=0
df.loc[df['asset_id']=='B','some_property']=2
df = df.append(df2).set_index(['asset_id','datetime'])
With what is arguably my crazy solution here:
df = df.reset_index()
df = df.set_index('datetime').groupby('asset_id').resample('30T').ffill().drop('asset_id',axis=1)
df = df.reset_index().set_index(['asset_id','datetime'])
Can I avoid all that re-indexing?

Pandas Dataframe timeseries

I want to build a dataframe with datetimestamp (upto minutes) as index and keep adding columns as I get data for each new column. For example, for Col-A, I aggregate by day, hour and minute from another dataset to a value 'k'. I want to insert this value 'k' into a dataframe at the 'right' row-index. The problem I am facing is the current row-identifier is from a groupby object on date,hour, min. Not sure how to 'concatenate' these 3 into a nice timeseries type.
This is what I have currently (output of my groupby object):
currGroupedData = cData.groupby(['DATE', 'HOUR', 'MINUTE'])
numUniqValuesPerDayHrMin = currGroupedData['UID'].nunique()
print numUniqValuesPerDayHrMin
Computing Values for A:
DATE HOUR MINUTE
2015-08-15 6 38 65
Name: UID, dtype: int64
To form a new dataframe to hold many columns (A, B, .., Z), I am doing this:
index = pd.date_range('2015-10-05 10:00:00', '2015-11-10 10:00:00', freq='1min')
df = pd.DataFrame(index=index)
Now, I want to 'somehow' take the value 65 and populate into my dataframe. How do I do this? I must somehow convert the "date, hour, minute" form groupby object to a timeseries obj...???
Also, I will have a series of values for Col-A for many mins of that day. I want to, in one-shot, populate an entire column with those values and fill the rest with '0s'. Then, move on processing/filling the next column.
Can I do this:
str = '2015-10-10 06:10:00'
str
Out[362]: '2015-10-10 06:10:00'
pd.to_datetime(str, format='%Y-%m-%d %H:%M:%S', coerce=True)
Out[363]: Timestamp('2015-10-10 06:10:00')
row_idx = pd.to_datetime(str, format='%Y-%m-%d %H:%M:%S', coerce=True)
type(row_idx)
Out[365]: pandas.tslib.Timestamp
data = pd.DataFrame({'Col-A': 65}, index = pd.Series(row_idx))
df.add(data)
Any thoughts?
you almost got it figured out in your code.
a few changes get the trick done.
initialize the dataframe without data and with the timeindex. (you
can always append more rows later)
initialize the new column with values set to 0.
set the value for the column at the target time.
|
import pandas as pd
index = pd.date_range('2015-10-05 10:00:00', '2015-11-10 10:00:00', freq='1min')
df = pd.DataFrame(index=index)
# initialize the column with all values set to 0.
df['first_column'] = 0
# format the target time into a timestamp
target_time = pd.to_datetime('2015-10-15 6:38')
# set the value for the target time to 65
df['first_column'][ target_time]=65
# output the value at the target time.
df['first_column'][ target_time]