How to get that graph in plotly? - dataframe

I have hourly data time series for 2016 till 2020 and I want to get a graph looks like the attached picture
datetime Demand
2016-1-1 01:00:00. 500
2016-1-1 02:00:00. 450
2016-1-1 03:00:00. 650
.........................
2017-1-1 01:00:00. 570
2017-1-1 02:00:00. 470
2017-1-1 03:00:00. 600
.........................
.........................
2020-1-1 01:00:00. 900
2020-1-1 02:00:00. 800
2020-1-1 03:00:00. 950
My dataframe looks like aboved dataframe

Basically you'll need to create two new columns for your dataframe
Year and Hour.
You can use datetime in order to do that.
With those two columns you can now create a px.line graph where x is your Hour column, where Y is your Demand column and where color is your Year column.
References:
Datetime
Line Charts With Plotly

Related

Date dependent calculation from 2 dataframes - average 6-month return

I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")

I want to do some aggregations with the help of Group By function in pandas

My dataset consists of a date column in 'datetime64[ns]' dtype; it also has a price and a no. of sales column.
I want to calculate the monthly VWAP (Volume Weighted Average Price ) of the stock.
( VWAP = sum(price*no.of sales)/sum(no. of sales) )
What I applied is:-
created a new dataframe column of month and year using pandas functions.
Now, I want monthly VWAP from this dataset which I modified, also, it should be distinct by year.
For eg. - March,2016 and March,2017 should have their seperate VWAP monthly values.
Start from defining a function to count vwap for the current
month (group of rows):
def vwap(grp):
return (grp.price * grp.salesNo).sum() / grp.salesNo.sum()
Then apply it to monthly groups:
df.groupby(df.dat.dt.to_period('M')).apply(vwap)
Using the following test DataFrame:
dat price salesNo
0 2018-05-14 120.5 10
1 2018-05-16 80.0 22
2 2018-05-20 30.2 12
3 2018-08-10 75.1 41
4 2018-08-20 92.3 18
5 2019-05-10 10.0 33
6 2019-05-20 20.0 41
(containing data from the same months in different years), I got:
dat
2018-05 75.622727
2018-08 80.347458
2019-05 15.540541
Freq: M, dtype: float64
As you can see, the result contains separate entries for May in both
years from the source data.

Adding the most recent data to a pandas data frame

I am trying to build, and keep up to date, a data frame/time series where I scrape the data from a website table, and want to take the most recent data, and add to the data I've already got. A sample of what the data frame looks like is:
Date Price
0 10/01/19 100
1 09/01/19 95
2 08/01/19 96
3 07/01/19 97
What I then want to do is run my little program and have it identify that I am missing data for the 11th and 12th of Jan, and then add it to the top of the data frame. I am quite comfortable with compiling a data frame using .read_html, and generally building a data frame, but this is a bit beyond my talents currently.
I know the done thing is usually to show you what I have so far attempted but to be honest I actually don't know where to begin with this one.
Many thanks
Lets say the old dataframe as df which looks like:
Date Price
0 2019-01-10 100
1 2019-01-09 95
2 2019-01-08 96
3 2019-01-07 97
After 2 days you download a data which gives you 2 rows for 2019-01-11 and 2019-01-12, lets name it new_df (values are just as examples):
Date Price
0 2019-01-12 67
1 2019-01-11 89
2 2019-01-10 100
3 2019-01-09 95
Note: there are a few values in the new df which are present in the old df.
Using df.append() , df.drop_duplicates() and df.sort_values() :-
>>df.append(new_df,ignore_index=True).drop_duplicates().sort_values(by='Date',ascending=False)
Date Price
4 2019-01-12 67
5 2019-01-11 89
0 2019-01-10 100
1 2019-01-09 95
2 2019-01-08 96
3 2019-01-07 97
This will append the new values and sort them in descending manner based on Date column keeping the latest date at the top.
if you want the index sorted just add sort_index() in the end : df.append(new_df,ignore_index=True).drop_duplicates().sort_values(by='Date',ascending=False).sort_index()

how to handle the missing values like this and date format for regression?

I want to make the regression model from this dataset(first two are dependent variable and last one is dependent variable).I have import dataset using dataset=pd.read_csv('data.csv')
Now I have made model previously also but never have done with date format dataset as independent variable so how should we handle these date format to make the regression model.
also how should we handle 0 value data in given dataset.
My dataset is like:in .csv format:
Month/Day, Sales, Revenue
01/01 , 0 , 0
01/02 , 100000, 0
01/03 , 400000, 0
01/06 ,300000, 0
01/07 ,950000, 1000000
01/08 ,10000, 15000
01/10 ,909000, 1000000
01/30 ,12200, 12000
02/01 ,950000, 1000000
02/09 ,10000, 15000
02/13 ,909000, 1000000
02/15 ,12200, 12000
I don't know to handle this format date and 0 value
Here's a start. I saved your data into a file and stripped all the whitespace.
import pandas as pd
df = pd.read_csv('20180112-2.csv')
df['Month/Day'] = pd.to_datetime(df['Month/Day'], format = '%m/%d')
print(df)
Output:
Month/Day Sales Revenue
0 1900-01-01 0 0
1 1900-01-02 100000 0
2 1900-01-03 400000 0
3 1900-01-06 300000 0
4 1900-01-07 950000 1000000
5 1900-01-08 10000 15000
6 1900-01-10 909000 1000000
7 1900-01-30 12200 12000
8 1900-02-01 950000 1000000
9 1900-02-09 10000 15000
10 1900-02-13 909000 1000000
11 1900-02-15 12200 12000
The year defaults to 1900 since it is not provided in your data. If you need to change it, that's an additional, different question. To change the year, see: Pandas: Change day
import datetime as dt
df['Month/Day'] = df['Month/Day'].apply(lambda dt: dt.replace(year = 2017))
print(df)
Output:
Month/Day Sales Revenue
0 2017-01-01 0 0
1 2017-01-02 100000 0
2 2017-01-03 400000 0
3 2017-01-06 300000 0
4 2017-01-07 950000 1000000
5 2017-01-08 10000 15000
6 2017-01-10 909000 1000000
7 2017-01-30 12200 12000
8 2017-02-01 950000 1000000
9 2017-02-09 10000 15000
10 2017-02-13 909000 1000000
11 2017-02-15 12200 12000
Finally, to find the correlation between columns, just use df.corr():
print(df.corr())
Output:
Sales Revenue
Sales 1.000000 0.953077
Revenue 0.953077 1.000000
How to handle missing data?
There is a number of ways to replace it. By average, by median or using moving average window or even RF-approach (or similar, MICE and so on).
For 'sales' column you can try any of this methods.
For 'revenue' column better not to use any of this especially if you have many missing values (it will harm the model). Just remove rows with missing values in 'revenue' column.
By the way, a few methods in ML accept missing values: XGBoost and in some way Trees/Forests. For the latest ones you may replace zeroes to some very different values like -999999.
What to do with the data?
Many things related to feature engineering can be done here:
1. Day of week
2. Weekday or weekend
3. Day in month (number)
4. Pre- or post-holiday
5. Week number
6. Month number
7. Year number
8. Indication of some factors (for example, if it is fruit sales data you can some boolean columns related to it)
9. And so on...
Almost every feature here should be preprocessed via one-hot-encoding.
And clean from correlations of course if you use linear models.

Generate Pandas DF OHLC data with Numpy

I would like to generate the following test data in my dataframe in a way similar to this:
df = pd.DataFrame(data=np.linspace(1800, 100, 400), index=pd.date_range(end='2015-07-02', periods=400), columns=['close'])
df
close
2014-05-29 1800.000000
2014-05-30 1795.739348
2014-05-31 1791.478697
2014-06-01 1787.218045
But using the following criteria:
intervals of 1 minute
increments of .25
prices moving up and down around 1800.00
maximum 2100.00, minimum 1700.00
parse_dates= "Timestamp"
Volume column rows have a range of min = 50 - max = 300
Day start 09:30 Day End 16:29:59
Please see desired output:
Open High Low Last Volume
Timestamp
2014-03-04 09:30:00 1783.50 1784.50 1783.50 1784.50 171
2014-03-04 09:31:00 1784.75 1785.75 1784.50 1785.25 28
2014-03-04 09:32:00 1785.00 1786.50 1785.00 1786.50 81
2014-03-04 09:33:00 1786.00
I have limited python experience and find the example for Numpy etc hard to follow as they look to be focused on academia. Is it possible to assist with this?