how to handle the missing values like this and date format for regression? - pandas

I want to make the regression model from this dataset(first two are dependent variable and last one is dependent variable).I have import dataset using dataset=pd.read_csv('data.csv')
Now I have made model previously also but never have done with date format dataset as independent variable so how should we handle these date format to make the regression model.
also how should we handle 0 value data in given dataset.
My dataset is like:in .csv format:
Month/Day, Sales, Revenue
01/01 , 0 , 0
01/02 , 100000, 0
01/03 , 400000, 0
01/06 ,300000, 0
01/07 ,950000, 1000000
01/08 ,10000, 15000
01/10 ,909000, 1000000
01/30 ,12200, 12000
02/01 ,950000, 1000000
02/09 ,10000, 15000
02/13 ,909000, 1000000
02/15 ,12200, 12000
I don't know to handle this format date and 0 value

Here's a start. I saved your data into a file and stripped all the whitespace.
import pandas as pd
df = pd.read_csv('20180112-2.csv')
df['Month/Day'] = pd.to_datetime(df['Month/Day'], format = '%m/%d')
print(df)
Output:
Month/Day Sales Revenue
0 1900-01-01 0 0
1 1900-01-02 100000 0
2 1900-01-03 400000 0
3 1900-01-06 300000 0
4 1900-01-07 950000 1000000
5 1900-01-08 10000 15000
6 1900-01-10 909000 1000000
7 1900-01-30 12200 12000
8 1900-02-01 950000 1000000
9 1900-02-09 10000 15000
10 1900-02-13 909000 1000000
11 1900-02-15 12200 12000
The year defaults to 1900 since it is not provided in your data. If you need to change it, that's an additional, different question. To change the year, see: Pandas: Change day
import datetime as dt
df['Month/Day'] = df['Month/Day'].apply(lambda dt: dt.replace(year = 2017))
print(df)
Output:
Month/Day Sales Revenue
0 2017-01-01 0 0
1 2017-01-02 100000 0
2 2017-01-03 400000 0
3 2017-01-06 300000 0
4 2017-01-07 950000 1000000
5 2017-01-08 10000 15000
6 2017-01-10 909000 1000000
7 2017-01-30 12200 12000
8 2017-02-01 950000 1000000
9 2017-02-09 10000 15000
10 2017-02-13 909000 1000000
11 2017-02-15 12200 12000
Finally, to find the correlation between columns, just use df.corr():
print(df.corr())
Output:
Sales Revenue
Sales 1.000000 0.953077
Revenue 0.953077 1.000000

How to handle missing data?
There is a number of ways to replace it. By average, by median or using moving average window or even RF-approach (or similar, MICE and so on).
For 'sales' column you can try any of this methods.
For 'revenue' column better not to use any of this especially if you have many missing values (it will harm the model). Just remove rows with missing values in 'revenue' column.
By the way, a few methods in ML accept missing values: XGBoost and in some way Trees/Forests. For the latest ones you may replace zeroes to some very different values like -999999.
What to do with the data?
Many things related to feature engineering can be done here:
1. Day of week
2. Weekday or weekend
3. Day in month (number)
4. Pre- or post-holiday
5. Week number
6. Month number
7. Year number
8. Indication of some factors (for example, if it is fruit sales data you can some boolean columns related to it)
9. And so on...
Almost every feature here should be preprocessed via one-hot-encoding.
And clean from correlations of course if you use linear models.

Related

count number of records by month over the last five years where record date > select month

I need to show the number of valid inspectors we have by month over the last five years. Inspectors are considered valid when the expiration date on their certification has not yet passed, recorded as the month end date. The below SQL code is text of the query to count valid inspectors for January 2017:
SELECT Count(*) AS RecordCount
FROM dbo_Insp_Type
WHERE (dbo_Insp_Type.CERT_EXP_DTE)>=#2/1/2017#);
Rather than designing 60 queries, one for each month, and compiling the results in a final table (or, err, query) are there other methods I can use that call for less manual input?
From this sample:
Id
CERT_EXP_DTE
1
2022-01-15
2
2022-01-23
3
2022-02-01
4
2022-02-03
5
2022-05-01
6
2022-06-06
7
2022-06-07
8
2022-07-21
9
2022-02-20
10
2021-11-05
11
2021-12-01
12
2021-12-24
this single query:
SELECT
Format([CERT_EXP_DTE],"yyyy/mm") AS YearMonth,
Count(*) AS AllInspectors,
Sum(Abs([CERT_EXP_DTE] >= DateSerial(Year([CERT_EXP_DTE]), Month([CERT_EXP_DTE]), 2))) AS ValidInspectors
FROM
dbo_Insp_Type
GROUP BY
Format([CERT_EXP_DTE],"yyyy/mm");
will return:
YearMonth
AllInspectors
ValidInspectors
2021-11
1
1
2021-12
2
1
2022-01
2
2
2022-02
3
2
2022-05
1
0
2022-06
2
2
2022-07
1
1
ID
Cert_Iss_Dte
Cert_Exp_Dte
1
1/15/2020
1/15/2022
2
1/23/2020
1/23/2022
3
2/1/2020
2/1/2022
4
2/3/2020
2/3/2022
5
5/1/2020
5/1/2022
6
6/6/2020
6/6/2022
7
6/7/2020
6/7/2022
8
7/21/2020
7/21/2022
9
2/20/2020
2/20/2022
10
11/5/2021
11/5/2023
11
12/1/2021
12/1/2023
12
12/24/2021
12/24/2023
A UNION query could calculate a record for each of 50 months but since you want 60, UNION is out.
Or a query with 60 calculated fields using IIf() and Count() referencing a textbox on form for start date:
SELECT Count(IIf(CERT_EXP_DTE>=Forms!formname!tbxDate,1,Null)) AS Dt1,
Count(IIf(CERT_EXP_DTE>=DateAdd("m",1,Forms!formname!tbxDate),1,Null) AS Dt2,
...
FROM dbo_Insp_Type
Using the above data, following is output for Feb and Mar 2022. I did a test with Cert_Iss_Dte included in criteria and it did not make a difference for this sample data.
Dt1
Dt2
10
8
Or a report with 60 textboxes and each calls a DCount() expression with criteria same as used in query.
Or a VBA procedure that writes data to a 'temp' table.

Date dependent calculation from 2 dataframes - average 6-month return

I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

I want to do some aggregations with the help of Group By function in pandas

My dataset consists of a date column in 'datetime64[ns]' dtype; it also has a price and a no. of sales column.
I want to calculate the monthly VWAP (Volume Weighted Average Price ) of the stock.
( VWAP = sum(price*no.of sales)/sum(no. of sales) )
What I applied is:-
created a new dataframe column of month and year using pandas functions.
Now, I want monthly VWAP from this dataset which I modified, also, it should be distinct by year.
For eg. - March,2016 and March,2017 should have their seperate VWAP monthly values.
Start from defining a function to count vwap for the current
month (group of rows):
def vwap(grp):
return (grp.price * grp.salesNo).sum() / grp.salesNo.sum()
Then apply it to monthly groups:
df.groupby(df.dat.dt.to_period('M')).apply(vwap)
Using the following test DataFrame:
dat price salesNo
0 2018-05-14 120.5 10
1 2018-05-16 80.0 22
2 2018-05-20 30.2 12
3 2018-08-10 75.1 41
4 2018-08-20 92.3 18
5 2019-05-10 10.0 33
6 2019-05-20 20.0 41
(containing data from the same months in different years), I got:
dat
2018-05 75.622727
2018-08 80.347458
2019-05 15.540541
Freq: M, dtype: float64
As you can see, the result contains separate entries for May in both
years from the source data.

Normalize time variable for recurrent LSTM Neural Network using Keras

I am using Keras to create an LSTM neural-network that can predict the concentration in the blood of a certain drug. I have a dataset with time stamps on which a drug dosage was administered and when the concentration in the blood was measured. These dosage and measurement time stamps are disjoint. Furthermore several other variables are measured at all time steps (both dosage and measurements). These variables are the input for my model along with the dosages (0 when no dosage was given at time t). The observed concentration in the blood is the response variable.
I have normalized all input features using the MinMaxScaler().
Q1:
Now I am wondering, do I need to normalize the time variable that corresponds with all rows as well and give it as input to the model? Or can I leave this variable out since the time steps are equally spaced?
The data looks like:
PatientID Time Dosage DosageRate DrugConcentration
1 0 100 12 NA
1 309 100 12 NA
1 650 100 12 NA
1 1030 100 12 NA
1 1320 NA NA 12
1 1405 100 12 NA
1 1812 90 8 NA
1 2078 90 8 NA
1 2400 NA NA 8
2 0 120 13.5 NA
2 800 120 13.5 NA
2 920 NA NA 16
2 1515 120 13.5 NA
2 1832 120 13.5 NA
2 2378 120 13.5 NA
2 2600 120 13.5 NA
2 3000 120 13.5 NA
2 4400 NA NA 2
As you can see, the time between two consecutive dosages and measurements differs for a patient and between patients, which makes the problem difficult.
Q2:
One approach I can think of is aggregating on measurements intervals and taking the average dosage and SD between two measurements. Then we only predict on time stamps of which we know the observed drug concentration. Would this work, or would we lose to much information?
Q3
A second approach I could think of is create new data points, so that all intervals between dosages are the same and set the dosage and dosage rate at those time points to zero. The disadvantage is then, that we can only calculate the error on the time stamps on which we know the observed drug concentration. How should we tackle this?