Normalize time variable for recurrent LSTM Neural Network using Keras - tensorflow

I am using Keras to create an LSTM neural-network that can predict the concentration in the blood of a certain drug. I have a dataset with time stamps on which a drug dosage was administered and when the concentration in the blood was measured. These dosage and measurement time stamps are disjoint. Furthermore several other variables are measured at all time steps (both dosage and measurements). These variables are the input for my model along with the dosages (0 when no dosage was given at time t). The observed concentration in the blood is the response variable.
I have normalized all input features using the MinMaxScaler().
Q1:
Now I am wondering, do I need to normalize the time variable that corresponds with all rows as well and give it as input to the model? Or can I leave this variable out since the time steps are equally spaced?
The data looks like:
PatientID Time Dosage DosageRate DrugConcentration
1 0 100 12 NA
1 309 100 12 NA
1 650 100 12 NA
1 1030 100 12 NA
1 1320 NA NA 12
1 1405 100 12 NA
1 1812 90 8 NA
1 2078 90 8 NA
1 2400 NA NA 8
2 0 120 13.5 NA
2 800 120 13.5 NA
2 920 NA NA 16
2 1515 120 13.5 NA
2 1832 120 13.5 NA
2 2378 120 13.5 NA
2 2600 120 13.5 NA
2 3000 120 13.5 NA
2 4400 NA NA 2
As you can see, the time between two consecutive dosages and measurements differs for a patient and between patients, which makes the problem difficult.
Q2:
One approach I can think of is aggregating on measurements intervals and taking the average dosage and SD between two measurements. Then we only predict on time stamps of which we know the observed drug concentration. Would this work, or would we lose to much information?
Q3
A second approach I could think of is create new data points, so that all intervals between dosages are the same and set the dosage and dosage rate at those time points to zero. The disadvantage is then, that we can only calculate the error on the time stamps on which we know the observed drug concentration. How should we tackle this?

Related

Date dependent calculation from 2 dataframes - average 6-month return

I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")

Quicksight Calculated field: sum of average?

The dataset I have is currently like so:
country
itemid
device
num_purchases
total_views_per_country_and_day
day
USA
ABC
iPhone11
2
900
2022-06-15
USA
ABC
iPhoneX
5
900
2022-06-15
USA
DEF
iPhoneX
8
900
2022-06-15
UK
ABC
iPhone11
10
350
2022-06-15
UK
DEF
iPhone11
20
350
2022-06-15
total_views_per_country_and_day is already pre-calculated to be the sum grouped by country and day. That is why for each country-day pair, the number is the same.
I have a Quicksight analysis with a filter for day.
The first thing I want is to have a table on my dashboard that shows the number of total views for each country.
However, if I were to do it with the dataset just like that, the table would sum everything:
country
total_views
USA
900+900+900=2700
UK
350+350=700
So what I did was, create a calculated field which is the average of total_views. Which worked---but only if my day filter on dashboard was for ONE day.
When filtered for day = 2022-06-15: correct
country
avg(total_views)
USA
2700/3=900
UK
700/2=350
But let's say we have data from 2022-06-16 as well, the averaging method doesn't work, because it will average based on the entire dataset. So, example dataset with two days:
country
itemid
device
num_purchases
total_views_per_country_and_day
day
USA
ABC
iPhone11
2
900
2022-06-15
USA
ABC
iPhoneX
5
900
2022-06-15
USA
DEF
iPhoneX
8
900
2022-06-15
UK
ABC
iPhone11
10
350
2022-06-15
UK
DEF
iPhone11
20
350
2022-06-15
USA
ABC
iPhone11
2
1000
2022-06-16
USA
ABC
iPhoneX
5
1000
2022-06-16
UK
ABC
iPhone11
10
500
2022-06-16
UK
DEF
iPhone11
20
500
2022-06-16
Desired Table Visualization:
country
total_views
USA
900 + 1000 = 1900
UK
350 + 500 = 850
USA calculation: (900 * 3)/3 + (1000 * 2) /2 = 900 + 1000
UK calculation: (350 * 2) /2 + (500 * 2) /2 = 350 + 500
Basically---a sum of averages.
However, instead it is calculated like:
country
avg(total_views)
USA
[(900 * 3) + (1000*2)] / 5 = 940
UK
[(350 * 2) + (500 * 2)] / 4 = 425
I want to be able to use this calculation later on as well to calculate num_purchases / total_views. So ideally I would want it to be a calculated field. Is there a formula that can do this?
I also tried, instead of calculated field, just aggregating total_views by average instead of sum in the analysis -- exact same issue, but I could actually keep a running total if I include day in the table visualization. E.G.
country
day
running total of avg(total_views)
USA
2022-06-15
900
USA
2022-06-16
900+1000=1900
UK
2022-06-15
350
UK
2022-06-16
350+500=850
So you can see that the total (2nd and 4th row) is my desired value. However this is not exactly what I want.. I don't want to have to add the day into the table to get it right.
I've tried avgOver with day as a partition, that also requires you to have day in the table visualization.
sum({total_views_per_country_and_day}) / distinct_count( {day})
Basically your average is calculated as sum of metric divided by number of unique days. The above should help.

Tallying events within specific time prior to a current event

I am trying to tally the number of events that happened in specific periods of time previous to each of my events (day/week/month) in a data frame.
I have a data frame with 50 individuals, each of who have events scattered throughout different periods of time (days/weeks/months) in the dataframe. Every row in the data frame is an event, and I'm trying to understand how the number of events in the previous day/week/month impacted the way the individual responded to the current event. Every event is marked with an individual ID (ID.2) and has a date and time associated with it (Datetime). I have already created columns for day (epd), week (epw), month (epm) and want to populate them, for each event, with the number of events for that specific individual in the previous day, week and month respectively.
My data looks like this:
> head(ACss)
Date Datetime ID.2 month day year epd epw epm
1 2019-05-25 2019-05-25 11:57 139 5 25 2019 NA NA NA
2 2019-06-09 2019-06-09 19:42 43 6 9 2019 NA NA NA
3 2019-07-05 2019-07-05 20:12 139 7 5 2019 NA NA NA
4 2019-07-27 2019-07-27 17:27 152 7 27 2019 NA NA NA
5 2019-08-04 2019-08-04 9:13 152 8 4 2019 NA NA NA
6 2019-08-04 2019-08-04 16:18 139 8 4 2019 NA NA NA
I have no idea how to go about doing this so haven't tried anything yet! Any and all suggestions are greatly appreciated!

how to handle the missing values like this and date format for regression?

I want to make the regression model from this dataset(first two are dependent variable and last one is dependent variable).I have import dataset using dataset=pd.read_csv('data.csv')
Now I have made model previously also but never have done with date format dataset as independent variable so how should we handle these date format to make the regression model.
also how should we handle 0 value data in given dataset.
My dataset is like:in .csv format:
Month/Day, Sales, Revenue
01/01 , 0 , 0
01/02 , 100000, 0
01/03 , 400000, 0
01/06 ,300000, 0
01/07 ,950000, 1000000
01/08 ,10000, 15000
01/10 ,909000, 1000000
01/30 ,12200, 12000
02/01 ,950000, 1000000
02/09 ,10000, 15000
02/13 ,909000, 1000000
02/15 ,12200, 12000
I don't know to handle this format date and 0 value
Here's a start. I saved your data into a file and stripped all the whitespace.
import pandas as pd
df = pd.read_csv('20180112-2.csv')
df['Month/Day'] = pd.to_datetime(df['Month/Day'], format = '%m/%d')
print(df)
Output:
Month/Day Sales Revenue
0 1900-01-01 0 0
1 1900-01-02 100000 0
2 1900-01-03 400000 0
3 1900-01-06 300000 0
4 1900-01-07 950000 1000000
5 1900-01-08 10000 15000
6 1900-01-10 909000 1000000
7 1900-01-30 12200 12000
8 1900-02-01 950000 1000000
9 1900-02-09 10000 15000
10 1900-02-13 909000 1000000
11 1900-02-15 12200 12000
The year defaults to 1900 since it is not provided in your data. If you need to change it, that's an additional, different question. To change the year, see: Pandas: Change day
import datetime as dt
df['Month/Day'] = df['Month/Day'].apply(lambda dt: dt.replace(year = 2017))
print(df)
Output:
Month/Day Sales Revenue
0 2017-01-01 0 0
1 2017-01-02 100000 0
2 2017-01-03 400000 0
3 2017-01-06 300000 0
4 2017-01-07 950000 1000000
5 2017-01-08 10000 15000
6 2017-01-10 909000 1000000
7 2017-01-30 12200 12000
8 2017-02-01 950000 1000000
9 2017-02-09 10000 15000
10 2017-02-13 909000 1000000
11 2017-02-15 12200 12000
Finally, to find the correlation between columns, just use df.corr():
print(df.corr())
Output:
Sales Revenue
Sales 1.000000 0.953077
Revenue 0.953077 1.000000
How to handle missing data?
There is a number of ways to replace it. By average, by median or using moving average window or even RF-approach (or similar, MICE and so on).
For 'sales' column you can try any of this methods.
For 'revenue' column better not to use any of this especially if you have many missing values (it will harm the model). Just remove rows with missing values in 'revenue' column.
By the way, a few methods in ML accept missing values: XGBoost and in some way Trees/Forests. For the latest ones you may replace zeroes to some very different values like -999999.
What to do with the data?
Many things related to feature engineering can be done here:
1. Day of week
2. Weekday or weekend
3. Day in month (number)
4. Pre- or post-holiday
5. Week number
6. Month number
7. Year number
8. Indication of some factors (for example, if it is fruit sales data you can some boolean columns related to it)
9. And so on...
Almost every feature here should be preprocessed via one-hot-encoding.
And clean from correlations of course if you use linear models.

Proc Optmodel SAS Variable not unique

I am using proc optmodel to solve a problem in which several items must be priced the same within the same location (let's say they are different colors of same product and are not currently priced the same). I know that volume will increase/decrease depending on direction of price change, and I have some MIN/MAX constraints as well.
The problem I am running into is that the procedure is only reading one group of unique SKUs....I think because they repeat. How can I get the procedure to optimize all unique combinations of SKU/LOCATION? I tried just changing the item numbers, which of course works, but is not practical for my business solution. Thanks.
data input_data;
input SKU DESC $ LOCATION $ OLD_PRICE MIN MAX LIFT OLD_UNITS;
cards;
111 black NY 12.99 10 15 1.3 100
222 white NY 13.45 11 15 .9 150
333 red NY 13.29 13 15 1.6 200
111 black DC 11.75 10 14 1.2 300
222 white DC 11.75 10 14 1.5 100
333 red DC 11.99 10 14 1.7 140
111 black LA 14.21 12 17 2.0 600
222 white LA 14.79 14 17 1.5 500
333 red LA 15.99 13 17 .3 200
444 orange LA 14.11 12 17 .6 300
;
run;
proc optmodel;
set<num> SKU;
string LOCATION{SKU};
string DESC{SKU};
set LOCATIONS = setof{i in SKU} LOCATION[i];
set SKUperLOCATION{gi in LOCATIONS} = {i in SKU: LOCATION[i] = gi};
number OLD_PRICE{SKU};
number MIN{SKU};
number MAX{SKU};
var NEW_PRICE{gi in LOCATIONs} >= max{i in SKUperLOCATION[gi]} MIN[i] <= min{i in SKUperLOCATION[gi]} MAX[i];
impvar NEW_PRICEbySKU{i in SKU} = NEW_PRICE[LOCATION[i]];
number LIFT{SKU};
number OLD_UNITS{SKU};
read data input_data into
SKU=[SKU]
DESC
LOCATION
OLD_PRICE
MIN
MAX
LIFT
OLD_UNITS;
max sales=sum{gi in LOCATIONs}
sum{i in SKUperLOCATION[gi]}
(NEW_PRICE[gi])*(1-(NEW_PRICE[gi]-OLD_PRICE[i])*LIFT[i]/OLD_PRICE[i])*OLD_UNITS[i];
expand;
solve;
create data results_FAM_maxsales
from [SKU]={SKU}
DESC
LOCATION
OLD_PRICE
NEW_PRICE=NEW_PRICEbySKU
MIN
MAX
LIFT
OLD_UNITS;
print NEW_PRICE sales;
quit;
One way would be to set your unique key to be SKU & Location. I haven't used OPTMODEL in a while, but something like this should work.
set<num,str> SKU_Loc;
num old_price{SKU_Loc};
<code>
read data input_data into SKU_Loc = [SKU Location];
<code>
Then change the rest of the code to reference the unique combination of SKU & location.