Discrepancy in Azure SQL DTU reporting? - azure-sql-database

Refer to DTU graph below.
• Both graphs show DTU consumption for the same period, but captured at different times.
• Graph on the left was captured minutes after DTU-consuming event;
• Graph on the right was captured some 19 hrs after.
Why are the two graphs different?

The difference is in the scale of the data points: your graph shows the same scale on the bottom (likely through use of the 'custom' view of the DTU percentage and other metrics) but the granularity of the data has changed. This is a similar question - the granularity for the last hour of data is 5 seconds, whereas the scale for multiple hours is 5 minutes - and the average of the 100 datapoints is the value for that 5 minute data point.
I'll verify this with the engineering team and update if it is inaccurate.

Related

Does deeper LSTM need more units?

I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.
It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).

How can I combine two time-series datasets with different time-steps?

I want to train a Multivariate LSTM model by using data from 2 datasets MIMIC-1.0 and MIMIC-3. The problem is that the vital signs recorded in the first data set is minute by minute while in MIMIC-III the data is recorded hourly. There is a interval difference between recording of data in both data sets.
I want to predict diagnosis from the vital signs by giving streams/sequences of vital signs to my model every 5 minutes. How can I merge both data sets for my model?
You need to be able to find a common field using which you can do a merge. For e.g. patient_ids or it's like. You can do the same with ICU episode identifiers. It's a been a while since I've worked on the MIMIC dataset to recall exactly what those fields were.
Dataset
Granularity
Subsampling for 5-minutely
MIMIC-I
Minutely
Subsample every 5th reading
MIMIC-III
Hourly
Interpolate the 10 5-minutely readings between each pair of consecutive hourly readings
The interpolation method you choose to get the between hour readings could be as simple as forward-filling the last value. If the readings are more volatile, a more complex method may be appropriate.

Getting fuel% from analog data

I am getting analog voltage data, in mV, from a fuel gauge. The calibration readings were taken for every 10% change in the fuel gauge as mentioned below :
0% - 2000mV
10% - 2100mV
20% - 3200mV
30% - 3645mV
40% - 3755mV
50% - 3922mV
60% - 4300mV
70% - 4500mv
80% - 5210mV
90% - 5400mV
100% - 5800mV
The tank capacity is 45L.
Post calibration, I am getting reading from adc as let's say, 3000mV. How to calculate the exact % of fuel left in the tank?
If you plot the transfer function of ADC reading agaist the percentage tank contents you get a graph like this
There appears to be a fair degree of non linearity in the relationship between the sensor and the measured quantity. This could be down to a measurement error that was made while performing the calibration or it could be a true non linear relationship between the sensor reading and the tank contents. Using these results will give fairly inaccurate estimates of tank contents due to the non linearity of the transfer function.
If the relationship is linear or can be described by another mathematical relationship then you can perform an interpolation between known points using this mathematical relationship.
If the relationship is not linear than you will need many more known points in your calibration data so that the errors due to the interpolation between points is minimised.
The percentage value corresponding to the ADC reading can be approximated by finding the entries in the calibration above and below the reading that has been taken - for the ADC reading example in the question these would be the 10% and 20% values
Interpolation_Proportion = (ADC - ADC_Below) / (ADC_Above - ADC_Below) ;
Percent = Percent_Below + (Interpolation_Proportion * (Percent_Above - Percent_Below)) ;
.
Interpolation proportion = (3000-2100)/(3200-2100)
= 900/1100
= 0.82
Percent = 10 + (0.82 * (20 - 10)
= 10 + 8.2
= 18.2%
Capacity = 45 * 18.2 / 100
= 8.19 litres
When plotted it appears that the data id broadly linear, with some outliers. It is likely that this is experimental error or possibly influenced by confounding factors such as electrical noise or temperature variation, or even just the the liquid slopping around! Without details of how the data was gathered and how carefully, it is not possible to determine, but I would ask how many samples were taken per measurement, whether these are averaged or instantaneous and whether the results are exactly repeatable over more than one experiment?
Assuming the results are "indicative" only, then it is probably wisest from the data you do have to assume that the transfer function is linear, and to perform a linear regression from the scatter plot of your test data. That can be most done easily using any spreadsheet charting "trendline" function:
From your date the transfer function is:
Fuel% = (0.0262 x SensormV) - 54.5
So for your example 3000mV, Fuel% = (0.0262 x 3000) - 54.5 = 24.1%
For your 45L tank that equates to about 10.8 Litres.

HOW to train LSTM for Multiple time series data - both for Univariate and Multivariate scenario?

I have data for hundreds of devices(pardon me, I am not specifying much detail about device and data recorded for devices). For each device, data is recorded per hour basis.
Data recorded are of 25 dimensions.
I have few prediction tasks
time series forecasting
where I am using LSTM. As because I have hundreds of devices, and each device is a time series(multivariate data), so all total my data is a Multiple time series with multivariate data.
To deal with multiple time series - my first approach is to concatenate data one after another and treat them as one time series (it can be both uni variate or multi variate) and apply LSTM and train my LSTM model.
But by this above approach(by concatenating time series data), actually I am loosing my time property of my data, so I need a better approach.
Please suggest some ideas, or blog posts.
Kindly don't confuse with Multiple time series with Multi variate time series data.
You may consider a One-fits-all model or Seq2Seq as e.g. this Google paper suggests. The approach works as follows:
Let us assume that you wanna make a 1-day ahead forecast (24 values) and you are using last 7 days (7 * 24 = 168 values) as input.
In time series analysis data is time dependent, such that you need a validation strategy that considers this time dependence, e.g. by rolling forecast approach. Separate hold-out data for testing your final trained model.
In the first step you will generate out of your many time series 168 + 24 slices (see the Google paper for an image). The x input will have length 168 and the y input 24. Use all of your generated slices for training the LSTM/GRU network and finally do prediction on your hold-out set.
Good papers on this issue:
Foundations of Sequence-to-Sequence Modeling for Time Series
Deep and Confident Prediction for Time Series at Uber
more
Kaggle Winning Solution
Kaggle Web Traffic Time Series Forecasting
List is not comprehensive, but you can use it as a starting point.

Accord.Net Implementation of weather scenario

I am trying to implement a prediction application for weather based on hidden markov models, using the accord framework. I am having some trouble on how to map the concepts into HMM structures and would like to get some insights. I am starting off with their sample application that can be found here: https://github.com/accord-net/framework/tree/master/Samples/Statistics/Gestures%20(HMMs)
Imagine the following scenario:
I am told every 6 hours what the weather is like: Cloudy, Sunny or Rainy. These would be my states in the framework correct?
Besides that, I have access to results of two different instruments, which are an air humidity meter and a wind speed meter. For simplicity, let's assume that both instruments provide a measure from 0 to 100, with 4 ranges. I would have something like 0, 1, 2 and 3 for observations regarding the humidity (0-25, 26-50, 51-75, 76-100) and the same ranges for wind would have values 4, 5, 6 and 7. These would be my observable values for sequences.
For a couple of days, I store the observations made regarding those two instruments, and based on that I will save the data for future usage, for learning purposes.
One of the questions I have is regarding timing. Since I plan to know states at every 6 hours, does it make sense or is it possible to store observations regarding instruments at a different rate? For example, if I stored observations of the instruments at every hour, I would end up with a 12 element sequence and the corresponding state, something like this for the first 12 hours:
0-4-0-5-0-4-1-7-1-6-0-4 - Cloudy
0-4-0-5-0-4-0-4-0-5-0-4 - Sunny
The 12 element sequence would be:
First hour observation of humidity - observation of wind speed (0-4)
Second hour observation of humidity - observation of wind speed (0-5)
and so on...
Should I, besides observation sequences and states, use labels for each of the instruments? Something like this:
0-0-0-1-1-0 - Humidity - Cloudy
4-5-4-4-5-4 - Wind Sp - Sunny
0-0-0-1-1-0 - Humidity - Cloudy
4-5-4-4-5-4 - Wind Sp - Sunny
The labels would be the instruments that were being measured, the sequences would be the observed values, and the states would be the state at the end of each 6 hours.
With this, I would like to be able to feed this information back to the model and predict the next state. Am I approaching this problem correctly? Would I be able to do what I need like this?
Thank you.