Sequence-depedent calculation on SQL - sql

I have a silo where I buy material from different suppliers (each time I do this, the added material has a lot number associated) and mix them inside. Then I consume mixed material from this silo.
I want to be able to calculate the proportion existing on the silo at any given moment with SQL.
I know how to do this with sequential programming but it must be a way to do this in SQL.
The table I have is a historic of movements, containing all necessary information for the calculations:
ID
Date
Lot number (only relevant if it's a buy)
Movement type (buy (1) or consumption (0))
Quantity
Here it is a SQL fiddle for testing: http://sqlfiddle.com/#!3/4ca50/1
In this example, the initial movement is a buy of 10.000 of lot 1000, then a consumption of 1.000 (so 9.000 remaining) and then another buy of 5.000 units of lot 2000.
Now, If I run the desired query up to that moment, it should return:
Lot 1000: 9.000 units (64,29%)
Lot 2000: 5.000 units (35,71%)
Next there are two consumptions for a total of 4.000 units, so at that point in the silo there a 10.000 remaining.
If I run the query until then, it should keep the same percentage but different amounts:
Lot 1000: 6.429 units (64,29%)
Lot 2000: 3.571 units (35,71%)
Then another buy of 8.000 units of lot 3.000, there are 18.000 total in the silo, so expected output is:
Lot 1000: 6.429 units (35,71%)
Lot 2000: 3.571 units (19,83%)
Lot 3000: 8.000 units (44,44%)
And finally after the last consuption of 2.000 there are 16.000 remaining in the same percentage:
Lot 1000: 5.715 units (35,71%)
Lot 2000: 3.173 units (19,83%)
Lot 3000: 7.112 units (44,44%)
I expect that you got the idea... basically each buy changes the composition in percentage, and each consumption keeps proportion but changes quantity.
I don't know where even to begin with the query... maybe it isn't even doable in pure SQL?

Related

Include belonging into model

If you had data like (prices and market-cap are not real)
Date Stock Close Market-cap GDP
15.4.2010 Apple 7.74 1.03 ...
15.4.2010 VW 50.03 0.8 ...
15.5.2010 Apple 7.80 1.04 ...
15.5.2010 VW 52.04 0.82 ...
where Close is the y you want to predict and Market-cap and GDP are your x-variables, would you also include Stock in your model as another independent variable as it could for example be that price building for Apple works than differently than for VW.
If yes, how would you do it? My idea is to assign 0 to Apple and 1 to VW in the column Stock.
You first need to identify what exactly are you trying to predict. As it stands, you have longitudinal data such that you have multiple measurements from the same company over a period of time.
Are you trying to predict the close price based on market cap + GDP?
Or are you trying to predict the future close price based on previous close price measurements?
You could stratify based on company name, but it really depends on what you are trying to achieve. What is the question you are trying to answer ?
You may also want to take the following considerations into account:
close prices measured at different times on the same company are correlated with each other.
correlations between two measurements soon after each other will be better than correlations between two measurements far apart in time.
There are four assumptions associated with a linear regression model:
Linearity: The relationship between X and the mean of Y is linear.
Homoscedasticity: The variance of residual is the same for any value of X.
Independence: Observations are independent of each other.
Normality: For any fixed value of X, Y is normally distributed.

Quantity of motion from quaternions

I made some recordings from a head tracker which provides the 4 values of the quaternions, which are saved in a csv (each row is a set of quaternion plus a timestamp).
I need to calculate for the whole recording how much the head moved. This is needed for an experiment where I would like to see whether under a condition the head moved more or less compared to another condition.
What is the best way to get a single quantity for each recording?
I have some proposals but I do not know how much appropriate they are:
PROPOSAL 1) I calculate the cumulative sum of the absolute value of the derivatives for each quaternion value, then I sum the 4 sums together to get a single value
PROPOSAL 2) I calculate the cumulative sum of the absolute value of the derivatives of the norm
Sounds like you just want a rough estimate of total angular movement as a single value. One way is to assume minimum rotation angle between quaternion samples and then just add up those angles. E.g., suppose two consecutive quaternion samples are q1 and q2. Then calculate the quaternion multiply q = q1 * inv(q2) and your delta-angle for that step is 2*acos(abs(qw)). Do this for each step and add up all the delta angles.

Data selecting in data science project about predicting house prices

This is my first data science project and I need to select some data. Of course, I know that I can not just select all the data available because this will result in overfitting. I am currently investigating house prices in the capital of Denmark for the past 10 years and I wanted to know which type of houses I should select in my data:
Owner-occupied flats and houses (This gives a dataset of 50000 elements)
Or just owner-occupied flats (this gives a dataset of 43000 elements)
So as you can see there are a lot more owner-occupied flats sold in the capital of Denmark. My opinion is that I should select just the owner-occupied flats because then I have the "same" kind of elements in my data and still have 43000 elements.
Also, there are a lot higher taxes involved if you own a house rather than owning an owner-occupied flat. This might affect the price of the house and skew the data a little bit.
I have seen a few projects where both owner-occupied flats and houses are selected for the data and the conclusion was overfitting, so that is what I am looking to avoid.
This is an classic example of over-fitting due to lack of data or insufficient data.
Let me example the selection process to resolve this kind of problem. I will example using the example of credit card fraud then relate that with your question or any future problem of prediction with classified data.
In ideal world credit card fraud are not that common. So, if you look at the real data you will find only 2% data which resulted in fraud. So, if you train a model with this datasets it would be biased as you don't have normal distribution of the class (i.e fraud and none fraud transaction in your case its Owner-occupied flats and houses). There are 4 a way to tackle this issue.
Let's Suppose Datasets has 90 none fraud data points and 10 fraud data points.
1. Under sampling majority class
In this we just select 10 data points from 90 and train model with 10:10 so distribution is normalised (In your case using only 7000 of 43000 flats). This is not ideal way as we would be throughout a huge amount of data.
2. Over sampling minority class by duplication
In this we duplicate the 10 data points to make it 90 data point distribution is normalised (In your case duplicating 7000 house data to make it 43000 i.e equal to that of flat). While this work there is a better way.
3. Over sampling minority class by SMOTE (recommended)
Synthetic Minority Over-sampling Technique is a technique we use k nearest neigbors algo to generate the minority class in your case the housing data. There is module named imbalanced-learn (here) which can be use to implement this.
4. Ensemble Method
In this method you divide your data into multiple datasets to make it balance for example dividing 90 into 9 sets so that each set can have 10 fraud class data and 10 none fraud class data. In your case diving 43000 in batch of 7000. After that training each one separately and using majority vote mechanism to predict.
So now I have created the following diagram. The green line shows the price per square meter of owner occupied flats and the red line shows price per square meter of houses (all prices in DKK). I was wondering if there is imbalanced classification? The maximum deviation of the prices is atmost 10% (see for example 2018). Is 10% enough to say that the data is biased and hence therefore is imbalanced classified?

Getting fuel% from analog data

I am getting analog voltage data, in mV, from a fuel gauge. The calibration readings were taken for every 10% change in the fuel gauge as mentioned below :
0% - 2000mV
10% - 2100mV
20% - 3200mV
30% - 3645mV
40% - 3755mV
50% - 3922mV
60% - 4300mV
70% - 4500mv
80% - 5210mV
90% - 5400mV
100% - 5800mV
The tank capacity is 45L.
Post calibration, I am getting reading from adc as let's say, 3000mV. How to calculate the exact % of fuel left in the tank?
If you plot the transfer function of ADC reading agaist the percentage tank contents you get a graph like this
There appears to be a fair degree of non linearity in the relationship between the sensor and the measured quantity. This could be down to a measurement error that was made while performing the calibration or it could be a true non linear relationship between the sensor reading and the tank contents. Using these results will give fairly inaccurate estimates of tank contents due to the non linearity of the transfer function.
If the relationship is linear or can be described by another mathematical relationship then you can perform an interpolation between known points using this mathematical relationship.
If the relationship is not linear than you will need many more known points in your calibration data so that the errors due to the interpolation between points is minimised.
The percentage value corresponding to the ADC reading can be approximated by finding the entries in the calibration above and below the reading that has been taken - for the ADC reading example in the question these would be the 10% and 20% values
Interpolation_Proportion = (ADC - ADC_Below) / (ADC_Above - ADC_Below) ;
Percent = Percent_Below + (Interpolation_Proportion * (Percent_Above - Percent_Below)) ;
.
Interpolation proportion = (3000-2100)/(3200-2100)
= 900/1100
= 0.82
Percent = 10 + (0.82 * (20 - 10)
= 10 + 8.2
= 18.2%
Capacity = 45 * 18.2 / 100
= 8.19 litres
When plotted it appears that the data id broadly linear, with some outliers. It is likely that this is experimental error or possibly influenced by confounding factors such as electrical noise or temperature variation, or even just the the liquid slopping around! Without details of how the data was gathered and how carefully, it is not possible to determine, but I would ask how many samples were taken per measurement, whether these are averaged or instantaneous and whether the results are exactly repeatable over more than one experiment?
Assuming the results are "indicative" only, then it is probably wisest from the data you do have to assume that the transfer function is linear, and to perform a linear regression from the scatter plot of your test data. That can be most done easily using any spreadsheet charting "trendline" function:
From your date the transfer function is:
Fuel% = (0.0262 x SensormV) - 54.5
So for your example 3000mV, Fuel% = (0.0262 x 3000) - 54.5 = 24.1%
For your 45L tank that equates to about 10.8 Litres.

Discrepancy in Azure SQL DTU reporting?

Refer to DTU graph below.
• Both graphs show DTU consumption for the same period, but captured at different times.
• Graph on the left was captured minutes after DTU-consuming event;
• Graph on the right was captured some 19 hrs after.
Why are the two graphs different?
The difference is in the scale of the data points: your graph shows the same scale on the bottom (likely through use of the 'custom' view of the DTU percentage and other metrics) but the granularity of the data has changed. This is a similar question - the granularity for the last hour of data is 5 seconds, whereas the scale for multiple hours is 5 minutes - and the average of the 100 datapoints is the value for that 5 minute data point.
I'll verify this with the engineering team and update if it is inaccurate.