SQL linear interpolation based on lookup table - sql

I need to build linear interpolation into an SQL query, using a joined table containing lookup values (more like lookup thresholds, in fact). As I am relatively new to SQL scripting, I have searched for an example code to point me in the right direction, but most of the SQL scripts I came across were for interpolating between dates and timestamps and I couldn't relate these to my situation.
Basically, I have a main data table with many rows of decimal values in a single column, for example:
Main_Value
0.33
0.12
0.56
0.42
0.1
Now, I need to yield interpolated data points for each of the rows above, based on a joined lookup table with 6 rows, containing non-linear threshold values and the associated linear normalized values:
Threshold_Level Normalized_Value
0 0
0.15 20
0.45 40
0.60 60
0.85 80
1 100
So for example, if the value in the Main_Value column is 0.45, the query will lookup its position in (or between) the nearest Threshold_Level, and interpolate this based on the adjacent value in the Normalized_Value column (which would yield a value of 40 in this example).
I really would be grateful for any insight into building a SQL query around this, especially as it has been hard to track down any SQL examples of linear interpolation using a joined table.
It has been pointed out that I could use some sort of rounding, so I have included a more detailed table below. I would like the SQL query to lookup each Main_Value (from the first table above) where it falls between the Threshold_Min and Threshold_Max values in the table below, and return the 'Normalized_%' value:
Threshold_Min Threshold_Max Normalized_%
0.00 0.15 0
0.15 0.18 5
0.18 0.22 10
0.22 0.25 15
0.25 0.28 20
0.28 0.32 25
0.32 0.35 30
0.35 0.38 35
0.38 0.42 40
0.42 0.45 45
0.45 0.60 50
0.60 0.63 55
0.63 0.66 60
0.66 0.68 65
0.68 0.71 70
0.71 0.74 75
0.74 0.77 80
0.77 0.79 85
0.79 0.82 90
0.82 0.85 95
0.85 1.00 100
For example, if the value from the Main_Value table is 0.52, it falls between Threshold_Min 0.45 and Threshold_Max 0.60, so the Normalized_% returned is 50%. The problem is that the Threshold_Min and Max values are not linear. Could anyone point me in the direction of how to script this?

Assuming you want the Main_Value and the nearest (low and not high) or equal Normalized_Value, you can do it like this:
select t1.Main_Value, max(t2.Normalized_Value) as Normalized_Value
from #t1 t1
inner join #t2 t2 on t1.Main_Value >= t2.Threshold_Level
group by t1.Main_Value
Replace #t1 and #t2 by the correct tablenames.

Related

How can I group a continuous column (0-1) into equal sizes? Scala spark

I have a dataframe column that I want to split into equal size buckets. The values in this column are floats between 0-1. Most of the data is skewed, so most values fall in the 0.90's and 1.
Bucket 10: All 1's (the size of this bucket will be different from 2-9 and 1)
Bucket 2-9: Any values > 0 and < 1 (equal sized)
Bucket 1: All 0's (the size of this bucket will be different from 2-9 and 10)
Example:
continous_number_col
Bucket
0.001
2
0.95
9
1
10
0
1
This should be how it looks when I groupBy("Bucket")
Counts of bucket 1 and 10 aren't significant here, they will just be in their own bucket.
And the 75 count will be different, just using as an example.
Bucket
Count
Values
1
1000
0
2
75
0.01 - 0.50
3
75
0.51 - 0.63
4
75
0.64 - 0.71
5
75
0.72 - 0.83
6
75
0.84 - 0.89
7
75
0.90 - 0.92
8
75
0.93 - 0.95
9
75
0.95 - 0.99
10
2000
1
I've tried using the QuantileDiscretizer() Function as this:
val df = {
rawDf
//Taking 1's and 0's out for the moment
.filter(col("continuous_number_col") =!= 1 && col("continuous_number_col") =!= 0)
}
val discretizer = new QuantileDiscretizer()
.setInputCol("continuous_number_col")
.setOutputCol("bucket_result")
.setNumBuckets(8)
val result = discretizer.fit(df).transform(df)
However, this gives me the following, not equal buckets:
bucket_result
count
7.0
20845806
6.0
21096698
5.0
21538813
4.0
21222511
3.0
21193393
2.0
21413413
1.0
21032666
0.0
21681424
Hopefully this gives enough context to what I'm trying to do. Thanks in advance.

Curse of dimensionality when dimensions are fixed

I think there is a big misunderstanding in the Data Science community in respect to what exactly 'curse of high dimensionality' means. Please consider two examples:
1) I want to compare the distance between point A and point B in a 1000-dimensional and 1001-dimensional space. This is an example of curse of high dimensionality because there is a high chance that the distance will be higher in the 1001-dimensional space.
2) I want to compare the distance between point A and point B in a 1000-dimensional space, and a distance between point A and point C in the 1000-dimensional space. This is not a curse of high dimensionality because even though the dimensions are high, they are kept fixed.
Is the second statement correct? If the distance ratio between points A-B is twice higher than A-C in a 2-dimensional space, I would expect to see twice higher distance ratio in 1000 dimensional space of the same points. This means that the curse of high dimensionality only occurs when one tries to compare distances between different numbers of dimensions.
I think I have answered this question with a little test. Therefore, I am going to leave here in case it is going to be useful for somebody:
I did an experiment where I created a dummy data-set with 3 observations (A=1, B=2, C=4), calculated euclidean distance between points, and varied a number of features to see if the ratio of distances between the points start to differentiate as the features increase.
After 2 features:
0 1 2 ratio
0 0.00 1.41 4.24 3.00
1 0.00 1.41 2.83 2.00
2 0.00 2.83 4.24 1.50
After 100 features:
0 1 2 ratio
0 0.00 10.00 30.00 3.00
1 0.00 10.00 20.00 2.00
2 0.00 20.00 30.00 1.50
After 1000 features:
0 1 2 ratio
0 0.00 31.62 94.87 3.00
1 0.00 31.62 63.25 2.00
2 0.00 63.25 94.87 1.50
After 10000 features:
0 1 2 ratio
0 0.00 100.00 300.00 3.00
1 0.00 100.00 200.00 2.00
2 0.00 200.00 300.00 1.50
What does this means? Curse of high dimensionality does not occur when dimensions are fixed. It can be seen that the ratio distance between first closest point (1) and second closest point (2) remains constant when number of dimensions are increasing.
To put it in perspective, yes you do travel longer to points but that makes sense as your total data space increases with each added feature. However, the ratio of travel between points is kept the same and that's what matters.
To be honest, I do not see such a problem with the famous 'curse of high dimensionality' unless you are in the situation where you need to compare same points in variant n of dimensions.

Select last row from each column of multi-index Pandas DataFrame based on time, when columns are unequal length

I have the following Pandas multi-index DataFrame with the top level index being a group ID and the second level index being when, in ISO 8601 time format (shown here without the time):
value weight
when
5e33c4bb4265514aab106a1a 2011-05-12 1.34 0.79
2011-05-07 1.22 0.83
2011-05-03 2.94 0.25
2011-04-28 1.78 0.89
2011-04-22 1.35 0.92
... ... ...
5e33c514392b77d517961f06 2009-01-31 30.75 0.12
2009-01-24 30.50 0.21
2009-01-23 29.50 0.96
2009-01-10 28.50 0.98
2008-12-08 28.50 0.65
when is currently defined as an index but this is not a requirement.
Assertions
when may be non-unique.
Columns may be of unequal length across groups
Within groups when, value and weight will always be of equal length (for each when there will always be a value and a weight
Question
Using the parameter index_time, how do you retrieve:
The most recent past value and weight from each group relative to index_time along with the difference (in seconds) between index_time and when.
index_time may be a time in the past such that only entries where when <= index_time are selected.
The result should be indexed in some way so that the group id of each result can be deduced
Example
From the above, if the index_time was 2011-05-10 then the result should be:
value weight age
5e33c4bb4265514aab106a1a 1.22 0.83 259200
5e33c514392b77d517961f06 30.75 0.12 72576000
Where original DataFrame given in the question is df:
import pandas as pd
df.sort_index(inplace=True)
result = df.loc[pd.IndexSlice[:, :when], :].groupby('id').tail(1)
result['age'] = when - result.index.get_level_values(level=1)

pandas time-weighted average groupby in panel data

Hi I have a panel data set looks like
stock date time spread1 weight spread2
VOD 01-01 9:05 0.01 0.03 ...
VOD 01-01 9.12 0.03 0.05 ...
VOD 01-01 10.04 0.02 0.30 ...
VOD 01-02 11.04 0.02 0.05
... ... ... .... ...
BAT 01-01 0.05 0.04 0.03
BAT 01-01 0.07 0.05 0.03
BAT 01-01 0.10 0.06 0.04
I want to calculate the weighted average of spread1 for each stock in each day. I can break the solution into several steps. i.e. I can apply groupby and agg function to get the sum of spread1*weight for each stock in each day in dataframe1, and then calculate the sum of weight for each stock in each day in dataframe2. After that merge two data sets and get weighted average for spread1.
My question is is there any simple way to calculate weighted average of spread1 here ? I also have spread2, spread3 and spread4. So I want to write as fewer code as possible. Thanks
IIUC, you need to transform the result back to the original, but using .transform with output that depends on two columns is tricky. We write our own function, where we pass the series of spread s and the original DataFrame df so we can also use the weights:
import numpy as np
def weighted_avg(s, df):
return np.average(s, weights=df.loc[df.index.isin(s.index), 'weight'])
df['spread1_avg'] = df.groupby(['stock', 'date']).spread1.transform(weighted_avg, df)
Output:
stock date time spread1 weight spread1_avg
0 VOD 01-01 9:05 0.01 0.03 0.020526
1 VOD 01-01 9.12 0.03 0.05 0.020526
2 VOD 01-01 10.04 0.02 0.30 0.020526
3 VOD 01-02 11.04 0.02 0.05 0.020000
4 BAT 01-01 0.05 0.04 0.03 0.051000
5 BAT 01-01 0.07 0.05 0.03 0.051000
6 BAT 01-01 0.10 0.06 0.04 0.051000
If needed for multiple columns:
gp = df.groupby(['stock', 'date'])
for col in [f'spread{i}' for i in range(1,5)]:
df[f'{col}_avg'] = gp[col].transform(weighted_avg, df)
Alternatively, if you don't need to transform back and one want value per stock-date:
def my_avg2(gp):
avg = np.average(gp.filter(like='spread'), weights=gp.weight, axis=0)
return pd.Series(avg, index=[col for col in gp.columns if col.startswith('spread')])
### Create some dummy data
df['spread2'] = df.spread1+1
df['spread3'] = df.spread1+12.1
df['spread4'] = df.spread1+1.13
df.groupby(['stock', 'date'])[['weight'] + [f'spread{i}' for i in range(1,5)]].apply(my_avg2)
# spread1 spread2 spread3 spread4
#stock date
#BAT 01-01 0.051000 1.051000 12.151000 1.181000
#VOD 01-01 0.020526 1.020526 12.120526 1.150526
# 01-02 0.020000 1.020000 12.120000 1.150000

Separating a column data into multiple columns in hive

I have a sample data for a device which contains two controller and it's version. The sample data is as follows:
device_id controller_id versions
123 1 0.1
123 2 0.15
456 2 0.25
143 1 0.35
143 2 0.36
This above data should be in the below format:
device_id 1st_ctrl_id_ver 2nd_ctrl_id_ver
123 0.1 0.15
456 NULL 0.25
143 0.35 0.36
I used the below code which is not working:
select
device_id,
case when controller_id="1" then versions end as 1st_ctrl_id_ver,
case when controller_id="2" then versions end as 2nd_ctrl_id_ver
from device_versions
The ouput which i got is:
device_id 1st_ctrl_id_ver 2nd_ctrl_id_ver
123 0.1 NULL
123 NULL 0.15
456 NULL 0.25
143 0.35 NULL
143 NULL 0.36
I don't want the Null values in each row.Can someone help me in writing the correct code?
To "fold" all lines with a given key to a single line, you have to run an aggregation. Even if you don't really aggregate values in practise.
Something like
select device_id,
MAX(case when controller_id="1" then versions end) as 1st_ctrl_id_ver,
MAX(case when controller_id="2" then versions end) as 2nd_ctrl_id_ver
from device_versions
GROUP BY device_id
But be aware that this code will work if and only if you have at most one entry per controller per device, and any controller with a version higher than 2 will be ignored. In other words it is rather brittle (but you can't do better in SQL anway)