How to average values based on location proximity - sql

I have an SQL table with geo-tagged values (Longitude, Latitude, value). The table is accumulated quickly and has thousands entries. Therefore, querying the table for values in some area return very large data-set.
I would like to know the way to average value with close location proximity to one value, here is an illustration:
Table:
Long lat value
10.123001 53.567001 10
10.123002 53.567002 12
10.123003 53.567003 18
10.124003 53.568003 13
lets say my current location is 10.123004, 53.567004. If I am querying for the values near by I will get the four raws with values 10, 12, 18, and 13. This works if the data-set is relatively small. If the data is large I would like to query sql for rounded location (10.123, 53.567) and need sql to return something like
Long lat value
10.123 53.567 10 (this is the average of 10, 12, and 18)
10.124 53.568 13
Is this possible? how we can average large data set based on locations?
Is sql database is the right choice in the first place?

GROUP BY rounded columns, and the AVG aggregate function should work fine for this:
SELECT ROUND(Long, 3) Long,
ROUND(Lat, 3) Lat,
AVG(value)
FROM Table
GROUP BY ROUND(Long, 3), ROUND(Lat, 3)
Add a WHERE clause to filter as needed.

Here's some rough pseudocode that might be a start. You need to provide the proper precision arguments for the round function in the dialect of SQL you are using for your project, so understand that the 3 I provide as the second argument to Round is the number of decimals of precision to which the number is rounded, as indicated by your original post.
Select round(lat,3),round(long,3),avg(value)
Group by round(lat,3),round(long,3)

The problem with the rounding approach is the boundary conditions -- what happens when points are close to the bounday.
However, for the neighborhood of a given point it is better to use something like:
select *
from table
where long between #MyLong - #DeltaLong and #MyLong + #DeltaLong and
lat between #MyLat - #DeltaLat and #MyLat + #DeltaLat
For this, you need to define #DeltaLong and #DeltaLat.
Rounding works fine for summarization, if that is your problem.

Related

Sum over partition between current row and number from another number

I need a moving sum that starts from the current row until X following rows. The problem is that X is not static (i.e. it comes from another column).
The code:
sum(column 0) OVER(
ORDER BY column 1, column 2, column 3
ROWS BETWEEN CURRENT ROW AND column 4 FOLLOWING
) as X
column 4 is already an integer, but the SQL "complains" asking for a hard coded integer. Casting and converting didn't work either.
Thank you in advance!
Usually SQL engines cannot handle "variables" on some parts of the SQL statement.
This is one of those cases. Unfortunately, a non-literal is not accepted at this place. You'll need to rephrase your query using a different strategy.

SQL to powerBI expression?

How to write this expression in PowerBI
select distinct([date]),Temperature from Device47A8F where Temperature>25
Totally new to PowerBI. Is there any tool that can change the query from sql to PowerBI expression?
I have tried so many type of different type of expressions but getting error, Most of the time I am getting this:
The expression refers to multiple columns. Multiple columns cannot be converted to a scalar value.
Need help, Thanks.
After I posted my answer, wondered if your expected result is get only one date by temperature, In other words, without repeated dates in your result set.
A side note: select distinct([date]),Temperature from Device47A8F where Temperature>25 returns repeated dates since DISTINCT keyword evaluate distinct columns values specified in the SELECT statement, it doesn't return distinct values in a specific column even if you surround it with parenthesis.
Now what brings us here. What I can see in your error is that you are trying to use a table-valued (produces a table with multiple columns) expression in a measure which only accepts scalar-valued (calculate only one value).
Supposing you have a table like this:
Running your SQL query you will get the highlighted in yellow rows:
You can see 01/09/2016 date is repeated. If you want to create a measure you have to define what calculation you want to show for temperature. i.e, average, max or min etc.
In the below expression is being calculated the maximum temperature greater than 25 per date:
MaxTempGreaterThan25 =
CALCULATE ( MAX ( Device47A8F[Temperature] ), Device47A8F[Temperature] > 25 )
In this case the measure MaxTempGreaterThan25 is calculated per date.
If you don't want to produce a measure but a table. In the Power BI Tool bar select Modeling tab and click New Table icon.
Use this expression:
MyTemperatureTable =
FILTER ( Device47A8F, Device47A8F[Temperature] > 25 )
It should produce a new table named MyTemperatureTable like this:
I recommend you learn some basics about DAX, it is pretty different from SQL / T-SQL and there are things you can't do depending on your model and data.
Let me know if this helps.
You probably don't need to write any code if your objective is to show the result in a Power BI visual e.g. a table. Power BI naturally aggregates data if the datatype is numeric (e.g. Temperature).
I would just add a Table visual on a Report page and add the Date and Temperature columns to it. Then in Visualizations / Fields / Values I would click the little down-arrow on the Temperature field and set the Aggregation e.g. Maximum. Then in Visualizations / Fields / Filters I would click the little down-arrow on the Temperature field and set the Filter e.g. is greater than: 25
Hard-coded solutions are unlikely to survive the next question from your users e.g. "but what if I want to see Temperature > 24? Or 20? Or 30?"

How does the Average function work in relational databases?

I'm trying to find geometric average of values from a table with millions of rows. For those that don't know, to find the geometric average, you mulitply each value times each other then divide by the number of rows.
You probably already see the problem; The number multiplied number will quickly exceed the maximum allowed system maximum. I found a great solution that uses the natural log.
http://timothychenallen.blogspot.com/2006/03/sql-calculating-geometric-mean-geomean.html
However that got me to wonder wouldn't the same problem apply with the arithmetic mean? If you have N records, and N is very large the running sum can also exceed the system maximum.
So how do RDMS calculate averages during queries?
I don't know an exact implementation for arithmetic mean in an RDBMS, nor did you specify one in your original question. But the RDBMS does not need to sum a million rows in a column in order to obtain the arithmetic mean. Consider the following summation:
sum = (x1 + x2 + x3 + ... + x1000000)
Then the mean can be written as
mean = sum / N = (x1 + x2 + x3 + ... + x1000000) / N, for N = 1,000,000
But this expression can be broken up into pieces like this:
mean = [(x1 + x2 + x3) / N ] + [(x4 + x5 + x6) / N] + ...
In other words, the RDBMS can simply scan down the million rows in a column and find the mean section by section, without running the risk of an overflow. And since each number in the column is presumably within range for the type storing it, there is no chance of the mean value itself overflowing.
Most databases don't support a product() function the way they support an average.
However, you can use do what you want with logs. The product (simplified) is like:
select exp(sum(ln(x)) as product
The average would be:
select power(exp(sum(ln(x))), 1.0 / count(*)) as geoaverage
or
select EXP(AVG(LN(x))) as geoaverage
The LN() function might be LOG() on some platforms...
These are schematics. The functions for exp() and ln() and power() vary, depending on the database. Plus, if you have to take into account zero or negative numbers, the logic is more complicated.
Very easy to check. For example, SQL Server 2008.
DECLARE #T TABLE(i int);
INSERT INTO #T(i) VALUES
(2147483647),
(2147483647);
SELECT AVG(i) FROM #T;
result
(2 row(s) affected)
Msg 8115, Level 16, State 2, Line 7
Arithmetic overflow error converting expression to data type int.
There is no magic. Column type is int, server adds values together using internal variable of the same type int and intermediary result exceeds range for int.
You can run the similar check for any other DBMS that you use. Different engines may behave differently, but I would expect all of them to stick to the original type of the column. For example, averaging two int values 100 and 101 may result in 100 or 101 (still int), but never 100.5.
For SQL Server this behavior is documented. I would expect something similar for all other engines:
AVG () computes the average of a set of values by dividing the sum of
those values by the count of nonnull values. If the sum exceeds the
maximum value for the data type of the return value an error will be
returned.
So, you have to be careful when calculating simple average as well, not just product.
Here is extract from SQL 92 Standard:
6) Let DT be the data type of the < value expression >.
9) If SUM or AVG is specified, then:
a) DT shall not be character string, bit string, or datetime.
b) If SUM is specified and DT is exact numeric with scale S, then the
data type of the result is exact numeric with implementation-defined
precision and scale S.
c) If AVG is specified and DT is exact numeric, then the data type of
the result is exact numeric with implementation- defined precision not
less than the precision of DT and implementation-defined scale not
less than the scale of DT.
d) If DT is approximate numeric, then the data type of the result is
approximate numeric with implementation-defined precision not less
than the precision of DT.
e) If DT is interval, then the data type of the result is inter- val
with the same precision as DT.
So, DBMS can convert int to larger type when calculating AVG, but it has to be an exact numeric type, not floating-point. In any case, depending on the values you can still get arithmetic overflow.
Some DBMS — specifically, the Informix DBMS — convert from an INT type to a floating point type to do the calculation:
SQL[2148]: create table t(i int);
SQL[2149]: insert into t values(214748347);
SQL[2150]: insert into t values(214748347);
SQL[2151]: insert into t values(214748347);
SQL[2152]: select avg(i) from t;
214748347.0
SQL[2153]: types on;
SQL[2154]: select i from t;
INTEGER
214748347
214748347
214748347
SQL[2155]: select avg(i) from t;
DECIMAL(32)
214748347.0
SQL[2156]:
Similarly with other types. This can still end with an overflow under some circumstances; you then get a runtime error. However, it is rather seldom that you exceed the precision — it typically takes a very large number of rows for the sum to exceed the limits, even if you're counting the US deficit over the next century in atto-Zimbabwean dollars circa 2009.

SQL - View column that calculates the percentage from other columns

I have a query from Access where I caluclated the percentage score of three seperate numbers Ex:
AFPercentageMajor: [AFNumberOfMajors]/([AFTotalMajor]-[AFMajorNA])
which could have values of 20/(23-2) = 95%
I have imported this table into my SQL database and tried to write a expression in the view (changed the names of the columns a bit)
AF_Major / (AF_Major_Totals - AF_Major_NA)
I tried adding *100 to the end of the statement but it only works if the calculation is at 100%. If it is anything less than that it puts it as a 0.
I have a feeling it just doesn't like the combincation of the three seperate column names. But like I said I'm still learning so I could be going at this completely wrong!
SQL Server does integer division. You need to change one of the values to a floating point representation. The following will work:
cast([AFNumberOfMajors] as float)/([AFTotalMajor]-[AFMajorNA])
You can multiply this by 100 to get the percentage value.

How to add "weights" to a MySQL table and select random values according to these?

I want to create a table, with each row containing some sort of weight. Then I want to select random values with the probability equal to (weight of that row)/(weight of all rows). For example, having 5 rows with weights 1,2,3,4,5 out of 1000 I'd get approximately 1/15*1000=67 times first row and so on.
The table is to be filled manually. Then I'll take a random value from it. But I want to have an ability to change the probabilities on the filling stage.
I found this nice little algorithm in Quod Libet. You could probably translate it to some procedural SQL.
function WeightedShuffle(list of items with weights):
max_score ← the sum of every item’s weight
choice ← random number in the range [0, max_score)
current ← 0
for each item (i, weight) in items:
current ← current + weight
if current ≥ choice or i is the last item:
return item i
The easiest (and maybe best/safest?) way to do this is to add those rows to the table as many times as you want the weight to be - say I want "Tree" to be found 2x more often then "Dog" - I insert it 2 times into the table and I insert "Dog" once and just select elements at random one by one.
If the rows are complex/big then it would be best to create a separate table (weighted_Elements or something) in which you'll just have foreign keys to the real rows inserted as many times as the weights dictate.
The best possible scenario (if i understand your question properly) is to setup your table as you normally would and then add two columns both INT's.
Column 1: Weight - This column would hold your weight value going from -X to +X, X being the highest value you want to have as a weight (IE: X=100, -100 to 100). This value is populated to give the row an actual weight and increase or decrease the probability of it coming up.
Column 2: *Count** - This column would hold the count of how many times this row has come up, this column is needed only if you want to use fair weighting. Fair weighting prevents one row from always showing up. (IE: if you have one row weighted at 100 and another at 2 the row with 100 will always show up, this column will allow weight 2 to be more 'valueable' as you get more weight 100 results). This column should be incremented by 1 each time a row result is pulled but you can make the logic more advanced later so it adds the weight etc.
Logic: - Its really simple now, your query simply has to request all rows as you normally would then make an extra select that (you can change the logic here to whatever you want) takes the weights and subtracts the count and order by that column.
The end result should be a table where you will get your weights appearing more often until a certain point where the system will evenly distribute itself out (leave out column 2) and you will have a system that will always return the same weighted order unless you offset the base of the query (IE: LIMIT [RANDOM NUMBER], [NUMBER OF ROWS TO RETURN])
I'm not an expert in probability theory, but assuming you have a column called WEIGHT, how about
select FIELD_1, ... FIELD_N, (rand() * WEIGHT) as SCORE
from YOURTABLE
order by SCORE
limit 0, 10
This would give you 10 records, but you can change the limit clause, of course.
The problem is called Reservoir Sampling (https://en.wikipedia.org/wiki/Reservoir_sampling)
The A-Res algorithm is easy to implement in SQL:
SELECT *
FROM table
ORDER BY pow(rand(), 1 / weight) DESC
LIMIT 10;
I came looking for the answer to the same question - I decided to come up with this:
id weight
1 5
2 1
SELECT * FROM table ORDER BY RAND()/weight
it's not exact - but it is using random so i might not expect exact. I ran it 70 times to get number 2 10 times. I would have expect 1/6th but i got 1/7th. I'd say that's pretty close. I'd have to run a script to do it a few thousand times to get a really good idea if it's working.