SQL Server CTE average with conditions - sql

I've been trying to visualise how to do this with a CTE as it appears on the surface to be the best way but just can't get it going. Maybe it needs a temp table as well. I am using SQL Server 2008 R2.
I need to create intercepts (a length along a line essentially) with the following parameters.
The average of the intercept must be greater than .7
The aim is to get the largest intercept possible
There can be up to 2 consecutive meters of values less than .7 (internal waste) but no more
There is no limit to the total internal waste within an intercept
There is no minimum intercept length (well there is but I'll take care of it later)
Note: there will be no gaps as I has taken care of that and the from and to can be decimal
An example is shown here:
enter image description here
In space seen here with the assay on the left and depth on right:
enter image description here
So for a little more clarity if needed, intervals 6 to 7 and 17 to 18 do not make part of the larger intercept as the internal waste (7-9 and/or 15-17) would bring the average below 0.7 - not because of the amount of internal waste.
However the result for 21-22 is not included because there are 3 meters of internal waste between it and the result for 17-18.
Note that there are multiple sites and areas which make part of the original tables primary key so I imagine that a partition of area and site would be used in any ROW_NUMBER OVER statements
Edit: the original code had errors in the from and to (multiple 14 to 15) which would have been confusing sorry. There will be no overlapping from-to's which hopefully simplifies things.
Example values to use:
create table #temp_inter (
area nvarchar(10),
site_ID nvarchar(10),
d_from decimal (18,3),
d_to decimal (18,3),
assay decimal (18,3))
insert into #temp_inter
values ('area_1','abc','0','5','0'),
('area_1','abc','5','6','0.165'),
('area_1','abc','6','7','0.761'),
('area_1','abc','7','8','0.321'),
('area_1','abc','8','9','0.292'),
('area_1','abc','9','10','1.135'),
('area_1','abc','10','11','0.225'),
('area_1','abc','11','12','0.983'),
('area_1','abc','12','13','0.118'),
('area_1','abc','13','14','0.438'),
('area_1','abc','14','15','0.71'),
('area_1','abc','15','16','0.65'),
('area_1','abc','16','17','2'),
('area_1','abc','17','18','0.367'),
('area_1','abc','18','19','0.047'),
('area_1','abc','19','20','0.71'),
('area_1','abc','20','21','0'),
('area_1','abc','21','22','0'),
('area_1','abc','22','23','0'),
('area_1','abc','23','24','2'),
('area_1','abc','24','25','0'),
('area_1','abc','25','26','0'),
('area_1','abc','26','30','0'),
('area_2','zzz','0','5','0'),
('area_2','zzz','5','6','1.165'),
('area_2','zzz','6','7','0.396'),
('area_2','zzz','7','8','0.46'),
('area_2','zzz','8','9','0.111'),
('area_2','zzz','9','10','0.053'),
('area_2','zzz','10','11','0.057'),
('area_2','zzz','11','12','0.055'),
('area_2','zzz','12','13','0.03'),
('area_2','zzz','13','14','0.026'),
('area_2','zzz','14','15','0.194'),
('area_2','zzz','15','16','0.367'),
('area_2','zzz','16','17','0.431'),
('area_2','zzz','17','18','0.341'),
('area_2','zzz','18','19','0.071'),
('area_2','zzz','19','20','0.26'),
('area_2','zzz','20','21','0.659'),
('area_2','zzz','21','22','0.602'),
('area_2','zzz','22','23','2.436'),
('area_2','zzz','23','24','0.874'),
('area_2','zzz','24','25','3.173'),
('area_2','zzz','25','26','0.179'),
('area_2','zzz','26','27','0.065'),
('area_2','zzz','27','28','0.024'),
('area_2','zzz','28','29','0')

Related

How to query column with letters on SQL?

I'm new to this.
I have a column: (chocolate_weight) On the table : (Chocolate) which has g at the end of every number, so 30x , 2x5g,10g etc.
I want to remove the letter at the end and then query it to show any that weigh greater than 35.
So far I have done
Select *
From Chocolate
Where chocolate_weight IN
(SELECT
REPLACE(chocolote_weight,'x','') From Chocolate) > 35
It is coming back with 0 , even though there are many that weigh more than 35.
Any help is appreciated
Thanks
If 'g' is always the suffix then your current query is along the right lines, but you don't need the IN you can do the replace in the where clause:
SELECT *
FROM Chocolate
WHERE CAST(REPLACE(chocolate_weight,'g','') AS DECIMAL(10, 2)) > 35;
N.B. This works in both the tagged DBMS SQL-Server and MySQL
This will fail (although only silently in MySQL) if you have anything that contains units other than grams though, so what I would strongly suggest is that you fix your design if it is not too late, store the weight as an numeric type and lose the 'g' completely if you only ever store in grams. If you use multiple different units then you may wish to standardise this so all are as grams, or alternatively store the two things in separate columns, one as a decimal/int for the numeric value and a separate column for the weight, e.g.
Weight
Unit
10
g
150
g
1000
lb
The issue you will have here though is that you will have start doing conversions in your queries to ensure you get all results. It is easier to do the conversion once when the data is saved and use a standard measure for all records.

Closest position between randomly moving objects

I have a large database tables that contains grid references (X and Y) associated with various objects (each with a unique object identifier) as they move with time. The objects move at approximately constant speed but random directions.
The table looks something like this….
CREATE TABLE positions (
objectId INTEGER,
x_coord INTEGER,
y_coord INTEGER,
posTime TIMESTAMP);
I want to find which two objects got closest to each other and at what time.
Finding the distance between two fixes is relatively easy – simple Pythagoras for the differences between the X and Y values should do the trick.
The first problem seems to be one of volume. The grid itself is large, 100,000 possible X co-ordinates and a similar number of Y co-ordinates. For any given time period the table might contain 10,000 grid reference positions for 1000 different objects – 10 million rows in total.
That’s not in itself a large number, but I can’t think of a way of avoiding doing a ‘product query’ to compare every fix to every other fix. Doing this with 10 million rows will produce 100 million million results.
The next issue is that I’m not just interested in the closest two fixes to each other, I’m interested in the closest two fixes from different objects.
Another issue is that I need to match time as well as position – I’m not just interested in two objects that have visited the same grid square, they need to have done so at the same time.
The other point (may not be relevant) is that the items are unlikely to every occupy exactly the same location at the same time.
I’ve got as far as a simple product query with a few sample rows, but I’m not sure on my next steps. I’m beginning to think this isn’t going something I can pull off with a single SQL query (please prove me wrong) and I’m likely to have to extract the data and subject it to some procedural programming.
Any suggestions?
I’m not sure what SE forum this best suited for – database SQL? Programming? Maths?
UPDATE - Another issue to add to the complexity, the timestamping for each object and position is irregular, one item might have a position recorded at 14:10:00 and another at 14:10:01. If these two positions are right next to each other and one second apart then they may actually represent the closest position although the time don't match!
In order to reduce the number of tested combinations you should segregate them by postime using subqueries. Also, it's recommended you create an index by postime to increase performance.
create index ix1_time on positions (postime);
Since you didn't mention any specific database I assumed PostgreSQL since it's easy to use (for me). The solution should look like:
with t as (
select distinct(postime) as pt from positions
)
select *
from t,
(
select *
from (
select
a.objectid as aid, b.objectid as bid,
a.x_coord + a.y_coord + b.x_coord + b.y_coord as dist -- fix here!
from t
join positions a on a.postime = t.pt
join positions b on b.postime = t.pt
where a.objectid <> b.objectid
) x
order by dist desc
limit 1
) y;
This SQL should compare each 10000 objects against each other on by postime. It will test 10 million combinations for each different postime value, but not against other postime values.
Please note: I used a.x_coord + a.y_coord + b.x_coord + b.y_coord as the distance formula. I leave the correct one for you to implement here.
In total it will compute 10 million x 1000 time values: a total of 10 billion comparisons. It will return the closest two points for each timepos, that is a total of 1000 rows.

Select row with mostly higher value and rarely lower value

I'm trying to select a random row from a table, but there is a column in this table called Rate, I want it to return the row that has a higher rate, and rarely ever return the rows that has a lower rate, is this possible?
Table :
CREATE TABLE _Random (Code varchar(128), Rate tinyint)
So you want a random row, but weighted towards the ones with higher rates?
It would also be good to know how many rows there are in the table - sorting the whole lot is kinda expensive. You may prefer to use a row_number concept than sorting by N guids.
So... One option could be to generate a single number, and then divide 100 by it. Imagine we generate a number between 0 and 1.
.25 gives us 400, .5 gives us 200, .75 gives us 133... Notice that there's a curve here - so the numbers closer to 100 come up more often (subtract 100 to make the range start at 1).
You could use RAND() for a single value between 0 and 1 (it's probably good enough), and then do the division and subtraction to get a number. If this is higher than the count of records, then maybe repeat? But try to choose a value for your division that suits.
If you need to weight it more, you could raise your RAND() value by some number, to flatten it out or steepen it up. Do some experimenting to see how it looks.
This query will fetch a random record which has an above average rate
SELECT TOP (1) * FROM _Random
WHERE Rate>(SELECT AVG(Rate) FROM _Random)
ORDER BY NEWID()

Find nearest lines to large number of points in an oracle spatial database

The problem I have is simple:
I have a set of datasets. Each dataset has within it a set of points. Each set of points is an identical a 6km spaced grid (this grid never changes). Each point has an associated value.Each dataset is unrelated, so the problem can be seen as just a single set of points.
If the value of a point exceeds a predefined threshold value then the point has to be queried against an oracle spatial database to find all line segments within a certain distance of the point.
Which is a simple enough problem to solve.
The line segments have a non-unique ID, which allow them to be grouped together into features of size 1 to 700 segments (it's all predefined topology).
Ultimately I need to know which feature IDs match against which points as well as the number of line segments for each feature match against each point.
In terms of dataset sizes:
There are around 200 datasets.
There are 56,000 points per dataset.
There is a little over 180,000 line segments in the spatially indexed database.
The line segments can be grouped into a total of 1900 features.
Usually there aren't many more than in the order of 10^3 points that exceed the threshold per dataset.
I have created a solution and it works adequately,
however I'm unhappy with the overall run times - it takes around 3min per dataset.
Normally I wouldn't mind if a precomputation task takes that long, but due to constraints this task cannot take more than an hour to run, and ideally would only take 1/2 an hour.
Currently I use SDO_WITHIN_DISTANCE to do the query, and I run this query for each and every point that exceeds the threshold:
SELECT id, count(shape) AS segments, sum(length) AS length
FROM (
SELECT shape, id, length
FROM lines_1
UNION ALL
SELECT shape, id, length
FROM lines_2
)
WHERE SDO_WITHIN_DISTANCE(
shape,
sdo_geometry(
3001,
8307,
SDO_POINT_TYPE(:lng,:lat, 0),
null,
null
),
'distance=4 unit=km'
) = 'TRUE'
GROUP BY id
This query takes around 0.4s to execute, which isn't all that bad, but it adds up for a single dataset, and is compounded over all of the datasets.
I am not overly experienced with Oracle spatial databases, so I'm not sure how to improve the speed.
Note that I cannot change the format of the incoming set of points, nor can I change the format of the database.
The only way to speed it up that I can think of is by pre computing the query for each point and storing that in a separate table, but I'd rather not do that as it more or less creates another copy of the data.
So the question is - is there a better way to do query?
I ended up precomputing my query into the following table.
+---------+---------+
| LINE_ID | VARCHAR |
| LAT | FLOAT |
| LNG | FLOAT |
+---------+---------+
There were just too many multiline segments for it to be efficient.
By precomputing it I can just lookup in the table for the relevant IDs (which ultimately was all I cared about).
The query takes less than 1/10th of the time, so it works out a lot faster.
Ultimately the tradeoff of having to recompute the point to ID mapping every week (takes about 2 hours) was worth the speed up.

Biased random in SQL?

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost