I have a database of aircraft flight track data that cross certain points. I'm looking at the altitude that the aircraft crossed these points at and trying to bin them by every 100 ft. The altitudes range from about 2000 ft to 15000 ft so I want a way to do this that automates the 100 ft increments. So I want to have the crossing point, a range (say 2000-2100 ft), and the count. And the next line is the crossing point, the next range (2100-2200 ft), and the count, and so on.
I'm still a SQL newbie so any help to get me pointed in the right direction would be appreciated. Thanks.
Edited for clarity - I have nothing. I want a column with my crossing location, another with the altitude range, and a third with the count. I'm just not sure to bin the data so it will give me the ranges in 100 ft. increments.
You can use a calculated column for the AltitudeBucket. This is automatically calculated. (This technique is often used for loading dimension tables into data warehouses.)
In this case, having the AltitudeBucket as a calculated column means you can do calculations on it and use it in WHERE clauses.
Create and populate a table.
CREATE TABLE dbo.TrackPoint
(
TrackPointID int NOT NULL IDENTITY(1,1) PRIMARY KEY,
CrossingPoint nvarchar(50) NOT NULL,
AltitudeFeet int NOT NULL
CHECK (AltitudeFeet BETWEEN 1 AND 60000),
AltitudeBucket AS (AltitudeFeet / 100) * 100 PERSISTED NOT NULL
);
GO
INSERT INTO dbo.TrackPoint (CrossingPoint, AltitudeFeet)
VALUES
(N'Paris', 12772),
(N'Paris', 12765),
(N'Paris', 32123),
(N'Toulouse', 5123),
(N'Toulouse', 6123),
(N'Toulouse', 6120),
(N'Lyon', 15000),
(N'Lyon', 15010);
Display what's in the table.
SELECT *
FROM dbo.TrackPoint;
Run a SELECT query to calculate summarised counts.
SELECT CrossingPoint, AltitudeBucket, COUNT(*) AS 'Count'
FROM dbo.TrackPoint
GROUP BY CrossingPoint, AltitudeBucket
ORDER BY CrossingPoint, AltitudeBucket;
If you want to display the altitude range.
SELECT CrossingPoint, AltitudeBucket, CAST(AltitudeBucket AS nvarchar) + N'-' + CAST(AltitudeBucket + 99 AS nvarchar) AS 'AltitudeBucketRange', COUNT(*) AS 'Count'
FROM dbo.TrackPoint
GROUP BY CrossingPoint, AltitudeBucket
ORDER BY CrossingPoint, AltitudeBucket;
Whenever you're attempting to automate any kind of process, you first must design the algorithm for the process to successfully execute manually. To begin, pick out the smallest piece of this process: returning a count of altitudes between range x and x+100. So when x = 2000, you want to return all records between 2000 and 2100.
SELECT COUNT(*) FROM AltitudesTable
WHERE altitude >= 2000 AND altitude < 2100;
The above code works for one case: 2000 <= x < 2100.
To "automate," or loop through all cases, try using T-SQL:
DECLARE #x INT = 2000;
WHILE EXISTS(SELECT * FROM AltitudesTable)
BEGIN
SELECT COUNT(*) FROM AltitudesTable
WHERE altitude >= #x AND altitude < #x+100;
#x = #x+100;
END
Respectfully, your requirements are not solidly defined, so I had to make some assumptions regarding table structure and datatypes.
Related
I've been able to find a few examples of questions similar to this one, but most only involve a single column being checked.
SQL Select until Quantity Met
Select rows until condition met
I have a large table representing facilities, with columns for each type of resource available and the number of those specific resources available per facility. I want this stored procedure to be able to take integer values in as multiple parameters (representing each of these columns) and a Lat/Lon. Then it should iterate over the table sorted by distance, and return all rows (facilities) until the required quantity of available resources (specified by the parameters) are met.
Data source example:
Id
Lat
Long
Resource1
Resource2
...
1
50.123
4.23
5
12
...
2
61.234
5.34
0
9
...
3
50.634
4.67
21
18
...
Result Wanted:
#latQuery = 50.634
#LongQuery = 4.67
#res1Query = 10
#res2Query = 20
Id
Lat
Long
Resource1
Resource2
...
3
50.634
4.67
21
18
...
1
50.123
4.23
5
12
...
Result includes all rows that meet the queries individually. Result is also sorted by distance to the requested lat/lon
I'm able to sort the results by distance, and sum the total running values as suggested in other threads, but I'm having some trouble with the logic comparing the running values with the quota provided in the params.
First I have some CTEs to get most recent edits, order by distance and then sum the running totals
WITH cte1 AS (SELECT
#origin.STDistance(geography::Point(Facility.Lat, Facility.Long, 4326)) AS distance,
Facility.Resource1 as res1,
Facility.Resource2 as res2
-- ...etc
FROM Facility
),
cte2 AS (SELECT
distance,
res1,
SUM(res1) OVER (ORDER BY distance) AS totRes1,
res2,
SUM(res1) OVER (ORDER BY distance) AS totRes2
-- ...etc, there's 15-20 columns here
FROM cte1
)
Next, with the results of that CTE, I need to pull rows until all quotas are met. Having the issues here, where it works for one row but my logic with all the ANDs isn't exactly right.
SELECT * FROM cte2 WHERE (
(totRes1 <= #res1Query OR (totRes1 > #res1Query AND totRes1- res1 <= #totRes1)) AND
(totRes2 <= #res2Query OR (totRes2 > #res2Query AND totRes2- res2 <= #totRes2)) AND
-- ... I also feel like this method of pulling the next row once it's over may be convoluted as well?
)
As-is right now, it's mostly returning nothing, and I'm guessing it's because it's too strict? Essentially, I want to be able to let the total values go past the required values until they are all past the required values, and then return that list.
Has anyone come across a better method of searching using separate quotas for multiple columns?
See my update in the answers/comments
I think you are massively over-complicating this. This does not need any joins, just some running sum calculations, and the right OR logic.
The key to solving this is that you need all rows, where the running sum up to the previous row is less than the requirement for all requirements. This means that you include all rows where the requirement has not been met, and the first row for which the requirement has been met or exceeded.
To do this you can subtract the current row's value from the running sum.
You could utilize a ROWS specification of ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING. But then you need to deal with NULL on the first row.
In any event, even a regular running sum should always use ROWS UNBOUNDED PRECEDING, because the default is RANGE UNBOUNDED PRECEDING, which is subtly different and can cause incorrect results, as well as being slower.
You can also factor out the distance calculation into a CROSS APPLY (VALUES, avoiding the need for lots of CTEs or derived tables. You now only need one level of derivation.
DECLARE #origin geography = geography::Point(#latQuery, #LongQuery, 4326);
SELECT
f.Id,
f.Lat,
f.Long,
f.Resource1,
f.Resource2
FROM (
SELECT f.*,
SumRes1 = SUM(f.Resource1) OVER (ORDER BY v1.Distance ROWS UNBOUNDED PRECEDING) - f.Resource1,
SumRes2 = SUM(f.Resource2) OVER (ORDER BY v1.Distance ROWS UNBOUNDED PRECEDING) - f.Resource2
FROM Facility f
CROSS APPLY (VALUES(
#origin.STDistance(geography::Point(f.Lat, f.Long, 4326))
)) v1(Distance)
) f
WHERE (
f.SumRes1 < #res1Query
OR f.SumRes2 < #res2Query
);
db<>fiddle
Was able to figure out the problem on my own here. The primary issue I was running into was that I was comparing 25 different columns' running totals versus the 25 stored proc parameters (quotas of resources required by the search).
Changing the lines such as these
(totRes1 <= #res1Query OR (totRes1 > #res1Query AND totRes1- res1 <= #totRes1)) AND --...
to
(totRes1 <= #res1Query OR (totRes1 > #res1Query AND totRes1- res1 <= #totRes1) OR #res1Query = 0) AND --...
(adding in the OR #res1Query = 0)solved my issue.
In other words, the search is often only for one or two columns (types of resources) - leaving others as zero. The way my logic was set up caused it to skip over lots of rows because it was instantly marking them as having met the quota (value less than or equal to the quota). like #A Neon Tetra suggested, was pretty close to it already.
Update:
First attempt didn't exactly fix my own issue. Posting the stripped down version of my code that is now working for me.
DECLARE #Lat AS DECIMAL(12,6)
DECLARE #Lon AS DECIMAL(12,6)
DECLARE #res1Query AS INT
DECLARE #res2Query AS INT
-- repeat for Resource 3 through 25, etc...
DECLARE #origin geography = geography::Point(#Lat, #Lon, 4326);
-- CTE to be able to expose distance
cte AS (SELECT TOP(99999) -- --> this is hacky, it won't let me order by distance unless I'm selecting TOP(x) or some other fn?
dbo.Facility.FacilityGUID,
dbo.Facility.Lat,
dbo.Facility.Lon,
#origin.STDistance(geography::Point(dbo.Facility.Lat, dbo.Facility.Lon, 4326))
AS distance,
dbo.Facility.Resource1 AS res1,
dbo.Facility.Resource2 AS res2,
-- repeat for Resource 3 through 25, etc...
FROM dbo.Facility
ORDER BY distance),
-- third CTE - has access to distance so we can keep track of a running total ordered by distance
---> have to separate into two since you can't reference the same alias (distance) again within the same SELECT
fullCTE AS (SELECT
FacilityID,
Lat,
Long,
distance,
res1,
SUM(res1) OVER (ORDER BY distance)AS totRes1,
res2,
SUM(res2) OVER (ORDER BY distance)AS totRes2,
-- repeat for Resource 3 through 25, etc...
FROM cte)
SELECT * -- Customize what you're pulling here for your output as needed
FROM dbo.Facility INNER JOIN fullCTE ON (fullCTE.FacilityID = dbo.Facility.FacilityID)
WHERE EXISTS
(SELECT
FacilityID
FROM fullCTE WHERE (
FacilityID = dbo.Facility.FacilityID AND
-- Keep pulling rows until all conditions are met, as opposed to pulling rows while they're under the quota
NOT (
((totRes1 - res1 >= #res1Query AND #res1Query <> 0) OR (#res1Query = 0)) AND
((totRes2 - res2 >= #res2Query AND #res2Query <> 0) OR (#res2Query = 0)) AND
-- repeat for Resource 3 through 25, etc...
)
)
)
I have a scenario, there is an escalator with 1000 lbs capacity. The total no of persons weight entering the escalator should not exceed 1000 lbs .LINE table contains persons name,weight and turn in the queue.
Below is the table syntax and values in it
create table line (id int not null PRIMARY KEY,
name varchar(255) not null,
weight int not null,
turn int unique not null,
check (weight > 0)
);
INSERT INTO LINE VALUES(6,'George Washington', 250, 1);
INSERT INTO LINE VALUES(5,'Thomas Jefferson',175, 7);
INSERT INTO LINE VALUES(3,'John Adams',350, 2);
INSERT INTO LINE VALUES(7,'Thomas Jefferson',800, 3);
INSERT INTO LINE VALUES(1,'James Elephant',500, 6);
INSERT INTO LINE VALUES(2,'Andy',200, 5);
INSERT INTO LINE VALUES(4,'Will Smith',400, 4);
Now i need to write a query to print last person name who enters the lift .Means, with the last person, lift capacity will be filled. Priority should be given based on TURN value i.e 1st person should be given 1st priority and 2nd person 2nd priority and so on. Sum of first 2 persons weight in turn is 600 ,if next person(Thomas Jefferson,Weight-800) enters the escalator,it exceeds escalator capacity .So, this person should be ignored/excluded and add next person(Will Smith,Weight-400) to the escalator .Now ,the sum of all persons weight is 1000 ,there by ,Will Smith name should be displayed in the output.
Could you please guide me to write SQL query for this.
PS:This is my first post. Kindly ignore in case of errors.
SQL fiddle link
You are solving kind of optimization task similar to knapsack problem except - luckily - in your case the greedy traversal suffices. You need recursive query. In each iteration one turn is processed and it is decided if new last person on elevator is the previous one or the new one (please uncomment select * for clear understanding). The name of last person in CTE is what you look for. (For simplicity I assumed turns are contiguous sequence from 1.)
with actual_elevator (name, last_turn, total) as (
select name, turn, weight from line where turn = 1 and weight <= 1000
union all
select case when r.total + l.weight <= 1000 then l.name else r.name end
, l.turn
, case when r.total + l.weight <= 1000 then r.total + l.weight else r.total end
from actual_elevator r
join line l on r.last_turn + 1 = l.turn
where r.total < 1000 -- UPDATE 1
)
--select * from actual_elevator
select name from actual_elevator where last_turn = (select max(turn) from line)
Modified fiddle.
I have to praise you for preparing fiddle, specifying db vendor and description of problem for concrete case. Not everyone asking [sql] questions on SO is so diligent.
UPDATE1: To stop iteration when sum is exactly 1000, use where condition to stop testing any new rows once the limit is reached (modified fiddle).
Note that when sum is less than 1000, the iteration must run to end because you never know the suitable value is say on last row until you see it.
Also note that algorithm greedily finds suboptimal solution. For example for input 800,100,200 it stops at 900 and does not backtrack to find 1000 or maximum. This would be quite different and more diffucult task which I assume you didn't required.
Ok, since it seems that my last two questions (this one and this one) only lead to confussion, I will try to explain the FULL problem here, so it might be a long post.
I'm trying to create a database for a trading system. The database has 2 main tables. One is table "Ticks" and the other is "Candles". As shown in the figure, each table has its own attributes..
Candles, bars or ohlc are the same thing.
The way a candle is seen in a chart is like this:
Candles are just a way to representate aggregated data, nothing more.
There are many ways to aggregate ticks in order to create one candle. In this post, I'm asking for a particular way that is creating one candle every 500 ticks. So, if the ticks table has 1000 ticks, I can create only 2 candles. If it has 500 ticks, I can create 1 candle. If it has 5000 ticks, I can create 10 candles. If there are 5001 ticks I still have only 10 candles, because I'm missing the other 499 ticks in order to create the 11th candle.
Actually, I'm storing all the ticks using a python script and creating (and therefore, inserting in the candles table) candles with another python script. This is a real time process.
Both scripts run in a while True: loop. No, I can't (read shouldn't) stop the scripts because the market is opened 24 hours - 5 days a week.
What I'm trying to do is to get rid of the python script that creates and stores the candles in the candles table. Why? Because I think that it will improve performance. Instead of doing multiple queries to know the amount of ticks that are available to create a new candle, I think that a trigger could handle it in a more efficient way (please, if I'm mistaken correct me).
I don't know how to actually solve it, but what I'm trying is to do this (thanks to #GordonLinoff for helping me in previous questions):
do $$
begin
with total_ticks as (
select count(*) c from (
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc) totals),
ticks_for_candles as(
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc
), candles as(
select max(date) as date,
max(bid) filter (where mod(seqnum, 500) = 1) as open,
max(bid) as high,
min(bid) as low,
max(bid) filter (where mod(seqnum, 500) = 500-1) as close,
max(ask) filter (where mod(seqnum, 500) = 500-1) as ask
from (
select t.*, row_number() over (order by date) as seqnum
from (select * from ticks_for_candles) t) as a
group by floor((seqnum - 1) /500)
having count(*) = 500
)
case 500<(select * from total_ticks)
when true then
return select * from candles
end;
end $$;
Using this, I get this error:
ERROR: syntax error at or near "case"
LINE 33: case 500<(select * from total_ticks)
^
SQL state: 42601
Character: 945
As you can see, there is no select after the CETs. If I put:
select case 500<(select * from total_ticks)
when true then
return select * from candles
end;
end $$;
I get this error:
ERROR: subquery must return only one column
LINE 31: (select * from candles)
^
QUERY: with total_ticks as (
select count(*) c from (
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc) totals),
ticks_for_candles as(
select * from eurusd_tick2 eurusd where date >
(SELECT date from eurusd_ohlc order by date desc limit 1)
order by date asc
), candles as(
select max(date) as date,
max(bid) filter (where mod(seqnum, 500) = 1) as open,
max(bid) as high,
min(bid) as low,
max(bid) filter (where mod(seqnum, 500) = 500-1) as close,
max(ask) filter (where mod(seqnum, 500) = 500-1) as ask
from (
select t.*, row_number() over (order by date) as seqnum
from (select * from ticks_for_candles) t) as a
group by floor((seqnum - 1) /500)
having count(*) = 500
)
select case 1000>(select * from total_ticks)
when true then
(select * from candles)
end
CONTEXT: PL/pgSQL function inline_code_block line 4 at SQL statement
SQL state: 42601
So honestly, I don't know how to do it correctly. It doesn't has to be with the actual code I provide here, but the desired output looks as follows:
-----------------------------------------------------------------------------------
| date | open | high | low | close | ask |
|2020-05-01 20:39:27.603452| 1.0976 | 1.09766 | 1.09732 | 1.09762 | 1.09776 |
This would be the output when there is enough ticks to create only 1 candle. If there is enough to create two of them, then there should be 2 rows.
So, at the end of the day, what I have in mind is that the trigger should check constantly if there is enough data to create a candle and if it is, then create it.
Is this a good idea or I should stick to the python script?
Can this be achieved with my approach?
What I'm doing wrong?
What should I do and how should I manage this situation?
I really hope that the question now is complete and there is no missing information.
All comments and advices are appreciated.
Thanks!
EDIT: Since this is a real time process, in one second there could be 499 ticks in the database and in the next second there could be 503 ticks. This means that 4 ticks arrived in 1 second.
Being a database guy, my approach would be to use triggers in the database.
Create a third table candle_in_the_making that contains the data from the ticks that have not yet been aggregated to a candles entry.
Create an INSERT trigger on the ticks table (doesn't matter if BEFORE or AFTER) that does the following:
For every tick inserted, add a row to candle_in_the_making.
If the row count reaches 500, compute and insert a new candles row and TRUNCATE candle_in_the_making.
This is simple if ticks are inserted only in a single thread.
If ticks are inserted concurrently, you have to find a way to prevent two threads from inserting the 500th tick in candle_in_the_making at the same time (so that you end up with 501 entries). I can think of two ways to do that in the database:
Have an extra table c_i_m_count that contains only a single number, which is the number of rows in candle_in_the_making. Before you insert into candle_in_the_making, you run the atomic
UPDATE c_i_m_count SET counter = counter + 1 RETURNING counter;
This locks the row, so that any two INSERTs into counter_in_the_making are effectively serialized.
Use advisory locks to serialize the inserting threads. In particular, a transaction level exclusive lock as taken by pg_advisory_xact_lock would be indicated.
I’m in the process of creating a report that will tell end users what percentage of a gridview (the total number of records is a finite number) has been completed in a given month. I have a gridview with records that I’ve imported and users have to go into each record and update a couple of fields. I’m attempting to create a report tell me what percentage of the grand total of records was completed in a given month. All I need is the percentage. The grand total is (for this example) is 2000.
I’m not sure if the actual gridview information/code is needed here but if it does, let me know and I’ll add it.
The problem is that I have been able to calculate the percentage total but when its displayed the percentage total is repeated for every single line in the table. I’m scratching my head on how to make this result appear only once.
Right now here’s what I have for my SQL code (I use nvarchar because we import from many non windows systems and get all sorts of extra characters and added spaces to our information):
Declare #DateCount nvarchar(max);
Declare #DivNumber decimal(5,1);
SET #DivNumber = (.01 * 2541);
SET #DateCount = (SELECT (Count(date_record_entered) FROM dbo.tablename WHERE date_record_entered IS NOT NULL and date_record_entered >= 20131201 AND date_record_entered <= 20131231);
SELECT CAST(ROUND(#DivNumber / #DateCount, 1) AS decimal(5,1) FROM dbo.tablename
Let’s say for this example the total number of records in the date_record_entered for the month of December is 500.
I’ve tried the smaller pieces of code separately with no success. This is the most recent thing I’ve tried.
I know I'm missing something simple here but I'm not sure what.
::edit::
What I'm looking for as the expected result of my query is to have a percentage represented of records modified in a given month. If 500 records were done that would be 25%. I just want to have the 25 (and trainling decimal(s) when it applies) showing once and not 25 showing for every row in this table.
The following query should provide what you are looking for:
Declare #DivNumber decimal(5,1);
SET #DivNumber = (.01 * 2541);
SELECT
CAST(ROUND(#DivNumber / Count(date_record_entered), 1) AS decimal(5,1))
FROM dbo.tablename
WHERE date_record_entered IS NOT NULL
and date_record_entered >= 20131201
AND date_record_entered <= 20131231
Why do you select the constant value cast(round(#divNumber / #DateCount, 1) as decimal(5,1)from the table? That's the cause of your problem.
I'm not too familiar with sql server, but you might try to just select without a from clause.
select emp.Natinality_Id,mstn.Title as Nationality,count(emp.Employee_Id) as Employee_Count,
count(emp.Employee_Id)* 100.0 /nullif(sum(count(*)) over(),0) as Nationality_Percentage
FROM Employee as emp
left join Mst_Natinality as mstn on mstn.Natinality_Id = emp.Natinality_Id
where
emp.Is_Deleted='false'
group by emp.Natinality_Id,mstn.Title
I have a MySQL table with approximately 3000 rows per user. One of the columns is a datetime field, which is mutable, so the rows aren't in chronological order.
I'd like to visualize the time distribution in a chart, so I need a number of individual datapoints. 20 datapoints would be enough.
I could do this:
select timefield from entries where uid = ? order by timefield;
and look at every 150th row.
Or I could do 20 separate queries and use limit 1 and offset.
But there must be a more efficient solution...
Michal Sznajder almost had it, but you can't use column aliases in a WHERE clause in SQL. So you have to wrap it as a derived table. I tried this and it returns 20 rows:
SELECT * FROM (
SELECT #rownum:=#rownum+1 AS rownum, e.*
FROM (SELECT #rownum := 0) r, entries e) AS e2
WHERE uid = ? AND rownum % 150 = 0;
Something like this came to my mind
select #rownum:=#rownum+1 rownum, entries.*
from (select #rownum:=0) r, entries
where uid = ? and rownum % 150 = 0
I don't have MySQL at my hand but maybe this will help ...
As far as visualization, I know this is not the periodic sampling you are talking about, but I would look at all the rows for a user and choose an interval bucket, SUM within the buckets and show on a bar graph or similar. This would show a real "distribution", since many occurrences within a time frame may be significant.
SELECT DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket -- choose an appropriate granularity (days used here)
,COUNT(*)
FROM entries
WHERE uid = ?
GROUP BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
ORDER BY DATEADD(day, DATEDIFF(day, 0, timefield), 0)
Or if you don't like the way you have to repeat yourself - or if you are playing with different buckets and want to analyze across many users in 3-D (measure in Z against x, y uid, bucket):
SELECT uid
,bucket
,COUNT(*) AS measure
FROM (
SELECT uid
,DATEADD(day, DATEDIFF(day, 0, timefield), 0) AS bucket
FROM entries
) AS buckets
GROUP BY uid
,bucket
ORDER BY uid
,bucket
If I wanted to plot in 3-D, I would probably determine a way to order users according to some meaningful overall metric for the user.
#Michal
For whatever reason, your example only works when the where #recnum uses a less than operator. I think when the where filters out a row, the rownum doesn't get incremented, and it can't match anything else.
If the original table has an auto incremented id column, and rows were inserted in chronological order, then this should work:
select timefield from entries
where uid = ? and id % 150 = 0 order by timefield;
Of course that doesn't work if there is no correlation between the id and the timefield, unless you don't actually care about getting evenly spaced timefields, just 20 random ones.
Do you really care about the individual data points? Or will using the statistical aggregate functions on the day number instead suffice to tell you what you wish to know?
AVG
STDDEV_POP
VARIANCE
TO_DAYS
select timefield
from entries
where rand() = .01 --will return 1% of rows adjust as needed.
Not a mysql expert so I'm not sure how rand() operates in this environment.
For my reference - and for those using postgres - Postgres 9.4 will have ordered set aggregates that should solve this problem:
SELECT percentile_disc(0.95)
WITHIN GROUP (ORDER BY response_time)
FROM pageviews;
Source: http://www.craigkerstiens.com/2014/02/02/Examining-PostgreSQL-9.4/