Efficiently running an SQL query over multiple inputs - sql

Hi I've got a simulation snapshot that is currently stored in an PostgreSQL database as a table the schema for the snapshot table is
simdb=> \d isonew_4.snapshot_102
Table "isonew_4.snapshot_102"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
x | real |
y | real |
z | real |
vx | real |
vy | real |
vz | real |
pot | real |
mass | real |
Indexes:
"snapshot_102_id_idx" btree (id) WITH (fillfactor=100)
I've got a query that calculates the mass enclosed for a single radius fine:
SELECT SUM(mass) AS mass
FROM isonew_4.snapshot_102 AS s
WHERE SQRT(s.x^2 + s.y^2 + s.z^2) < {radius}
However I would like to run this over a number number of different radii.
Since the table has around 100 million rows it's something that I would prefer to do as a SQL query rather than grabbing all of the particles and using something like numpy.histogram in python to do the binning on my machine locally.

Method #1
This query might work, with for example 10,20 and 25 as the successive values for the radius:
WITH r(radius) as (values (10),(20),(25))
SELECT radius, SUM(mass) AS mass
FROM isonew_4.snapshot_102 AS s CROSS JOIN r
WHERE SQRT(s.x^2 + s.y^2 + s.z^2) < radius
GROUP BY radius;
The output has two columns: radius and corresponding sum(mass).
Method #2
If the query is too slow because of the CROSS JOIN with the list (presumably, EXPLAIN or better EXPLAIN ANALYZE would tell for sure), a different approach that certainly guarantees a single scan of the big table is to gather all results in a single row, one column per radius, with a generated query looking like this:
SELECT
sum(case when r < 10 then s.mass else 0 end) as radius10,
sum(case when r < 20 then s.mass else 0 end) as radius20,
sum(case when r < 25 then s.mass else 0 end) as radius25
FROM (select mass,SQRT(x^2 + y^2 + z^2) as r from isonew_4.snapshot_102) AS s
Method #3
If it's not practical, another completely different approach that might be worth trying would be to pre-compute SQRT(x^2 + y^2 + z^2) in a btree functional index in the hope that the SQL engine can use it with the inequality comparison. Whether this happens and if the query would be faster or not depends mainly on the data distribution.
create index radius_idx on isonew_4.snapshot_102(SQRT(x^2 + y^2 + z^2));
Then use the first query, either repeated with single radius each time, or method #1 with the GROUP BY and all values at once. If the values are very selective, the execution might be way faster than even a single large sequential scan.

Related

How to replace 0 and 1 in SQL Server 2012

I have a rugby database + player table. In the player table I have performance and I want to represent the performance as
0 = low
1 = medium
2 = high
I don't know what datatype the column should be. And what is the formula or function to do that?
Please help
You can define your column like this:
performance tinyint not null check (performance in (0, 1, 2))
tinyint takes only 1 byte for a value and values can range from 0 to 255.
If you store the values as 1 - Low, 2 - Medium, 3 - High and are using SQL server 2012+, then you can simply use CHOOSE function to convert the value to text when select like this:
select choose(performance,'Low','Medium','High')
. . .
If you really want to store as 0,1,2, use :
select choose(performance+1,'Low','Medium','High')
. . .
If you are using a lower version of SQL server, you can use CASE like this:
case performance
when 0 then 'Low'
when 1 then 'Medium'
when 2 then 'High'
end
1- column datatype should b int.
2- where you send the date check the performance first like:-
if(performance = low)
perVar = 0
send into database
There are a number of ways you can handle this. One way would be to represent the performance using an int column, which would take on values 0, 1, 2, .... To get the labels for those peformances, you could create a separate table which would map those numbers to descriptive strings, e.g.
id | text
0 | low
1 | medium
2 | high
You would then join to this table whenever you needed the full text description. Note that this is probably the only option which will scale as the number of performance types starts to get large.
If you don't want a separate table, you could also use a CASE expression to generate labels when querying, e.g.
CASE WHEN id = 0 THEN 'low'
WHEN id = 1 THEN 'medium'
WHEN id = 1 THEN 'high'
END
I would use a TINYINT datatype in the performance table to conserve space, then use a FOREIGN KEY CONSTRAINT from a second table which holds the descriptions. The constraint would force the entry of 0, 1, 2 in the performance table while providing a normalized solution that could grow to include additional perforamnce metrics.

How to consolidate blocks of time?

I have a derived table with a list of relative seconds to a foreign key (ID):
CREATE TABLE Times (
ID INT
, TimeFrom INT
, TimeTo INT
);
The table contains mostly non-overlapping data, but there are occasions where I have a TimeTo < TimeFrom of another record:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 70 |
| 10 | 60 | 150 |
| 10 | 75 | 150 |
| .. | ... | ... |
+----+----------+--------+
The result set is meant to be a flattened linear idle report, but with too many of these overlaps, I end up with negative time in use. I.e. If the window above for ID = 10 was 150 seconds long, and I summed the differences of relative seconds to subtract from the window size, I'd wind up with 150-(20+20+90+75)=-55. This approach I've tried, and is what led me to realizing there were overlaps that needed to be flattened.
So, what I'm looking for is a solution to flatten the overlaps into one set of times:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 150 |
| .. | ... | ... |
+----+----------+--------+
Considerations: Performance is very important here, as this is part of a larger query that will perform well on it's own, and I'd rather not impact its performance much if I can help it.
On a comment regarding "Which seconds have an interval", this is something I have tried for the end result, and am looking for something with better performance. Adapted to my example:
SELECT SUM(C.N)
FROM (
SELECT A.N, ROW_NUMBER()OVER(ORDER BY A.N) RowID
FROM
(SELECT TOP 60 1 N FROM master..spt_values) A
, (SELECT TOP 720 1 N FROM master..spt_values) B
) C
WHERE EXISTS (
SELECT 1
FROM Times SE
WHERE SE.ID = 10
AND SE.TimeFrom <= C.RowID
AND SE.TimeTo >= C.RowID
AND EXISTS (
SELECT 1
FROM Times2 D
WHERE ID = SE.ID
AND D.TimeFrom <= C.RowID
AND D.TimeTo >= C.RowID
)
GROUP BY SE.ID
)
The problem I have with this solution is I have get a Row Count Spool out of the EXISTS query in the query plan with a number of executions equal to COUNT(C.*). I left the real numbers in that query to illustrate that getting around this approach is for the best. Because even with a Row Count Spool reducing the cost of the query by quite a bit, it's execution count increases the cost of the query as a whole by quite a bit as well.
Further Edit: The end goal is to put this in a procedure, so Table Variables and Temp Tables are also a possible tool to use.
OK. I'm still trying to do this with just one SELECT. But This totally works:
DECLARE #tmp TABLE (ID INT, GroupId INT, TimeFrom INT, TimeTo INT)
INSERT INTO #tmp
SELECT ID, 0, TimeFrom, TimeTo
FROM Times
ORDER BY Id, TimeFrom
DECLARE #timeTo int, #id int, #groupId int
SET #groupId = 0
UPDATE #tmp
SET
#groupId = CASE WHEN id != #id THEN 0
WHEN TimeFrom > #timeTo THEN #groupId + 1
ELSE #groupId END,
GroupId = #groupId,
#timeTo = TimeTo,
#id = id
SELECT Id, MIN(TimeFrom), Max(TimeTo) FROM #tmp
GROUP BY ID, GroupId ORDER BY ID
Left join each row to its successor overlapping row on the same ID value (where such exist).
Now for each row in the result-set of LHS left join RHS the contribution to the elapsed time for the ID is:
isnull(RHS.TimeFrom,LHS.TimeTo) - LHS.TimeFrom as TimeElapsed
Summing these by ID should give you the correct answer.
Note that:
- where there isn't an overlapping successor row the calculation is simply
LHS.TimeTo - LHS.TimeFrom
- where there is an overlapping successor row the calculation will net to
(RHS.TimeFrom - LHS.TimeFrom) + (RHS.TimeTo - RHS.TimeFrom)
which simplifies to
RHS.TimeTo - LHS.TimeFrom
What about something like below (assumes SQL 2008+ due to CTE):
WITH Overlaps
AS
(
SELECT t1.Id,
TimeFrom = MIN(t1.TimeFrom),
TimeTo = MAX(t2.TimeTo)
FROM dbo.Times t1
INNER JOIN dbo.Times t2 ON t2.Id = t1.Id
AND t2.TimeFrom > t1.TimeFrom
AND t2.TimeFrom < t1.TimeTo
GROUP BY t1.Id
)
SELECT o.Id,
o.TimeFrom,
o.TimeTo
FROM Overlaps o
UNION ALL
SELECT t.Id,
t.TimeFrom,
t.TimeTo
FROM dbo.Times t
INNER JOIN Overlaps o ON o.Id = t.Id
AND (o.TimeFrom > t.TimeFrom OR o.TimeTo < t.TimeTo);
I do not have a lot of data to test with but seems decent on the smaller data sets I have.
I also wrapped by head around this issue - and afterall I found, that the problem is your data.
You claim (if i get that right), that these entries should reflect the relative times, when a user goes idle / comes back.
So, you should consider to sanitize your data and refactor your inserts to produce valid data sets.
For instance, the two lines:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 50 | 70 |
| 10 | 60 | 150 |
how can it be possible that a user is idle until second 70, but goes idle on second 60? This already implies, that he has been back latest at around second 59.
I can only assume that this issue comes from different threads and/or browser windows (tabs) a user might be using your application with. (Each having it's own "idle detection")
So instead of working-around the symptoms - you should fix the cause! Why is this data entry inserted into the table? You could avoid this by simple checking, if the user is already idle before inserting a new row.
Create a unique key constraint on ID and TimeTo
Whenever an idle-event is detected, execute the following query:
INSERT IGNORE INTO Times (ID,TimeFrom,TimeTo)VALUES('10', currentTimeStamp, -1);
-- (If the user is already "idle" - nothing will happen)
Whenever an comeback-event is detected, execute the following query:
UPDATE Times SET TimeTo=currentTimeStamp WHERE ID='10' and TimeTo=-1
-- (If the user is already "back" - nothing will happen)
The fiddle linked here: http://sqlfiddle.com/#!2/dcb17/1 would reproduce the chain of events for your example, but resulting in a clean and logical set of idle-windows:
ID TIMEFROM TIMETO
10 10 30
10 50 70
10 75 150
Note: The Output is slightly different from the output you desired. But I feel that this is more accurate, cause of the reason outlined above: A user cannot go idle on second 70 without returning from it's current idle state before. He either STAYS idle (and a second thread/tab runs into the idle-event) Or he returned in between.
Especially for your need to maximize performance, you should fix the data and not invent a work-around-query. This is maybe 3 ms upon inserts, but could be worth 20 seconds upon select!
Edit: if Multi-Threading / Multiple-Sessions is the cause for the wrong insert, you would also need to implement a check, if most_recent_come_back_time < now() - idleTimeout - otherwhise a user might comeback on tab1, and is recorded idle on tab2 after a few seconds, cause tab2 did run into it's idle timeout, cause the user only refreshed tab1.
I had the 'same' problem once with 'days' (additionaly without counting WE and Holidays)
The word counting gave me the following idea:
create table Seconds ( sec INT);
insert into Seconds values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9), ...
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom and t.timeto-1
and id=10;
you can cut the start to 0 (I put the '10' here in braces)
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom- (10) and t.timeto- (10)-1
and id=10;
and finaly
select count(distinct sec) from times t, seconds s,
(select min(timefrom) m from times where id=10) as m
where s.sec between t.timefrom-m.m and t.timeto-m.m-1
and id=10;
additionaly you can "ignore" eg. 10 seconds by dividing you loose some prezition but earn speed
select count(distinct sec)*d from times t, seconds s,
(select min(timefrom) m from times where id=10) as m,
(select 10 d) as d
where s.sec between (t.timefrom-m)/d and (t.timeto-m)/d-1
and id=10;
Sure it depends on the range you have to look at, but a 'day' or two of seconds should work (although i did not test it)
fiddle ...

MS-SQL JOIN with multiple SUBSTRING and LIKE

I have a MS SQL 2005/2008 database and trying to compare two tables of data using substrings with % wildcard to try and find data within one character of a column in other table.
Example is:
UPDATE table1
SET table1.Marker = 1
FROM table1
INNER JOIN table2
ON table1.ForeignKey = table2.ID
AND tabl1.CharacterColumn LIKE SUBSTRING(table2.CharacterColumn , 1, 5) + '%' + SUBSTRING(table2.CharacterColumn , 7, 8)
UPDATE table1
SET table1.Marker = 1
FROM table1
INNER JOIN table2
ON table1.ForeignKey = table2.ID
AND tabl1.CharacterColumn LIKE SUBSTRING(table2.CharacterColumn , 1, 6) + '%' + SUBSTRING(table2.CharacterColumn , 8, 8)
At present it takes a while to run this routine as the column can contain up to 10 characters and the dataset is on a table1 of 300 million rows (however a dataset of maybe 300k) and table2 of 2 million rows (a dataset of 100k).
My question is is the JOIN statement the best way to do one character out searching on a column?
i can't give exact examples as the data is protected, however this should help:
Table2 -
ID | FK | Name
1 | 100 | Phillips
2 | 100 | Bloggs
3 | 100 | Jones
Table1 -
ID | Table2FK | Name
1 | 100 | Philpips
2 | 100 | Bloggs
3 | 100 | Jones
As you see table2 record 1 is within one character of table1 record 1 and I want to identify that. Also the one character out can be at any point in the string
When you wrap column in SQL function, SQL Server is no longer able to use indexes. If you have large tables like you have described SQL Server will need to do many CPU intensive operations like Index Scan. Your have 2 alternatives
Create indexed view with a columns of that sub-string. It will take longer to build it first time but after that you will be able to easily join.
Second alternative is to modify your tables to break apart character column into two separate column and that create index on those two columns.
String operations are very costly and it is best to be able to break strings apart instead into separate columns instead of doing it real time.
Indexed Views Documentation http://technet.microsoft.com/en-us/library/dd171921(v=sql.100).aspx

How to generate records and spread them among pairs from a table?

I have to generate about a million random trips between about 40K destinations. Each destination has it's own weight (total_probability), the more it is, the more trips should start or end in this place.
Either the trips should be generated randomly, but destinations (start and end points) should be weighted by probability, or it's possible to just pre-calculate an exact number of trips (divide each weight by the sum of weights, multiply by 1M and round to integers).
Problem is how to make it in PostgreSQL without generating the 40K*40K table with all destinations pairs.
Table "public.dests"
Column | Type | Modifiers
-------------------+------------------+-----------
id | integer |
total_probability | double precision |
Table "public.trips"
Column | Type | Modifiers
------------+------------------+-----------
from_id | integer |
to_id | integer |
trips_num | integer |
...
some other metrics...
primary key for trips is (from_id, to_id)
Should I generate a table with 1M records and then update it iteratively, or a for loop with 1M inserts will be fast enough? I work on a 2-core lightweight laptop.
P.S. I gave up and did this in Python. To perform a set of queries and the transformation in Python, I'll run SQL scripts from Python rather than from a shell script. Thanks for suggestions!
In 9.1, you can use TRIGGERs on VIEWs, which effectively let you create materialized views (albeit manually). I think your first run may be expensive, but using a loop is probably the way to go, but then after that, I'd use a series of TRIGGERs to maintain the data in a table.
At the end of the day you need to decide whether or not you want to calculate the results for every query, or you memoize the result via a materialized view.
I'm confused by your requirement but I guess this can get you started:
select
f.id as "from", t.id as to,
f.total_prob as from_prob, t.total_prob as to_prob
from
(
select id, total_prob
from dest
order by random()
limit 1010
) f
inner join
(
select id, total_prob
from dest
order by random()
limit 1010
) t on f.i != t.i
order by random()
limit 1000000
;
EDIT:
This took about ten minutes in my not that modern desktop:
create table trips (from_id integer, to_id integer, trip_prob double precision);
insert into trips (from_id, to_id, trip_prob)
select
f.id, t.id, f.total_prob * t.total_prob
from
(
select id, total_prob
from dests
) f
inner join
(
select id, total_prob
from dests
) t on f.id != t.id
where random() <= f.total_prob * t.total_prob
order by random()
limit 1000000
;
alter table trips add primary key (from_id, to_id);
select * from trips limit 5;
from_id | to_id | trip_prob
---------+-------+--------------------
1 | 6 | 0.0728749980226821
1 | 11 | 0.239824750923743
1 | 14 | 0.235899211677577
1 | 15 | 0.176168172647811
1 | 17 | 0.19708509944588
(5 rows)

How can I speed up queries that are looking for the root node of a transitive closure?

I have a historical transitive closure table that represents a tree.
create table TRANSITIVE_CLOSURE
(
CHILD_NODE_ID number not null enable,
ANCESTOR_NODE_ID number not null enable,
DISTANCE number not null enable,
FROM_DATE date not null enable,
TO_DATE date not null enable,
constraint TRANSITIVE_CLOSURE_PK unique (CHILD_NODE_ID, ANCESTOR_NODE_ID, DISTANCE, FROM_DATE, TO_DATE)
);
Here's some sample data:
CHILD_NODE_ID | ANCESTOR_NODE_ID | DISTANCE
--------------------------------------------
1 | 1 | 0
2 | 1 | 1
2 | 2 | 0
3 | 1 | 2
3 | 2 | 1
3 | 3 | 0
Unfortunately, my current query for finding the root node causes a full table scan:
select *
from transitive_closure tc
where
distance = 0
and not exists (
select null
from transitive_closure tci
where tc.child_node_id = tci.child_node_id
and tci.distance <> 0
);
On the surface, it doesn't look too expensive, but as I approach 1 million rows, this particular query is starting to get nasty... especially when it's part of a view that grabs the adjacency tree for legacy support.
Is there a better way to find the root node of a transitive closure? I would like to rewrite all of our old legacy code, but I can't... so I need to build the adjacency list somehow. Getting everything except the root node is easy, so is there a better way? Am I thinking about this problem the wrong way?
Query plan on a table with 800k rows.
OPERATION OBJECT_NAME OPTIONS COST
SELECT STATEMENT 2301
HASH JOIN RIGHT ANTI 2301
Access Predicates
TC.CHILD_NODE_ID=TCI.CHILD_NODE_ID
TABLE ACCESS TRANSITIVE_CLOSURE FULL 961
Filter Predicates
TCI.DISTANCE = 1
TABLE ACCESS TRANSITIVE_CLOSURE FULL 962
Filter Predicates
DISTANCE=0
How long does the query take to execute, and how long do you want it to take? (You usually do not want to use the cost for tuning. Very few people know what the explain plan cost really means.)
On my slow desktop the query only took 1.5 seconds for 800K rows. And then 0.5 seconds after the data was in memory. Are you getting something significantly worse,
or will this query be run very frequently?
I don't know what your data looks like, but I'd guess that a full table scan will always be best for this query. Assuming that your hierarchical data
is relatively shallow, i.e. there are many distances of 0 and 1 but very few distances of 100, the most important column will not be very distinct. This means
that any of the index entries for distance will point to a large number of blocks. It will be much cheaper to read the whole table at once using multi-block reads
than to read a large amount of it one block at a time.
Also, what do you mean by historical? Can you store the results of this query in a materialized view?
Another possible idea is to use analytic functions. This replaces the second table scan with a sort. This approach is usually faster, but for me this
query actually takes longer, 5.5 seconds instead of 1.5. But maybe it will do better in your environment.
select * from
(
select
max(case when distance <> 0 then 1 else 0 end)
over (partition by child_node_id) has_non_zero_distance
,transitive_closure.*
from transitive_closure
)
where distance = 0
and has_non_zero_distance = 0;
Can you try adding an index on distance and child_node_id, or change the order of these column in the existing unique index? I think it should then be possible for the outer query to access the table by the index by distance while the inner query needs only access to the index.
Add ONE root node from which all your current root nodes are descended. Then you would simply query the children of your one root. Problem solved.