Postgres Query Based on Previous and Next Rows - sql

I'm trying to solve the bus routing problem in postgresql which requires visibility of previous and next rows. Here is my solution.
Step 1) Have one edges table which represents all the edges (the source and target represent vertices (bus stops):
postgres=# select id, source, target, cost from busedges;
id | source | target | cost
----+--------+--------+------
1 | 1 | 2 | 1
2 | 2 | 3 | 1
3 | 3 | 4 | 1
4 | 4 | 5 | 1
5 | 1 | 7 | 1
6 | 7 | 8 | 1
7 | 1 | 6 | 1
8 | 6 | 8 | 1
9 | 9 | 10 | 1
10 | 10 | 11 | 1
11 | 11 | 12 | 1
12 | 12 | 13 | 1
13 | 9 | 15 | 1
14 | 15 | 16 | 1
15 | 9 | 14 | 1
16 | 14 | 16 | 1
Step 2) Have a table which represents bus details like from time, to time, edge etc.
NOTE: I have used integer format for "from" and "to" column for faster results as I can do an integer query, but I can replace it with any better format if available.
postgres=# select id, "busedgeId", "busId", "from", "to" from busedgetimes;
id | busedgeId | busId | from | to
----+-----------+-------+-------+-------
18 | 1 | 1 | 33000 | 33300
19 | 2 | 1 | 33300 | 33600
20 | 3 | 2 | 33900 | 34200
21 | 4 | 2 | 34200 | 34800
22 | 1 | 3 | 36000 | 36300
23 | 2 | 3 | 36600 | 37200
24 | 3 | 4 | 38400 | 38700
25 | 4 | 4 | 38700 | 39540
Step 3) Use dijkstra algorithm to find the nearest path.
Step 4) Get the upcoming buses from the busedgetimes table in the earliest first order for the nearest path detected by dijkstra algorithm.
Problem: I am finding it difficult to make the query for the Step 4.
For example: If I get the path as edges 2, 3, 4, to travel from source vertex 2 to target vertex 5 in the above records. To get the first bus for the first edge, it's not so hard as I can simply query with from < 'expected departure' order by from desc but for the second edge, the from condition requires to time of first result row. Also, query requires edge ids filter.
How can I achieve this in a single query?

I am not sure if I understood your problem correctly. But getting values from other rows this can be done by window functions (https://www.postgresql.org/docs/current/static/tutorial-window.html):
demo: db<>fiddle
SELECT
id,
lag("to") OVER (ORDER BY id) as prev_to,
"from",
"to",
lead("from") OVER (ORDER BY id) as next_from
FROM bustimes;
The lag function moves the value of the previous row into the current one. The lead function does the same with the next row. So you are able to calculate a difference between last arrival and current departure or something like that.
Result:
id prev_to from to next_from
18 33000 33300 33300
19 33300 33300 33600 33900
20 33600 33900 34200 34200
21 34200 34200 34800 36000
22 34800 36000 36300
Please notice that "from" and "to" are reserved words in PostgreSQL. It would be better to chose other names.

Related

Generate 'average' column from sub query and ROW_NUMBER window function in SQL SELECT

I have the following SQL Server tables (with sample data):
Questionnaire
id | coachNodeId | youngPersonNodeId | complete
1 | 12 | 678 | 1
2 | 12 | 52 | 1
3 | 30 | 99 | 1
4 | 12 | 678 | 1
5 | 12 | 678 | 1
6 | 30 | 99 | 1
7 | 12 | 52 | 1
8 | 30 | 102 | 1
Answer
id | questionnaireId | score
1 | 1 | 1
2 | 2 | 3
3 | 2 | 2
4 | 2 | 5
5 | 3 | 5
6 | 4 | 5
7 | 4 | 3
8 | 5 | 4
9 | 6 | 1
10 | 6 | 3
11 | 7 | 5
12 | 8 | 5
ContentNode
id | text
12 | Zak
30 | Phil
52 | Jane
99 | Ali
102 | Ed
678 | Chris
I have the following T-SQL query:
SELECT
Questionnaire.id AS questionnaireId,
coachNodeId AS coachNodeId,
coachNode.[text] AS coachName,
youngPersonNodeId AS youngPersonNodeId,
youngPersonNode.[text] AS youngPersonName,
ROW_NUMBER() OVER (PARTITION BY Questionnaire.coachNodeId, Questionnaire.youngPersonNodeId ORDER BY Questionnaire.id) AS questionnaireNumber,
score = (SELECT AVG(score) FROM Answer WHERE Answer.questionnaireId = Questionnaire.id)
FROM
Questionnaire
LEFT JOIN
ContentNode AS coachNode ON Questionnaire.coachNodeId = coachNode.id
LEFT JOIN
ContentNode AS youngPersonNode ON Questionnaire.youngPersonNodeId = youngPersonNode.id
WHERE
(complete = 1)
ORDER BY
coachNodeId, youngPersonNodeId
This query outputs the following example data:
questionnaireId | coachNodeId | coachName | youngPersonNodeId | youngPersonName | questionnaireNumber | score
1 | 12 | Zak | 678 | Chris | 1 | 1
2 | 12 | Zak | 52 | Jane | 1 | 3
3 | 30 | Phil | 99 | Ali | 1 | 5
4 | 12 | Zak | 678 | Chris | 2 | 4
5 | 12 | Zak | 678 | Chris | 3 | 4
6 | 30 | Phil | 99 | Ali | 2 | 2
7 | 12 | Zak | 52 | Jane | 2 | 5
8 | 30 | Phil | 102 | Ed | 1 | 5
To explain what's happening here… There are various coaches whose job is to undertake questionnaires with various young people, and log the scores. A coach might, at a later date, repeat the questionnaire with the same young person several times, hoping that they get a better score. The ultimate goal of what I'm trying to achieve is that the managers of the coaches want to see how well the coaches are performing, so they'd like to see whether the scores for the questionnaires tend to go up or not. The window function represents a way to establish how many times the questionnaire has been undertaken by the same coach/young person combo.
I need to be able to determine the average score based on the questionnaire number. So for example, the coach 'Zak' logged scores of '1' and '3' for his first questionnaires (where questionnaireNumber = 1) so the average would be 2. For his second questionnaires (where questionnaireNumber = 2) the scores were '3' and '5' so the average would be 4. So in analysing this data we know that over time Zak's questionnaire scores have improved from an average of '2' the first time to an average of '4' the second time.
I feel like the query needs to be grouped by the coachNodeId and questionnaireNumber values so it would output something like this (I've ommitted the questionnaireId, youngPersonNodeId, youngPersonName and score columns as they aren't crucial for the output — they're only used to derive the averageScore — and wouldn't be useful the way the results are grouped):
coachNodeId | coachName | questionnaireNumber | averageScore
12 | Zak | 1 | 2 (calculation: (1 + 3) / 2)
12 | Zak | 2 | 4 (calculation: (3 + 5) / 2)
12 | Zak | 3 | 4 (only one value: 4)
30 | Phil | 1 | 5 (calculation: (5 + 5) / 2)
30 | Phil | 2 | 2 (only one value: 2)
Could anyone suggest how I can modify my query to output the average scores based on the score from the sub-query and the ROW_NUMBER window function? I've hit the limits of my SQL skills!
Many thanks.
It is a bit hard to tell without sample data, but I think you are describing aggregation:
SELECT q.coachNodeId AS coachNodeId,
cn.[text] AS coachName,
q.youngPersonNodeId AS youngPersonNodeId,
ypn.[text] AS youngPersonName,
AVG(score)
FROM Questionnaire q JOIN
ContentNode cn
ON q.coachNodeId = cn.id JOIN
ContentNode ypn
ON q.youngPersonNodeId = ypn.id LEFT JOIN
Answer a
ON a.questionnaireId = q.id
WHERE complete = 1
GROUP BY q.coachNodeID, cn.[text] AS coachName,
q.youngPersonNodeId, ypn.[text]

Select well spread points from a big table

I'm trying to write a stored procedure for selecting X amount of well spread points in time from a big table.
I have a table points:
"Userid" integer
, "Time" timestamp with time zone
, "Value" integer
It contains hundreds of millions of records. And about a million of records per each user.
I want to select X points (lets say 50), which all well spread from time A to time B. The problem is that the points are not spread equally (if one point is in 6:00:00, the next point may be after 15 seconds, 20, or 4 minutes for example).
Selection all the points for an id can take up to 60 seconds (because there are about a million points).
Is there any way to select the exact amount of points I desire, as much well spread as possible, in a fast way?
Sample data:
+--------+---------------------+-------+
| UserId | Time | Value |
+--------+---------------------+-------+
1 | 1 | 2017-04-10 14:00:00 | 1 |
2 | 1 | 2017-04-10 14:00:10 | 10 |
3 | 1 | 2017-04-10 14:00:20 | 32 |
4 | 1 | 2017-04-10 14:00:35 | 80 |
5 | 1 | 2017-04-10 14:00:58 | 101 |
6 | 1 | 2017-04-10 14:01:00 | 203 |
7 | 1 | 2017-04-10 14:01:30 | 204 |
8 | 1 | 2017-04-10 14:01:40 | 205 |
9 | 1 | 2017-04-10 14:02:02 | 32 |
10 | 1 | 2017-04-10 14:02:15 | 7 |
11 | 1 | 2017-04-10 14:02:30 | 900 |
12 | 1 | 2017-04-10 14:02:45 | 22 |
13 | 1 | 2017-04-10 14:03:00 | 34 |
14 | 1 | 2017-04-10 14:03:30 | 54 |
15 | 1 | 2017-04-10 14:04:00 | 54 |
16 | 1 | 2017-04-10 14:06:00 | 60 |
17 | 1 | 2017-04-10 14:07:20 | 654 |
18 | 1 | 2017-04-10 14:07:40 | 32 |
19 | 1 | 2017-04-10 14:08:00 | 33 |
20 | 1 | 2017-04-10 14:08:12 | 32 |
21 | 1 | 2017-04-10 14:10:00 | 8 |
+--------+---------------------+-------+
I want to select 11 "best" points from the list above, for the user with Id 1,
from time 2017-04-10 14:00:00 to 2017-04-10 14:10:00.
Currently its done on the server, after selecting all the points for the user.
I calculate the "best times" by dividing the difference in times and get a list such as: 14:00:00,14:01:00,....14:10:00 (11 "best times", as the amount of points). Than I look for the closest point for each "best time", that not have been selected yet.
The result will be points: 1, 6, 9, 13, 15, 16, 17, 18, 19, 20, 21
Edit:
I'm trying something like this:
SELECT * FROM "points"
WHERE "Userid" = 1 AND
(("Time" =
(SELECT "Time" FROM
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:00:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:01:00' - "Time"))
LIMIT 1)) OR
("Time" =
(SELECT "Time" FROM
"points"
ORDER BY abs(extract(epoch from '2017-04-10 14:02:00' - "Time"))
LIMIT 1)))
The problems here are that:
A) It doesn't take in account points that already have been selected.
B) Because of the ORDER BY, each additional time increases the running time of the query by ~ 1 second, and for 50 points I get back to the 1 minute mark.
There is an optimization problem behind your question that's hard to solve with just SQL.
That said, your attempt of an approximation can be implemented to use an index and show good performance irregardless of table size. You need this index if you don't have it already:
CREATE INDEX ON points ("Userid", "Time");
Query:
SELECT *
FROM generate_series(timestamptz '2017-04-10 14:00:00+0'
, timestamptz '2017-04-10 14:09:00+0' -- 1 min *before* end!
, interval '1 minute') grid(t)
LEFT JOIN LATERAL (
SELECT *
FROM points
WHERE "Userid" = 1
AND "Time" >= grid.t
AND "Time" < grid.t + interval '1 minute' -- same interval
ORDER BY "Time"
LIMIT 1
) t ON true;
dbfiddle here
Most importantly, the rewritten query can use above index and will be very fast, solving problem B).
It also addresses problem A) to some extent as no point is returned more than once. If there is no row between two adjacent points in the grid, you get no row in the result. Using LEFT JOIN .. ON true keeps all grid rows and appends NULL in this case. Eliminate those NULL rows by switching to CROSS JOIN. You may get fewer result rows this way.
I am only search ahead of each grid point. You might append a second LATERAL join to also search behind each grid point (just another index-scan), and take the closer one of the two results (ignoring NULL). But that introduces two problems:
If one match is behind and the next is ahead, the gap widens.
You need special treatment for lower and / or upper bound of the outer interval
And you need two LATERAL joins with two index scans.
You could use a recursive CTE to search 1 minute ahead of the last time actually found, but then the total number of rows returned varies even more.
It all comes down to an exact definition of what you need, and where compromises are allowed.
Related:
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
Aggregating the most recent joined records per week
MySQL/Postgres query 5 minutes interval data
Optimize GROUP BY query to retrieve latest row per user
answer use generate_series('2017-04-10 14:00:00','2017-04-10 14:10:00','1 minute'::interval) and join for comparison.
for others to save time on data set:
t=# create table points(i int,"UserId" int,"Time" timestamp(0), "Value" int,b text);
CREATE TABLE
Time: 13.728 ms
t=# copy points from stdin delimiter '|';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1 | 1 | 2017-04-10 14:00:00 | 1 |
>> 2 | 1 | 2017-04-10 14:00:10 | 10 |
3 | 1 | 2017-04-10 14:00:20 | 32 |
4 | 1 | 2017-04-10 14:00:35 | 80 |
5 | 1 | 2017-04-10 14:00:58 | 101 |
6 | 1 | 2017-04-10 14:01:00 | 203 |
7 | 1 | 2017-04-10 14:01:30 | >> 204 |
8 | 1 | 2017-04-10 14:01:40 | 205 |
9 | 1 | 2017-04-10 14:02:02 | 32 |
10 | 1 | 2017-04-10 14:02:15 | 7 |
11 | 1 | 2017-04-10 14:02:30 | 900 |
12 | 1 | 2017-04-10 14:02:45 | 22 |
>> >> >> >> >> >> >> >> >> >> 13 | 1 | 2017-04-10 14:03:00 | 34 |
14 | 1 | 2017-04-10 14:03:30 | 54 |
15 | 1 | 2017-04-10 14:04:00 | 54 |
16 | 1 | 2017-04-10 14:06:00 | 60 |
17 | 1 | 2017-04-10 14:07:20 | 654 |
18 | 1 | 2017-04-10 14:07:40 | 32 |
19 | 1 | 2017-04-10 14:08:00 | 33 |
20 | 1 | 2017-04-10 14:08:12 | 32 |
21 | 1 | 2017-04-10 14:10:00 | 8 |>> >> >> >> >> >> >> >> \.
>> \.
COPY 21
Time: 7684.259 ms
t=# alter table points rename column "UserId" to "Userid";
ALTER TABLE
Time: 1.013 ms
Frankly I don't understand the request. This is how I got it from description and results are different from expected by OP:
t=# with r as (
with g as (
select generate_series('2017-04-10 14:00:00','2017-04-10 14:10:00','1 minute'::interval) s
)
select *,abs(extract(epoch from '2017-04-10 14:02:00' - "Time"))
from g
join points on g.s = date_trunc('minute',"Time")
order by abs
limit 11
)
select i, "Time","Value",abs
from r
order by i;
i | Time | Value | abs
----+---------------------+-------+-----
4 | 2017-04-10 14:00:35 | 80 | 85
5 | 2017-04-10 14:00:58 | 101 | 62
6 | 2017-04-10 14:01:00 | 203 | 60
7 | 2017-04-10 14:01:30 | 204 | 30
8 | 2017-04-10 14:01:40 | 205 | 20
9 | 2017-04-10 14:02:02 | 32 | 2
10 | 2017-04-10 14:02:15 | 7 | 15
11 | 2017-04-10 14:02:30 | 900 | 30
12 | 2017-04-10 14:02:45 | 22 | 45
13 | 2017-04-10 14:03:00 | 34 | 60
14 | 2017-04-10 14:03:30 | 54 | 90
(11 rows)
I added abs column to justify why I thought those rows fit request better

SQL: Add values according to index columns only for lines sharing an id

Yesterday I asked this question: SQL: How to add values according to index columns but I found out that my problem is a bit more complicated:
I have an array like this
id | value| position | relates_to_position |type
19 | 100 | 2 | NULL | 1
19 | 50 | 6 | NULL | 2
19 | 20 | 7 | 6 | 3
20 | 30 | 3 | NULL | 2
20 | 10 | 4 | 3 | 3
From this I need to create the resulting table, which adds all the lines where the relates_to_position value matches the position value, but only for lines sharing the same id!
The resulting table should be
id | value| position |type
19 | 100 | 2 | 1
19 | 70 | 6 | 2
20 | 40 | 3 | 2
I am using Oracle 11. There is only one level of recursion, meaning a line would not refer to a line which has the relates_to_pos field set.
I think the following query will do this:
select id, coalesce(relates_to_position, position) as position,
sum(value) as value, min(type) as type
from t
group by id, coalesce(relates_to_position, position);

Simple psql count query

I am very new to postgresql and would like to generate some summary data from our table
We have a simple message board - table name messages which has an element ctg_uid. Each ctg_uid corresponds to a category name in the table categories.
Here are the categories select * from categories ORDER by ctg_uid ASC;
ctg_uid | ctg_category | ctg_creator_uid
---------+--------------------+-----------------
1 | general | 1
2 | faults | 1
3 | computing | 1
4 | teaching | 2
5 | QIS-FEEDBACK | 3
6 | QIS-PHYS-FEEDBACK | 3
7 | SOP-?-CHANGE | 3
8 | agenda items | 7
10 | Acq & Process | 2
12 | physics-jobs | 3
13 | Tech meeting items | 12
16 | incident-forms | 3
17 | ERRORS | 3
19 | Files | 10
21 | QIS-CAR | 3
22 | doses | 4
24 | admin | 3
25 | audit | 3
26 | For Sale | 4
31 | URGENT-REPORTS | 4
34 | dt-jobs | 3
35 | JOBS | 3
36 | IN-PATIENTS | 4
37 | Ordering | 4
38 | dep-meetings | 4
39 | reporting | 4
What I would like to do is for all messages on our messages is count the frequency of each category
I can do it on a category by category basis
SELECT count(msg_ctg_uid) FROM messages where msg_ctg_uid='13';
However is it possible to do this in a one liner?
The following gives the the category and ctg_uid for each message
SELECT ctg_category, msg_ctg_uid FROM messages INNER JOIN categories ON (ctg_uid = msg_ctg_uid);
but SELECT ctg_category, count(msg_ctg_uid) FROM messages INNER JOIN categories ON (ctg_uid = msg_ctg_uid);
gives me the error ERROR: column "categories.ctg_category" must appear in the GROUP BY clause or be used in an aggregate function
How do I aggregate the frequency of each category ?
You're missing the group by clause:
SELECT ctg_category, count(msg_ctg_uid)
FROM messages INNER JOIN categories ON (ctg_uid = msg_ctg_uid);
GROUP BY ctg_category
this means you want the count per ctg_category

Oracle CONNECT Operations

I've read through the Oracle documentation concerning the CONNECT operations, but I can't seem to get my head around a database query we have in an existing application. Below is a simplified version of the query.
SELECT LEVEL,
CONNECT_BY_ROOT MY_MONTH MY_LABEL,
b.*
FROM (
SELECT ROWNUM AS ORDERING,
MY_AREA,
TRUNC (THE_MONTH, 'MONTH') AS MY_MONTH
FROM MY_TABLE
ORDER BY MY_AREA, MY_MONTH DESC
) b
WHERE LEVEL <= 3
START WITH 1 = 1
CONNECT BY PRIOR MY_AREA = MY_AREA
AND PRIOR ORDERING = ORDERING - 1
AND PRIOR MY_MONTH <= ADD_MONTHS(MY_MONTH, 6);
While I have a basic understanding of the CONNECT functionalities, this combination has me lost. Can anyone explain what is going on in this query?
I think the end says to get all of the rows that have the same area and a row number 1 less than the current row number and a date before 6 months in the future from the current date. I would guess this would only return 1 row (due to the row number criteria) or 0 rows if the other criteria weren't met. And then maybe the first CONNECT_BY_ROOT says to get that row's MY_MONTH value?
Start with b, which is a table of MY_AREA (a number?), MY_MONTH, which is a month-truncated date (i.e. the days are all set to 01), and an aliased ROWNUM, which is determined by the ORDER BY clause, which is ORDER BY MY_AREA, MY_MONTH DESC, e.g.:
+----------+---------+-----------+
| ORDERING | MY_AREA | MY_MONTH |
+----------+---------+-----------+
| 1 | 10 | 01-SEP-12 |
| 2 | 10 | 01-JAN-12 |
| 3 | 12 | 01-AUG-12 |
| 4 | 12 | 01-JUN-12 |
| 5 | 12 | 01-MAY-12 |
| 6 | 12 | 01-JAN-12 |
| 7 | 12 | 01-JAN-10 |
+----------+---------+-----------+
The WHERE clause doesn't come into play until later, so move on to START WITH, which says only 1 = 1. This means that every row in b will be used in the query; if you had had another condition here, e.g. my_area < 5 or whatever, only a certain set of rows would have been used.
Now, the CONNECT BY, which determines how the hierarchy should be built. This works like a WHERE clause, except for the special PRIOR keyword which tells the DB to look at the previous level in the hierarchy. So:
PRIOR MY_AREA = MY_AREA just means that the child node has to have the same value for `MY_AREA'
PRIOR ORDERING = ORDERING - 1 means that the child should come one row after the current node in b's ordering.
PRIOR MY_MONTH <= ADD_MONTHS(MY_MONTH, 6) means that in order to be joined into the hierarchy, the previous MY_MONTH should be 6 months or less after the date of the current node.
The whole hierarchy is then created. LEVEL (special for CONNECT BY...) is set to the level in the hierarchy, CONNECT_BY_ROOT gives the MY_MONTH value for the root of that hierarchy and aliases it to MY_LABEL. After this, the table would look something like the following table. I've added separators for each hierarchy for clarity.
+-------+-----------+----------+---------+-----------+
| LEVEL | MY_LABEL | ORDERING | MY_AREA | MY_MONTH |
+-------+-----------+----------+---------+-----------+
| 1 | 01-SEP-12 | 1 | 10 | 01-SEP-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-JAN-12 | 2 | 10 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-AUG-12 | 3 | 12 | 01-AUG-12 |
| 2 | 01-AUG-12 | 4 | 12 | 01-JUN-12 |
| 3 | 01-AUG-12 | 5 | 12 | 01-MAY-12 |
| 4 | 01-AUG-12 | 6 | 12 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-JUN-12 | 4 | 12 | 01-JUN-12 |
| 2 | 01-JUN-12 | 5 | 12 | 01-MAY-12 |
| 3 | 01-JUN-12 | 6 | 12 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-MAY-12 | 5 | 12 | 01-MAY-12 |
| 2 | 01-MAY-12 | 6 | 12 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-JAN-12 | 6 | 12 | 01-JAN-12 |
+-------+-----------+----------+---------+-----------+
| 1 | 01-JAN-10 | 7 | 12 | 01-JAN-10 |
+-------+-----------+----------+---------+-----------+
So, as you can see, each of the rows appears at the top of its own hierarchy, with all nodes meeting the CONNECT BY criteria under it.
Finally, the WHERE clause is applied; this chops off all of the levels > 3 in every hierarchy, so you're left with a maximum of 3 levels. This affects only one row in the middle hierarchy, the one with LEVEL = 4.