My context is PostgreSQL 8.3
I need to speed up this query as both tables have millions of records.
For each row in table Calls, there are two rows in Trunks table. For every call_id, I want to copy value from trunks.trunk to calls.orig_trunk when trunk_id is the lowest trunk_id of the two rows. ...And copy value from trunks.trunk to calls.orig_trunk when trunk_id is the highest trunk_id of the two rows.
initial content of Table Calls:
Call_ID | dialed_number | orig_trunk | dest_trunk
--------|---------------|------------|-----------
1 | 5145551212 | null | null
2 | 8883331212 | null | null
3 | 4164541212 | null | null
Table Trunks:
Call_ID | trunk_id | trunk
--------|----------|-------
1 | 1 | 116
1 | 2 | 9
2 | 3 | 168
2 | 4 | 3
3 | 5 | 124
3 | 6 | 9
final content of Table Calls:
Call_ID | dialed_number | orig_trunk| dest_trunk
--------|---------------|-----------|----------
1 | 5145551212 | 116 | 9
2 | 8883331212 | 168 | 3
3 | 4164541212 | 124 | 9
I have created index for every column.
update calls set orig_trunk = t2.trunk
from ( select call_id,trunk_id from trunks
order by trunk_id ASC ) as t2
where (calls.call_id=t2.call_id );
update calls set dest_trunk = t2.trunk
from ( select call_id,trunk_id from trunks
order by trunk_id DESC ) as t2
where (calls.call_id=t2.call_id );
Any ideas ?
This is the final code with test conditions as comments.
The subquery is very efficient and rapid. However the test revealed that partitionning the table will have a greater impact on execution time than efficiency of the subquery. On a table of 1 million rows, the update takes 80 seconds. On a table of 12 millions rows, the update takes 580 seconds.
update calls1900 set orig_trunk = a.orig_trunk, dest_trunk = a.dest_trunk
from (select
x.call_id,
t1.trunk as orig_trunk, t2.trunk as dest_trunk
from (select calls1900.call_id
,min(t.trunk_id) as orig_trunk_id
,max(t.trunk_id) as dest_trunk_id
from calls1900
join trunks t on (t.call_id = calls1900.call_id)
-- where calls1900.call_id between 43798930 and 43798950
group by calls1900.call_id
) x
join trunks t1 on (t1.trunk_id = x.orig_trunk_id)
join trunks t2 on (t2.trunk_id = x.dest_trunk_id)
) a
where (calls1900.call_id = a.call_id); -- and (calls1900.call_id between 43798930 and 43798950)<code>
From the example posted, it looks like many unnecessary updates are being performed. Here is an example of a query to get the results you are looking for:
select distinct c.call_id, c.dialed_number
,first_value(t.trunk) over w as orig_trunk
,last_value(t.trunk) over w as dest_trunk
from calls c
join trunks t on (t.call_id = c.call_id)
window w as (partition by c.call_id
order by trunk_id
range between unbounded preceding
and unbounded following
)
There are other ways to do it without the analytic function, for example:
select x.call_id
,x.dialed_number
,t1.trunk as orig_trunk
,t2.trunk as dest_trunk
from (select c.call_id, c.dialed_number
,min(t.trunk_id) as orig_trunk_id
,max(t.trunk_id) as dest_trunk_id
from calls c
join trunks t on (t.call_id = c.call_id)
group by c.call_id, c.dialed_number
) x
join trunks t1 on (t1.trunk_id = x.orig_trunk_id)
join trunks t2 on (t2.trunk_id = x.dest_trunk_id)
Experiment to see what works best in your situation. Probably want to be indexed on the joining columns.
What to do with the result set is dependent on the nature of the application. Is this a one off? Then why not just create a new table from the result set:
CREATE TABLE trunk_summary AS
SELECT ...
Is it constantly changing? Is it frequently accessed? Is it sufficient to just create a view? Or maybe an update is to be performed based on the result set. Maybe a range can be updated at a time. It really depends, but this might give a start.
Related
I have a table like this:
ID | Flag
-----------
1 | True
1 | True
1 | NULL
1 | True
1 | NULL
2 | False
2 | False
2 | False
2 | NULL
2 | NULL
And I want an output like this:
ID | Flag
-----------
1 | True
1 | True
1 | True
1 | True
1 | True
2 | False
2 | False
2 | False
2 | False
2 | False
I want to replace nulls with the value assigned in different records. Is there a way to do it in a single update statement?
One option uses a correlated subquery:
update mytable t
set flag = (select bool_or(flag) from mytable t1 where t1.id = t.id)
Demo on DB Fiddle:
id | flag
-: | :---
1 | t
1 | t
1 | t
1 | t
1 | t
2 | f
2 | f
2 | f
2 | f
2 | f
You can also use exists:
update t
set flag = exists (select 1 from t t2 where t2.id = t.id and t2.flag);
The advantage of exists over a subquery with aggregation is performance: the query can stop at the first row where flag is true. This is a simple index lookup on an index on (id, flag).
Performance would be more improved by limiting the number of rows being updated. That actually suggests two separate statements:
update t
set flag = true
where (flag is null or not flag) and
exists (select 1 from t t2 where t2.id = t.id and t2.flag);
update t
set flag = false
where (flag is null or flag) and
not exists (select 1 from t t2 where t2.id = t.id and not t2.flag);
These could be combined into a single (more complicated) statement, but the sets being updated are disjoint. This limits the updates to the rows that need to be updated, as well as limiting the subquery to a simple lookup (assuming an index on (id, flag)).
The answers provided satisfy your sample data, but may still leave you short of a satisfactory answer. That is because your sample data is missing a couple significant sets. What happens if you had the following, either instead of or in addition to your current sample data?
+----+-------+
| id | flag |
+----+-------+
| 3 | true |
| 3 | false |
| 3 | null |
| 4 | null |
| 4 | null |
+----+-------+
The answer could be significantly different.
Assuming (like your sample data suggests):
There can never be the same id with true and false in the set. Else, you'd have to define what to do.
null values remain unchanged if there is no non-null value for the same id.
This should give you best performance:
UPDATE tbl t
SET flag = t1.flag
FROM (
SELECT DISTINCT ON (id)
id, flag
FROM tbl
ORDER BY id, flag
) t1 -- avoid repeated computation for same id
WHERE t.id = t1.id
AND t.flag IS NULL -- avoid costly no-op updates
AND t1.flag IS NOT NULL; -- avoid costly no-op updates;
db<>fiddle here
The subquery t1 distills target values per id once.
SELECT DISTINCT ON (id)
id, flag
FROM tbl
ORDER BY id, flag;
Since null sorts last, it effectively grabs the first non-null value per id. false sorts before true, but that has no bearing on the case as there can never be both for the same id. See:
Sort NULL values to the end of a table
Select first row in each GROUP BY group?
If you have many rows per id, there are faster techniques:
Optimize GROUP BY query to retrieve latest row per user
The added conditions in the outer query prevent all no-op updates from happening, thus avoiding major cost. Only rows are updated where a null value actually changes. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.
I have an "insert only" database, wherein records aren't physically updated, but rather logically updated by adding a new record, with a CRUD value, carrying a larger sequence. In this case, the "seq" (sequence) column is more in line with what you may consider a primary key, but the "id" is the logical identifier for the record. In the example below,
This is the physical representation of the table:
seq id name | CRUD |
----|-----|--------|------|
1 | 10 | john | C |
2 | 10 | joe | U |
3 | 11 | kent | C |
4 | 12 | katie | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
This is the logical representation of the table, considering the "most recent" records:
seq id name | CRUD |
----|-----|--------|------|
2 | 10 | joe | U |
3 | 11 | kent | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
In order to, for instance, retrieve the most recent record for the person with id=12, I would currently do something like this:
SELECT
*
FROM
PEOPLE P
WHERE
P.ID = 12
AND
P.SEQ = (
SELECT
MAX(P1.SEQ)
FROM
PEOPLE P1
WHERE P.ID = 12
)
...and I would receive this row:
seq id name | CRUD |
----|-----|--------|------|
5 | 12 | sue | U |
What I'd rather do is something like this:
WITH
NEW_P
AS
(
--CTE representing all of the most recent records
--i.e. for any given id, the most recent sequence
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
The first SQL example using the the subquery already works for us.
Question: How can I leverage a CTE to simplify our predicates when needing to leverage the "most recent" logical view of the table. In essence, I don't want to inline a subquery every single time I want to get at the most recent record. I'd rather define a CTE and leverage that in any subsequent predicate.
P.S. While I'm currently using DB2, I'm looking for a solution that is database agnostic.
This is a clear case for window (or OLAP) functions, which are supported by all modern SQL databases. For example:
WITH
ORD_P
AS
(
SELECT p.*, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY seq DESC) rn
FROM people p
)
,
NEW_P
AS
(
SELECT * from ORD_P
WHERE rn = 1
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
PS. Not tested. You may need to explicitly list all columns in the CTE clauses.
I guess you already put it together. First find the max seq associated with each id, then use that to join back to the main table:
WITH newp AS (
SELECT id, MAX(seq) AS latestseq
FROM people
GROUP BY id
)
SELECT p.*
FROM people p
JOIN newp n ON (n.latestseq = p.seq)
ORDER BY p.id
What you originally had would work, or moving the CTE into the "from" clause. Maybe you want to use a timestamp field rather than a sequence number for the ordering?
Following up from #Glenn's answer, here is an updated query which meets my original goal and is on par with #mustaccio's answer, but I'm still not sure what the performance (and other) implications of this approach vs the other are.
WITH
LATEST_PERSON_SEQS AS
(
SELECT
ID,
MAX(SEQ) AS LATEST_SEQ
FROM
PERSON
GROUP BY
ID
)
,
LATEST_PERSON AS
(
SELECT
P.*
FROM
PERSON P
JOIN
LATEST_PERSON_SEQS L
ON
(
L.LATEST_SEQ = P.SEQ)
)
SELECT
*
FROM
LATEST_PERSON L2
WHERE
L2.ID = 12
I have a table ('sales') in a MYSQL DB which should rightfully have had a unique constraint enforced to prevent duplicates. To first remove the dupes and set the constraint is proving a bit tricky.
Table structure (simplified):
'id (unique, autoinc)'
product_id
The goal is to enforce uniqueness for product_id. The de-duping policy I want to apply is to remove all duplicate records except the most recently created, eg: the highest id.
Or to put another way, I would like to delete only duplicate records, excluding the ids matched by the following query whilst also preserving the existing non-duped records:
select id
from sales s
inner join (select product_id,
max(id) as maxId
from sales
group by product_id
having count(product_id) > 1) groupedByProdId on s.product_id
and s.id = groupedByProdId.maxId
I've struggled with this on two fronts - writing the query to select the correct records to delete and then also the constraint in MYSQL where a subselect FROM clause of a DELETE cannot reference the same table from which data is being removed.
I checked out this answer and it seemed to deal with the subject, but seem specific to sql-server, though I wouldn't rule this question out from duplicating another.
In reply to your comment, here's a query that works in MySQL:
delete YourTable
from YourTable
inner join YourTable yt2
on YourTable.product_id = yt2.product_id
and YourTable.id < yt2.id
This would only remove duplicate rows. The inner join will filter out the latest row for each product, even if no other rows for the same product exist.
P.S. If you try to alias the table after FROM, MySQL requires you to specify the name of the database, like:
delete <DatabaseName>.yt
from YourTable yt
inner join YourTable yt2
on yt.product_id = yt2.product_id
and yt.id < yt2.id;
Perhaps use ALTER IGNORE TABLE ... ADD UNIQUE KEY.
For example:
describe sales;
+------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| product_id | int(11) | NO | | NULL | |
+------------+---------+------+-----+---------+----------------+
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 3 |
| 6 | 2 |
+----+------------+
ALTER IGNORE TABLE sales ADD UNIQUE KEY idx1(product_id), ORDER BY id DESC;
Query OK, 6 rows affected (0.03 sec)
Records: 6 Duplicates: 3 Warnings: 0
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 6 | 2 |
| 5 | 3 |
| 2 | 1 |
+----+------------+
See this pythian post for more information.
Note that the ids end up in reverse order. I don't think this matters, since order of the ids should not matter in a database (as far as I know!). If this displeases you however, the post linked to above shows a way to solve this problem too. However, it involves creating a temporary table which requires more hard drive space than the in-place method I posted above.
I might do the following in sql-server to eliminate the duplicates:
DELETE FROM Sales
FROM Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
It looks like the analogous delete statement for mysql might be:
DELETE FROM Sales
USING Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
This type of problem is easier to solve with CTEs and Ranking functions, however, you should be able to do something like the following to solve your problem:
Delete Sales
Where Exists(
Select 1
From Sales As S2
Where S2.product_id = Sales.product_id
And S2.id > Sales.Id
Having Count(*) > 0
)
In Mysql, I want to select the bottom 2 items from each category
Category Value
1 1.3
1 4.8
1 3.7
1 1.6
2 9.5
2 9.9
2 9.2
2 10.3
3 4
3 8
3 16
Giving me:
Category Value
1 1.3
1 1.6
2 9.5
2 9.2
3 4
3 8
Before I migrated from sqlite3 I had to first select a lowest from each category, then excluding anything that joined to that, I had to again select the lowest from each category. Then anything equal to that new lowest or less in a category won. This would also pick more than 2 in case of a tie, which was annoying... It also had a really long runtime.
My ultimate goal is to count the number of times an individual is in one of the lowest 2 of a category (there is also a name field) and this is the one part I don't know how to do.
Thanks
SELECT c1.category, c1.value
FROM catvals c1
LEFT OUTER JOIN catvals c2
ON (c1.category = c2.category AND c1.value > c2.value)
GROUP BY c1.category, c1.value
HAVING COUNT(*) < 2;
Tested on MySQL 5.1.41 with your test data. Output:
+----------+-------+
| category | value |
+----------+-------+
| 1 | 1.30 |
| 1 | 1.60 |
| 2 | 9.20 |
| 2 | 9.50 |
| 3 | 4.00 |
| 3 | 8.00 |
+----------+-------+
(The extra decimal places are because I declared the value column as NUMERIC(9,2).)
Like other solutions, this produces more than 2 rows per category if there are ties. There are ways to construct the join condition to resolve that, but we'd need to use a primary key or unique key in your table, and we'd also have to know how you intend ties to be resolved.
You could try this:
SELECT * FROM (
SELECT c.*,
(SELECT COUNT(*)
FROM user_category c2
WHERE c2.category = c.category
AND c2.value < c.value) cnt
FROM user_category c ) uc
WHERE cnt < 2
It should give you the desired results, but check if performance is ok.
Here's a solution that handles duplicates properly. Table name is 'zzz' and columns are int and float
select
smallest.category category, min(smallest.value) value
from
zzz smallest
group by smallest.category
union
select
second_smallest.category category, min(second_smallest.value) value
from
zzz second_smallest
where
concat(second_smallest.category,'x',second_smallest.value)
not in ( -- recreate the results from the first half of the union
select concat(c.category,'x',min(c.value))
from zzz c
group by c.category
)
group by second_smallest.category
order by category
Caveats:
If there is only one value for a given category, then only that single entry is returned.
If there was a unique recordID for each row you wouldn't need all the concats to simulate a unique key.
Your mileage may vary,
--Mark
A union should work. I'm not sure of the performance compared to Peter's solution.
SELECT smallest.category, MIN(smallest.value)
FROM categories smallest
GROUP BY smallest.category
UNION
SELECT second_smallest.category, MIN(second_smallest.value)
FROM categories second_smallest
WHERE second_smallest.value > (SELECT MIN(smallest.value) FROM categories smallest WHERE second.category = second_smallest.category)
GROUP BY second_smallest.category
Here is a very generalized solution, that would work for selecting first n rows for each Category. This will work even if there are duplicates in value.
/* creating temporary variables */
mysql> set #cnt = 0;
mysql> set #trk = 0;
/* query */
mysql> select Category, Value
from (select *,
#cnt:=if(#trk = Category, #cnt+1, 0) cnt,
#trk:=Category
from user_categories
order by Category, Value ) c1
where c1.cnt < 2;
Here is the result.
+----------+-------+
| Category | Value |
+----------+-------+
| 1 | 1.3 |
| 1 | 1.6 |
| 2 | 9.2 |
| 2 | 9.5 |
| 3 | 4 |
| 3 | 8 |
+----------+-------+
This is tested on MySQL 5.0.88
Note that initial value of #trk variable should be not the least value of Category field.