An application of mine is trying to execute a count(*) query which returns after about 30 minutes. What's strange is that the query is very simple and the tables involved are large, but not gigantic (10,000 and 50,000 records).
The query which takes 30 minutes is:
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
The database schema is essentially:
create table BATCH (
BATCH_ID int not null,
[other columns]...,
CONSTRAINT PK_BATCH PRIMARY KEY (BATCH_ID)
);
create table GROUP (
GROUP_ID int not null,
BATCH_ID int,
ENABLED char(1) not null,
[other columns]...,
CONSTRAINT PK_GROUP PRIMARY KEY (GROUP_ID),
CONSTRAINT FK_GROUP_BATCH_ID FOREIGN KEY (BATCH_ID)
REFERENCES BATCH (BATCH_ID),
CONSTRAINT CHK_GROUP_ENABLED CHECK(ENABLED in ('Y', 'N'))
);
create table RECORD (
GROUP_ID int not null,
RECORD_NUMBER int not null,
[other columns]...,
CONSTRAINT PK_RECORD PRIMARY KEY (GROUP_ID, RECORD_NUMBER),
CONSTRAINT FK_RECORD_GROUP_ID FOREIGN KEY (GROUP_ID)
REFERENCES GROUP (GROUP_ID)
);
create index IDX_GROUP_BATCH_ID on GROUP(BATCH_ID);
I checked whether there are any blocks in the database and there are none. I also ran the following pieces of the query and all except the last two returned instantly:
select count(*) from RECORD -- 55,501
select count(*) from GROUP -- 11,693
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
-- 55,501
select count(*)
from GROUP g
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
-- 3,112
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1
-- 27,742 - took around 5 minutes to run
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.ENABLED = 'Y'
-- 51,749 - took around 5 minutes to run
Can someone explain what's going on? How can I improve the query's performance? Thanks.
A coworker figured out the issue. It's because the table statistics weren't being updated and the last time the table was analyzed was a couple of months ago (when the table was essentially empty). I ran analyze table RECORD compute statistics and now the query is returning in less than a second.
I'll have to talk to the DBA about why the table statistics weren't being updated.
SELECT COUNT(*)
FROM RECORD R
LEFT OUTER JOIN GROUP G ON G.GROUP_ID = R.GROUP_ID
AND G.BATCH_ID = 1
AND G.ENABLED = 'Y'
Try that and let me know how it turns out. Not saying this IS the answer, but since I don't have access to a DB right now, I can't test it. Hope it works for ya.
An explain plan would be a good place to start.
See here:
Strange speed changes with sql query
for how to use the explain plan syntax (and query to see the result.)
If that doesn't show anything suspicious, you'll probably want to look at a trace.
Related
I has two tables, "tracks" (header of track and track_points - points in track).
Schema:
CREATE TABLE tracks(
id INTEGER PRIMARY KEY ASC,
start_time TEXT NOT NULL
);
CREATE TABLE track_points (
id INTEGER PRIMARY KEY AUTOINCREMENT,
data BLOB,
track_id INTEGER NOT NULL,
FOREIGN KEY(track_id) REFERENCES tracks(id) ON DELETE CASCADE
);
CREATE INDEX track_id_idx ON track_points (track_id);
CREATE INDEX start_time_idx ON tracks (start_time);
And I want delete all "tracks" that has 0 or 1 point.
Note if 0 points in tracks, then it has no rows in "track_points".
I write such query:
DELETE FROM tracks WHERE tracks.id IN
(SELECT track_id FROM
(SELECT tracks.id as track_id, COUNT(track_points.id) as track_len FROM tracks
LEFT JOIN track_points ON tracks.id=track_points.track_id GROUP BY tracks.id)
WHERE track_len<=1)
it seems to work, but I wonder is it possible to optmize such query?
I mean time of work (now 10 seconds on big table on my machine).
Or may be simplification of this SQL code is possible (with preservance of work time of course)?
You can simplify your code by removing 1 level of your subqueries, because you can achieve the same with a HAVING clause instead of an outer WHERE clause:
DELETE FROM tracks
WHERE id IN (
SELECT t.id
FROM tracks t LEFT JOIN track_points p
ON t.id = p.track_id
GROUP BY t.id
HAVING COUNT(p.id) <= 1
);
The above code may not make any difference, but it's simpler.
The same logic could also be applied by using EXCEPT:
DELETE FROM tracks
WHERE id IN (
SELECT id FROM tracks
EXCEPT
SELECT track_id
FROM track_points
GROUP BY track_id
HAVING COUNT(*) > 1
);
What you can try is a query that does not involve the join of the 2 tables.
Aggregate only in track_points and get the track_ids with 2 or more occurrences. Then delete all the rows from tracks with ids that are not included in the result of the previous query:
DELETE FROM tracks
WHERE id NOT IN (
SELECT track_id
FROM track_points
GROUP BY track_id
HAVING COUNT(*) > 1
);
In oracle sql, imagine I have a table of game developers and a table of products sold by a game store. Imagine I am trying to select only game developers that have a total amount of products available at the game store that is less than 10.
For the sake of this, I will call the tables 'Developers' and 'Games'. Developers contains a PK of DEV_ID which will serve as the FK of GAME_DEV in Games.
CREATE TABLE Developers (
DEV_ID varchar(5) NOT NULL PRIMARY KEY,
DEV_NAME varchar(20) NOT NULL);
CREATE TABLE Games (
GAME_ID varchar(5) NOT NULL PRIMARY KEY
GAME_NAME varchar(20) NOT NULL,
GAME_PRICE varcher(10) NOT NULL,
GAME_DEV varchar(5) NOT NULL,
CONSTRAINT game_fk FOREIGN KEY (GAME_DEV)
REFERENCES Developers(DEV_ID));
I have tried doing something like creating a view, and then trying to select only the DEV_ID from the view where the amount of entries is less than 10. Heres what ive tried:
CREATE OR REPLACE VIEW games_developers AS
SELECT * FROM Games g
INNER JOIN Developer d
ON g.GAME_DEV = d.DEV_ID;
SELECT DEV_ID FROM games_developers
WHERE COUNT(DEV_NAME) < 10;
Now I get the error message "group function is not allowed here"
Any ideas of how I can pull a list of developers who only have an available amount of games at the store that is less than 10?
One method is:
SELECT d.*
FROM Developer d
WHERE d.DEV_ID IN (SELECT g.GAME_DEV
FROM Games g
GROUP BY g.GAME_DEV
HAVING COUNT(*) < 10
);
However, this will miss developers with no games in the store. So:
SELECT d.*
FROM Developer d
WHERE d.DEV_ID NOT IN (SELECT g.GAME_DEV
FROM Games g
GROUP BY g.GAME_DEV
HAVING COUNT(*) >= 10
);
Another option, which may or may not be any clearer
select d.dev_id, d.dev_name, count(*) num_games
from developer d
left outer join games g
on( d.dev_id = g.game_dev )
group by d.dev_id, d.dev_name
having count(*) < 10
If you changed your view to do a left outer join instead, then you could just do
select d.dev_id, d.dev_name, count(*) num_games
from games_developers d /* If the view does an outer join */
group by d.dev_id, d.dev_name
having count(*) < 10
I'm having a performance problem with a TOP(1) (or EXISTS) select statement on a join of 2 tables.
I'm using SQL Server 2008 R2.
I have 2 tables:
CREATE TABLE Records(
Id PRIMARY KEY INT NOT NULL,
User INT NOT NULL,
RecordType INT NOT NULL)
CREATE TABLE Values(
Id PRIMARY KEY BIGINT NOT NULL,
RecordId INT NOT NULL,
Field INT NOT NULL,
Value NVARCHAR(400) NOT NULL,
CONSTRAINT FK_Values_Record FOREIGN KEY(RecordId) REFERENCES Records(Id))
with indexes:
CREATE NONCLUSTERED INDEX IDX_Records ON Records(User ASC, RecordType ASC) INCLUDE(Id)
CREATE NONCLUSTERED INDEX IDX_Values ON Values(RecordId ASC, Field ASC) INCLUDE(Value)
CREATE NONCLUSTERED INDEX IDX_ValuesByVal ON Values(Field ASC, Value ASC) INCLUDE(RecordId)
The tables contain a lot of data, around 100 million records in Records and 150 million in Values, and they are still growing. Some users have a lot of data, some only a small amount.
For some user/field combination we might have no records in the Values table, but for some other user/field we have almost as many records in the Values table as we have in the Records table for that user.
I want to write a query testing if I have any data for a user/field combination. My first try was this:
SELECT TOP(1) V.Field
FROM Records R
INNER JOIN Values V ON V.RecordId = R.Id
WHERE R.User = #User
AND R.RecordType = #RecordType
AND V.Field = #Field
The problem with this query was, that if the execution plan was not in the server's cache and the first user did not have a lot of data, the server would put an execution plan for this query that did not work well for a user with a lot of data, resulting in a timeout (more than 15 seconds). The same problem occurred for RecordTypes or Fields. So I had to hardcode the id's in the query instead of using variables.
SELECT TOP(1) V.Field
FROM Records R
INNER JOIN Values V ON V.RecordId = R.Id
WHERE R.User = 123
AND R.RecordType = 45
AND V.Field = 67
But even then the server would sometime do a a table scan instead of using the available indexes, also resulting in timeouts. So i had to add FORCESEEK to the query:
SELECT TOP(1) V.Field
FROM Records R WITH (FORCESEEK)
INNER JOIN Values V WITH (FORCESEEK) ON V.RecordId = R.Id
WHERE R.User = 123
AND R.RecordType = 45
AND V.Field = 67
But even now, the server sometimes first seeks in the Records table and then in the Values table, instead of first seeking in the Values table and then in the Records table, also resulting in timeouts. I don't know why this result in a timeout, but it does. As fields are linked to a RecordType in my model, I could remove the RecordType clause, forcing the server of first seeking in the Values table
SELECT TOP(1) V.Field
FROM Records R WITH (FORCESEEK)
INNER JOIN Values V WITH (FORCESEEK) ON V.RecordId = R.Id
WHERE R.User = 123
AND V.Field = 67
With this last change I no longer have any timeouts, but still the query take around 1 to 2 seconds, sometimes even 5 to 7 seconds.
I still don't understand why this takes this much time.
Does anyone have any ideas how to improve this query to avoid these long querytimes ?
Should not make any difference but for grins try
SELECT TOP(1) 1
FROM Records R
JOIN Values V
ON V.RecordId = R.Id
AND R.User = 123
AND R.RecordType = 45
AND V.Field = 67
Below is an example schema with 3 tables. I'm trying to run a query that returns all Jobs where all child Shifts are of status 6. If a Job has a child Shift with a status of 5, the Job should not be returned. The proper response for a query from the sample data inserted below is no rows returned.
There is a working query below with the comment "Works". I am trying to refactor the "works" query to use joins instead of subqueries. The query with the comment "Does not work" is my attempt.
-- begin setup and table creation: only run this section once.
CREATE EXTENSION "uuid-ossp";
CREATE TABLE jobs
(
id uuid NOT NULL DEFAULT uuid_generate_v4(),
CONSTRAINT jobs_pkey PRIMARY KEY (id)
);
CREATE TABLE bookings
(
id uuid NOT NULL DEFAULT uuid_generate_v4(),
job_id uuid,
CONSTRAINT bookings_pkey PRIMARY KEY (id)
);
CREATE TABLE shifts
(
id uuid NOT NULL DEFAULT uuid_generate_v4(),
booking_id uuid,
status integer,
CONSTRAINT shifts_pkey PRIMARY KEY (id)
);
insert into jobs (id) values ('e857c86c-bc31-11e6-9aae-57793f585d49');
insert into bookings (id, job_id) values ('736da82c-bc32-11e6-b9b8-f36753d321ac', 'e857c86c-bc31-11e6-9aae-57793f585d49');
insert into bookings (id, job_id) values ('7d839e5c-bc32-11e6-8bb3-4fa95be86a74', 'e857c86c-bc31-11e6-9aae-57793f585d49');
insert into shifts (booking_id, status) values ('736da82c-bc32-11e6-b9b8-f36753d321ac', 6);
insert into shifts (booking_id, status) values ('7d839e5c-bc32-11e6-8bb3-4fa95be86a74', 5);
-- end setup and table creation
We want all jobs where all child shifts are of status 6. If a job has a child shift with a status of 5, the job should not be returned. The proper response for a query from the sample data inserted above is no rows returned.
Does not work :(
SELECT "jobs".*
FROM "jobs"
inner join bookings b1 on jobs.id = b1.job_id
inner join shifts s1 on b1.id = s1.booking_id
left outer join bookings b2 on jobs.id = b2.job_id
left outer join shifts s2 on b2.id = s2.booking_id and s2.status IN (2,3,4,5)
WHERE s1.status = 6
AND s2.id IS NULL
GROUP BY "jobs"."id";
Works
SELECT "jobs".*
FROM "jobs"
WHERE jobs.id IN (
SELECT job_id
FROM bookings
WHERE bookings.id IN (
SELECT booking_id FROM shifts WHERE status = 6
)
) AND jobs.id NOT IN (
SELECT job_id FROM bookings WHERE bookings.id IN (
SELECT booking_id FROM shifts WHERE status IN (2,3,4,5)
)
)
GROUP BY "jobs"."id";
How can I refactor the "works" query to use joins instead of subqueries? The "does not work" query is my attempt.
Try this (haven't tested so there may be typos):
with prohibited_jobs as (
select distinct jobs.id
from jobs
join bookings on jobs.id == bookings.job_id
join shifts on shifts.booking_id = booking.job_id
where shift.status != 6
)
select jobs.*
from jobs
left outer join prohibited_jobs p on p.id = jobs.id
where
p.id IS NULL
It's not completely free from subqueries (doing everything with joins would most certainly be less efficient), but it removes some unnecessary checks, so may be a little bit faster (which I suspect is your goal).
There is a small difference to your working query, in that it returns all jobs where all shifts have status 6 (as you said you want), whereas your query also ensures that the job has at least one shift (of status 6).
I have two entities in my database that are connected with a many to many relationship. I was wondering what would be the best way to list which entities have the most similarities based on it?
I tried doing a count(*) with intersect, but the query takes too long to run on every entry in my database (there are about 20k records). When running the query I wrote, CPU usage jumps to 100% and the database has locking issues.
Here is some code showing what I've tried:
My tables look something along these lines:
/* 20k records */
create table Movie(
Id INT PRIMARY KEY,
Title varchar(255)
);
/* 200-300 records */
create table Tags(
Id INT PRIMARY KEY,
Desc varchar(255)
);
/* 200,000-300,000 records */
create table TagMovies(
Movie_Id INT,
Tag_Id INT,
PRIMARY KEY (Movie_Id, Tag_Id),
FOREIGN KEY (Movie_Id) REFERENCES Movie(Id),
FOREIGN KEY (Tag_Id) REFERENCES Tags(Id),
);
(This works, but it is terribly slow)
This is the query that I wrote to try and list them:
Usually I also filter with top 1 & add a where clause to get a specific set of related data.
SELECT
bk.Id,
rh.Id
FROM
Movies bk
CROSS APPLY (
SELECT TOP 15
b.Id,
/* Tags Score */
(
SELECT COUNT(*) FROM (
SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = bk.Id
INTERSECT
SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = b.Id
) Q1
)
as Amount
FROM
Movies b
WHERE
b.Id <> bk.Id
ORDER BY Amount DESC
) rh
Explanation:
Movies have tags and the user can get try to find movies similar to the one that they selected based on other movies that have similar tags.
Hmm ... just an idea, but maybe I didnt understand ...
This query should return best matched movies by tags for a given movie ID:
SELECT m.id, m.title, GROUP_CONCAT(DISTINCT t.Descr SEPARATOR ', ') as tags, count(*) as matches
FROM stack.Movie m
LEFT JOIN stack.TagMovies tm ON m.Id = tm.Movie_Id
LEFT JOIN stack.Tags t ON tm.Tag_Id = t.Id
WHERE m.id != 1
AND tm.Tag_Id IN (SELECT Tag_Id FROM stack.TagMovies tm WHERE tm.Movie_Id = 1)
GROUP BY m.id
ORDER BY matches DESC
LIMIT 15;
EDIT:
I just realized that it's for M$ SQL ... but maybe something similar can be done...
You should probably decide on a naming convention and stick with it. Are tables singular or plural nouns? I don't want to get into that debate, but pick one or the other.
Without access to your database I don't know how this will perform. It's just off the top of my head. You could also limit this by the M.id value to find the best matches for a single movie, which I think would improve performance by quite a bit.
Also, TOP x should let you get the x closest matches.
SELECT
M.id,
M.title,
SM.id AS similar_movie_id,
SM.title AS similar_movie_title,
COUNT(*) AS matched_tags
FROM
Movie M
INNER JOIN TagsMovie TM1 ON TM1.movie_id = M.movie_id
INNER JOIN TagsMovie TM2 ON
TM2.tag_id = TM1.tag_id AND
TM2.movie_id <> TM1.movie_id
INNER JOIN Movie SM ON SM.movie_id = TM2.movie_id
GROUP BY
M.id,
M.title,
SM.id AS similar_movie_id,
SM.title AS similar_movie_title
ORDER BY
COUNT(*) DESC