I have a table with 700K+ records on wich a simple GROUP BY query takes in excess of 35+ seconds to execute. I'm out of ideas on how to optimize this.
SELECT TOP 10 called_dn, COUNT(called_dn) FROM reportview.calls_out GROUP BY called_dn;
Here I add TOP 10 to limit network transfer induced delays.
I have an index on called_dn (hsqldb seems not to be using this).
called_dn is non nullable.
reportview.calls_out is a cached table.
Here's the table script:
CREATE TABLE calls_out (
pk_global_call_id INTEGER GENERATED BY DEFAULT AS SEQUENCE seq_global_call_id NOT NULL,
sys_global_call_id VARCHAR(65),
call_start TIMESTAMP WITH TIME ZONE NOT NULL,
call_end TIMESTAMP WITH TIME ZONE NOT NULL,
duration_interval INTERVAL HOUR TO SECOND(0),
duration_seconds INTEGER,
call_segments INTEGER,
calling_dn VARCHAR(25) NOT NULL,
called_dn VARCHAR(25) NOT NULL,
called_via_dn VARCHAR(25),
fk_end_status INTEGER NOT NULL,
fk_incoming_queue INTEGER,
call_start_year INTEGER,
call_start_month INTEGER,
call_start_week INTEGER,
call_start_day INTEGER,
call_start_hour INTEGER,
call_start_minute INTEGER,
call_start_second INTEGER,
utc_created TIMESTAMP WITH TIME ZONE,
created_by VARCHAR(25),
utc_modified TIMESTAMP WITH TIME ZONE,
modified_by VARCHAR(25),
PRIMARY KEY (pk_global_call_id),
FOREIGN KEY (fk_incoming_queue)
REFERENCES lookup_incoming_queue(pk_id),
FOREIGN KEY (fk_end_status)
REFERENCES lookup_end_status(pk_id));
I'm I stuck with this kind of performance or is there something I might try to speed up this query?
EDIT: Here's the query plan if it helps:
isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[true]
columns=[ COLUMN: REPORTVIEW.CALLS_OUT.CALLED_DN not nullable
COUNT arg=[ COLUMN: REPORTVIEW.CALLS_OUT.CALLED_DN nullable]
[range variable 1
join type=INNER
table=CALLS_OUT
cardinality=771855
access=FULL SCAN
join condition = [index=SYS_IDX_SYS_PK_10173_10177]]]
groupColumns=[COLUMN: REPORTVIEW.CALLS_OUT.CALLED_DN]
offset=[VALUE = 0, TYPE = INTEGER]
limit=[VALUE = 10, TYPE = INTEGER]
PARAMETERS=[]
SUBQUERIES[]
Well, as it seems there's no way to avoid a full column scan in this situation.
Just for reference of future souls reaching this question, here's what I resorted to in the end:
Created a summary table maintained by INSERT / DELETE triggers in the original table. This in combination with suitable indexes and using LIMIT USING INDEX clauses in my queries yields very good performance.
Related
I am working on an application where records are in billions and I need to make a query where GroupBy clause is needed.
Table Schema:
CREATE TABLE event (
eventId INTEGER PRIMARY KEY,
eventTime INTEGER NOT NULL,
sourceId INTEGER NOT NULL,
plateNumber VARCHAR(10) NOT NULL,
plateCodeId INTEGER NOT NULL,
plateCountryId INTEGER NOT NULL,
plateStateId INTEGER NOT NULL
);
CREATE TABLE source (
sourceId INTEGER PRIMARY KEY,
sourceName VARCHAR(32) NOT NULL
);
Scenario:
User will select sources, suppose source ID (1,2,3)
We need to get all events which occurred more than once for those source for event time range
Same event criteria (same platenumber, platecodeId, platestateId, plateCountryId)
I have prepared a query to perform above mentioned operation but its taking long time to execute.
select plateNumber, plateCodeId, plateStateId,
plateCountryId, sourceId,count(1) from event
where sourceId in (1,2,3)
group by sourceId, plateCodeId, plateStateId,
plateCountryId, plateNumber
having count(1) > 1 limit 10 offset 0
Can you recommend optimized query for it?
Since you didn't supply the projection DDL, I'll assume the projection is default and created by the CREATE TABLE statement
Your goal is to achieve the use of the GROUPBY PIPELINED algorithm instead of GROUPBY HASH which is usually slower and consumes more memory.
To do so, you need the table('s projection) to be sorted by the columns in the group by clause.
More info here: GROUP BY Implementation Options
CREATE TABLE event (
eventId INTEGER PRIMARY KEY,
eventTime INTEGER NOT NULL,
sourceId INTEGER NOT NULL,
plateNumber VARCHAR(10) NOT NULL,
plateCodeId INTEGER NOT NULL,
plateCountryId INTEGER NOT NULL,
plateStateId INTEGER NOT NULL
)
ORDER BY sourceId,
plateCodeId,
plateStateId,
plateCountryId,
plateNumber;
You can see which algorithm is being used by adding EXPLAIN before your query.
The issue here is simple as that, Postgresql doesn't allow the following query structure:
-- TABLE OF FACTS
CREATE TABLE facts_table (
id integer NOT NULL,
description CHARACTER VARYING(50),
amount NUMERIC(12,2) DEFAULT 0,
quantity INTEGER,
detail_1 CHARACTER VARYING(50),
detail_2 CHARACTER VARYING(50),
detail_3 CHARACTER VARYING(50),
time TIMESTAMP(0) WITHOUT TIME ZONE DEFAULT LOCALTIMESTAMP(0)
);
ALTER TABLE facts_table ADD PRIMARY KEY(id);
-- SUMMARIZED TABLE
CREATE TABLE table_cube (
id INTEGER,
description CHARACTER VARYING(50),
amount NUMERIC(12,2) DEFAULT 0,
quantity INTEGER,
time TIMESTAMP(0) WITHOUT TIME ZONE DEFAULT LOCALTIMESTAMP(0)
);
ALTER TABLE table_cube ADD PRIMARY KEY(id);
INSERT INTO table_cube(id, description, amount, quantity, time)
SELECT
id,
description,
SUM(amount) AS amount,
SUM(quantity) AS quantity,
time
FROM facts_table
GROUP BY CUBE(id, description, time);
----------------------------------------------------------------
ERROR: grouping sets are not allowed in INSERT SELECT queries.
I think it's pretty obvious that CUBE produces null results on every field indicated as a grouping set (as it computes every possible combination), therefore I can not insert that row in my table_cube table, so , does Postgres just assume, that I'm trying to insert a row in a table with a PK field? Even if the table_cube table doesn't have a PK, this cannot be accomplished.
Thanks.
Version: PostgreSQL 9.6
You have define table_cube(id) as Primary Key. So, If Cube contains null
values, it can't be inserted. I have checked without having id as Primary
Key, It works fine and when define id as primary key I got error:
"ERROR:id contains null values" SQL state: 23502
As suggested by Haleemur Ali,
"If a constraint is required, use a unique index with all the grouping
set columns: CREATE UNIQUE INDEX unq_table_cube_id_description_time ON
table_cube(id, description, time); Please update your question with more
information on database & version."
is a good option. But you have to remove Primary Key On "Id" and assign only Unique Key as suggested above as with having primary key and unique key again get this error:
ERROR: null value in column "id" violates not-null constraint
DETAIL: Failing row contains (null, null, 1300, 1522, null).
SQL state: 23502
So, the conclusion is, with unique index there is no need of Primary Key or with cube there is no need of unique index or Primary Key.
I use SQLlite3 (exactly 3.9.2 2015-11-02) in my application. For test purposes I have a few tables.
One of them has schema as follows:
CREATE TABLE "Recordings" (
`PartId` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
`CameraId` INTEGER NOT NULL,
`StartTime` INTEGER NOT NULL,
`EndTime` INTEGER NOT NULL,
`FilePath` TEXT NOT NULL UNIQUE,
`FileSize` INTEGER NOT NULL,
`DeleteLockManual` INTEGER NOT NULL DEFAULT 0,
`Event1` INTEGER NOT NULL,
`Event2` INTEGER NOT NULL,
`Event3` INTEGER NOT NULL,
`Event4` INTEGER NOT NULL,
`Event5` INTEGER NOT NULL,
`Event6` INTEGER NOT NULL,
FOREIGN KEY(`CameraId`) REFERENCES Devices ( CameraId )
);
CREATE INDEX `Table_Event2` ON `Table` (`Event2`);
CREATE INDEX `Table_Event3` ON `Table` (`Event3`);
CREATE INDEX `Table_Event4` ON `Table` (`Event4`);
CREATE INDEX `Table_Event5` ON `Table` (`Event5`);
CREATE INDEX `Table_Event6` ON `Table` (`Event6`);
CREATE INDEX `Table_Event1` ON `Table` (`Event1`);
CREATE INDEX `Table_DeleteLockManual` ON `Table` (`DeleteLockManual`);
CREATE INDEX `Table_EndTime` ON `Table` (`EndTime`);
The hardware I'm using is pretty old: Pentium 4 2.4GHZ, 512MB RAM, old 40GB Hard drive.
The Recordings table contains ~60k rows.
When I'm doing INSERT on this table, from time to time (one per 30) query takes extremely long to finish. Last time it took 23 sec (sic!) for a single 1-row INSERT. For the rest times it takes ~120ms.
I dumped stack during this operation:
[<e0a245a8>] jbd2_log_wait_commit+0x88/0xbc [jbd2]
[<e0a258ce>] jbd2_complete_transaction+0x69/0x6d [jbd2]
[<e0ac1e8e>] ext4_sync_file+0x208/0x279 [ext4]
[<c020acb1>] vfs_fsync_range+0x64/0x76
[<c020acd7>] vfs_fsync+0x14/0x16
[<c020acfb>] do_fsync+0x22/0x3f
[<c020aed5>] SyS_fdatasync+0x10/0x12
[<c0499292>] syscall_after_call+0x0/0x4
[<ffffffff>] 0xffffffff
Or:
[<c02b3560>] submit_bio_wait+0x46/0x51
[<c02bb54b>] blkdev_issue_flush+0x41/0x67
[<e0ac1ea8>] ext4_sync_file+0x222/0x279 [ext4]
[<c020acb1>] vfs_fsync_range+0x64/0x76
[<c020acd7>] vfs_fsync+0x14/0x16
[<c020acfb>] do_fsync+0x22/0x3f
[<c020aed5>] SyS_fdatasync+0x10/0x12
[<c0499292>] syscall_after_call+0x0/0x4
[<ffffffff>] 0xffffffff
The application using this database is single-threaded.
What can cause such behavior?
Will switching to recent hardware (with ssd) solve this issue?
Here's the creation of my tables...
CREATE TABLE questions(
id INTEGER PRIMARY KEY AUTOINCREMENT,
question VARCHAR(256) UNIQUE NOT NULL,
rangeMin INTEGER,
rangeMax INTEGER,
level INTEGER NOT NULL,
totalRatings INTEGER DEFAULT 0,
totalStars INTEGER DEFAULT 0
);
CREATE TABLE games(
id INTEGER PRIMARY KEY AUTOINCREMENT,
level INTEGER NOT NULL,
inclusive BOOL NOT NULL DEFAULT 0,
active BOOL NOT NULL DEFAULT 0,
questionCount INTEGER NOT NULL,
completedCount INTEGER DEFAULT 0,
startTime DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE gameQuestions(
gameId INTEGER,
questionId INTEGER,
asked BOOL DEFAULT 0,
FOREIGN KEY(gameId) REFERENCES games(id),
FOREIGN KEY(questionId) REFERENCES questions(id)
);
I'll explain the full steps that I'm doing, and then I'll ask for input.
I need to...
Using a games.id value, lookup the games.questionCount and games.level for that game.
Now since I have games.questionCount and games.level, I need to look at all of the rows in questions table with questions.level = games.level and select games.questionCount of them at random.
Now with the rows (aka questions) I got from step 2, I need to put them into gameQuestions table using the games.id value and the questions.id value.
Whats the best way to accomplish this? I could do it with several different sql queries, but I feel like someone really skilled with sql could make it happen a bit more efficient. I am using sqlite3.
This does it in one statement. Let's assume :game_id to be the game id you want to process.
insert into gameQuestions (gameId, questionId)
select :game_id, id
from questions
where level = (select level from games where id = :game_id)
order by random()
limit (select questionCount from games where id = :game_id);
#Tony: sqlite doc says LIMIT takes an expression. The above statement works fine using sqlite 3.8.0.2 and produces the desired results. I have not tested an older version.
Anyone have guidance on how to approach building indexes for the following query? The query works as expected, but I can't seem to get around full table scans. Working with Oracle 11g.
SELECT v.volume_id
FROM ( SELECT MIN (usv.volume_id) volume_id
FROM user_stage_volume usv
WHERE usv.status = 'NEW'
AND NOT EXISTS
(SELECT 1
FROM user_stage_volume kusv
WHERE kusv.deal_num = usv.deal_num
AND kusv.locked = 'Y')
GROUP BY usv.deal_num, usv.volume_type
ORDER BY MAX (usv.priority) DESC, MIN (usv.last_update) ASC) v
WHERE ROWNUM = 1;
Please request any more info you may need in comments and I'll edit.
Here is the create script for the table. The PK is VOLUME_ID. DEAL_NUM is not unique.
CREATE TABLE ENDUR.USER_STAGE_VOLUME
(
DEAL_NUM NUMBER(38) NOT NULL,
EXTERNAL_ID NUMBER(38) NOT NULL,
VOLUME_TYPE NUMBER(38) NOT NULL,
EXTERNAL_TYPE VARCHAR2(100 BYTE) NOT NULL,
GMT_START DATE NOT NULL,
GMT_END DATE NOT NULL,
VALUE FLOAT(126) NOT NULL,
VOLUME_ID NUMBER(38) NOT NULL,
PRIORITY INTEGER NOT NULL,
STATUS VARCHAR2(100 BYTE) NOT NULL,
LAST_UPDATE DATE NOT NULL,
LOCKED CHAR(1 BYTE) NOT NULL,
RETRY_COUNT INTEGER DEFAULT 0 NOT NULL,
INS_DATE DATE NOT NULL
)
ALTER TABLE ENDUR.USER_STAGE_VOLUME ADD (
PRIMARY KEY
(VOLUME_ID))
An index on (deal_num) would help the subquery greatly. In fact, an index on (deal_num, locked) would allow the subquery to avoid the table itself altogether.
You should expect a full table scan on the main query, as it filters on status which is not indexed (and most likely would not benefit from being indexed, unless 'NEW' is a fairly rare value for status).
I think it's running your inner subquery (inside not exists...) once for every run of the outer subquery.
That will be where performance takes a hit - it will run through all of user_stage_volume for each row in user_stage_volume, which is O(n^2), n being the number of rows in usv.
An alternative would be to create a view for the inner subquery, and use that view, or alternatively, to name a temporary view by using WITH.