Extremly poor insert performance with SQLite3 - sql

I use SQLlite3 (exactly 3.9.2 2015-11-02) in my application. For test purposes I have a few tables.
One of them has schema as follows:
CREATE TABLE "Recordings" (
`PartId` INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
`CameraId` INTEGER NOT NULL,
`StartTime` INTEGER NOT NULL,
`EndTime` INTEGER NOT NULL,
`FilePath` TEXT NOT NULL UNIQUE,
`FileSize` INTEGER NOT NULL,
`DeleteLockManual` INTEGER NOT NULL DEFAULT 0,
`Event1` INTEGER NOT NULL,
`Event2` INTEGER NOT NULL,
`Event3` INTEGER NOT NULL,
`Event4` INTEGER NOT NULL,
`Event5` INTEGER NOT NULL,
`Event6` INTEGER NOT NULL,
FOREIGN KEY(`CameraId`) REFERENCES Devices ( CameraId )
);
CREATE INDEX `Table_Event2` ON `Table` (`Event2`);
CREATE INDEX `Table_Event3` ON `Table` (`Event3`);
CREATE INDEX `Table_Event4` ON `Table` (`Event4`);
CREATE INDEX `Table_Event5` ON `Table` (`Event5`);
CREATE INDEX `Table_Event6` ON `Table` (`Event6`);
CREATE INDEX `Table_Event1` ON `Table` (`Event1`);
CREATE INDEX `Table_DeleteLockManual` ON `Table` (`DeleteLockManual`);
CREATE INDEX `Table_EndTime` ON `Table` (`EndTime`);
The hardware I'm using is pretty old: Pentium 4 2.4GHZ, 512MB RAM, old 40GB Hard drive.
The Recordings table contains ~60k rows.
When I'm doing INSERT on this table, from time to time (one per 30) query takes extremely long to finish. Last time it took 23 sec (sic!) for a single 1-row INSERT. For the rest times it takes ~120ms.
I dumped stack during this operation:
[<e0a245a8>] jbd2_log_wait_commit+0x88/0xbc [jbd2]
[<e0a258ce>] jbd2_complete_transaction+0x69/0x6d [jbd2]
[<e0ac1e8e>] ext4_sync_file+0x208/0x279 [ext4]
[<c020acb1>] vfs_fsync_range+0x64/0x76
[<c020acd7>] vfs_fsync+0x14/0x16
[<c020acfb>] do_fsync+0x22/0x3f
[<c020aed5>] SyS_fdatasync+0x10/0x12
[<c0499292>] syscall_after_call+0x0/0x4
[<ffffffff>] 0xffffffff
Or:
[<c02b3560>] submit_bio_wait+0x46/0x51
[<c02bb54b>] blkdev_issue_flush+0x41/0x67
[<e0ac1ea8>] ext4_sync_file+0x222/0x279 [ext4]
[<c020acb1>] vfs_fsync_range+0x64/0x76
[<c020acd7>] vfs_fsync+0x14/0x16
[<c020acfb>] do_fsync+0x22/0x3f
[<c020aed5>] SyS_fdatasync+0x10/0x12
[<c0499292>] syscall_after_call+0x0/0x4
[<ffffffff>] 0xffffffff
The application using this database is single-threaded.
What can cause such behavior?
Will switching to recent hardware (with ssd) solve this issue?

Related

Primary Index not being used

I have a greenplum Cluster with below specifications:
Master (16 VCPUs, 32GB RAM, 27GB Swap)
4 Segments (16 VCPUs, 62GB RAM, 27GB Swap) on each
Earlier i had two segments and was having outstanding performance for my use cases but ever since i have expanded the cluster to four nodes, i am unable to get the indexes to being used by the queries.
The queries which were being being executed within 10ms (with index hit) are now taking 2-5 seconds on sequential scan.
I have attached my schema and some sample explain analyze outputs(This is a sample query plan, relevant table has 48260809 number of rows in it).
Schema:
\c dmiprod
Create Table dmiprod_schema."package"
(
identity varchar(4096) not null,
"identityHash" bytea not null,
"packageDate" date not null,
ctime timestamp not null,
customer varchar(32) not null
) distributed by ("identityHash");
ALTER TABLE ONLY dmiprod_schema.package ADD CONSTRAINT "package_pkey" PRIMARY KEY ("identityHash");
create index idx_package_ctime on dmiprod_schema.package ("ctime");
create index idx_package_packageDate on dmiprod_schema.package ("packageDate");
create index idx_package_customer on dmiprod_schema.package ("customer");
CREATE TABLE dmiprod_schema."tags"
(
"identityHash" bytea not null,
tag varchar(32) not null,
UNIQUE ("identityHash",tag)
) distributed by ("identityHash");
create index "idx_tags_identityHash" on dmiprod_schema.tags ("identityHash");
create index idx_tags_tag on dmiprod_schema.tags ("tag");
CREATE TABLE dmiprod_schema."features"
(
"identityHash" bytea not null,
ctime timestamp not null,
utime timestamp not null,
phash varchar(64) ,
ahash varchar(64),
chash varchar(78),
iimages JSON ,
lcert JSON ,
slogos JSON
) distributed by ("identityHash");
ALTER TABLE ONLY dmiprod_schema.features ADD CONSTRAINT "features_pkey" PRIMARY KEY ("identityHash");
create index idx_features_phash on dmiprod_schema.features ("phash");
CREATE TABLE dmiprod_schema."raw"
(
"identityHash" bytea not null,
ctime timestamp not null,
utime timestamp not null,
ourl TEXT,
lurl TEXT,
"pageText" TEXT,
"ocrText" TEXT,
html TEXT,
meta JSON
) distributed by ("identityHash");
ALTER TABLE ONLY dmiprod_schema.raw ADD CONSTRAINT "raw_pkey" PRIMARY KEY ("identityHash");
CREATE TABLE dmiprod_schema.packageLock
(
"identityHash" bytea not null,
secret bytea not null,
ctime timestamp not null,
UNIQUE ("identityHash")
) distributed by ("identityHash");
ALTER TABLE ONLY dmiprod_schema.packageLock ADD CONSTRAINT "packageLock_pkey" PRIMARY KEY ("identityHash");
create index idx_packageLock_secret on dmiprod_schema.packageLock ("secret");
Recreating the table and the indexes, inserting 20 million 32, 48, 64, 96 and 128 length random bytea in identityHash, and then performing same select results in package_pkey being used and in under 20ms.
Aside from index usage, the other difference is usage of GPORCA optimizer. I suggest you set optimizer = 'on'; and try again. If that does not work, post your GPDB/Greenplum version and session settings include optimizer, enable_indexscan, and any other relevant settings.
I tested on VMs single physical host and Tanzu Greenplum 6.17.1 with 4 segment hosts.
Just to confirm: when you expanded the system (assuming the use of gpexpand), you did run the redistribution phase (gpexpand is a two step process)? When that completed, did you run analyzedb on the system to make sure statistics were updated with the new table/index segments?

GroupBy query for billion records - Vertica

I am working on an application where records are in billions and I need to make a query where GroupBy clause is needed.
Table Schema:
CREATE TABLE event (
eventId INTEGER PRIMARY KEY,
eventTime INTEGER NOT NULL,
sourceId INTEGER NOT NULL,
plateNumber VARCHAR(10) NOT NULL,
plateCodeId INTEGER NOT NULL,
plateCountryId INTEGER NOT NULL,
plateStateId INTEGER NOT NULL
);
CREATE TABLE source (
sourceId INTEGER PRIMARY KEY,
sourceName VARCHAR(32) NOT NULL
);
Scenario:
User will select sources, suppose source ID (1,2,3)
We need to get all events which occurred more than once for those source for event time range
Same event criteria (same platenumber, platecodeId, platestateId, plateCountryId)
I have prepared a query to perform above mentioned operation but its taking long time to execute.
select plateNumber, plateCodeId, plateStateId,
plateCountryId, sourceId,count(1) from event
where sourceId in (1,2,3)
group by sourceId, plateCodeId, plateStateId,
plateCountryId, plateNumber
having count(1) > 1 limit 10 offset 0
Can you recommend optimized query for it?
Since you didn't supply the projection DDL, I'll assume the projection is default and created by the CREATE TABLE statement
Your goal is to achieve the use of the GROUPBY PIPELINED algorithm instead of GROUPBY HASH which is usually slower and consumes more memory.
To do so, you need the table('s projection) to be sorted by the columns in the group by clause.
More info here: GROUP BY Implementation Options
CREATE TABLE event (
eventId INTEGER PRIMARY KEY,
eventTime INTEGER NOT NULL,
sourceId INTEGER NOT NULL,
plateNumber VARCHAR(10) NOT NULL,
plateCodeId INTEGER NOT NULL,
plateCountryId INTEGER NOT NULL,
plateStateId INTEGER NOT NULL
)
ORDER BY sourceId,
plateCodeId,
plateStateId,
plateCountryId,
plateNumber;
You can see which algorithm is being used by adding EXPLAIN before your query.

Optimizing GROUP BY in hsqldb

I have a table with 700K+ records on wich a simple GROUP BY query takes in excess of 35+ seconds to execute. I'm out of ideas on how to optimize this.
SELECT TOP 10 called_dn, COUNT(called_dn) FROM reportview.calls_out GROUP BY called_dn;
Here I add TOP 10 to limit network transfer induced delays.
I have an index on called_dn (hsqldb seems not to be using this).
called_dn is non nullable.
reportview.calls_out is a cached table.
Here's the table script:
CREATE TABLE calls_out (
pk_global_call_id INTEGER GENERATED BY DEFAULT AS SEQUENCE seq_global_call_id NOT NULL,
sys_global_call_id VARCHAR(65),
call_start TIMESTAMP WITH TIME ZONE NOT NULL,
call_end TIMESTAMP WITH TIME ZONE NOT NULL,
duration_interval INTERVAL HOUR TO SECOND(0),
duration_seconds INTEGER,
call_segments INTEGER,
calling_dn VARCHAR(25) NOT NULL,
called_dn VARCHAR(25) NOT NULL,
called_via_dn VARCHAR(25),
fk_end_status INTEGER NOT NULL,
fk_incoming_queue INTEGER,
call_start_year INTEGER,
call_start_month INTEGER,
call_start_week INTEGER,
call_start_day INTEGER,
call_start_hour INTEGER,
call_start_minute INTEGER,
call_start_second INTEGER,
utc_created TIMESTAMP WITH TIME ZONE,
created_by VARCHAR(25),
utc_modified TIMESTAMP WITH TIME ZONE,
modified_by VARCHAR(25),
PRIMARY KEY (pk_global_call_id),
FOREIGN KEY (fk_incoming_queue)
REFERENCES lookup_incoming_queue(pk_id),
FOREIGN KEY (fk_end_status)
REFERENCES lookup_end_status(pk_id));
I'm I stuck with this kind of performance or is there something I might try to speed up this query?
EDIT: Here's the query plan if it helps:
isDistinctSelect=[false]
isGrouped=[true]
isAggregated=[true]
columns=[ COLUMN: REPORTVIEW.CALLS_OUT.CALLED_DN not nullable
COUNT arg=[ COLUMN: REPORTVIEW.CALLS_OUT.CALLED_DN nullable]
[range variable 1
join type=INNER
table=CALLS_OUT
cardinality=771855
access=FULL SCAN
join condition = [index=SYS_IDX_SYS_PK_10173_10177]]]
groupColumns=[COLUMN: REPORTVIEW.CALLS_OUT.CALLED_DN]
offset=[VALUE = 0, TYPE = INTEGER]
limit=[VALUE = 10, TYPE = INTEGER]
PARAMETERS=[]
SUBQUERIES[]
Well, as it seems there's no way to avoid a full column scan in this situation.
Just for reference of future souls reaching this question, here's what I resorted to in the end:
Created a summary table maintained by INSERT / DELETE triggers in the original table. This in combination with suitable indexes and using LIMIT USING INDEX clauses in my queries yields very good performance.

Postgresql Query is very Slow

I have a table with 300000 rows and when i run a simple query like
select * from diario_det;
it leaves 41041 ms to return rows. It's fine that? How i can optimize the query?
I use Postgresql 9.3 in Centos 7.
Here's is my table
CREATE TABLE diario_det
(
cod_empresa numeric(2,0) NOT NULL,
nro_asiento numeric(8,0) NOT NULL,
nro_secue_pase numeric(4,0) NOT NULL,
descripcion_pase character varying(150) NOT NULL,
monto_debe numeric(16,3),
monto_haber numeric(16,3),
estado character varying(1) NOT NULL,
cod_pcuenta character varying(15) NOT NULL,
cod_local numeric(2,0) NOT NULL,
cod_centrocosto numeric(4,0) NOT NULL,
cod_ejercicio numeric(4,0) NOT NULL,
nro_comprob character varying(15),
conciliado_por character varying(10),
CONSTRAINT fk_diario_det_cab FOREIGN KEY (cod_empresa, cod_local, cod_ejercicio, nro_asiento)
REFERENCES diario_cab (cod_empresa, cod_local, cod_ejercicio, nro_asiento) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fk_diario_det_pc FOREIGN KEY (cod_empresa, cod_pcuenta)
REFERENCES plan_cuenta (cod_empresa, cod_pcuenta) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
OIDS=TRUE
);
ALTER TABLE diario_det
OWNER TO postgres;
-- Index: pk_diario_det_ax
-- DROP INDEX pk_diario_det_ax;
CREATE INDEX pk_diario_det_ax
ON diario_det
USING btree
(cod_pcuenta COLLATE pg_catalog."default", cod_local, estado COLLATE pg_catalog."default");
Very roughly size of one row is 231 bytes, times 300000... It's 69300000 bytes (~69MB) that has to be transferred from server to client.
I think that 41 seconds is a bit long, but still the query has to be slow because of amount of data that has to be loaded from disk and transferred.
You can optimise query by
selecting just columns you that are going to use not all of them (if you need just cod_empresa it would reduce total amount of transferred data to ~1.2MB, but server would still have to iterate trough all records - slow)
filter only rows that are going to use - using WHERE on columns with indexes can really speed the query up
If you want to know what is happening in your query, play around with EXPLAIN and EXPLAIN EXECUTE.
Also, if you're running dedicated database server, be sure to configure it properly to use a lot of system resources.

SQL - Selecting random rows and combining into a new table

Here's the creation of my tables...
CREATE TABLE questions(
id INTEGER PRIMARY KEY AUTOINCREMENT,
question VARCHAR(256) UNIQUE NOT NULL,
rangeMin INTEGER,
rangeMax INTEGER,
level INTEGER NOT NULL,
totalRatings INTEGER DEFAULT 0,
totalStars INTEGER DEFAULT 0
);
CREATE TABLE games(
id INTEGER PRIMARY KEY AUTOINCREMENT,
level INTEGER NOT NULL,
inclusive BOOL NOT NULL DEFAULT 0,
active BOOL NOT NULL DEFAULT 0,
questionCount INTEGER NOT NULL,
completedCount INTEGER DEFAULT 0,
startTime DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE gameQuestions(
gameId INTEGER,
questionId INTEGER,
asked BOOL DEFAULT 0,
FOREIGN KEY(gameId) REFERENCES games(id),
FOREIGN KEY(questionId) REFERENCES questions(id)
);
I'll explain the full steps that I'm doing, and then I'll ask for input.
I need to...
Using a games.id value, lookup the games.questionCount and games.level for that game.
Now since I have games.questionCount and games.level, I need to look at all of the rows in questions table with questions.level = games.level and select games.questionCount of them at random.
Now with the rows (aka questions) I got from step 2, I need to put them into gameQuestions table using the games.id value and the questions.id value.
Whats the best way to accomplish this? I could do it with several different sql queries, but I feel like someone really skilled with sql could make it happen a bit more efficient. I am using sqlite3.
This does it in one statement. Let's assume :game_id to be the game id you want to process.
insert into gameQuestions (gameId, questionId)
select :game_id, id
from questions
where level = (select level from games where id = :game_id)
order by random()
limit (select questionCount from games where id = :game_id);
#Tony: sqlite doc says LIMIT takes an expression. The above statement works fine using sqlite 3.8.0.2 and produces the desired results. I have not tested an older version.