MariaDB Optimizing Big Table Selects With Table Partitioning - optimization

I have the following table that contains milions of rows (2M-30M)
rowId(int) intElementId(int) timestamp(int) float float
This table gets truncated and filled every time it is used , the primary operation is select using the intElementId that is an integer and the timestamp.
For example
SELECT * from myTable where intElementId = x and timestamp = y
Currently im using btree indexes on intElementId and timestamp . The sum of the operations take quite a lot of time when theres is more data on my table.
I have thought of dynamically generating multiple tables based on my intElementId (eg testTable_xxx,testTable_xxx) that is usually bellow a thousand so i can query via the timestamp more effectivly in a smaller sample . While searching about this on different questions i have been discouraged to do this , because as i found out mariaDb is not optimized for something like this.
An alternative solution i have found is partitioning my table , i have seen a couple of applications for this but not many examples of it actually being used.
I have seen solutions partitioning by key seems to be what i want but on the mariaDB wiki its once again discouraged and suggested to use partitioning by range instead.
My questions are :
Is partitioning by key okay ?
Is having ~ a 1000 partions okay ?
Is having less partitions than the number of the disting intElementIds bad ? and how does it affect me ?
To test my solution i have used the following and it does seem to improve perfomance by about ~ 48%
testData = generateTestData()
traditionalInsert(testData)
time = timeit.timeit(benchmarkDbTable, number=20)
partitionedInsert(testData)
time = timeit.timeit(benchmarkDbTable, number=20)
Where my tables are generated like so
Traditional
CREATE TABLE testTable (`id` INT NOT NULL AUTO_INCREMENT , PRIMARY KEY (`id`),intElementIdINT, timestamp int, price FLOAT, volume float)
ALTER TABLE testTable ADD INDEX `intElementId` (`intElementId`)
Partitioned where the number of partitions equals to the number of unique intElementId (<1000)
CREATE TABLE testTable (`id` INT NOT NULL AUTO_INCREMENT , PRIMARY KEY (`id`,`intElementId`),intElementId INT, timestamp int, price FLOAT, volume INT) PARTITION BY KEY (intElementId) PARTITIONS ?;",
Testing Like so
cur.execute("SELECT COUNT(*) FROM testTable ")
cur.execute("SELECT DISTINCT intElementId FROM testTable ")
intElementId = cur.fetchall()
for element in intElementId:
cur.execute(f"SELECT COUNT(*) FROM testTable WHERE intElementId= {element[0]}")
for element in intElementId:
cur.execute(f"SELECT MIN(price), MAX(price) FROM testTable WHERE intElementId= {intElementId[0]}")
for elements in intElementId:
cur.execute(f"SELECT price FROM testTable WHERE intElementId= {intElementId[0]} ORDER BY price DESC LIMIT 10")

Related

Faster Sqlite insert from another table

I have an Sqlite DB which I am doing updates on and its very slow. I am wondering if I am doing it the best way or is there a faster way. My tables are:
create table files(
fileid integer PRIMARY KEY,
name TEXT not null,
sha256 TEXT,
created INT,
mtime INT,
inode INT,
nlink INT,
fsno INT,
sha_id INT,
size INT not null
);
create table fls2 (
fileid integer PRIMARY KEY,
name TEXT not null UNIQUE,
size INT not null,
sha256 TEXT not null,
fs2,
fs3,
fs4,
fs7
);
Table 'files' is actually in an attached DB named ttb. I am then doing this:
UPDATE fls2
SET fs3 = (
SELECT inode || 'X' || mtime || 'X' || nlink
FROM
ttb.files
WHERE
ttb.files.fsno = 3
AND
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
);
So the idea is, fls2 has values in 'name' which are also present in ttb.files.name. In ttb.files there are other parameters which I want to insert into the corresponding rows in fls2. The query works but I assume the matching up of the two tables is taking the time, and I wonder if theres a more efficient way to do it. There are indexes on each column in fls2 but none on files. I am doing it as a transaction, and pragma journal = memory (although sqlite seems to be ignoring that because a journal file is being created).
It seems slow, so far about 90 minutes for around a million rows in each table.
One CPU is pegged so I assume its not disk bound.
Can anyone suggest a better way to structure the query?
EDIT: EXPLAIN QUERY PLAN
|--SCAN TABLE fls2
`--CORRELATED SCALAR SUBQUERY 1
`--SCAN TABLE files
Not sure what that means though. It carries out the SCAN TABLE files for each SCAN TABLE fls2 hit?
EDIT2:
Well blimey, Crtl-C the query which had been running 2.5 hours at that point, exit Sqlite, run sqlite with the files DB, create index (sha256, name) - 1 minute or so. Exit that, run Sqlite with the main DB. Explain shows that now the latter scan is done with the index. Run the update - takes 150 seconds. Compared to >150 minutes, thats a heck of a speed up. Thanks for the assistance.
TIA, Pete
There are indexes on each column in fls2
Indexes are used for faster selection. They slow down inserts and updates. Maybe removing the one for fls2.fs3 helps?
Not an expert on sqlite, but on some databases it is more performant to insert the data into temporary table, delete them, then insert them from the temp table.
Insert into tmptab
Select fileid,
name,
size,
sha256,
fs2,
inode || 'X' || mtime || 'X' || nlink,
fs4,
fs7
From fls2
Inner join files on
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
delete from
Fls2 where exists (select 1 from tmptab where tmptab.<primary key> = fls2.<primary key>)
Insert into fls2 select * from tmptab

SQLite slow select query - howto make it faster

In SQLite I have a large DB (~35Mb).
It contains a table with the following syntax:
CREATE TABLE "log_temperature" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
"date" datetime NOT NULL,
"temperature" varchar(20) NOT NULL
)
Now when I want to search for datas within a period, it is too slow on an embedded system:
$ time sqlite3 sample.sqlite "select MIN(id) from log_temperature where
date BETWEEN '2019-08-13 00:00:00' AND '2019-08-13 23:59:59';"
331106
real 0m2.123s
user 0m0.847s
sys 0m0.279s
Note1: ids are running from 210610 to 331600.
Note2: if I run 'SELECT id FROM log_temperature ORDER BY ASC LIMIT 1', it gives the exact same timing as with the 'MIN' function.
I want to have the 'real time of 0m2.123s' to be as close to 0m0.0s as possible.
What are my options for making this faster? (Without removing hundreds of thousands of data?)
ps.: embedded system parameters are not important here. This shall be solved by optimizing the query or the underlying schema.
First, I would recommend that you write the query as:
select MIN(id)
from log_temperature
where date >= '2019-08-13' and date < '2019-08-14';
This doesn't impact performance, but it makes the query easier to write -- and no need to fiddle with times.
Then, you want an index on (date, id):
create index idx_log_temperature_date_id on log_temperature(date, id);
I don't think id is needed in the index, if it is declared as the primary key of the table.
Can you create an index on the date?
CREATE INDEX index_name ON log_temperature(date);

Cassandra 1.1 composite keys, columns, and filtering in CQL 3

I wish to have a table something as follows:
CREATE TABLE ProductFamilies (
ID varchar,
PriceLow int,
PriceHigh int,
MassLow int,
MassHigh int,
MnfGeo int,
MnfID bigint,
Data varchar,
PRIMARY KEY (ID)
);
There are 13 fields in total. Most of these represent buckets. Data is a JSON of product family IDs, which are then used in a subsequent query. Given how Cassandra works, the column names under the hood will be the values. I wish to filter these.
I wish to run queries as follows:
SELECT Data FROM MyApp.ProductFamilies WHERE ID IN (?, ?, ?) AND PriceLow >= ?
AND PriceHigh <= ? AND MassLow >= ? AND MassHigh <= ? and MnfGeo >= ? AND
MnfGeo <= ?
I read that Cassandra can only execute WHERE predicates against composite row keys or indexed columns. Is this still true? If so, I would have to make the columns < Data part of the PK.
Is it still the case that one has to include all columns from left to right and cannot skip any?
Are there any non-optimum points in my design?
I would like to add a column "Materials", which is an array of possible materials in a product family. Think pizza toppings, and querying "WHERE Materials IN ('Pineapple')". Without creating a separate inverted index of materials and then performing a manual intersection against the above query, is there any other [more elegant] way of handling this in Cassandra?
If you specify the exact PK you are looking up, as you propose here (id IN ...), you can use whatever expressions you like in the remaining predicates. There are no restrictions.
List collections are supported starting in 1.2.0, which is scheduled for release at the end of October. Indexed querying of collection contents may or may not be supported.
Basically to support you queries you need to have
create column family ProductFamilies with
comparator='CompositeType(UTF8Type, Int32Type, Int32Type, Int32Type, Int32Type, Int32Type, LongType, UTF8Type)'
and key_validation_class='UTF8Type'
or
CREATE TABLE ProductFamilies (
ID varchar,
PriceLow int,
PriceHigh int,
MassLow int,
MassHigh int,
MnfGeo int,
MnfID bigint,
Data varchar,
PRIMARY KEY (ID, PriceLow, PriceHigh, MassLow, MnfGeo, MnfID, Data)
);
Now you can query
SELECT Data FROM MyApp.ProductFamilies WHERE ID IN (?, ?, ?) AND PriceLow >= ?
AND PriceHigh <= ? AND MassLow >= ? AND MassHigh <= ? and MnfGeo >= ? AND
MnfGeo <= ?
Provided you don't miss any column from left to right [although not a filter but atleast a *] and all your values are in the column names rather the value.
One more thing you should understand about composite columns is "Column Slice must be contiguous" So, pricelow > =10 and pricelow <= 40 will return you a contiguous slice but filtering the result set with masslow and other columns will not work as it is not going to result in a contiguous slice. BTW pricelow = 10 and masslow <= 20 and masslow >=10 should work [tested with phpcassa] as it will result in a contiguous slice once again.
Else create a or multiple secondary index on any of the column of yours. Then you have the rights to query based on column values provided you always have atleast one of the indexed field in query.
http://www.datastax.com/docs/1.1/ddl/indexes
Regarding you material question there is no other go than having an inverted index if it is going to be a multivalued column as of I know.
It would be great if #jbellis verifies this

Sybase Check Constraint Evaluation

I'm trying to formulate some check constraints in SQL Anywhere 9.0.
Basically I have schema like this:
CREATE TABLE limits (
id INT IDENTITY PRIMARY KEY,
count INT NOT NULL
);
CREATE TABLE sum (
user INT,
limit INT,
my_number INT NOT NULL CHECK(my_number > 0),
PRIMARY KEY (user, limit)
);
I'm trying to force a constraint my_number for each limit to be at most count in table.
I've tried
CHECK ((SELECT sum(my_number) FROM sum WHERE limit = limit) <= (SELECT count FROM limits WHERE id = limit))
and
CHECK (((SELECT sum(my_number) FROM sum WHERE limit = limit) + my_number) <= (SELECT count FROM limits WHERE id = limit))
and they both seem not to do the correct thing. They are both off by one (meaning once you get a negative number, then insertion will fail, but not before that.
So my question is, with what version of the table are these subqueries being executed against? Is it the table before the insertion happens, or does the subquery check for consistency after the insert happens, and rolls back if it finds it invalid?
I do not really understand what you try to enforce here but based on this help topic.
Using CHECK constraints on columns
Once a CHECK condition is in place, future values are evaluated
against the condition before a row is modified.
I would go for a before insert trigger. You have more options and can bring up a better error message.

Create a unique primary key (hash) from database columns

I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S
The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.
Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.
Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.
If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?