Efficient inserts with duplicate checks for large tables in Postgres - sql

I'm currently working on a project collecting a very large amount of data from a network of wireless modems out in the field. We have a table 'readings' that looks like this:
CREATE TABLE public.readings (
id INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('readings_id_seq'::regclass),
created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
timestamp TIMESTAMP WITHOUT TIME ZONE NOT NULL,
modem_serial CHARACTER VARYING(255) NOT NULL,
channel1 INTEGER NOT NULL,
channel2 INTEGER NOT NULL,
signal_strength INTEGER,
battery INTEGER,
excluded BOOLEAN NOT NULL DEFAULT false
);
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
It's important for the integrity of the system that we never have two readings from the same modem with the same timestamp, hence the unique index.
Our challenge at the moment is to find a performant way of inserting readings. We often have to insert millions of rows as we bring in historical data, and when adding to an existing base of 100 million plus readings, this can get kind of slow.
Our current approach is to import batches of 10,000 readings into a temporary_readings table, which is essentially an unindexed copy of readings. We then run the following SQL to merge it into the main table and remove duplicates:
INSERT INTO readings (created, timestamp, modem_serial, channel1, channel2, signal_strength, battery)
SELECT DISTINCT ON (timestamp, modem_serial) created, timestamp, modem_serial, channel1, channel2, signal_strength, battery
FROM temporary_readings
WHERE NOT EXISTS(
SELECT * FROM readings
WHERE timestamp=temporary_readings.timestamp
AND modem_serial=temporary_readings.modem_serial
)
ORDER BY timestamp, modem_serial ASC;
This works well, but takes ~20 seconds per 10,000 row block to insert. My question is twofold:
Is this the best way to approach the problem? I'm relatively new to projects with these sorts of performance demands, so I'm curious to know if there are better solutions.
What steps can I take to speed up the insert process?
Thanks in advance!

Your query idea is okay. I would try timing it for 100,000 rows in the batch, to start to get an idea of an optimal batch size.
However, the distinct on is slowing things down. Here are two ideas.
The first is to assume that duplicates in batches are quite rare. If this is true, try inserting the data without the distinct on. If that fails, then run the code again with the distinct on. This complicates the insertion logic, but it might make the average insertion much shorter.
The second is to build an index on temporary_readings(timestamp, modem_serial) (not a unique index). Postgres will take advantage of this index for the insertion logic -- and sometimes building an index and using it is faster than alternative execution plans. If this does work, you might try larger batch sizes.
There is a third solution which is to use on conflict. That would allow the insertion itself to ignore duplicate values. This is only available in Postgres 9.5, though.

Adding to a table that already contains 100 million indexed records will be slow no matter what! You can probably speed things up somewhat by taking a fresh look at your indexes.
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
At the moment you have three indexes but they are on the same combination of columns. Can't you manage with just the unique index?
I don't know what your other queries are like but your WHERE NOT EXISTS query can make use of this unique index.
If you have queries with the WHERE clause only filtering on the modem_serial field, your unique index is unlikely to be used. However if you flip the columns in that index it will be!
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
To quote from the manual:
A multicolumn B-tree index can be used with query conditions that
involve any subset of the index's columns, but the index is most
efficient when there are constraints on the leading (leftmost)
columns.
The order of the columns in the index matters.

Related

PostgreSQL index reduces data size but makes the query slower

I have a PostgreSQL table with 7.9GB of JSON data. My goal is to perform aggregations on the whole table on a daily basis, the aggregation results will later be used for analytical reports in Google Data Studio.
One of the queries I'm trying to run looks as follows:
explain analyze
select tender->>'procurementMethodType' as procurement_method,
tender->>'status' as tender_status,
sum(cast(tender->'value'->>'amount' as decimal)) as total_expected_value
from tenders
group by 1,2
The query plan and execution time are the following:
The problem is that the database has to scan through all the 7.9GB of data, even though the query uses only 3 field values out of approximately 100. So I decided to create the following index:
create index on tenders((tender->>'procurementMethodType'), (tender->>'status'), (cast(tender->'value'->>'amount' as decimal)))
The size of the index is 44MB, which is much smaller than the size of the entire table, so I expect that the query should be much faster. However, when I run the same query with the index created, I get the following result:
The query with index is slower! How can this be possible?
EDIT: the table itself contains two columns: the ID column and the jsonb data column:
create table tenders (
id uuid primary key,
tender jsonb
)
The code that does an index only scan is somewhat deficient in this case. It thinks it needs "tender" to be available in the index in order to fulfill the demand for cast(tender->'value'->>'amount' as decimal). It fails to realize that having cast(tender->'value'->>'amount' as decimal) itself in the index obviates the need for "tender" itself. So it is doing a regular index scan, in which it has to jump from the index to the table for every row it will return, to fish out "tender" and then compute cast(tender->'value'->>'amount' as decimal). This means it is jumping all over the table doing random io, which is much slower than just reading the table sequentially and then doing a sort.
You could try an index on ((tender->>'procurementMethodType'), (tender->>'status'), tender). This index would be huge (as large as the table) if it can even be built, but would take away the need for a sort.
But your current query finishes in 30 seconds. For a query that is only run once a day, does it really need to be faster than this?

SQL index for date range query

For a few days, I've been struggling with improving the performance of my database and there are some issues that I'm still kind a confused about regarding indexing in a SQL Server database.
I'll try to be as informative as I can.
My database currently contains about 100k rows and will keep growing, therfore I'm trying to find a way to make it work faster.
I'm also writing to this table, so if you suggestion will drastically reduce the writing time please let me know.
Overall goal is to select all rows with a specific names that are in a date range.
That will usually be to select over 3,000 rows out of a lot lol ...
Table schema:
CREATE TABLE [dbo].[reports]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[IsDuplicate] [bit] NOT NULL,
[IsNotValid] [bit] NOT NULL,
[Time] [datetime] NOT NULL,
[ShortDate] [date] NOT NULL,
[Source] [nvarchar](350) NULL,
[Email] [nvarchar](350) NULL,
CONSTRAINT [PK_dbo.reports]
PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY]
This is the SQL query I'm using:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-13' AND '2017-10-15'
As I understood, my best approach to improve efficency without hurting the writing time as much would be to create a nonclustered index on the Source and ShortDate.
Which I did like such, index schema:
CREATE NONCLUSTERED INDEX [Source&Time]
ON [dbo].[reports]([Source] ASC, [ShortDate] ASC)
Now we are getting to the tricky part which got me completely lost, the index above sometimes works, sometime half works and sometime doesn't work at all....
(not sure if it matters but currently 90% of the database rows has the same Source, although this won't stay like that for long)
With the query below, the index isn't used at all, I'm using SQL Server 2014 and in the Execution Plan it says it only uses the clustered index scan:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-10' AND '2017-10-15'
With this query, the index isn't used at all, although I'm getting a suggestion from SQL Server to create an index with the date first and source second... I read that the index should be made by the order the query is? Also it says to include all the columns Im selecting, is that a must?... again I read that I should include in the index only the columns I'm searching.
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate = '2017-10-13'
SQL Server index suggestion -
/* The Query Processor estimates that implementing the following
index could improve the query cost by 86.2728%. */
/*
USE [db]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[reports] ([ShortDate], [Source])
INCLUDE ([id], [IsDuplicate], [IsNotValid], [Time], [Email])
GO
*/
Now I tried using the index SQL Server suggested me to make and it works, seems like it uses 100% of the nonclustered index using both the queries above.
I tried to use this index but deleting the included columns and it doesn't work... seems like I must include in the index all the columns I'm selecting?
BTW it also work when using the index I made if I include all the columns.
To summarize: seems like the order of the index didn't matter, as it worked both when creating Source + ShortDate and ShortDate + Source
But for some reason its a must to include all the columns... (which will drastically affect the writing to this table?)
Thanks a lot for reading, My goal is to understand why this stuff happens and what I should do otherwise (not just the solution as I'll need to apply it on other projects as well ).
Cheers :)
Indexing in SQL Server is part know-how from long experience (and many hours of frustration), and part black magic. Don't beat yourself up over that too much - that's what a place like SO is ideal for - lots of brains, lots of experience from many hours of optimizing, that you can tap into.
I read that the index should be made by the order the query is?
If you read this - it is absolutely NOT TRUE - the order of the columns is relevant - but in a different way: a compound index (made up from multiple columns) will only ever be considered if you specify the n left-most columns in the index definition in your query.
Classic example: a phone book with an index on (city, lastname, firstname). Such an index might be used:
in a query that specifies all three columns in its WHERE clause
in a query that uses city and lastname (find all "Miller" in "Detroit")
or in a query that only filters by city
but it can NEVER EVER be used if you want to search only for firstname ..... that's the trick about compound indexes you need to be aware of. But if you always use all columns from an index, their ordering is typically not really relevant - the query optimizer will handle this for you.
As for the included columns - those are stored only in the leaf level of the nonclustered index - they are NOT part of the search structure of the index, and you cannot specify filter values for those included columns in your WHERE clause.
The main benefit of these included columns is this: if you search in a nonclustered index, and in the end, you actually find the value you're looking for - what do you have available at that point? The nonclustered index will store the columns in the non-clustered index definition (ShortDate and Source), and it will store the clustering key (if you have one - and you should!) - but nothing else.
So in this case, once a match is found, and your query wants everything from that table, SQL Server has to do what is called a Key lookup (often also referred to as a bookmark lookup) in which it takes the clustered key and then does a Seek operation against the clustered index, to get to the actual data page that contains all the values you're looking for.
If you have included columns in your index, then the leaf level page of your non-clustered index contains
the columns as defined in the nonclustered index
the clustering key column(s)
all those additional columns as defined in your INCLUDE statement
If those columns "cover" your query, e.g. provide all the values that your query needs, then SQL Server is done once it finds the value you searched for in the nonclustered index - it can take all the values it needs from that leaf-level page of the nonclustered index, and it does NOT need to do another (expensive) key lookup into the clustering index to get the actual values.
Because of this, trying to always explicitly specify only those columns you really need in your SELECT can be beneficial - in this case, you might be able to create an efficient covering index that provides all the values for your SELECT - always using SELECT * makes that really hard or next to impossible.....
In general, you want the index to be from most selective (i.e. filtering out the most possible records) to least selective; if a column has low cardinality, the query optimizer may ignore it.
That makes intuitive sense - if you have a phone book, and you're looking for people called "smith", with the initial "A", you want to start with searching for "smith" first, and then the "A"s, rather than all people whose initial is "A" and then filter out those called "Smith". After all, the odds are that one in 26 people have the initial "A".
So, in your example, I guess you have a wide range of values in short date - so that's the first column the query optimizer is trying to filter out. You say you have few different values in "source", so the query optimizer may decide to ignore it; in that case, the second column in that index is no use either.
The order of where clauses in the index is irrelevant - you can swap them round and achieve the exact same results, so the query optimizer ignores them.
EDIT:
So, yes, make the index. Imagine you have a pile of cards to sort - in your first run, you want to remove as many cards as possible. Assuming it's all evenly spread - if you have 1000 separate short_dates over a million rows, that means you end up with 1000 items if your first run starts on short_date; if you sort by source, you have 100000 rows.
The included columns of an index is for the columns you are selecting.
Due to the fact that you do select * (which isn't good practice), the index won't be used, because it would have to lookup the whole table to get the values for the columns.
For your scenario, I would drop the default clustered index (if there is one) and create a new clustered index with the following statement:
USE [db]
GO
CREATE CLUSTERED INDEX CIX_reports
ON [dbo].[reports] ([ShortDate],[Source])
GO

Is it advised to index the field if I envision retrieving all records corresponding to positive values in that field?

I have a table with definition somewhat like the following:
create table offset_table (
id serial primary key,
offset numeric NOT NULL,
... other fields...
);
The table has about 70 million rows in it.
I envision doing the following query many times
select * from offset_table where offset > 0;
For speed issues, I am wondering whether it would be advised to create an index like:
create index on offset_table(offset);
I am trying to avoid creation of unnecessary indices on this table as it is pretty big already.
As you mentioned in the comments, it would be ~70% of rows that match the offset > 0 predicate.
In that case the index would not be beneficial, since postgresql (and basically every other DBMS) would prefer a full table scan instead. It happens because it would be faster than jumping between reading the index consequently and the table randomly.

How to understand avoid FULL TABLE SCAN

I have a table that is responsible to store log.
The DDL is this:
CREATE TABLE LOG(
"ID_LOG" NUMBER(12,0) NOT NULL ENABLE,
"DATA" DATE NOT NULL ENABLE,
"OPERATOR_CODE" VARCHAR2(20 BYTE),
"STRUCTURE_CODE" VARCHAR2(20 BYTE),
CONSTRAINT "LOG_PK" PRIMARY KEY ("ID_LOG")
);
with these two indices:
CREATE INDEX STRUCTURE_CODE ON LOG ("OPERATOR_CODE");
CREATE INDEX LOG_01 ON LOG ("STRUCTURE_CODE", "DATA") ;
but this query produce a FULL TABLE SCAN:
SELECT log.data AS data1,
OPERATOR_CODE,
STRUCTURE_CODE
FROM log
WHERE data BETWEEN to_date('03/03/2008', 'DD-MM-YYYY')
AND to_date('08/03/2015', 'DD-MM-YYYY')
AND STRUCTURE_CODE = '1601';
Why I see always a FULL TABLE SCAN on column DATA and STRUCTURE_CODE?
(I have tried also on create two different index for STRUCTURE_CODE and DATA but I have always a full table scan)
Did you run stats on your new index and the table?
How much data is in that table and what percentage of it is likely to be returned by that query? Sometimes a full table scan is better for small tables or for queries that will return a large percentage of the data.
How many rows do you have in that table?
How many is returned by this query?
Please include the explain plan.
If loading the table and doing a full table scan (FTS) is cheaper (in IO cost) than utilizing an index, the table will be loaded and FTS will happen. [Basically the same what Necreaux said]
This can happen either if the table is small, or the expected result set size is big.
What is small? FTS will almost always happen if the table is smaller than DB_FILE_MULTIBLOCK_READ_COUNT. This case, the table usually can be loaded into memory with one big read. It's not always an issue, check the IO cost in the explain plan.
What is big? If the table is pretty big, and you'll return the most of it, it is cheaper to read up the whole table in a few large IO calls, than making some index reads then making a lot of tiny IO calls all around to the table.
Blind guessing from your query (without the explain plan results), I think it would first consider an index range scan (over LOG_01), followed by a table access by rowid (to get the OPERATOR_CODE as it is not in the index), but either it decides that your table is too small, or that there are so many rows to be returned from that date range/structure_code, that rolling through the table is cheaper (in IO Cost terms).

Index for table in SQL Server 2012

I had a question on indexes. I have a table like this:
id BIGINT PRIMARY KEY NOT NULL,
cust_id VARCHAR(8) NOT NULL,
dt DATE NOT NULL,
sale_type VARCHAR(10) NOT NULL,
sale_type_sub VARCHAR(40),
amount DOUBLE PRECISION NOT NULL
The table has several million rows. Assuming that queries will often filter results by date ranges, sale types, amounts above and below certain values, and that joins will occur on cust_id... what do you all think is the ideal index structure?
I wasn't sure if a clustered index would be best, or individual indexes on each column? Both?
Any serious table in SQL Server should always have a well-chosen, good clustering key - it makes so many things faster and more efficient. From you table structure, I'd use the ID as the clustering key.
Next, you say joins occur on cust_id - so I would put an index on cust_id. This speeds up joins in general and is a generally accepted recommendation.
Next, it really depends on your queries. Are they all using the same columns in their WHERE clauses? Or do you get queries that use dt, and others that use sale_type separately?
The point is: the fewer indices the better - so if ever possible, I'd try to find one compound index that covers all your needs. But if you have an index on three columns (e.g. on (sale_type, dt, amount), then that index can be used for queries
using all three columns in the WHERE clause
using sale_type and dt in the WHERE clause
using only sale_type in the WHERE clause
but it could NOT be used for queries that use dt or amount alone. A compound index always requires you to use the n left-most columns in the index definition - otherwise it cannot be used.
So my recommendation would be:
define the clustering key on ID
define a nonclustered index on cust_id for the JOINs
examine your system to see what other queries you have - what criteria is being used for selection, how often do those queries execute? Don't over-optimize a query that's executed once a month - but do spend time on those that are executed dozens of times every hour.
Add one index at a time - let the system run for a bit - do you measure an improvement in query times? Does it feel faster? If so: leave that index. If not: drop it again. Iterate until you're happy with the overall system performance.
The best way to find indexes for your table is sql server profiler.