so i have this table :
CREATE TABLE DATA.history (
modemId text PRIMARY KEY,
generationDate timestamp PRIMARY KEY,
eventId integer PRIMARY KEY,
speed double,
altitude double INDEX OFF,
latitude Double INDEX OFF,
longitude Double INDEX OFF,
odometer Double,
month timestamp with time zone GENERATED ALWAYS AS date_trunc('month', generationDate) PRIMARY KEY) PARTITIONED BY (month) clustered into 10 shards;
with data from gps devices in a 3 nodes cluster(8 cpu, 32gb ram).
I have a few doubts about it :
1.- Im not sure about the number of shardings (im asuming that the numbers of shards count per partition, in this case for a 2 months data i'd have 2 x 10 shards, right?), what do you think is a good number here? queries will be for modemid and generationdate ranges.
2._ How can i check the performance of the queries, i know there is a EXPLAIN command like mysql, but im not sure what to look in it.
thank you !
To partially answer your question, look at the docs and general info on shards here. Advice is that there should be at least as many shards for a table than there are CPUs in the cluster.
Your second question is probably a little too subjective to answer.
Related
I have some business that needs to be run on a daily basis and will be affecting all the rows in the tables. Once a record is fixed and can't change again by the logic it gets moved to an on disk table. At its max there will end up being approximately 30 million rows in the table. Its very skinny, just the linkage items to a main table and a key to a flag table. The flag key is what will be updated.
My question is when I'm preparing a table of this size which size bucket count should I be looking to use on the index?
The table will start off small with likely only a few hundred thousand rows in April, but come the end of the financial year it will ave grown to the maximum mentioned as previous years have indicated and I'm not sure if this practically empty bucket at the start will have any issues or if it is ok to have the count at the 30 million mark.
thanks in advance you comments, suggestion and help.
I've provided the code below and I've tried googling what occurs if the bucket count is high but the intial number of rows is low as the table grows over time but found nothing to help me understand if there will be a performance issue because of this.
CREATE TABLE [PRD].[CTRL_IN_MEM]
(
[FILE_LOAD_ID] INT NOT NULL,
[RECORD_IDENTIFIER] BIGINT NOT NULL,
[FLAG_KEY] SMALLINT NOT NULL,
[APP_LEVEL_PART] BIT NOT NULL
--Line I'm not sure about
CONSTRAINT [pk_CTRL_IN_MEM] PRIMARY KEY NONCLUSTERED HASH ([FILE_LOAD_ID], [RECORD_IDENTIFIER]) WITH (BUCKET_COUNT = 30000000),
INDEX cci_CTRL_IN_MEM CLUSTERED COLUMNSTORE
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY=SCHEMA_AND_DATA)
Currently I using following query:
SELECT
ID,
Key
FROM
mydataset.mytable
where ID = 100077113
and Key='06019'
My data has 100 million rows:
ID - unique
Key - can have ~10,000 keys
If I know the key looking for ID can be done on ~10,000 rows and work much faster and process much less data.
How can I use the new clustering capabilites in BigQuery to partition on the field Key?
(I'm going to summarize and expand on what Mikhail, Pentium10, and Pavan said)
I have a table with 12M rows and 76 GB of data. This table has no timestamp column.
This is how to cluster said table - while creating a fake date column for fake partitioning:
CREATE TABLE `fh-bigquery.public_dump.github_java_clustered`
(id STRING, size INT64, content STRING, binary BOOL
, copies INT64, sample_repo_name STRING, sample_path STRING
, fake_date DATE)
PARTITION BY fake_date
CLUSTER BY id AS (
SELECT *, DATE('1980-01-01') fake_date
FROM `fh-bigquery.github_extracts.contents_java`
)
Did it work?
# original table
SELECT *
FROM `fh-bigquery.github_extracts.contents_java`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(3.3s elapsed, 72.1 GB processed)
# clustered table
SELECT *
FROM `fh-bigquery.public_dump.github_java_clustered2`
WHERE id='be26cfc2bd3e21821e4a27ec7796316e8d7fb0f3'
(2.4s elapsed, 232 MB processed)
What I learned here:
Clustering can work with unique ids, even for tables without a date to partition by.
Prefer using a fake date instead of a null date (but only for now - this should be improved).
Clustering made my query 99.6% cheaper when looking for rows by id!
Read more: https://medium.com/#hoffa/bigquery-optimized-cluster-your-tables-65e2f684594b
you can have one filed of type DATE with NULL value, so you will be able partition by that field and since the table partitioned you will be able to enjoy clustering
You need to recreate your table with an additional date column with all rows having NULL values. And then you set partition to the date column. This way your table is partitioned.
After you've done with this, you will add clustering, based on the columns you identified in your query. Clustering will improve processing time and query costs will be reduced.
Now you can partition table on an integer column so this might be a good solution, remember there is a limit of 4,000 partitions for each table. So because you have ~10,000 keys I will suggest to create a sort of group_key that bundles ids together or maybe you have another column that you can leverage as integer with a cardinality < 4,000.
Recently BigQuery introduced support for clustering table even if they are not partitioned. So you can simply cluster on your integer field and don't use partitioning all together. Although, this solution will not be most effective for data scan optimisation.
I'm currently working on a project collecting a very large amount of data from a network of wireless modems out in the field. We have a table 'readings' that looks like this:
CREATE TABLE public.readings (
id INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('readings_id_seq'::regclass),
created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT now(),
timestamp TIMESTAMP WITHOUT TIME ZONE NOT NULL,
modem_serial CHARACTER VARYING(255) NOT NULL,
channel1 INTEGER NOT NULL,
channel2 INTEGER NOT NULL,
signal_strength INTEGER,
battery INTEGER,
excluded BOOLEAN NOT NULL DEFAULT false
);
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
It's important for the integrity of the system that we never have two readings from the same modem with the same timestamp, hence the unique index.
Our challenge at the moment is to find a performant way of inserting readings. We often have to insert millions of rows as we bring in historical data, and when adding to an existing base of 100 million plus readings, this can get kind of slow.
Our current approach is to import batches of 10,000 readings into a temporary_readings table, which is essentially an unindexed copy of readings. We then run the following SQL to merge it into the main table and remove duplicates:
INSERT INTO readings (created, timestamp, modem_serial, channel1, channel2, signal_strength, battery)
SELECT DISTINCT ON (timestamp, modem_serial) created, timestamp, modem_serial, channel1, channel2, signal_strength, battery
FROM temporary_readings
WHERE NOT EXISTS(
SELECT * FROM readings
WHERE timestamp=temporary_readings.timestamp
AND modem_serial=temporary_readings.modem_serial
)
ORDER BY timestamp, modem_serial ASC;
This works well, but takes ~20 seconds per 10,000 row block to insert. My question is twofold:
Is this the best way to approach the problem? I'm relatively new to projects with these sorts of performance demands, so I'm curious to know if there are better solutions.
What steps can I take to speed up the insert process?
Thanks in advance!
Your query idea is okay. I would try timing it for 100,000 rows in the batch, to start to get an idea of an optimal batch size.
However, the distinct on is slowing things down. Here are two ideas.
The first is to assume that duplicates in batches are quite rare. If this is true, try inserting the data without the distinct on. If that fails, then run the code again with the distinct on. This complicates the insertion logic, but it might make the average insertion much shorter.
The second is to build an index on temporary_readings(timestamp, modem_serial) (not a unique index). Postgres will take advantage of this index for the insertion logic -- and sometimes building an index and using it is faster than alternative execution plans. If this does work, you might try larger batch sizes.
There is a third solution which is to use on conflict. That would allow the insertion itself to ignore duplicate values. This is only available in Postgres 9.5, though.
Adding to a table that already contains 100 million indexed records will be slow no matter what! You can probably speed things up somewhat by taking a fresh look at your indexes.
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
CREATE INDEX ix_readings_timestamp ON readings USING BTREE (timestamp);
CREATE INDEX ix_readings_modem_serial ON readings USING BTREE (modem_serial);
At the moment you have three indexes but they are on the same combination of columns. Can't you manage with just the unique index?
I don't know what your other queries are like but your WHERE NOT EXISTS query can make use of this unique index.
If you have queries with the WHERE clause only filtering on the modem_serial field, your unique index is unlikely to be used. However if you flip the columns in that index it will be!
CREATE UNIQUE INDEX _timestamp_modemserial_uc ON readings USING BTREE (timestamp, modem_serial);
To quote from the manual:
A multicolumn B-tree index can be used with query conditions that
involve any subset of the index's columns, but the index is most
efficient when there are constraints on the leading (leftmost)
columns.
The order of the columns in the index matters.
Yes, fillfactor again. I spend many hours reading and I can't decide what's best for each case. I don't understand when and how fragmentation happens. I'm migrating a database from MS SQL Server to PostgreSQL 9.2.
Case 1
10-50 inserts / minute in a sequential (serial) PK, 20-50 reads / hour.
CREATE TABLE dev_transactions (
transaction_id serial NOT NULL,
transaction_type smallint NOT NULL,
moment timestamp without time zone NOT NULL,
gateway integer NOT NULL,
device integer NOT NULL,
controler smallint NOT NULL,
token integer,
et_mode character(1),
status smallint NOT NULL,
CONSTRAINT pk_dev_transactions PRIMARY KEY (transaction_id)
);
Case 2
Similar structure, index for serial PK, writes in blocks (one shot) of ~ 50.000 registers every 2 months, readings 10-50 / minute.
Does a 50% fillfactor mean that each insert generates a new page and moves 50% of existing rows to a newly generated page?
Does a 50% fillfactor mean frees space is allocated between physical rows in new data pages?
A new page is generated only if there is no free space left in existing pages?
As you can see I'm very confused; I would appreciate some help — maybe a good link to read about PostgreSQL and index fillfactor.
FILLFACTOR
With only INSERT and SELECT you should use a FILLFACTOR of 100 for tables (which is the default anyway). There is no point in leaving wiggle room per data page if you are not going to "wiggle" with UPDATEs.
The mechanism behind FILLFACTOR is simple. INSERTs only fill data pages (usually 8 kB blocks) up to the percentage declared by the FILLFACTOR setting. Also, whenever you run VACUUM FULL or CLUSTER on the table, the same wiggle room per block is re-established. Ideally, this allows UPDATE to store new row versions in the same data page, which can provide a substantial performance boost when dealing with lots of UPDATEs. Also beneficial in combination with H.O.T. updates. See:
Redundant data in update statements
Indexes need more wiggle room by design. They have to store new entries at the right position in leaf pages. Once a page is full, a relatively costly "page split" is needed. So indexes tend to bloat more than tables. The default FILLFACTOR for a (default) B-Tree index is 90 (varies per index type). And wiggle room makes sense for just INSERTs, too. The best strategy heavily depends on write patterns.
Example: If new inserts have steadily growing values (typical case for a serial or timestamp column), then there are basically no page-splits, and you might go with FILLFACTOR = 100 (or a bit lower to allows for some noise).
For a random distribution of new values, you might go below the default 90 ...
Basic source of information: the manual for CREATE TABLE and CREATE INDEX.
Other optimization
But you can do something else - since you seem to be a sucker for optimization ... :)
CREATE TABLE dev_transactions(
transaction_id serial PRIMARY KEY
, gateway integer NOT NULL
, moment timestamp NOT NULL
, device integer NOT NULL
, transaction_type smallint NOT NULL
, status smallint NOT NULL
, controller smallint NOT NULL
, token integer
, et_mode character(1)
);
This optimizes your table with regard to data alignment and avoids padding for a typical 64 bit server and saves a few bytes, probably just 8 byte on average - you typically can't squeeze out much with "column tetris":
Calculating and saving space in PostgreSQL
Keep NOT NULL columns at the start of the table for a very small performance bonus.
Your table has 9 columns. The initial ("cost-free") 1-byte NULL bitmap covers 8 columns. The 9th column triggers an additional 8 bytes for the extended NULL bitmap - if there are any NULL values in the row.
If you make et_mode and token NOT NULL, all columns are NOT NULL and there is no NULL bitmap, freeing up 8 bytes per row.
This even works per row if some columns can be NULL. If all fields of the same row have values, there is no NULL bitmap for the row. In your special case, this leads to the paradox that filling in values for et_mode and token can make your storage size smaller or at least stay the same:
Do nullable columns occupy additional space in PostgreSQL?
Basic source of information: the manual on Database Physical Storage.
Compare the size of rows (filled with values) with your original table to get definitive proof:
SELECT pg_column_size(t) FROM dev_transactions t;
(Plus maybe padding between rows, as the next row starts at a multiple of 8 bytes.)
Today I just read some comments and I made some experiment. I imagined a system which storing some coordinates.
Here is the situation:
I have two tables, the first is:
CREATE TABLE Points
(
ID int IDENTITY(1,1) PRIMARY KEY,
X int,
Y int,
Name varchar(20),
Created datetime
)
It is just storing coordinates (1 million rows). The second one is a helper table storing some let's say often used points for a select (around 1100 rows)
CREATE TABLE PointSearchHelper
(
X int,
Y int
)
So far so fine.
I would like to make an easy select:
SELECT p.* FROM Points p
INNER JOIN PointSearchHelper h
ON p.X = h.X AND p.Y = h.Y
I run the script, it gets the 1100 rows in around 280 ms on average.
When I check the execution plan I see, that the SQL Server 2008 R2 recommends an index (who would have thought? ;) ) :
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Points] ([X], [Y])
INCLUDE ([ID], [Name], [Created])
This one is a full index on the table, contains each column. It's size is "huge" comparing, that I'm storing the data now two times!
So the query no is much faster! It is around 75 ms(!) Very great improvement BUT I need almost double space for this improvement.
My question is simple: Is there any way to tell the SQL Server on the columns how to store the values or any other trick to save yourself from a double storage?
UPDATE:
With other words: is there any trick to avoid the "full index" with the same performance?
Change your PointSearchHelper table to just use the index rather than the x, y coordinates:
create table PointSearchHelper . . .
points_id int not null primary key
When you do the join, do it on points_id instead. This should reduce space and increase performance.
PS. I'm having the weirdest problem. Adding an open paren to the code is causing an error in loading the anwer.
Are your X+Y pairs unique?
If they are, you might consider dropping the identity column and creating a composite primary key on the X+Y pairs. That would remove the need for the additional index and might speed up your query even more.
It largely depends on other queries against this table, but if you did not want to have the full index, you could remove the primary key from ID, and instead place the primary key (and the clustered index) on (X, Y)
Doing this would store the data in the table by X and Y values, so this particular query would be faster, and only need to use the newly created clustered index.
You would have to look for potential problems with performance this might create if you have queries against your Points table that use the ID in WHERE clause, as this column will no longer be stored sorted ASC as it is now. If you see that the majority of your queries are querying this table by X, Y values, you could test this change in a development server and see if it suits your needs.
What result do you get when you create the index without INCLUDEing the non-key values? It may be close to the speed you get with the full index.
Additionally, if the X, Y coordinates are guaranteed unique in Points then you could consider dropping the ID column and creating the primary key directly on (X, Y). This will save you some space and also the overhead of indexing that column.
I thougth easier to answer here for the answers, because I made the "homework", and I'm surprised:
First:
Changeing the INDEX without INCLUEDED non-key values -> It does not help, performance is around the 280 ms, like the normal one without the Full Index.
Second:
Drop the ID column make X + Y as the primary key (Let say those points are unique) and make an other primary key index on the PointSearchHelper table on X + Y. That solution surprised me, because then the Execution plan used both index, but the speed was also around 280 ms. So it did not helped at all.
Third:
Droping the ID of storing X and Y, let say making some logic around it when I save the values I checking what is the primary key ID of those records.
With this there is only two index, again two primary key index on Points and PointHelperSearch. (I can see both of them in the exectuin plan, those are used. )
And it did it!! The speed was around 60-70 ms. So here is the trick.
Now, I'm wondering what is the differenc between Second and Third. Is it count so many ms, that there is two number instead of one?