Query Postgres table by Block Range Index (BRIN) identifier directly - sql

I have N client machines. I want to load each of machine with distinct partition of BRIN index.
That requires to:
create BRIN with predefined number of partitions - equal to number of client machines
send queries from clients which uses WHERE on BRIN partitions identifier instead of filter on indexed column
The main goal is performance improvement when loading single table from postgres into distributed client machines, keeping equal number of rows between the clients - or close to equal if rows count not divides by machines count.
I can achieve it currently by maintaining new column which chunks my table into number of buckets equal to number of client machines (or use row_number() over (order by datetime) % N on the fly). This way it would not be efficient in timing and memory, and the BRIN index looks like a nice feature which could speed up such use cases.
Minimal reproducible example for 3 client machines:
CREATE TABLE bigtable (datetime TIMESTAMPTZ, value TEXT);
INSERT INTO bigtable VALUES ('2015-12-01 00:00:00+00'::TIMESTAMPTZ, 'txt1');
INSERT INTO bigtable VALUES ('2015-12-01 05:00:00+00'::TIMESTAMPTZ, 'txt2');
INSERT INTO bigtable VALUES ('2015-12-02 02:00:00+00'::TIMESTAMPTZ, 'txt3');
INSERT INTO bigtable VALUES ('2015-12-02 03:00:00+00'::TIMESTAMPTZ, 'txt4');
INSERT INTO bigtable VALUES ('2015-12-02 05:00:00+00'::TIMESTAMPTZ, 'txt5');
INSERT INTO bigtable VALUES ('2015-12-02 16:00:00+00'::TIMESTAMPTZ, 'txt6');
INSERT INTO bigtable VALUES ('2015-12-02 23:00:00+00'::TIMESTAMPTZ, 'txt7');
Expected output:
client 1
2015-12-01 00:00:00+00, 'txt1'
2015-12-01 05:00:00+00, 'txt2'
2015-12-02 02:00:00+00, 'txt3'
client 2
2015-12-02 03:00:00+00, 'txt4'
2015-12-02 05:00:00+00, 'txt5'
client 3
2015-12-02 16:00:00+00, 'txt6'
2015-12-02 23:00:00+00, 'txt7'
The question:
How can I create BRIN with predefined number of partitions and run queries which filters on partition identifiers instead of filtering on index column?
Optionally any other way that BRIN (or other pg goodies) can speed up the task of parallel loading multiple clients from single table?

It sounds like you want to shard a table over many machines, and have each local table (one shard of the global table) have a BRIN index with exactly one bucket. But that does not make any sense. If the single BRIN index range covers the entire (local) table, then it can never be very helpful.
It sounds like what you are looking for is partitioning with CHECK constraints that can be used for partition-exclusion. PostgreSQL has supported that for a long time with table inheritance (although not for each partition being on a separate machine). Using this method, the range covered in the CHECK constraint has to be set explicitly for each partition. This ability to explicitly specify the bounds sounds like it exactly what you are looking for, just using a different technology.
But, the partition exclusion constraint code doesn't work well with modulus. The code is smart enough to know that WHERE id=5 only needs to check the CHECK (id BETWEEN 1 and 10) partition, because it knows that id=5 implies that id is between 1 and 10. More accurately, it know that contrapositive of that.
But the code was never written to know that WHERE id=5 implies that id%10 = 5%10, even though humans know that. So if you build your partitions on modulus operators, like CHECK (id%10=5) rather than on ranges, you would have to sprinkle all your queries with WHERE id = $1 and id % 10= $1 %10 if you wanted it to take advantage of the constraints.

Going by your description and comments I'd say you're looking in the wrong direction. You want to split the table upfront so access will be fast and simple, but without having to split things upfront because that would require you know the number of nodes upfront which is kind of variable if I understand correctly. And regardless, it takes quite a bit of processing to split things too.
To be honest, I'd go about your problem differently. Instead of assigning every record to a bucket I'd rather suggest to assign every record a pseudo-random value in a given range. I don't know about Postgres but in MSSQL I'd use BINARY_CHECKSUM(NewID()) instead of Rand(). Main reason being that the random function is harder to use SET-based there. Instead you could also use some hashing code that returns a reasonable working space. Anyway, in my MSSQL situation the resulting value would then be a signed integer sitting somewhere in the range -2^31 to +2^31 (giver or take, check the documentation for the exact boundaries!). As such, when the master machine decides to assign n client-machines, each machine can be assigned an exact range that -- given the properties of the randomizer/hashing algo -- will envelope a reasonably close approximation to the workload divided by n.
Assuming you have an index on the selection field this should be reasonably fast, regardless whether you decide to split the table in a thousand or a million chunks.
PS: mind that this approach will only work 'properly' if number of rows to process (greatly) outnumbers the number of machines that will do the processing. With small numbers you might see several machines not getting anything while others get to do all the work.

Basically, all you need to know is the size of the relation after loading, and then the pages_per_range storage parameter should be set to the divisor that gives you the desired number of partitions.
No need to introduce an artificial partition ID, because there is support for sufficient types and operators. Physical table layout is important here, so if you insist on the partition ID being the key, and end up introducing an out-of-order mapping between the natural load order and the artificial partition ID, make sure you cluster the table on that column's sort order before creating BRIN.
However, at the same time, remember that more discrete values have a better chance of hitting the index than fewer, so high cardinality is better - artificial partition identifier will have 1/n the cardinality of a natural key, where n is the number of distinct values per partition.
More here and here.

Related

DB2 v10 zos : identify free index values

My organisation has hundreds of DB2 tables that each have a randomly generated unique integer index. The random values are generated by either COBOL CICS mainframe programs or Java distributed applications. The normal approach taken is to randomly generate an integer value (only positive values are employed), then attempt to insert the data row, retrying when a duplicate index value has already been persisted. I would like to improve the performance of this approach and I'm considering trying to identify integer values that have not been generated and persisted to each table, this would mean we don't ever need to retry. We would know our insert would work. Does db2 have a function that can return unused index values?
The short answer is no.
The slightly longer answer is to point out that, if such a function existed, in your case on the first insert into one of your tables the size of the result set it would return would be 2,147,483,647 (positive) integers. At 4 bytes each, that would be 8,589,934,588 bytes.
Given the constraints of your existing system, what you're doing is probably the best that can be done. If the performance of retrying is unacceptable, I'm afraid redesigning your key scheme is the next step.
I think that's a question to ask: Is this scheme of using random numbers for unique keys causing a performance problem? As the tables fill up the key space you will see more and more retries, but you have a relatively large key space. If you're seeing large numbers of retries maybe your random numbers are less random than you'd like.
just a thought but you could use one sequence for a group of tables. In this way, the value will still be random (because you wouldn't know which it the next table you perform an insert to) but based on a specific sequance wich mean that most of the time you won't get a retry because the number keep ascending. that same Sequance can loop after a few hunderd million inserts and start to "fill in the blanks".
as far as other key ideas are concerned,You could also try and use a diffrent key, maybe one based on Timestamp or Rowid. that will still be random but not repetitive.

AWS DynamoDB v2: Do I need secondary index for alternative queries?

I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.

How can I improve performance of average method in SQL?

I'm having some performance problems where a SQL query calculating the average of a column is progressively getting slower as the number of records grows. Is there an index type that I can add to the column that will allow for faster average calculations?
The DB in question is PostgreSQL and I'm aware that particular index type might not be available, but I'm also interested in the theoretical answer, weather this is even possible without some sort of caching solution.
To be more specific, the data in question is essentially a log with this sort of definition:
table log {
int duration
date time
string event
}
I'm doing queries like
SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion
SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today
The second one is always fairly fast since it has a more restrictive WHERE clause, but the total average duration one is the type of query that is causing the problem. I understand that I could cache the values, using OLAP or something, my question is weather there is a way I can do this entirely by DB side optimisations such as indices.
The performance of calculating an average will always get slower the more records you have, at it always has to use values from every record in the result.
An index can still help, if the index contains less data than the table itself. Creating an index for the field that you want the average for generally isn't helpful as you don't want to do a lookup, you just want to get to all the data as efficiently as possible. Typically you would add the field as an output field in an index that is already used by the query.
Depends what you are doing? If you aren't filtering the data then beyond having the clustered index in order, how else is the database to calculate an average of the column?
There are systems which perform online analytical processing (OLAP) which will do things like keeping running sums and averages down the information you wish to examine. It all depends one what you are doing and your definition of "slow".
If you have a web based program for instance, perhaps you can generate an average once a minute and then cache it, serving the cached value out to users over and over again.
Speeding up aggregates is usually done by keeping additional tables.
Assuming sizeable table detail(id, dimA, dimB, dimC, value) if you would like to make the performance of AVG (or other aggregate functions) be nearly constant time regardless of number of records you could introduce a new table
dimAavg(dimA, avgValue)
The size of this table will depend only on the number of distinct values of dimA (furthermore this table could make sense in your design as it can hold the domain of the values available for dimA in detail (and other attributes related to the domain values; you might/should already have such table)
This table is only helpful if you will anlayze by dimA only, once you'll need AVG(value) according to dimA and dimB it becomes useless. So, you need to know by which attributes you will want to do fast analysis on. The number of rows required for keeping aggregates on multiple attributes is n(dimA) x n(dimB) x n(dimC) x ... which may or may not grow pretty quickly.
Maintaining this table increases the costs of updates (incl. inserts and deletes), but there are further optimizations that you can employ...
For example let us assume that system predominantly does inserts and only occasionally updates and deletes.
Lets further assume that you want to analyze by dimA only and that ids are increasing. Then having structure such as
dimA_agg(dimA, Total, Count, LastID)
can help without a big impact on the system.
This is because you could have triggers that would not fire on every insert, but lets say on ever 100 inserts.
This way you can still get accurate aggregates from this table and the details table with
SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA
FROM details d INNER JOIN
dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID
GROUP BY a.dimA
The above query with proper indexes would get one row from dimA_agg and only less then 100 rows from detail - this would perform in near constant time (~logfanoutn) and would not require update to dimA_agg for every insert (reducing update penalties).
The value of 100 was just given as an example, you should find optimal value yourself (or even keep it variable, though triggers only will not be enough in that case).
Maintaining deletes and updates must fire on each operation but you can still inspect if the id of the record to be deleted or updated is in the stats already or not to avoid the unnecessary updates (will save some I/O).
Note: The analysis is done for the domain with discreet attributes; when dealing with time series the situation gets more complicated - you have to decide the granularity of the domain in which you want to keep the summary.
EDIT
There are also materialized views, 2, 3
Just a guess, but indexes won't help much since average must read all the record (in any order), indexes are usefull the find subsets of rows, ubt if you have to iterate on all rows with no special ordering indexes are not helping...
This might not be what you're looking for, but if your table has some way to order the data (e.g. by date), then you can just do incremental computations and store the results.
For example, if your data has a date column, you could compute the average for records 1 - Date1 then store the average for that batch along with Date1 and the #records you averaged. The next time you compute, you restrict your query to results Date1..Date2, and add the # of records, and update the last date queried. You have all the information you need to compute the new average.
When doing this, it would obviously be helpful to have an index on the date, or whatever column(s) you are using for the ordering.

Organizing lots of timestamped values in a DB (sql / nosql)

I have a device I'm polling for lots of different fields, every x milliseconds
the device returns a list of ids and values which I need to store with a time stamp in a DB of sorts.
Users of the system need to be able to query this DB for historic logs to create graphs, or query the last timestamp for each value.
A simple approach would be to define a MySQL table with
id,value_id,timestamp,value
and let users select
Select value form t where value_id=x order by timestamp desc limit 1
and just push everything there with index on timestamp and id, But my question is what's the best approach performance / size wise for designing the schema? or using nosql? can anyone comment on possible design trade offs. Will such a design scale with millions of records?
When you say "... or query the last timestamp for each value" is this what you had in mind?
select max(timestamp) from T where value = ?
If you have millions of records, and the above is what you meant (i.e. value is alone in the WHERE clause), then you'd need an index on the value column, otherwise you'd have to do a full table scan. But if queries will ALWAYS have [timestamp] column in the WHERE clause, you do not need an index on [value] column if there's an index on timestamp.
You need an index on the timestamp column if your users will issue queries where the timestamp column appears alone in the WHERE clause:
select * from T where timestamp > x and timestamp < y
You could index all three columns, but you want to make sure the writes do not slow down because of the indexing overhead.
The rule of thumb when you have a very large database is that every query should be able to make use of an index, so you can avoid a full table scan.
EDIT:
Adding some additional remarks after your clarification.
I am wondering how you will know the id? Is [id] perhaps a product code?
A single simple index on id might not scale very well if there are not many different product codes, i.e. if it's a low-cardinality index. The rebalancing of the trees could slow down the batch inserts that are happening every x milliseconds. A composite index on (id,timestamp) would be better than a simple index.
If you rarely need to sort multiple products but are most often selecting based on a single product-code, then a non-traditional DBMS that uses a hashed-key sparse-table rather than a b-tree might be a very viable even a superior alternative for you. In such a database, all of the records for a given key would be found physically on the same set of contiguous "pages"; the hashing algorithm looks at the key and returns the page number where the record will be found. There is no need to rebalance an index as there isn't an index, and so you completely avoid the related scaling worries.
However, while hashed-file databases excel at low-overhead nearly instant retrieval based on a key value, they tend to be poor performers at sorting large groups of records on an attribute, because the data are not stored physically in any meaningful order, and gathering the records can involve much thrashing. In your case, timestamp would be that attribute. If I were in your shoes, I would base my decision on the cardinality of the id: in a dataset of a million records, how many DISTINCT ids would be found?
YET ANOTHER EDIT SINCE THE SITE IS NOT LETTING ME ADD ANOTHER ANSWER:
Simplest way is to have two tables, one with the ongoing history, which is always having new values inserted, and the other, containing only 250 records, one per part, where the latest value overwrites/replaces the previous one.
Update latest
set value = x
where id = ?
You have a choice of
indexes (composite; covering value_id, timestamp and value, or some combination of them): you should test performance with different indexes; composite and non-composite, also be aware that there are quite a few significantly different ways to get 'max per group' (search so, especially mysql version with variables)
triggers - you might use triggers to maintain max row values in another table (best performance of further selects; this is redundant and could be kept in memory)
lazy statistics/triggers, since your database is updated quite often you can save cycles if you update your statistics periodically (if you can allow the stats to be y seconds old and if you poll 1000 / x times a second, then you potentially save y * 100 / x potential updates; and this can be noticeable, especially in terms of scalability)
The above is true if you are looking for last bit of performance, if not keep it simple.

Is there a better/faster method locating a row with the maximum value in a column?

INFORMIX-SE 7.32:
I have a transaction table with about 5,000 nrows. The transaction.ticket_number[INT] is a column which gets updated with the next available sequential ticket number every time a specific row is updated. The column is unique indexed. I'm currently using the following SELECT statement to locate the max(transaction.ticket_num):
SELECT MAX(transaction.ticket_number) FROM transaction;
Since the row being updated is clustered acording to the transaction.fk_id[INT], where it is joined to customer.pk_id[SERIAL],the row is not physically located at the end of the transaction table, rather it resides within the group of transaction rows belonging to each particular customer. I chose to cluster the transactions belonging to each customer because response time is faster when I scroll through each customers transaction. Is there a faster way of locating the max(transaction.ticket_number) with the above query?.. Would a 'unique index on transaction(ticket_number) descending' improve access or is the indexed fully traversed from begining to end irrelevantly?
On a table of only 5000 rows on a modern machine, you are unlikely to be able to measure the difference in performance of the various techniques, especially in the single-user scenario which I believe you are facing. Even if the 5000 rows were all at the maximum permissible size (just under 32 KB), you would be dealing with 160 MB of data, which could easily fit into the machine's caches. In practice, I'm sure your rows are far smaller, and you'd never need all the data in the cache.
Unless you have a demonstrable performance problem, go with the index on the ticket number column and rely on the server (Informix SE) to do its job. If you have a demonstrable problem, show the query plans from SET EXPLAIN output. However, there are major limits on how much you can tweak SE performance - it is install-and-go technology with minimal demands on tuning.
I'm not sure whether Informix SE supports the 'FIRST n' (aka 'TOP n') notation that Informix Dynamic Server supports; I believe not.
Due to NULLABLE columns and other factors, use of indexes, etc, you can often find the following would be faster, but normally only negligably...
SELECT TOP 1 ticket_number FROM transaction ORDER BY ticket_number DESCENDING
I'm also uncertain as to whether you actually have an Index on [ticket_number]? Or do you just have a UNIQUE constraint? A constraint won't help determine a MAX, but an INDEX will.
In the event that an INDEX exists with ticket_number as the first indexable column:
- An index seek/lookup would likely be used, not needing to scan the other values at all
In the event that an INDEX exists with ticket_number Not as the first indexable column:
- An index scan would likely occur, checking every single unique entry in the index
In the event that no usable INDEX exists:
- The whole table would be scanned