I want to see every value for every user in a time period.
Should my primary key be (user_id, timestamp) such that I hit every node and let the cluster key window things down.
Or should my primary key be (day_of_year, timestamp) such that my partition key finds a subset of nodes, and I use the timestamp cluster key to achieve more fine grained control of the time period.
You need to estimate the data that a partition created by (day_of_year, timestamp) will have. If the partition size goes over 100 MB, then this might bring you problems in the future during repairs. So, if you are at risk of going over 100 MB for partitions like that, then you should go for (user_id, timestamp). It will also distribute effort throughout the nodes instead of concentrating on just one.
To get an idea of the partition size you can run nodetool cfstats. In the output, check the value for Compacted partition maximum bytes. It's not guaranteed to be the largest partition everywhere, but it will tell you the largest partition size that was compacted on the node where you are running the command.
Related
I have a table with timeuuid as a clustering key.
CREATE TABLE event (
domain TEXT,
createdAt TIMEUUID,
kind TEXT,
PRIMARY KEY (domain, createdAt)
);
I wish to select the data in order of this clustering key with the following guarantee - if I selected something, there will be NO inserts before those records in the future(so I could iterate through records checking what's new happened without the risk of skipping any events)
SELECT kind FROM event WHERE domain = ? AND createdAt > lastCreatedAtWeAreAwareOf
If I generate timeuuid on client and use parallel insert to scylla it's technically possible that recent timeuuid will get inserted first before several older(say due to say some networking issue) and I might miss those records in my selects.
What are possible ways to resolve this?
I tried using currentTimeUUID function and it seems to work(monotonically increasing within the same partition key) but creates a lot of duplicates(20-40 duplicates per the same partition key), I.e I end up with lots of records with exactly the same currentTimeUUID(I would really like a way to avoid duplicates, it complicates the select process and consumes unnecessary resources)
I'm also curious is there a threat of backward clock jumps when using currentTimeUUID function?
EDITED
It seems that there's a bug in Scylla that currentTimeUUID always generates duplicates for writes done at the same time using the same coordinator. I created an issue here. Thanks for bringing this up.
PREVIOUS ANSWER BELOW
If I generate timeuuid on client and use parallel insert to scylla it's technically possible that recent timeuuid will get inserted first before several older(say due to say some networking issue) and I might miss those records in my selects.
Just to clarify, all writes will be stored in the right order. There will be a point in time when you will be able to read old enough writes in the right order. This means that one possible solution would be to make sure that select does not query too recent data. Thus leaving a window for 'late' writes to arrive and take their place in line. For example, you could use a select like this:
SELECT kind FROM event WHERE domain = ? AND createdAt > lastCreatedAtWeAreAwareOf AND createdAt < now() - 30s
I don't know whether it's ok for you to impose such delay though. This strategy won't give you a full certainty because all writes that got delayed by more than 30s will be missed.
I tried using currentTimeUUID function and it seems to work(monotonically increasing within the same partition key) but creates a lot of duplicates(20-40 duplicates per the same partition key), I.e I end up with lots of records with exactly the same currentTimeUUID(I would really like a way to avoid duplicates, it complicates the select process and consumes unnecessary resources)
You can reduce the chances of clustering key duplications by introducing additional clustering key column like:
CREATE TABLE event (
domain TEXT,
createdAt TIMEUUID,
randomBit UUID/int,
kind TEXT,
PRIMARY KEY (domain, createdAt, randomBit)
);
and generate value for it on the client in some good random way. Maybe there's some aspect of the record that you know is guaranteed to be unique and could be used as a clustering key column. It would work better than a random field.
I have a a table UNITARCHIVE partitionned by date, and clustered by UNIT, DUID.
the total size of the table 892 Mb.
when I try this query
SELECT * FROM `test-187010.ReportingDataset.UNITARCHIVE` WHERE duid="RRSF1" and unit="DUNIT"
Bigquery tell me, it will process 892 mb, I thought clustering is supposed to reduce the scanned size, I understand when I filter per date, the size is reduced dramatically, but i need the whole date range.
is it by design or am I doing something wrong
To get the most benefits out of clustering, each partition needs to have a certain amount of data.
For example, if the minimum size of a cluster is 100MB (decided internally by BigQuery), and you have only 100MB of data per day, then querying 100 days will scan 100*100MB - regardless of the clustering strategy.
As an alternative with this amount of data, instead of partitioning by day, partition by year. Then you'll get the most benefits out of clustering with a low amount of data per day.
See Partition by week/year/month to get over the partition limit? for a reference table that shows this off.
If the table contains lots of different partition by id (10K+) and the size of the table is growing (with millions of records), will it run into out of memory issue?
When this query runs, does system require to store partition window (which would be 10K then) in to memory as well as row_number for each partition?
It is safe to use it. How it will perform - it depends. If there is a "POC index" (Partition, Order by, Cover), i.e. an index with your partition by column as a first key column, then your order by column(s) as key column(s), and the selected columns as included columns, it will be the best for this particular query. But you should consider the pros and cons of this index. If there is no such index (or similar), the query will be heavier - think for table scan, spills to tempdb, etc. What will be the impact on your server - you should test to see.
Current Scenario
Datastore used: Dynamo Db.
DB size: 15-20 MB
Problem: for storing data I am thinking to use a common hash as the partition key (and timestamp as sort key), so that the complete table is saved in a single partition only. This would give me undivided throughput for the table.
But I also intend to create GSIs for querying, so I was wondering whether it would be wrong to use GSIs for single partition. I can use Local SIs also.
Is this the wrong approach?
Under the hood, GSI is basically just another DynamoDB table. It follows the same partitioning rules as the main table. Partitions in you primary table are not correlated to the partitions of your GSIs. So it doesn't matter if your table has a single partition or not.
Using single partition in DynamoDB is a bad architectural choice overall, but I would argue that for 20 Mb database that doesn't matter too much.
DynamoDB manages table partitioning for you automatically, adding new
partitions if necessary and distributing provisioned throughput
capacity evenly across them.
Deciding which partition the item should go can't be controlled if the partition key values are different.
I guess what you are going to do is having same partition key value for all the items with different sort key value (timestamp). In this case, I believe the data will be stored in single partition though I didn't understand your point regarding undivided throughput.
If you wanted to keep all the items of the index in single partition, I think LSI (Local Secondary Index) would be best suited here. LSI is basically having an alternate sort key for the partition key.
A local secondary index maintains an alternate sort key for a given
partition key value.
Your single partition rule is not applicable for index and you wanted different partition key, then you need GSI.
I have N client machines. I want to load each of machine with distinct partition of BRIN index.
That requires to:
create BRIN with predefined number of partitions - equal to number of client machines
send queries from clients which uses WHERE on BRIN partitions identifier instead of filter on indexed column
The main goal is performance improvement when loading single table from postgres into distributed client machines, keeping equal number of rows between the clients - or close to equal if rows count not divides by machines count.
I can achieve it currently by maintaining new column which chunks my table into number of buckets equal to number of client machines (or use row_number() over (order by datetime) % N on the fly). This way it would not be efficient in timing and memory, and the BRIN index looks like a nice feature which could speed up such use cases.
Minimal reproducible example for 3 client machines:
CREATE TABLE bigtable (datetime TIMESTAMPTZ, value TEXT);
INSERT INTO bigtable VALUES ('2015-12-01 00:00:00+00'::TIMESTAMPTZ, 'txt1');
INSERT INTO bigtable VALUES ('2015-12-01 05:00:00+00'::TIMESTAMPTZ, 'txt2');
INSERT INTO bigtable VALUES ('2015-12-02 02:00:00+00'::TIMESTAMPTZ, 'txt3');
INSERT INTO bigtable VALUES ('2015-12-02 03:00:00+00'::TIMESTAMPTZ, 'txt4');
INSERT INTO bigtable VALUES ('2015-12-02 05:00:00+00'::TIMESTAMPTZ, 'txt5');
INSERT INTO bigtable VALUES ('2015-12-02 16:00:00+00'::TIMESTAMPTZ, 'txt6');
INSERT INTO bigtable VALUES ('2015-12-02 23:00:00+00'::TIMESTAMPTZ, 'txt7');
Expected output:
client 1
2015-12-01 00:00:00+00, 'txt1'
2015-12-01 05:00:00+00, 'txt2'
2015-12-02 02:00:00+00, 'txt3'
client 2
2015-12-02 03:00:00+00, 'txt4'
2015-12-02 05:00:00+00, 'txt5'
client 3
2015-12-02 16:00:00+00, 'txt6'
2015-12-02 23:00:00+00, 'txt7'
The question:
How can I create BRIN with predefined number of partitions and run queries which filters on partition identifiers instead of filtering on index column?
Optionally any other way that BRIN (or other pg goodies) can speed up the task of parallel loading multiple clients from single table?
It sounds like you want to shard a table over many machines, and have each local table (one shard of the global table) have a BRIN index with exactly one bucket. But that does not make any sense. If the single BRIN index range covers the entire (local) table, then it can never be very helpful.
It sounds like what you are looking for is partitioning with CHECK constraints that can be used for partition-exclusion. PostgreSQL has supported that for a long time with table inheritance (although not for each partition being on a separate machine). Using this method, the range covered in the CHECK constraint has to be set explicitly for each partition. This ability to explicitly specify the bounds sounds like it exactly what you are looking for, just using a different technology.
But, the partition exclusion constraint code doesn't work well with modulus. The code is smart enough to know that WHERE id=5 only needs to check the CHECK (id BETWEEN 1 and 10) partition, because it knows that id=5 implies that id is between 1 and 10. More accurately, it know that contrapositive of that.
But the code was never written to know that WHERE id=5 implies that id%10 = 5%10, even though humans know that. So if you build your partitions on modulus operators, like CHECK (id%10=5) rather than on ranges, you would have to sprinkle all your queries with WHERE id = $1 and id % 10= $1 %10 if you wanted it to take advantage of the constraints.
Going by your description and comments I'd say you're looking in the wrong direction. You want to split the table upfront so access will be fast and simple, but without having to split things upfront because that would require you know the number of nodes upfront which is kind of variable if I understand correctly. And regardless, it takes quite a bit of processing to split things too.
To be honest, I'd go about your problem differently. Instead of assigning every record to a bucket I'd rather suggest to assign every record a pseudo-random value in a given range. I don't know about Postgres but in MSSQL I'd use BINARY_CHECKSUM(NewID()) instead of Rand(). Main reason being that the random function is harder to use SET-based there. Instead you could also use some hashing code that returns a reasonable working space. Anyway, in my MSSQL situation the resulting value would then be a signed integer sitting somewhere in the range -2^31 to +2^31 (giver or take, check the documentation for the exact boundaries!). As such, when the master machine decides to assign n client-machines, each machine can be assigned an exact range that -- given the properties of the randomizer/hashing algo -- will envelope a reasonably close approximation to the workload divided by n.
Assuming you have an index on the selection field this should be reasonably fast, regardless whether you decide to split the table in a thousand or a million chunks.
PS: mind that this approach will only work 'properly' if number of rows to process (greatly) outnumbers the number of machines that will do the processing. With small numbers you might see several machines not getting anything while others get to do all the work.
Basically, all you need to know is the size of the relation after loading, and then the pages_per_range storage parameter should be set to the divisor that gives you the desired number of partitions.
No need to introduce an artificial partition ID, because there is support for sufficient types and operators. Physical table layout is important here, so if you insist on the partition ID being the key, and end up introducing an out-of-order mapping between the natural load order and the artificial partition ID, make sure you cluster the table on that column's sort order before creating BRIN.
However, at the same time, remember that more discrete values have a better chance of hitting the index than fewer, so high cardinality is better - artificial partition identifier will have 1/n the cardinality of a natural key, where n is the number of distinct values per partition.
More here and here.