Scylladb : clustering key cartesian product size 600 is greater than maximum 100 - scylla

I am using data stax java driver to query scylladb , i see this error while reading data from scylla
RequestHandler: ip:9042 replied with server error (clustering key cartesian product size 600 is greater than maximum 100), defuncting connection.

This error is returned in order to prevent too large restriction sets from being generated, which may put a strain on your server. If you're aware of the risks and know a reasonable upper bound of the number of restrictions for your queries, you can manually change the maximum in scylla.yaml, e.g. max_clustering_key_restrictions_per_query: 650. Note however, that this option has a warning in its description and it should be acknowledged:
Maximum number of distinct clustering key restrictions per query.
This limit places a bound on the size of IN tuples, especially when multiple
clustering key columns have IN restrictions. Increasing this value can result
in server instability.
In particular, setting this flag above a couple of hundred is risky - 600 should be alright, but at this point you could also consider rephrasing your query, so that they have less values in their IN restrictions - perhaps splitting some queries into multiple smaller ones?
Source from Scylla tracker: https://github.com/scylladb/scylla/pull/4797

it depends on the data shape and concurrency. If your rows are large and the concurrency is high, it is easy to cause scylla to run out of memory.
If your rows are small and/or the concurrency is low, everything will be fine.
It's okay to increase the parameter value, just be aware you're on dangerous ground and you should try to reduce your IN query cartesian product sizes.
max value can be set as 1000000000.

Related

How to know the minimum cluster size in a bigquery table?

I'm comparing the performance of clustering with that of partitioning.
Comparing a partitioned table with a clustered table, the accessed data size of the clustered table is sometimes bigger than that of the partitioned table. (e.g., clustering 122.4MB vs partitioning 35.6MB)
I expect this is due to the limitation of the cluster's minimum data size.
Is there any way to know the limit? Or is there any other
cause of the difference of accessed data size?
Edit
I found the posts 1, 2 by ex-Google.
Post 2 said that "each cluster of data in BigQuery has a minimum size.", and Post 1 said that "If you have less than 100MB of data per day, clustering won't do much for you".
From these posts, I inferred that the cause of the large size of the clustered table is a minimum size of a cluster.
Clusters are not like partitions. In fact there is no guarantee that there will be one cluster per column value (or if you use multiple columns for each combination of them). This is also why BigQuery cannot give you a good estimation of how much data the query will use before running it (like it does for partitions). Meanwhile, different partitions use different memory blocks.
Also, consider that BigQuery perform Auto-clustering (for free) therefore changing all the clusters. This is done so that the table will have more efficient clusters. This is required because when you insert/delete data the clusters results in very skewed clusters resulting in inefficient queries. This will results in data scanned by the same query even if data has not been inserted/deleted if in between BigQuery performed auto-clustering.
Another effect of this implementation is that a single table have a maximum number of partitions (4000). However, you do not have any restriction on the number of keys used for clustering.
So, single clusters in BigQuery may contains multiple clustering values and the underling clustered data blocks may change automatically due to auto-clustering.

Google Cloud Bigtable Update or Insert with Versioning

I'm wondering if I should use a update query to update my row data or use maxversions and enable the versioning and just insert.
I do understand it may depend on what kind of data I need to store, but just wanted to know if there is a performance difference between querying (selecting) a data witch has versioning or non-versioning. Or has a performance difference between insert and update.
Performance is impacted by the size of the row and the amount of data returned from the server.
Bigtable has to read an entire row for every request. That will be a limiting factor on reads. At some size (100s+ of MB), systemic performance will degrade any time the tablet with that row is loaded. When the row size reaches GBs, you'll have major problems.
At query time, performance is also impacted by how much data is returned from the server. You can still get decent performance lower range of "large rows" if you limit your Get or Scan to a small subset of the row. Limits like cells per row, and/or retrieving only a few qualifiers would help with the network costs.
In general, it's better to keep your rows smaller, if you can. That is generally done with a combination of "insert" and some sort of age/version restriction on the column family.

AWS Redshift column limit?

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.
I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.
I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.
1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.
Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?
FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

Is altering the Page Size in SQL Server the best option for handling "Wide" Tables?

I have a multiple tables in my application that are both very wide and very tall. The width comes from sometimes 10-20 columns with a variety of datatypes varchar/nvarchar as well as char/bigint/int/decimal. My understanding is that the default page size in SQL is 8k, but can be manually changed. Also, that varchar/nvarchar columns are except from this restriction and they are often(always?) moved to a separate location, a process called Row_Overflow. Evenso, MS documentation states that Row-Overflowed data will degrade performance. "querying and performing other select operations, such as sorts or joins on large records that contain row-overflow data slows processing time, because these records are processed synchronously instead of asynchronously"
They recommend moving large columns into joinable metadata tables. "This can then be queried in an asynchronous JOIN operation".
My question is, is it worth enlarging the page size to accomodate the wide columns, and are there other performance problems thatd come up? If I didnt do that and instead partitioned the table into 1 or more metadata tables, and the tables got "big" like in the 100MM records range, wouldnt joining the partitioned tables far outweigh the benefits? Also, if the SQL Server is on a single core machine (or on SQL Azure) my understanding is that parallelism is disabled, so would that also eliminate the benefit of moving the tables intro partitions given that the join would no longer be asynchronous? Any other strategies that you'd recommend?
EDIT: Per the great comments below and some additional reading (that I shouldve done originally), you cannot manually alter SQL Server page size. Also, related SO post: How do we change the page size of SQL Server?. Additional great answer there from #remus-rusanu
You cannot change the page size.
varchar(x) and (MAX) are moved off-row when necessary - that is, there isn't enough space on the page itself. If you have lots of large values it may indeed be more effective to move them into other tables and then join them onto the base table - especially if you're not always querying for that data.
There is no concept of synchronously and asynchronously reading that off-row data. When you execute a query, it's run synchronously. You may have parallelization but that's a completely different thing, and it's not affected in this case.
Edit: To give you more practical advice you'll need to show us your schema and some realistic data characteristics.
My understanding is that the default page size in SQL is 8k, but can
be manually changed
The 'large pages' settings refers to memory allocations, not to change the database page size. See SQL Server and Large Pages Explained. I'm afraid your understanding is a little off.
As a general non-specific advice, for wide fixed length columns the best strategy is to deploy row-compression. For nvarchar, Unicode compression can help a lot. For specific advice, you need to measure. What is the exact performance problem you encountered? How did you measured? Did you used a methodology like Waits and Queues to identify the bottlenecks and you are positive that row size and off-row storage is an issue? It seems to me that you used the other 'methodology'...
you can't change the default 8k page size
varchar and nvarchar are treated like any other field, unless the are (max) which means they will be stored a little bit different because they can extend the size of a page, which you cant do with another datatype, also because it is not possible
For example, if you try to execute this statement:
create table test_varchars(
a varchar(8000),
b varchar(8001),
c nvarchar(4000),
d nvarchar(4001)
)
Column a and c are fine because both on them are max 8000 bytes in length.
But, you would get the following errors on columns b and d:
The size (8001) given to the column 'b' exceeds the maximum allowed for any data type (8000).
The size (4001) given to the parameter 'd' exceeds the maximum allowed (4000).
because both of them exceed the 8000 bytes limit. (Remember that the n in front of varchar or char means unicode and occupies double of space)

Is there a practical limit to the number of elements in a sorted set in redis?

I'm currently migrating some data to Redis and I'm considering using a sorted set to store approximately 1.4e6 items (with associated scores/counts). Is this number of items in a set likely to exceed a practical limit, making it too painful to use the set? I plan on running 64 bit redis, so available memory for the data should not be a problem. Does anyone have experience with a sorted set this size? If so, how are your insertion and query times for the set?
It depends what you want to do with the set. The simple operations are mostly O(log n) which means that they take only twice as long for a million item set as they do for a thousand item set. Unless you have something seriously broken in your config like a memory limit smaller than the set, performance shouldn't be a problem.
Where you need to be careful is with operations on multiple sets, particularly union - that will take a thousand times longer for the million item set. In practical terms this isn't necessarily a problem though - either it will be fast enough for your purposes anyway (redis has commands documented as too slow for production use that are still best measured in milliseconds) or you can adjust the order of operations to avoid running union on really large sets.
Our site has a sorted set with about 2 milions items (email addresses) with integer scores and it took up about 320MB in memory size.