How to avoid maximum storage limit of cosmosdb for Scalar DL's asset records? - scalardl

I plan to store asset records in Scalar DL on Azure CosmosDB.
CosmosDB limits maximum storage size across all items per (logical) partition to 20GB.
Scalar DL's asset is partition key on Azure CosmosDB. So, the same asset is stored in same partition. I think there is a limitation of asset size on Azure CosmosDB.
Can we avoid this limitation?
Does Scalar DL have a feature to chain to another asset?
See also: https://learn.microsoft.com/en-us/azure/cosmos-db/concepts-limits

The short answer is No.
First, any database systems usually have such limitations.
For example, DynamoDB partition has also 10 GB limitation.
Cassandra partition can have up to 2 billion cells but it is recommended to be smaller than 100 MB from performance perspective. (see https://www.instaclustr.com/cassandra-data-partitioning/)
So, it's always a good practice to model your asset properly not to grow too much.
If you know what you are doing and there are no other solutions except for having a large asset (partition), please split your asset into multiple assets. (e.g., if there is an asset that has an ID named Asset-A, create Asset-A-1, Asset-A-2, ..., Asset-A-M)
In such a case, an application has to also manage how they are split.

Related

How to know the minimum cluster size in a bigquery table?

I'm comparing the performance of clustering with that of partitioning.
Comparing a partitioned table with a clustered table, the accessed data size of the clustered table is sometimes bigger than that of the partitioned table. (e.g., clustering 122.4MB vs partitioning 35.6MB)
I expect this is due to the limitation of the cluster's minimum data size.
Is there any way to know the limit? Or is there any other
cause of the difference of accessed data size?
Edit
I found the posts 1, 2 by ex-Google.
Post 2 said that "each cluster of data in BigQuery has a minimum size.", and Post 1 said that "If you have less than 100MB of data per day, clustering won't do much for you".
From these posts, I inferred that the cause of the large size of the clustered table is a minimum size of a cluster.
Clusters are not like partitions. In fact there is no guarantee that there will be one cluster per column value (or if you use multiple columns for each combination of them). This is also why BigQuery cannot give you a good estimation of how much data the query will use before running it (like it does for partitions). Meanwhile, different partitions use different memory blocks.
Also, consider that BigQuery perform Auto-clustering (for free) therefore changing all the clusters. This is done so that the table will have more efficient clusters. This is required because when you insert/delete data the clusters results in very skewed clusters resulting in inefficient queries. This will results in data scanned by the same query even if data has not been inserted/deleted if in between BigQuery performed auto-clustering.
Another effect of this implementation is that a single table have a maximum number of partitions (4000). However, you do not have any restriction on the number of keys used for clustering.
So, single clusters in BigQuery may contains multiple clustering values and the underling clustered data blocks may change automatically due to auto-clustering.

Oracle SQL database In-Memory - compare compressions sizes

I'm playing with in-memory storage in oracle sql. I would like to compare the results of compression, I meant the amount of used space. For example, I'm running these queries:
ALTER TABLE RENTING INMEMORY MEMCOMPRESS FOR QUERY LOW(RETURN_DATE);
vs
ALTER TABLE RENTING INMEMORY MEMCOMPRESS FOR CAPACITY HIGH(RETURN_DATE);
Is there any easy way to check the size used by these compressions in SQL developer?
I found this article https://blogs.oracle.com/in-memory/database-in-memory-compression, there is a table containing 'space used' for each type of compression. This exactly what I am trying to do on my own. Thanks for any advices.
Querying v$im_segments after population will show you how many bytes from the table were loaded and how much of the in-memory store was utilised.
Since the column space is part of the In-Memory Compression Units (IMCU), there is no way to see how much space is consumed by individual columns. It is possible to display the individual column level compression setting in the view v$im_column_level though. The closest you could come would be to compare the populated size between the two compression levels. As Connor said, you can do this with v$im_segments or you can display individual IMCU information for an object with the view v$im_header.

S3 partition (file size) for effecient Athena query

I have a pipeline that load daily records into S3. I then utilize AWS Glue Crawler to create partition for facilitating AWS Athena query. However, there is a large partitioned data, if compared to others.
S3 folders/files are displayed as follows:
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/00/00/2019-00-00.parquet.gzip') 7.8 MB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/11/2019-01-11.parquet.gzip') 29.8 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/12/2019-01-12.parquet.gzip') 28.5 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/13/2019-01-13.parquet.gzip') 29.0 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/14/2019-01-14.parquet.gzip') 43.3 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/15/2019-01-15.parquet.gzip') 139.9 KB
with the file size displayed at the end of each line. Note that, 2019-00-00.parquet.gzip contains all records before 2019-01-11 and therefore, its size is large. I have read this and it says that "If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit."
So, I wonder should I split 2019-00-00.parquet.gzip into smaller parquet files with different partitions. For example,
key='database/table/2019/00/00/2019-00-01.parquet.gzip',
key='database/table/2019/00/00/2019-00-02.parquet.gzip',
key='database/table/2019/00/00/2019-00-03.parquet.gzip', ......
However, I suppose this partitioning is not so useful as it does not reflect when were the old records stored. I am opened for all workarounds. Thank you.
If the full size of your data is less than a couple of gigabytes in total, you don't need to partition your table at all. Partitioning small datasets hurt performance much more than it helps. Keep all the files in the same directory, deep directory structures in unpartitioned tables also hurt performance.
For small datasets you'll be better off without partitioning as long as there aren't too many files (try to keep it below a hundred). If you for some reason must have lots of small files you might get benefits from partitioning, but benchmark it in that case.
When the size of the data is small, like in your case, the overhead of finding the files on S3, opening, and reading them will be higher than actually processing them.
If your data grows to hundreds of megabytes you can start thinking about partitioning, and aim for a partitioning scheme where partitions are around a hundred megabytes to a gigabyte in size. If there is a time component to your data, which there seems to be in your case, time is the best thing to partition on. Start by looking at using year as partition key, then month, and so on. Exactly how to partition your data depends on the query patterns, of course.

Structuring a large DynamoDB table with many searchable attributes?

I've been struggling with the best way to structure my table. Its intended to have many, many GBs of data (I haven't been given a more detailed estimate). The table will be claims data (example here) with a partition key being the resourceType and a sort key being the id (although these could be potentially changed). The end user should be able to search by a number of attributes (institution, provider, payee, etc totaling ~15).
I've been toying with combining global and local indices in order to achieve this functionality on the backend. What would be the best way to structure the table to allow a user to search the data according to 1 or more of these attributes in essentially any combination?
If you use resourceType as a partition key you are essentially throwing away the horizontal scaling features that DynamoDB provides out of the box.
The reason to partition your data is such that you distribute it across many nodes in order to be able to scale without incurring a performance penalty.
It sounds like you're looking to put all claim documents into a single partition so you can do "searches" by arbitrary attributes.
You might be better off combining your DynamoDB table with something like ElasticSearch for quick, arbitrary search capabilities.
Keep in mind that DynamoDB can only accommodate approximately 10GB of data in a single partition and that a single partition is limited to up to 3000 reads per second, and up to 1000 writes per second (reads + 3 * writes <= 3000).
Finally, you might consider storing your claim documents directly into ElasticSearch.

How to improve the performance of SAS Enterprise Guide 4

I have a table with almost 200,000,000 records. It takes a long long time to query.
Any idea about improving the performance?
Consider adding index on id-type columns in that table.
BTW, this has nothing to do with SAS EG performance, but everything to do with the underlying BASE SAS engine.
In addition to indexing you should also consider:
Compression. When you build the dataset make sure you use the compress=yes option if you're not already. This will shrink the size of the table on disk resulting in less disk I/O (the slowest part of querying).
Check column lengths - make sure you're not using a field length of $255 to store something that only needs a length of $20 etc...
Use the SAS SPDE (Scalable Performance Data Engine). It allows you to partition your SAS datasets into multiple files and optionally spread them across different disks. Once your SAS datasets reach a certain size you can see performance improvements. I generally tend to use SPD libnames any time a dataset grows > 10G. No additional SAS modules are requires - this is enabled as part of Base SAS.