Cassandra Compacted partition maximum byte size is higher than total space used for the table - datastax

I am working on Cassandra version 2.1.13.1218 and cqlsh version 5.0.1.
For a given table, when I run cfstats command, Compacted partition maximum bytes is greater than Space used (total). For example:
Compacted partition maximum bytes: 4.64 MB
and
Space used (total): 2.28 MB.
Total space used by a table should always be higher since all large/small partition sizes are part of the total space of the given table. How can compacted partition maximum byte size be higher than total space used for the table?
Command is: ./cqlsh cfstats keyspace.columnfamilyname -H
Can someone help me understand this and what is the different between Space used (live) and Space used (total)?

The Space used indicates how much space is used by the table on disk. This depends on the OS and the compression ratio.
Whereas the Compacted partition max bytes is just max encountered partition size (after compaction). This is based on the data modeling/schema and logical record size used. For instance, 100kb record size times 40 records (each going into the same partition) will give you a 4MB partition.
This when it sits on the disk may be compressed further and you may get 2MB on disk. Can you share the rest of the stats too (compression info for ex, min and avg size, number of keys)?

Related

How to (streaming-) insert many small rows (few bytes per row) into BigQuery cost-efficiently?

I have a BigQuery table with the following properties:
Table size: 1.64 TB
Number of rows: 9,883,491,153
The data is put there using streaming inserts (in batches of 500 rows each).
From the Google Cloud Pricing Calculator the costs for these inserts so far should roughly be 86 $.
But in reality, it turns out to be around 482 $.
The explanation is in the pricing docs:
Streaming inserts (tabledata.insertAll): $0.010 per 200 MB (You are charged for rows that are successfully inserted. Individual rows are calculated using a 1 KB minimum size.)
So, in the case of my table, each row is just 182 bytes, but I need to pay the full 1024 bytes for each row, resulting in ~ 562 % of the originally (incorrectly) estimated costs.
Is there a canonical (and of course legal) way to improve the situation, i.e., reduce cost? (Something like inserting into a temp table with just one array-of-struct column, to hold multiple rows in a row, and then split-moving regularly into the actual target table?)
I can suggest you these options:
Use BigQuery Storage Write API. You can stream records into BigQuery and they can be available as the ones written in the DB, or batch a process to insert a large number of records to commit in a single operation.
Some advantages are:
Lower cost because you have 2 TB per month free.
It supports exactly-once semantics through the use of stream offset.
If a table schema changes while a client is streaming, BigQuery
Storage Write notifies the client.
Here is more information about BigQuery Storage Write.
Another option, you could use Beam/DataFlow to create a batch for streaming into BigQuery and use BigQueryIO with the write method of batch.
You can see more information here.

how clustering works in BigQuery

I have a a table UNITARCHIVE partitionned by date, and clustered by UNIT, DUID.
the total size of the table 892 Mb.
when I try this query
SELECT * FROM `test-187010.ReportingDataset.UNITARCHIVE` WHERE duid="RRSF1" and unit="DUNIT"
Bigquery tell me, it will process 892 mb, I thought clustering is supposed to reduce the scanned size, I understand when I filter per date, the size is reduced dramatically, but i need the whole date range.
is it by design or am I doing something wrong
To get the most benefits out of clustering, each partition needs to have a certain amount of data.
For example, if the minimum size of a cluster is 100MB (decided internally by BigQuery), and you have only 100MB of data per day, then querying 100 days will scan 100*100MB - regardless of the clustering strategy.
As an alternative with this amount of data, instead of partitioning by day, partition by year. Then you'll get the most benefits out of clustering with a low amount of data per day.
See Partition by week/year/month to get over the partition limit? for a reference table that shows this off.

Specify minimum number of generated files from Hive insert

I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent.
So my questions are
(a) what determines how many files are generated and
(b) is there a way to specify the minimum number of files or (even better) the maximum size of each file?
The number of files generated during INSERT ... SELECT depends on the number of processes running on final reducer (final reducer vertex if you are running on Tez) plus bytes per reducer configured.
If the table is partitioned and there is no DISTRIBUTE BY specified, then in the worst case each reducer creates files in each partition. This creates high pressure on reducers and may cause OOM exception.
To make sure reducers are writing only one partition files each, add DISTRIBUTE BY partition_column at the end of your query.
If the data volume is too big, and you want more reducers to increase parallelism and to create more files per partition, add random number to the distribute by, for example using this: FLOOR(RAND()*100.0)%10 - it will distribute data additionally by random 10 buckets, so in each partition will be 10 files.
Finally your INSERT sentence will look like:
INSERT OVERWRITE table PARTITION(part_col)
SELECT *
FROM src
DISTRIBUTE BY part_col, FLOOR(RAND()*100.0)%10; --10 files per partition
Also this configuration setting affects the number of files generated:
set hive.exec.reducers.bytes.per.reducer=67108864;
If you have too much data, Hive will start more reducers to process no more than bytes per reducer specified on each reducer process. The more reducers - the more files will be generated. Decreasing this setting may cause increasing the number of reducers running and they will create minimum one file per reducer. If partition column is not in the distribute by then each reducer may create files in each partition.
To make long story short, use
DISTRIBUTE BY part_col, FLOOR(RAND()*100.0)%10 -- 10 files per partition
If you want 20 files per partition, use FLOOR(RAND()*100.0)%20; - this will guarantee minimum 20 files per partition if you have enough data, but will not guarantee the maximum size of each file.
Bytes per reducer setting does not guarantee that it will be the fixed minimum number of files. The number of files will depend of total data size/bytes.per.reducer. This setting will guarantee the maximum size of each file.
But much better use some evenly distributed key or combination with low cardinality instead of random because in case of containers restart, rand() may produce different values for the same rows and it may cause data duplication or loss(same data which is already present in some reducer output will be distributed one more time to another reducer). You can calculate similar function on some keys available instead of rand() to get more or less evenly distributed key with low cardinality.
You can use both methods combined: bytes per reducer limit + distribute by to control both the minimum number of files and maximum file size.
Also read this answer about using distribute by to distribute data evenly between reducers: https://stackoverflow.com/a/38475807/2700344

Unable to upload data even after partitioning in VoltDB

We are trying to upload 80 GB of data in 2 host servers each with 48 GB RAM(in total 96GB). We have partitioned table too. But even after partitioning, we are able to upload data only upto 10 GB. In VMC interface, we checked the size worksheet. The no of rows in the table is 40,00,00,000 and table maximum size is 1053,200,000k and minimum size is 98,000,000K. So, what is issue in uploading 80GB even after partitioning and what is this table size?
The size worksheet provides minimum and maximum size in memory that the number of rows would take, based on the schema of the table. If you have VARCHAR or VARBINARY columns, then the difference between min and max can be quite substantial, and your actual memory use is usually somewhere in between, but can be difficult to predict because it depends on the actual size of the strings that you load.
But I think the issue is that the minimum size is 98GB according to the worksheet, meaning if any nullable strings are null, or any not-null strings would be an empty string. Even without taking into account the heap size and any overhead, this is higher than your 96GB capacity.
What is your kfactor setting? If it is 0, there will be only one copy of each record. If it is 1, there will be two copies of each record, so you would really need 196GB minimum in that configuration.
The size per record in RAM depends on the datatypes chosen and if there are any indexes. Also, VARCHAR values longer than 15 characters or 63 bytes are stored in pooled memory which carries more overhead than fixed-width storage, although it can reduce the wasted space if the values are smaller than the maximum size.
If you want some advice on how to minimize the per-record size in memory, please share the definition of your table and any indexes, and I might be able to suggest adjustments that could reduce the size.
You can add more nodes to the cluster, or use servers with more RAM to add capacity.
Disclaimer: I work for VoltDB.

Understanding the concept of simple row compression in SQL server

I read that the row compression feature of SQL server reduces the space needed to store a row by only using the bytes required to store a given value. Without compression an int column needs 4 bytes. We need the whole 4 bytes only if we are storing the number 1 or 1 million. With row compression turned on, sql server looks at the actual value being stored and calculates the amount of storage needed.
What I don't understand - Why does 1 and 1mn need the full 4 bytes and why do the other numbers, bigger than 1 require lesser memory ?
EDIT - It is taken from the book: Delivering business intelligence with SQL server 2008
Row compression reduces the space required to store a row of data. It does this by only
using the bytes required to store a given value. For example, without row compression, a
column of type int normally occupies 4 bytes. This is true if we are storing the number
1 or the number 1 million. SQL Server allocates enough space to store the maximum
value possible for the data type.
When row compression is turned on, SQL Server makes smarter space allocations.
It looks at the actual value being stored for a given column in a given row, and then
determines the storage required to represent that value. This is done to the nearest
whole byte.
Of course, this added complexity adds some overhead as data is inserted or updated
in the table. It also adds a smaller amount of overhead when data is retrieved from that
table. In most cases, the time taken for this additional processing is negligible. In fact,
in some cases, the saving in disk read and disk write time can even be greater than the
calculation time required for data compression.
You're having wrong perception of this. Please read the docs thoroughly.
1 and 1 million are just examples of values that require 4 bytes if row compression is off. It does not mean that the other numbers require less space.
1 and 1 million was probably picked out as examples because 1 is small and 1 million is big.
An int will take 4 bytes regardless of what value it has if row compression is off.