Disadvantages of using table compression - sql

Are there any disadvantages of using table compression such as Row compression and Page compression, for example:
ALTER TABLE A
REBUILD WITH (DATA_COMPRESSION = PAGE) --or ROW
If the above command could leverage the performance of the sql query, why don't we use that every time we create a new table even though it may not effect a table with few data pages.
Any disadvantages of using this?
Thanks
Summary:
check either #paulbarbin's answer or check the conclusion part of this post here
As we can see, row- and page-level compression can be powerful tools
to help you reduce space taken by your data and improve the execution
speed, but at the expense of CPU time. This is because each access of
a row or page requires a step to undo the compression (or calculate
and match hashes) and this translates directly into compute time. So,
when deploying row- or page-level compression, conduct some similar
testing (you are welcome to use my framework!) and see how it plays
out in your test environment. Your results should inform your decision
- if you're already CPU-bound, can you afford to deploy this? If your storage is on fire, can you afford NOT to?

Compression does come with an overhead. There is additional CPU required to complete the compression and based on the limitations of compression, you might find that the gain is less than the pain. However, it's my understanding that most people benefit from page compression for most scenarios and use row compression in specific circumstances. I'd say try it in your dev/test environment, determine your cost on CPU and savings in queries and implement if it makes sense.

when a page level compression gets applied to a table,row level compression get also applied.The benefits of page compression depend on the type of data compressed. Data that involves many repeating values will be more compressed than data populated by more unique values.One more thing, Data compression change the query plan because the data is compressed in different number pages and rows.Additional CPU requires to retrieve compressed data exist.
I suggest, go with the compression only when you have a big warehouse table that contain millions of records and you/your application don't need to query the table frequently. You can also use partition level compression when it's partitioned table.

Related

How do I design a data ingestion process in Snowflake that includes update/inserts and maintain optimum performance

I will be ingesting about 20-years of data that includes files with millions of rows about 500 columns. Reading through Snowflake (SF) documentation I saw that I should load the files in an order that would allow SF to create the micro-partitions (MP) with metadata optimized for pruning. However, I am concerned because I will be updating previously loaded records that could ruin the integrity of the MP. Is there a best practice for handling updates? Might I at some point need to reorg the table data to regain its performance structure. Are cluster keys adequate for handling or should I consider a combination of the two. I am planning on splitting the load files into logical combinations that would also support the proper metadata definitions but am also wondering if there is preferred limit to number of columns. If there is a know best practice document please let me know. Thanks. hs
You don’t need to worry about any of that, regarding how the data is stored. Snowflake doesn’t do updates, it only inserts into the micro partitions.
Performance with updates will obviously be slower than pure inserts, but that’s a different issue.

SQL Server: Bulk Insert Data Loading to Partitioned Table with Multiple File Groups

I am trying to load a series of CSV files, ranging from 100MB to 20GB in size (total of ~3TB). So, I need every performance enhancement that I can. I am aiming to use filegrouping, and partitioning as a mean. I performed a series of tests to see the optimum approach.
First, I tried various filegroup combination; the best I get is when I am loading into a table that is on 1 filegroup; with multiple files assigned to it, and they are all siting on one disc. This combination outperformed to the case that I have multiple filegroups.
Next step was naturally to have partitioning. ODDLY, all the partitioning combination that I examined have lower performance. I tried defining various partition function/schemes and various filegroup combinations. But all showed a lower loading speed.
I am wondering what I am missing here!?
So far, I managed to load (using bulk insert) a 1GB csv file in 3 minutes. Any idea is much appreciated.
For gaining optimal Data Loading speed you need to first understand SQL Server data load process, which means understanding how SQL Server achieves below mentioned optimizations.
Minimal Logging.
Parallel Loading.
Locking Optimization.
These two article will explain in detail how you can achieve all the above optimizations in detail. Fastest Data Loading using Bulk Load and Minimal Logging and Bulk Loading data into HEAP versus CLUSTERED Table
Hope this helps.

Why Select SQL queries on tables with blobs are slow, even when the blob is not selected?

SELECT queries on tables with BLOBs are slow, even if I don't include the BLOB column. Can someone explain why, and maybe how to circumvent it? I am using SQL Server 2012, but maybe this is more of a conceptual problem that would be common for other distributions as well.
I found this post: SQL Server: select on a table that contains a blob, which shows the same problem, but the marked answer doesn't explain why is this happening, neither provides a good suggestion on how to solve the problem.
If you are asking for a way to solve the performance drag, there are a number of approaches that you can take. Adding indexes to your table should help massively provided you aren't simply selecting the entire recordset. Creating views over the table may also assist. It's also worth checking the levels of index fragmentation on the table as this can cause poor performance and could be addressed with a regular maintenance job. The suggestion of creating a linked table to store the blob data is also a genuinely good one.
However, if your question is asking why it's happening, this is because of the fundamentals of the way MS SQL Server functions. Essentially your database, and all databases on the server and split into pages, 8kb chunks of data with a 96-byte header. Each page representing what is possible in a single I/O operation. Pages are collected contained and grouped within Exents, 64kb collections of eight contiguous pages. SQL Server therefore uses sixteen Exents per megabyte of data. There are a few differing page types, a data page type for example won't contain what are termed "Large Objects". This include the data types text, image, varbinary(max), xml data, etc... These also are used to store variable length columns which exceed 8kb (and don't forget the 96 byte header).
At the end of each page will be a small amount of free space. Database operations obviously shift these pages around all the time and free space allocations can grow massively in a database dealing with large amounts of I/O and random record access / modification. This is why free space on a database can grow massively. There are tools available within the management suite to allow you to reduce or remove free space and basically this re-organizes pages and exents.
Now, I may be making a leap here but I'm guessing that the blobs you have in your table exceed 8kb. Bear in mind if they exceed 64kb they will not only span multiple pages but indeed span multiple exents. The net result of this will be that a "normal" table read will cause massive amounts of I/O requests. Even if you're not interested in the BLOB data, the server may have to read through the pages and exents to get the other table data. This will only be compounded as more transactions make pages and exents that make up a table to become non-contiguous.
Where "Large Objects" are used, SQL Server writes Row-Overflow values which include a 24bit pointer to where the data is actually stored. If you have several columns on your table which exceed the 8kb page size combined with blobs and impacted by random transactions, you will find that the majority of the work your server is doing is I/O operations to move pages in and out of memory, reading pointers, fetching associated row data, etc, etc... All of which represents serious overhead.
I got a suggestion then, have all the blobs in a separate table with an identity ID, then only save the identity ID in your main table
it could be because - maybe SQL cannot cache the table pages as easily, and you have to go to the disk more often. I'm no expert as to why though.
A lot of people frown at BLOBS/images in databases - In SQL 2012 there is some sort of compromise where you can configure the DB to keep objects in a file structure, not in the actual DB anymore - you might want to look for that

Why Spark SQL considers the support of indexes unimportant?

Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of
these (such as indexes) are less important due to Spark SQL’s
in-memory computational model. Others are slotted for future releases
of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use
case the data size far exceeds the size of available memory.
Assuming this is not uncommon, what is meant by "Spark SQL’s
in-memory computational model"? Is Spark SQL recommended only for
cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large
dataset can take a long time. I read this argument against
indexing in in-memory database, but I was not convinced. The example
there discusses a scan of a 10,000,000 records table, but that's not
really big data. Scanning a table with billions of records can cause
simple queries of the "SELECT x WHERE y=z" type take forever instead
of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?
Indexing input data
The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.
In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".
So what is an index, anyway?
Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.
But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.
40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?
Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.
Seriouusly. Think about that a second.
The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.
If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.
Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.
And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.
What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.
Tell me again why you want indexes.
Indexes vs ___ ?
Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?
Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.
If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.
If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!
But I digress...
Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.

How can I quickly insert test data into BigQuery?

Inserting large amounts of test data into BigQuery can be slow, especially if the exact details of the data aren't important and you just want to test the performance of a particular shape of query/data.
What's the best way to achieve this without waiting around for many GB of data to upload to GCS?
In general, I'd recommend testing over small amounts of data (to save money and time).
If you really need large amounts of test data, there are several options.
If you care about the exact structure of the data:
You can upload data to GCS in parallel (if a slow single transfer is the bottleneck).
You could create a short-lived Compute Engine VM and use it to insert test data into GCS (which is likely to provide higher throughput than over your local link). This is somewhat involved, but gives you a very fast path for inserting data generated on-the-fly by a script.
If you just want to try out the capabilities of the platform, there are a number of public datasets available for experimentation. See:
https://cloud.google.com/bigquery/docs/sample-tables
If you just need a large amount of data and duplicate rows are acceptable:
You can insert a moderate amount of data via upload to GCS. Then duplicate it by querying the table and appending the result to the original. You can also use the bq command line tool with copy and the --append flag to achieve a similar result without being charged for a query.
This method has a bit of a caveat -- to get performance similar to typical production usage, you'll want to load your data in reasonably large chunks. For a 400GB use case, I'd consider starting with 250MB - 1GB of data in a single import. Many tiny insert operations will slow things down (and are better handled via the streaming API, which does the appropriate batching for you).