Google Cloud Bigtable Update or Insert with Versioning - bigtable

I'm wondering if I should use a update query to update my row data or use maxversions and enable the versioning and just insert.
I do understand it may depend on what kind of data I need to store, but just wanted to know if there is a performance difference between querying (selecting) a data witch has versioning or non-versioning. Or has a performance difference between insert and update.

Performance is impacted by the size of the row and the amount of data returned from the server.
Bigtable has to read an entire row for every request. That will be a limiting factor on reads. At some size (100s+ of MB), systemic performance will degrade any time the tablet with that row is loaded. When the row size reaches GBs, you'll have major problems.
At query time, performance is also impacted by how much data is returned from the server. You can still get decent performance lower range of "large rows" if you limit your Get or Scan to a small subset of the row. Limits like cells per row, and/or retrieving only a few qualifiers would help with the network costs.
In general, it's better to keep your rows smaller, if you can. That is generally done with a combination of "insert" and some sort of age/version restriction on the column family.

Related

AWS Redshift column limit?

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.
I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.
I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.
1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.
Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?
FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

Will the query plan be changed on different data size?

Suppose the data distribution does not change, For a same query, only dataset is enlarged a time, will the time taken also becomes 1 time? If the data distribution does not change, will the query plan change if in theory?
Yes, the query plan may still change even if the data is completely static, though it probably won't.
The autovaccum daemon will ANALYZE your tables and generate new statistics. This usually happens only when they've changed, but may happen for other reasons (wrap-around prevention vacuum, etc).
The statistics include a random sampling to collect common values for a histogram. Being random, the outcome may be somewhat different each time.
To reduce the chances of plans shifting for a static dataset, you probably want to increase the statistics target on the table's columns and re-ANALYZE. Don't set it too high though, as the query planner has to read those histograms when it makes planning decisions, and bigger histograms mean slightly more planning time.
If your table is growing continuously but the distribution isn't changing then you want the planner to change plans at various points. A 1000-row table is almost certainly best accessed by doing a sequential scan; an index scan would be a waste of time and effort. You certainly don't want a million row table being scanned sequentially unless you're retrieving a majority of the rows, though. So the planner should - and does - adjust its decisions based not only on the data distribution, but the overall row counts.
Here is an example. You have record on one page and an index. Consider the query:
select t.*
from table t
where col = x;
And, assume you have an index on col. With one record, the fastest way is to simply read the record and check the where clause. You could have 200 records on the page, so the selectivity of the query might be less than 1%.
One of the key considerations that a SQL optimizer makes in choosing an algorithm is the number of expected page reads. So, if you have a query like the above, the engine might think "I have to read all pages in the table anyway, so let me just do a full table scan and ignore the index." Note that this will be true when the data is on a single page.
This generalizes to other operations as well. If all the records in your data fit on one data page, then "slow" algorithms are often the best or close enough to the best. So, nested loop joins might be better than using indexes, hash-based, or sort-merge based joins. Similarly, a sort-based aggregation might be better than other methods.
Alas, I am not as familiar with the Postgres query optimizer as I am with SQL Server and Oracle. I have definitely encountered changes in execution plans in those databases as data grew.

How Expensive is SQL ORDER BY?

I don't quite understand how a SQL command would sort a large resultset. Is it done in memory on the fly (i.e. when a query is perfomed)?
Is is going to be faster to sort using ORDER BY in SQL rather than sort say a linked list of objects containing the results in a language like Java (assuming a fast built-in sort, probably using quicksort)?
It will almost certainly be more efficient to sort the data in the database. Databases are designed to deal with large data volumes. And there are various optimizations available to the database that would not be available to the middle tier. If you plan on writing a hyper-efficient sort routine in the middle tier that takes advantage of information that you have about your data that the database doesn't (i.e. farming the data out to a cluster of dozens of middle tier machines so that the sort never spills to disk, taking advantage of the fact that your data is mostly ordered to choose an algorithm that wouldn't normally be particularly efficient), you can probably beat the database's sort speed. But that tends to be rare.
Depending on the query, for example, the database optimizer may choose a query plan that returns the data in order without performing a sort. For example, the database knows that the data in an index is sorted so it may choose to do an index scan to return the data in order without ever having to materialize and sort the entire result set. If it does have to materialized the entire result, it only needs the columns you are sorting by and some sort of row identifier (i.e. a ROWID in Oracle) rather than sorting an entire row of data like a naive middle tier implementation is likely to do. For example, if you have a composite index on (col1, col2) and you decide to sort on UPPER(col2), LOWER(col1), the database could read the col1 & col2 values from the index, sort the row identifiers, and then go fetch the data from the table. Of course, the database doesn't have to do this-- the optimizer will take into account the cost of doing a sort against the cost of fetching the data from the table or from the various indexes. The database may well conclude that the most efficient approach is to do a table scan, read the entire row into memory, and sort it. It may conclude that leveraging an index results in more I/O to fetch the data but makes up for it by reducing or eliminating the sort costs.
The answer is... it depends. If the ORDER BY part can be done by using an index in the database, then the execution plan for the query will use that index and the results will come back in the right order straight from the DB. If not, then the database will perform the sorting, but it's likely better at it than you reading all the results into memory (and certainly better than reading the results into a linked list).
The exact method depends on the product you are using, but normally a fully-featured DBMS has multiple sort algorithms at its disposal. Some work on disk, optimizing for space over time, some work in memory, optimizing for speed. Check the source code of the available open source ones, if you are interested in the gory details.
It's unlikely that you are going to get better results by doing the sorting yourself or using some other library, although there can be pathological cases such as some operating system's qsort() having problems with certain data distributions. Try it out if you must, but prefer using a DBMS to manage your data, because that's what they are good at.
Unless sort is index based if you use database sort you are guaranteeing you will wait for entire result set to be resolved and sorted in the database before you see even a single row of the result set.
If you sort it yourself data may be incrementally streamed (better for network constrained environment) and perhaps incrementally useful to application reducing execution delay even if sorting operation consumes the same amount of total time.
Depending on deployment scenario it might make a big difference where the extra costs associated with sorting should be paid out. In scenarios I work with middle tier is disposable and scalable while data tier is more expensive to scale out. If it costs the same CPU but database CPU costs 5x or 10x in terms of operational cost it becomes cheaper in real terms to do it outside the database.

Performance of returning entire tables containing blog text as opposed to selecting specific columns

I think this is a pretty common scenario: I have a webpage that's returning links and excerpts to the 10 most recent blog entries.
If I just queried the entire table, I could use my ORM mapped object, but I'd be downloading all the blog text.
If I restricted the query to just the columns that I need, I'd be defining another class that'll hold just those required fields.
How bad is the performance hit if I were to query entire rows? Is it worth selecting just what I need?
The answer is "it depends".
There are two things that affect performance as far as column selection:
Are there covering indexes? E.g. if there is an index containing ALL of the columns in the smaller query, then a smaller column set would be extremely benefifical performance wise, since the index would be read without reading any rows themselves.
Size of columns. Basically, count how big the size of the entire row is, vs. size of only the columns in smaller query.
If the ratio is significant (e.g. full row is 3x bigger), then you might have significant savings in both IO (for retrieval) and network (for transmission) cost.
If the ratio is more like 10% benefit, it might not be worth it as far as DB performance gain.
It depends, but it will never be as efficient as returning only the columns you need (obviously). If there are few rows and the row sizes are small, then network bandwidth won't be affected too badly.
But, returning only the columns you need increases the chance that there is a covering index that can be used to satisfy the query, and that can make a big difference in the time a query takes to execute.
,Since you specify that it's for 10 records, the answer changes from "It Depends" to "Don't spend even a second worrying about this".
Unless your server is in another country on a dialup connection, wire time for 10 records will be zero, regardless of how many bytes you shave off each row. It's simply not something worth optimizing for.
So for this case, you get to set your ORM free to grab you those records in the least efficient manner it can come up with. If your situation changes, and you suddenly need more than, say, 1000 records at once, then you can come back and we'll make fun of you for not specifying columns, but for now you get a free pass.
For extra credit, once you start issuing this homepage query more than 10x per second, you can add caching on the server to avoid repeatedly hitting the database. That'll get you a lot more bang for your buck than optimizing the query.

Does the speed of the query depend on the number of rows in the table?

Let's say I have this query:
select * from table1 r where r.x = 5
Does the speed of this query depend on the number of rows that are present in table1?
The are many factors on the speed of a query, one of which can be the number of rows.
Others include:
index strategy (if you index column "x", you will see better performance than if it's not indexed)
server load
data caching - once you've executed a query, the data will be added to the data cache. So subsequent reruns will be much quicker as the data is coming from memory, not disk. Until such point where the data is removed from the cache
execution plan caching - to a lesser extent. Once a query is executed for the first time, the execution plan SQL Server comes up with will be cached for a period of time, for future executions to reuse.
server hardware
the way you've written the query (often one of the biggest contibutors to poor performance!). e.g. writing something using a cursor instead of a set-based operation
For databases with a large number of rows in tables, partitioning is usually something to consider (with SQL Server 2005 onwards, Enterprise Edition there is built-in support). This is to split the data down into smaller units. Generally, smaller units = smaller tables = smaller indexes = better performance.
Yes, and it can be very significant.
If there's 100 million rows, SQL server has to go through each of them and see if it matches.
That takes a lot more time compared to there being 10 rows.
You probably want an index on the 'x' column, in which case the sql server might check the index rather than going through all the rows - which can be significantly faster as the sql server might not even need to check all the values in the index.
On the other hand, if there's 100 million rows matching x = 5, it's slower than 10 rows.
Almost always yes. The real question is: what is the rate at which the query slows down as the table size increases? And the answer is: by not much if r.x is indexed, and by a large amount if not.
Not the rows (to a certain degree of course) per se, but the amount of data (columns) is what can make a query slow. The data also needs to be transfered from the backend to the frontend.
The Answer is Yes. But not the only factor.
if you did appropriate optimizations and tuning the performance drop will be negligible
Main Performance factors
Indexing Clustered or None clustered
Data Caching
Table Partitioning
Execution Plan caching
Data Distribution
Hardware specs
There are some other factors but these are mainly considered.
Even how you designed your Schema makes effect on the performance.
You should assume that your query always depends on the number of rows. In fact, you should assume the worst case (linear or O(N) for the example you provided) and exponential for more complex queries. There are database specific manuals filled with tricks to help you avoid the worst case but SQL itself is a language and doesn't specify how to execute your query. Instead, the database implementation decides how to execute any given query: if you have indexed a column or set of columns in your database then you will get O(log(N)) performance for a simple lookup; if the system has effective query caching you might get O(1) response. Here is a good introductory article: High scalability: SQL and computational complexity