Google Bigtable under usage "performance" - bigtable

I have seen the warnings of not using Google Big Table for small data sets.
Does this mean that a workload of 100 QPS could run slower (total time; not per query) than a workload of 8000 QPS?
I understand that 100 QPS is going to be incredibly inefficient on BigTable; but could it be as drastic as 100 inserts takes 15 seconds to complete; where-as a 8000 inserts could run in 1 second?
Just looking for a "in theory; from time to time; yes" vs "probably relatively unlikely" type answer to be a rough guide for how I structure my performance test cycles.
Thanks

There's a flat start up cost to running any Cloud Bigtable operations. That start up cost generally is generally less than 1 second. I would expect 100 operations should take less than 8000 operations. When I see extreme slowness, I usually suspect network latency or some other unique condition.

We're having issues with running small workloads on our Developer Big Table instance (2.5 TB) One instance instead of 3.
We have a key set up on user id and around 100 rows on the key user id. Total records in the database are a few million. We querying big table and seeing 1.4 seconds of latency from fetching the rows associated with a single key of user id. Total number of records returned is less than 100 and we're seeing way over a second of latency. It seems to me that giant workloads are the only way to use this data store. We're looking at other NoSQL alternatives like Redis.

Related

PostgreSQL: VACUUM FULL duration estimation

I inherited a PostgreSQL database in production with one table that is around 250 GB in size. It only has around ten thousand live rows which I estimate to be not more than 20 MB.
The table grew to such a size because AUTOVACUUM has been turned off at some time. (I know why this was done. It will be reactivated and the original issue has been fixed, so this is not part of the question.)
Our problem is that many queries take pretty long time. For example, a SELECT count(*) FROM foo; takes around 15 minutes.
Now after considering other options, I'd like to run a VACUUM FULL on the table. I try to estimate the duration this would take to complete so I can plan a maintenance window.
In my understanding, VACUUM FULL creates a new table, copies all live tuples to it and replaces the original table with this copy.
My estimation would be that this process doesn't take much longer than a simple query like the above on this table as the live data is pretty slim in overall size and count.
Would you agree that my expectation of the run time of 'VACUUM FULL' is somehow realistic? If not, why not?
Are there best practises for estimating VACUUM FULL durations?
The only dependable estimate can be had by restoring a file system backup on a similar machine and test it. That's what I would recommend.
The duration will not only depend on the size, but also on the amount of bloat: if there are fewer real data, it will be faster.
That said, I'd ask for a maintenance window of 2 hours, which should be ample on anything but very questionable hardware.

Bigquery partitioning table performance

I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".
I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.
If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.
I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).
To summarize:
More tables = less data scanned
Less tables =(?) longer response time to the client
Can anyone shed some light on how to find the balance between cost and performance?
A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.
When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.
This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.
Internally, BigQuery stores tables in
shards; these are discrete chunks of data that can be processed in parallel. If
you have a 100 GB table, it might be stored in 5000 shards, which allows it to be
processed by up to 5000 workers in parallel. You shouldn’t make any assumptions
about the size of number of shards in a table. BigQuery will repartition
data periodically to optimize the storage and query behavior.
Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.
To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.

What is the mathematical relationship between "no. of rows affected" and "execution time" of a sql query?

The query remains constant i.e it will remain the same.
e.g. a select query takes 30 minutes if it returns 10000 rows.
Would the same query take 1 hour if it has to return 20000 rows?
I am interested in knowing the mathematical relation between no. of rows(N) and execution time(T) keeping other parameters as constant(K).
i.e T= N*K or
T=N*K + C or
any other formula?
Reading http://research.microsoft.com/pubs/76556/progress.pdf if it helps. Anybody who can understand this before me, please do reply. Thanks...
Well that is good question :), but there is not exact formula, because it depends of execution plan.
SQL query optimizer could choose another execution plan on query which return different number of rows.
I guess if the query execution plan is the same for both query's and you have some "lab" conditions then time growth could be linear. You should research more on sql execution plans and statistics
Take the very simple example of reading every row in a single table.
In the worst case, you will have to read every page of the table from your underlying storage. The worst case for this is having to do a random seek. The seek time will dominate all other factors. So you can estimate the total time.
time ~= seek time x number of data pages
Assuming your rows are of a fairly regular size, then this is linear in the number of rows.
However databases do a number of things to try and avoid this worst case. For example, in SQL Server table storage is often allocated in extents of 8 consecutive pages. A hard drive has a much faster streaming IO rate than random IO rate. If you have a clustered index, reading the pages in cluster order tend to have a lot more streaming IO than random IO.
The best case time, ignoring memory caching, is (8KB is the SQL Server page size)
time ~= 8KB * number of data pages / streaming IO rate in KB/s
This is also linear in the number of rows.
As long as you do a reasonable job managing fragmentation, you could reasonably extrapolate linearly in this simple case. This assumes your data is much larger than the buffer cache. If not, you also have to worry about the cliff edge where your query changes from reading from buffer to reading from disk.
I'm also ignoring details like parallel storage paths and access.

bigquery overcharges when selecting just few rows

select DATE(request_time) from logs.nobids_05 limit 1
gave me "3.48 GB processed" which a bit much considering that request_time is a field that appears in each row.
There are many other cases where just touching column automatically adds its total size to the cost. For example,
select * from logs.nobids_05 limit 1
gives me "This query will process 274 GB when run".
I am sure bigquery does not need to read 274GB for outputting 1 row of data.
2019 update: IF you cluster your tables, the cost of a SELECT * LIMIT 1 will be minimal.
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b
Running a "SELECT * FROM big_table LIMIT 1" with BigQuery would be the equivalent of doing this: https://www.youtube.com/watch?v=KZ-slvv_ZT4.
BigQuery is an analytical database. It's architecture and pricing are optimized for analysis at scale, not for single row handling.
Every operation in BigQuery involves a full table scan, but only of the columns mentioned in the query. The goal is to have predictable costs: Before running the query you are able to know how much data will be involved, therefore its cost. It might seem a big price to query just one row, but the good news is the cost remains constant, even when the queries get way more complex and CPU intensive.
Once in a while you might need to run a single row query, and the costs might seem excessive, but the assumption here is that you are using this tool to analyze data at scale, and the overall costs of having data stored in it should be more than competitive with other tools available. Since you've been working with other tools, I'd love to see a total cost comparison of analytical sessions within real case scenarios.
By the way, BigQuery has a better way for doing the equivalent of "SELECT * LIMIT x". It's free, and it relies on the REST API instead of querying:
https://developers.google.com/bigquery/docs/reference/v2/tabledata/list
This being said, thanks for the feedback, as there is a balancing job between making pricing more complex and the tool better suited for other jobs - and this balance is built on the feedback we get.
I don't think this is a bug. "When you run a query, you're charged according to the total data processed in the columns you select, even if you set an explicit LIMIT on the results." (https://developers.google.com/bigquery/pricing#samplecosts)

What is the performance of HSQLDB with several clients

I would like to use HSQLDB +Hibernate in a server with 5 to 30 clients that will fairly intensively write to the DB.
Each client will persist a dozen thousands lines in a single table every 30 seconds (24/7, that's roughly 1 billion rows/day), and the clients will also query the database for a few thousands lines more or less at random times at an average frequency of a couple of requests every 5 to 10 seconds.
Can HSQLDB handle such a use case or should I switch to MySQL/PostgreSQL ?
You are looking at a total of 2000 - 12000 writes and 5000 - 30000 reads per second.
With fast hardware, HSQLDB can probably handle this with persistent memory tables. With CACHED tables, it may be able to handle the lower range with solid state disks (disk seek time is the main parameter).
See this test. You can run it with MySQL and PostgresSQL for comparison.
http://hsqldb.org/web/hsqlPerformanceTests.html
You should switch. HSQLDB is not for critical apps. Be prepared for data corruption and decreasing startup performance over time.
The main negative hype comes from JBoss: https://community.jboss.org/wiki/HypersonicProduction
See also http://www.coderanch.com/t/89950/JBoss/HSQLDB-production
Also see similar question: Is is safe to use HSQLDB for production? (JBoss AS5.1)