Bigtable/BigQuery pricing when inserts depend on lookups - google-bigquery

I have a simple proof-of-concept application written in traditional SQL. I need to scale it to much larger size (potentially trillions of rows, multiple terabytes or possibly petabytes in size). I'm trying to come up with the pricing model of how this could be done using Google's Bigtable/BigQuery/Dataflow.
From what I gather from Google's pricing documents, Bigtable is priced in terms of nodes needed to handle the necessary QPS and in terms of storage required, whereas the BigQuery is priced in terms of each query's size.
But what happens when your inserts into the table actually require the lookup of that same table? Does that mean that you have to consider an additional cost factor into each insert? If my total column size is 1TB and I have to do a SELECT on that column before each additional insert, will I be charged $5 for each insert operation as a consequence? Do I have to adjust my logic to accommodate this pricing structure? Like breaking the table into a set of smaller tables, etc?
Any clarification much appreciated, as well as links to more detailed and granular pricing examples for Bigtable/BigQuery/Dataflow than what's available on Google's website.

I am the product manager for Google Cloud Bigtable.
It's hard to give a detailed answer without a deeper understanding of the use case. For example, when you need to do a lookup before doing an insert, what's the complexity of the query? Is it an arbitrary SQL query, or can you get by with a lookup by primary key? How big is the data set?
If you only need to do lookups by key, then you may be able to use Bigtable (which, like HBase, only has a single key: the row key), and each lookup by row key is fast and does not require scanning the entire column.
If you need complex lookups, you may be able to use:
Google BigQuery, but note that each lookup on a column is a full scan as per this answer, though as suggested in another answer, you can partition data to scan less data, if that's helpful
Google Cloud Datastore, which is a document database (like MongoDB), allows you to set up indexes on some of the fields, so you can do a search based on those properties
Google Cloud SQL, which is a managed service for MySQL, but while it can scale to TB, it does not scale to PB, so it depends how big your dataset is that you need to query prior to inserting
Finally, if your use case is going into the PB-range, I strongly encourage you to get in touch with Google Cloud Platform folks and speak with our architects and engineers to identify the right overall solution for your specific use cases, as there may be other optimizations that we can make if we can discuss your project in more detail.

Regarding BigQuery, you are able to partition your data based on day. So if you need to query only last days the charge will be for that and not for full table.
On the other hand, you need to rethink your data management. Choosing an append-only and event based data flow could help you to avoid lookups on the same table.

will I be charged $5 for each insert operation as a consequence?
Yes, any time you scan that column - you will be charged for full column's size unless your result is cachable (see query caching) which most likely is not your case
Do I have to adjust my logic ... ?
Yes.
"breaking the table into a set of smaller tables" (Sharding with Table wildcard functions) or Partitioning is the way to go for you

Related

How handle Hash-key, partition and index on Azure?

I've studied Azure Synapse and distribution types.
Hash-distributed table needs a column to distribute the data between different nodes,
For me it's the same idea of partition, I saw some examples that uses a hash-key, partition and index. It's not clear in my mind their differences and how to choose one of them. How Hash-key, partition and index could work together?
Just an analogy which might explain the difference between Hash and Partition
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries. When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
For more details:
What is a difference between table distribution and table partition in sql?
https://www.linkedin.com/pulse/partitioning-distribution-azure-synapse-analytics-swapnil-mule
And indexing is physical arrangement within those nodes

What is a difference between table distribution and table partition in sql?

I am still struggling with identifying how the concept of table distribution in azure sql data warehouse differs from concept of table partition in Sql server?
Definition of both seems to be achieving same results.
Azure DW has up to 60 computing nodes as part of it's MPP architecture. When you store a table on Azure DW you are storing it amongst those nodes. Your tables data is distributed across these nodes (using Hash distribution or Round Robin distribution depending on your needs). You can also choose to have your table (preferably a very small table) replicated across these nodes.
That is distribution. Each node has its own distinct records that only that node worries about when interacting with the data. It's a shared-nothing architecture.
Partitioning is completely divorced from this concept of distribution. When we partition a table we decide which rows belong into which partitions based on some scheme (like partitioning an order table by the order.create_date for instance). A chunk of records for each create_date then gets stored in its own table separate from any other create_date set of records (invisibly behind the scenes).
Partitioning is nice because you may find that you only want to select 10 days worth of orders from your table, so you only need to read against 10 smaller tables, instead of having to scan across years of order data to find the 10 days you are after.
Here's an example from the Microsoft website where horizontal partitioning is done on the name column with two "shards" based on the names alphabetical order:
Table distribution is a concept that is only available on MPP type RDBMSs like Azure DW or Teradata. It's easiest to think of it as a hardware concept that is somewhat divorced (to a degree) from the data. Azure gives you a lot of control here where other MPP databases base distribution on primary keys. Partitioning is available on nearly every RDBMS (MPP or not) and it's easiest to think of it as a storage/software concept that is defined by and dependent on the data in the table.
In the end, they do both work to solve the same problem. But... nearly every RDBMS concept (indexing, disk storage, optimization, partition, distribution, etc) are there to solve the same problem. Namely: "How do I get the exact data I need out as quickly as possible?" When you combine these concepts together to match your data retrieval needs you make your SQL requests CRAZY fast even against monstrously huge data.
Just for fun, allow me to explain it with an analogy.
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries.
When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
Conceptually they are the same. The basic idea is that the data will be split across multiple stores. However, the implementation is radically different. Under the covers, Azure SQL Data Warehouse manages and maintains the 70 databases that each table you define is created within. You do nothing beyond define the keys. The distribution is taken care of. For partitioning, you have to define and maintain pretty much everything to get it to work. There's even more to it, but you get the core idea. These are different processes and mechanisms that are, at the macro level, arriving at a similar end point. However, the processes these things support are very different. The distribution assists in increased performance while partitioning is primarily a means of improved data management (rolling windows, etc.). These are very different things with different intents even as they are similar.

Why Spark SQL considers the support of indexes unimportant?

Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of
these (such as indexes) are less important due to Spark SQL’s
in-memory computational model. Others are slotted for future releases
of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use
case the data size far exceeds the size of available memory.
Assuming this is not uncommon, what is meant by "Spark SQL’s
in-memory computational model"? Is Spark SQL recommended only for
cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large
dataset can take a long time. I read this argument against
indexing in in-memory database, but I was not convinced. The example
there discusses a scan of a 10,000,000 records table, but that's not
really big data. Scanning a table with billions of records can cause
simple queries of the "SELECT x WHERE y=z" type take forever instead
of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?
Indexing input data
The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.
In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".
So what is an index, anyway?
Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.
But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.
40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?
Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.
Seriouusly. Think about that a second.
The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.
If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.
Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.
And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.
What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.
Tell me again why you want indexes.
Indexes vs ___ ?
Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?
Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.
If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.
If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!
But I digress...
Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.

Rule of thumb when to partition tables

I've seen many entries about partitioning tables, but there is not a lot of information on when you should make a partition.
Is there a rule of thumb when you should partition tables in SQL Server.
Thanks
My benchmarks indicate that it depends on the query load.
If the queries you perform ALWAYS contain a filter on the partition field the performance benefit is virtually instant (like 1000 records in the table is already beneficial)
If the queries do NOT always contain a filter on the partition field you really have to benchmark with a good sample of the query load before making the decision.
You also have to account for the partition system you use. if you use "static" partitions there is not much harm in creating them immediately. When you use a "sliding window" system you need to take into account the overhead of creating and merging partitions. (which can take a long time on big tables)
#Filip's post is a great topical guide. When your doing your ontologies and estimating how your application will be used, that is, how your users will interact with the application and how that translates to database access, you should have a good idea of the kind of queries that will be performed and how fast certain tables will grow. If your that confident, then you should immediately partition the tables to defer from any maintenance hapzards.
But if your trying to decide on whether to partition populated tables, or you like to perform partitioning lazily like me, here's a nice little nugget from the PostgreSQL docs:
The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server. [src]

real-time data warehouse for web access logs

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.
To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.
The question is how to balance and design the system so that ;
(1) the data can be fetched and presented to the user in real-time (<2 seconds),
(2) data can be aggregated on per-hour and per-day basis, and
(2) as large amount of data can still be stored in the warehouse, and
Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.
Is it possible it get real-time query performance from a "simple" data warehouse like this,
and still have it store a lot of data (it would be nice to be able to never throw away any data)
Are there ways to aggregate the data into higher resolution tables?
I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.
Maybe I'm grasping for too much.
UPDATE
My schema looks like this;
dimensions:
client (ip-address)
server
url
facts;
timestamp (in seconds)
bytes transmitted
Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.
Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.
Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case.
Namely, the capped collections (do a search for mongodb capped collections for more info)
And the fast increment operation for things like keeping track of page views, hits, etc.
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
Doesn't sound like it would be a problem. MySQL is very fast.
For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.
If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.
Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.
You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).
Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.
If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.
With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.
This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.
Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.
But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.
As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.
Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.