How does Big Query store millions of rows of a column with categorical (duplicate) string values? - optimization

We are streaming around a million records per day into BQ and a particular string column has categorical values of "High", "Medium" and "Low".
I am trying to understand if Biq Query does storage optimisations other than compression at its own end and what is the scale of that? Looked for documentation on this and was unable to find explanations on the same.
For example if i have:
**Col1**
High
High
Medium
Low
High
Low
**... 100 Million Rows**
Would BQ Store it internally as follows
**Col1**
1
1
2
3
1
3
**... 100 Million Rows**

Summary of noteworthy (and correct!) answers:
As Elliott pointed out in the comments, you can read details on BigQuery's data compression here.
As Felipe notes, there is no need to consider these details as a user of BigQuery. All such optimizations are done behind the scenes, and are being improved continuously as BigQuery evolves without any action on your part.
As Mikhail notes in the comments, you are billed by the logical data size, regardless of any optimizations applied at the storage layer.

BigQuery constantly improves the underlying storage - and this all happens without any user interaction.
To see the original ideas behind BigQuery's columnar storage, read the Dremel paper:
https://ai.google/research/pubs/pub36632
To see the most recent published improvements in storage, see Capacitor:
https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format

BigQuery relies on Colossus, Google’s latest generation distributed file system. Each Google datacenter has its own Colossus cluster, and each Colossus cluster has enough disks to give every BigQuery user thousands of dedicated disks at a time.
You may gather more detail from the "BigQuery under the hood" page.

Related

Fastest way to preview the first 1000 rows in BigQuery

As a preview of data, I find myself doing the following query quite a bit:
SELECT * FROM mytable ORDER BY row_id ASC limit 1000
Is there a faster way to do this? It is literally the default view we show to the user.
Here's one suggestion for previewing data: https://cloud.google.com/bigquery/docs/best-practices-costs#preview-data, but I wasn't sure how it orders rows.
Is there a faster way to do this?
BigQuery is a big data solution and it's power lies when you have big sets and crunching large amount of records. Actually as much data you have BigQuery power up and do much better work for you.
Preview as stated by #Mikhail is not order and only give you a quick review on the table data and structure.
Not sure about your use case but you might consider saving data in Google Big table which is a noSQL database which is super fast (Again really depends on your use case)
Consistent sub-10ms latency
check Replication provides higher availability, higher durability, and resilience in the face of zonal failures
*( check Ideal for Ad Tech, Fintech, and IoT
check Storage engine for machine learning applications
check Easy integration with open source big data tools
And set the tables as external tables in BigQuery

InfluxDB max available expiration and performance concerns

I develop my metrics based on influxdb. I want to keep the data forever therefore my retention policy is set to inf and my shard retention policy is set to 100 years (the max I could set).
My main concern has to do with degrading performance by keeping this data. My series will not be more than 100000 (as adviced for the low server specs).
Is there gonna be an impact on the memory used indexing wise? More specific memory used by influxdb regardless of issuing any actions such as queries/continoues queries
Also in case there is a problem with performance, is it possible to backup only the data that are bound to be deleted?
Based on InfluxDB Hardware sizing guidelines, in moderate load situation with a single node InfluxDB deployed on a server with these specifications: CPU:6 cores and RAM:8-32 GB; you can have 250k writes per second and about 25 queries per second. These numbers will definitely meet your requirements. Also by increasing CPU and RAM you can achieve better performance.
Note, If the scale of your work grew in the future, you can also use "continues query" for down-sampling old data; or export a part of data to a backup file.

Bigtable/BigQuery pricing when inserts depend on lookups

I have a simple proof-of-concept application written in traditional SQL. I need to scale it to much larger size (potentially trillions of rows, multiple terabytes or possibly petabytes in size). I'm trying to come up with the pricing model of how this could be done using Google's Bigtable/BigQuery/Dataflow.
From what I gather from Google's pricing documents, Bigtable is priced in terms of nodes needed to handle the necessary QPS and in terms of storage required, whereas the BigQuery is priced in terms of each query's size.
But what happens when your inserts into the table actually require the lookup of that same table? Does that mean that you have to consider an additional cost factor into each insert? If my total column size is 1TB and I have to do a SELECT on that column before each additional insert, will I be charged $5 for each insert operation as a consequence? Do I have to adjust my logic to accommodate this pricing structure? Like breaking the table into a set of smaller tables, etc?
Any clarification much appreciated, as well as links to more detailed and granular pricing examples for Bigtable/BigQuery/Dataflow than what's available on Google's website.
I am the product manager for Google Cloud Bigtable.
It's hard to give a detailed answer without a deeper understanding of the use case. For example, when you need to do a lookup before doing an insert, what's the complexity of the query? Is it an arbitrary SQL query, or can you get by with a lookup by primary key? How big is the data set?
If you only need to do lookups by key, then you may be able to use Bigtable (which, like HBase, only has a single key: the row key), and each lookup by row key is fast and does not require scanning the entire column.
If you need complex lookups, you may be able to use:
Google BigQuery, but note that each lookup on a column is a full scan as per this answer, though as suggested in another answer, you can partition data to scan less data, if that's helpful
Google Cloud Datastore, which is a document database (like MongoDB), allows you to set up indexes on some of the fields, so you can do a search based on those properties
Google Cloud SQL, which is a managed service for MySQL, but while it can scale to TB, it does not scale to PB, so it depends how big your dataset is that you need to query prior to inserting
Finally, if your use case is going into the PB-range, I strongly encourage you to get in touch with Google Cloud Platform folks and speak with our architects and engineers to identify the right overall solution for your specific use cases, as there may be other optimizations that we can make if we can discuss your project in more detail.
Regarding BigQuery, you are able to partition your data based on day. So if you need to query only last days the charge will be for that and not for full table.
On the other hand, you need to rethink your data management. Choosing an append-only and event based data flow could help you to avoid lookups on the same table.
will I be charged $5 for each insert operation as a consequence?
Yes, any time you scan that column - you will be charged for full column's size unless your result is cachable (see query caching) which most likely is not your case
Do I have to adjust my logic ... ?
Yes.
"breaking the table into a set of smaller tables" (Sharding with Table wildcard functions) or Partitioning is the way to go for you

Pros & cons of BigQuery vs. Amazon Redshift [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Comparing Google BigQuery vs. Amazon Redshift shows that both can answer same set of requirements, differ mostly by cost plans. It seems that Redshift is more complex to configure (defining keys and optimization work) vs. Google BigQuery that perhaps has an issue with joining tables.
Is there a pros & cons list of Google BigQuery vs. Amazon Redshift?
I posted this comparison on reddit. Quickly enough a long term RedShift practitioner came to comment on my statements. Please see https://www.reddit.com/r/bigdata/comments/3jnam1/whats_your_preference_for_running_jobs_in_the_aws/cur518e for the full conversation.
Sizing your cluster:
Redshift will ask you to choose a number of CPUs, RAM, HD, etc. and to turn them on.
BigQuery doesn't care. Use it whenever you want, no provisioning needed.
Hourly costs when doing nothing:
Redshift will ask you to pay per hour of each of these servers running, even when you are doing nothing.
When idle BigQuery only charges you $0.02 per month per GB stored. 2 cents per month per GB, that's it.
Speed of queries:
Redshift performance is limited by the amount of CPUs you are paying for
BigQuery transparently brings in as many resources as needed to run your query in seconds.
Indexing:
Redshift will ask you to index (correction: distribute) your data under certain criteria, and you'll only be able to run fast queries based on this index.
BigQuery has no indexes. Every operation is fast.
Vacuuming:
Redshift requires periodic maintenance and 'vacuum' operations that last hours. You are paying for each of these server hours.
BigQuery does not. Forget about 'vacuuming'.
Data partitioning and distributing:
Redshift requires you to think about how to distribute data within your servers to keep performance up - optimization that works only for certain queries.
BigQuery does not. Just run whatever query you want.
Streaming live data:
Impossible(?) with Redshift.
BigQuery easily handles ingesting up to 100,000 rows per second per table.
Growing your cluster:
If you have more data, or more concurrent users scaling up will be painful with Redshift.
BigQuery will just work.
Multi zone:
You want a multi-zone Redshift for availability and data integrity? Painful.
BigQuery is multi-zoned by default.
To try BigQuery you don't need a credit card or any setup time. Just try it (quick instructions to try BigQuery).
When you are ready to put your own data into BigQuery, just copy your JSON new-line separated logs from to Google Cloud Storage and import them.
See this in depth guide to data warehouse pricing on the cloud:
Understanding Cloud Pricing Part 3.2 - More Data Warehouses
Amazon Redshift is a standard SQL database (based on Postgres) with MPP features that allow it to scale. These features also require you to conform your data model somewhat to get the best performance. It supports a large amount of the SQL standard and most tools that can speak to Postgres can use it unchanged.
BigQuery is not a database, in the sense that there it doesn't use standard SQL and doesn't provide JDBC/ODBC connectivity. It's a unique service with it's own API and interfaces. It provides limited support for SQL queries but most users interact with via custom code (Java, Python, etc.). Some 3rd party tools have added support for BigQuery but existing tools will not work without modification.
tl;dr - Redshift is better for interacting with existing tools and using complex SQL. BigQuery is better for custom coded interactions and teams who dislike SQL.
UPDATE 2017-04-17 - Here's a much more up to date summary of the cost and speed differences (wrapped in a sales pitch so YMMV). TL;DR - Redshift is usually faster and will be cheaper if you query the data somewhat regularly. http://blog.panoply.io/a-full-comparison-of-redshift-and-bigquery
UPDATE - Since I keep getting down votes on this (🤷‍♂️) here's an up-to-date response to the items in the other answer:
Sizing your cluster:
Redshift allows you to tailor your costs to your usage. If you want the fastest possible queries choose SSD nodes and if you want the lowest possible cost per GB choose HDD nodes. Start small and add nodes whenever you want.
Hourly costs when doing nothing:
Redshift keeps your cluster ready for queries, can respond in milliseconds (result cache) and it provides a simple, predictable monthly bill.
For example, even if some script accidentally runs 10,000 giant queries over the weekend your Redshift bill will not increase at all.
Speed of queries:
Redshift performance is absolutely best in class and gets faster all the time. 3-5x faster in the last 6 months.
Indexing:
Redshift has no indexes. It allows you to define sort keys to optimize performance from fast to insanely fast.
Vacuuming:
Redshift now automatically runs routine maintenance such as ANALYZE and VACUUM DELETE when your cluster has free resource.
Data partitioning and distributing:
Redshift never requires distribution. It allows you to define distribution keys which can make even huge joins very fast.
{Ask competitors about join performance…}
Streaming live data:
Redshift has 2 choices
Stream real time data into Redshift using Amazon Kinesis Firehose.
Skip ingestion altogether by querying your real time instantly on S3 as soon as it land (and at high speeds) using Redshift Spectrum external tables.
Growing your cluster:
Redshift can elastically resize most clusters in a few minutes.
Multi zone:
Redshift seamlessly replaces any failed hardware and continuously backs up your data, including across regions if desired.

What is the best way to store highly parametrized entities?

Ok, let met try to explain this in more detail.
I am developing a diagnostic system for airplanes. Let imagine that airplanes has 6 to 8 on-board computers. Each computer has more than 200 different parameters. The diagnostic system receives all this parameters in binary formatted package, then I transfer data according to the formulas (to km, km/h, rpm, min, sec, pascals and so on) and must store it somehow in a database. The new data must be handled each 10 - 20 seconds and stored in persistence again.
We store the data for further analytic processing.
Requirements of storage:
support sharding and replication
fast read: support btree-indexing
NOSQL
fast write
So, I calculated an average disk or RAM usage per one plane per day. It is about 10 - 20 MB of data. So an estimated load is 100 airplanes per day or 2GB of data per day.
It seems that to store all the data in RAM (memcached-liked storages: redis, membase) are not suitable (too expensive). However, now I am looking to the mongodb-side. Since it can utilize as RAM and disk usage, it supports all the addressed requirements.
Please, share your experience and advices.
There is a helpful article on NOSQL DBMS Comparison.
Also you may find information about the ranking and popularity of them, by category.
It seems regarding to your requirements, Apache's Cassandra would be a candidate due to its Linear scalability, column indexes, Map/reduce, materialized views and powerful built-in caching.