Why BigQuery Execution Engine Dremel needs to load data from BigQuery File System to local storage before processing it? - google-bigquery

I tried to dive into BigQuery architecture and got quite confused by the information I gathered about the BigQuery architecture. What is described is that the execution engine Dremel will load data from BigQuery file system Colossus to Dremel's leaf nodes' local storage and process from there. Why the need of the re-store the data in local storage instead of loading into memory and process straight out?
Can anyone help shed some light?

Dremel does not load data from Colossus to local storage, it loads it directly into the memory. If you can point out where it says otherwise, we will correct it.

Related

Is there (still) an advantage to staging data on Google Cloud Storage before loading into BigQuery?

I have a data set stored as a local file (~100 GB uncompressed JSON, could still be compressed) that I would like to ingest into BigQuery (i.e. store it there).
Certain guides (for example, https://www.oreilly.com/library/view/google-bigquery-the/9781492044451/ch04.html) suggest to first upload this data to Google Cloud Storage before loading it from there into BigQuery.
Is there an advantage in doing this, over just loading it directly from the local source into BigQuery (using bq load on a local file)? It's been suggested in a few places that this might speed up loading or make it more reliable (Google Bigquery load data with local file size limit, most reliable format for large bigquery load jobs), but I'm unsure whether that's still the case today. For example, according to its documentation, BigQuery supports resumable uploads to improve reliability (https://cloud.google.com/bigquery/docs/loading-data-local#resumable), although I don't know if those are used when using bq load. The only limitation I could find that still holds true is that the size of a compressed JSON file is limited to 4 GB (https://cloud.google.com/bigquery/quotas#load_jobs).
Yes, having data in Cloud Storage is a big advantage during development. In my cases I often create a BigQuery table from data in the Cloud Storage multiple times till I tune up all things like schema, model, partitioning, resolving errors etc. It would be really time consuming to upload data every time.
Cloud Storage to BigQuery
Pros
loading data is incredibly fast
possible to remove BQ table when not used and import it when needed (BQ table is much bigger than plain maybe compressed data in Cloud Storage)
you save your local storage
less likely fail during table creation (from local storage there could be networking issues, computer issues etc.)
Cons
you pay some additional cost for storage (in the case you do not plan to touch your data often e.g. once per month - you can decrease price to use the nearline storage)
So I would go for storing data to the Cloud Storage first but of course, it depends on your use case.

File structure of Apache Beam DynamicDestinations write to BigQuery

I am using DynamicDestinations (from BigQueryIO) to export data from one Cassandra table to multiple Google BigQuery tables. The process consists of several steps including writing prepared data to Google Cloud Storage (as files in JSON format) and then loading the files to BQ via load jobs.
The problem is that export process has ended with out of memory error at the last step (loading files from Google Storage to BQ). But there are prepared files with all of the data in GCS remaining. There are 3 directories in BigQueryWriteTemp location:
And there a lot of files with not obvious names:
The question is what is the storage structure of the files? How can I match the files with tables (table names) they prepared for? How can I use the files to continue export process from load jobs step? Can I use some piece of Beam code for that?
These files, if you're using Beam 2.3.0 or earlier, contain JSON data to be imported into BigQuery using its load job API. However:
This is an implementation detail that you can not rely on, in general. It is very likely to change in future versions of Beam (JSON is horribly inefficient).
It is not possible to match these files with the tables they are intended for - that was stored in the internal state of the pipeline that has failed.
There is also no way to know how much data was written to these files and how much wasn't. The files may contain only partial data: maybe your pipeline failed before creating some of the files, or after some of them were already loaded into BigQuery and deleted.
Basically, you'll need to rerun the pipeline and fix the OOM issue so that it succeeds.
For debugging OOM issues, I suggest using a heap dump. Dataflow can write heap dumps to GCS using --dumpHeapOnOOM --saveHeapDumpsToGcsPath=gs://my_bucket/. You can examine these dumps using any Java memory profiler, such as Eclipse MAT or YourKit. You can also post your code as a separate SO question and ask for advice reducing its memory usage.

Loading data from a Google Persistent Disk into BigQuery?

What's the recommended way of loading data into BigQuery that is currently located in a Google Persistent Disk? Are there any special tools or best practises for this particular use case?
Copy to GCS (Google Cloud Storage), point BigQuery to load from GCS.
There's no current direct connection between a persistent disk and BigQuery. You could send the data straight to BigQuery with the bq CLI, but makes everything slower if you ever need to retry.

Running Spark application using HDFS or S3

In my spark application, I just want to access a big file, and distribute the computation across many nodes on EC2.
Initially, my file is stored on S3.
It's very convenient for me to load the file with sc.textFile() function from S3.
However, I can put some efforts to load the data to HDFS and then read the data from there.
My question is, will the performance be better with HDFS?
My code involves the spark partitions(mapPartitions transforamtion), so does it really matter what is my initial file system?
Obviously when using S3 the latency is higher and the data throughput is lower compared to HDFS on local disk.
But it depends what you do with your data. It seems most of programs are limited more by CPU power than network throughput. So you should be fine with the 1Gbps throughput that you get from S3.
Anyway you can check recent slides from Aaron Davidson's talk on Spark Summit 2015. This topic is discussed there.
http://www.slideshare.net/databricks/spark-summit-eu-2015-lessons-from-300-production-users/16

Which storage is good for read performance

I have a custom data file. Reading this file with high speed on my local computer. Reading speed is avarage is 0.5 ms in my tests(simple read operations with seeking). I want to use same operation on azure. Tried to use Blob Storage with following steps:
Create cloud storage account
Create blob client
Get container
Get blob reference
OpenRead stream
This steps takes approximatelly 10-15 seconds. It's a readonly file. What can i do for increse reading performance? What is the best storage for a large number of read operations. In this time reading speed is more important for me. I do not want to use data file with web/worker role. I must be on the cloud storage.
You would have to analyze your access patterns to debug this issue further. For example, OpenRead gives you a stream that is easy to work with, but its read-ahead buffering strategy might not be optimal if you are seeking within the file. By default, the stream will buffer 4MB at a time, but it has to discard this buffer if the caller seeks beyond that 4MB range. Depending on how much you read after each seek, you might want to reduce the read-ahead buffer size or use DownloadRangeToStream API directly. Or, if your blob is small enough, you can download it in one shot using DownloadToStream API and then handle it in memory.
I would recommend using Fiddler to watch what requests your application makes to Azure Storage and see whether that is the best approach for your scenario. If you see that each individual request is taking a long time, you can enable Azure Storage Analytics to analyze the E2E latency and Server latency for those requests. Please refer to the Monitor, diagnose, and troubleshoot Microsoft Azure Storage article for more information on how to interpret Analytics data.