Inserting large amounts of test data into BigQuery can be slow, especially if the exact details of the data aren't important and you just want to test the performance of a particular shape of query/data.
What's the best way to achieve this without waiting around for many GB of data to upload to GCS?
In general, I'd recommend testing over small amounts of data (to save money and time).
If you really need large amounts of test data, there are several options.
If you care about the exact structure of the data:
You can upload data to GCS in parallel (if a slow single transfer is the bottleneck).
You could create a short-lived Compute Engine VM and use it to insert test data into GCS (which is likely to provide higher throughput than over your local link). This is somewhat involved, but gives you a very fast path for inserting data generated on-the-fly by a script.
If you just want to try out the capabilities of the platform, there are a number of public datasets available for experimentation. See:
https://cloud.google.com/bigquery/docs/sample-tables
If you just need a large amount of data and duplicate rows are acceptable:
You can insert a moderate amount of data via upload to GCS. Then duplicate it by querying the table and appending the result to the original. You can also use the bq command line tool with copy and the --append flag to achieve a similar result without being charged for a query.
This method has a bit of a caveat -- to get performance similar to typical production usage, you'll want to load your data in reasonably large chunks. For a 400GB use case, I'd consider starting with 250MB - 1GB of data in a single import. Many tiny insert operations will slow things down (and are better handled via the streaming API, which does the appropriate batching for you).
Related
I already have terabytes of data stored on BigQuery and I'm wondering to perform heavy data transformations on it.
Considering COSTS and PERFORMANCE, what the best approach you guys would suggest to perform these transformations for future usage of these data on BigQuery?
I'm considering a few options:
1. Read raw data from DataFlow and then load the transformed data back into BigQuery?
2. Do it directly from BigQuery?
Any ideas about how to proceed with this?
I wrote down some most important things about performance, you can find there consideration regarding your question about using DataFlow.
Best practices considering performance:
Choosing file format:
BigQuery supports a wide variety of file formats for data ingestion. Some are going to be naturally faster than others. When optimizing for load speed, prefer using the AVRO file format, which is binary, row-based format and enables to split it and then read it in parallel with multiple workers.
Loading data from compressed files, specifically CSV and JSON, is going to be slower than loading data in a other format. And the reason being is because, since the compression of Gzip is not splitable, there is a need to take that file, load it onto a slot within BQ, and then do the decompression, and then finally parallelize the load afterwards.
**FASTER**
Avro(Compressed)
Avro(Uncompressed)
Parquet/ORC
CSV
JSON
CSV (Compressed)
JSON(Compressed
**SLOWER**
ELT / ETL:
After loading data into BQ, you can think about transformations (ELT or ETL). So in general, you want to prefer ELT over ETL where possible. BQ is very scalable and can handle large transformations on a ton of data. ELT is also quite a bit simpler, because you could just write some SQL queries, transform some data and then move data around between tables, and not have to worry about managing a separate ETL application.
Raw and staging tables:
Once, you have started loading data into BQ, in general, within your warehouse, you're going to want to leverage raw and staging tables before publishing to reporting tables. The raw table essentially contains the full daily extract, or a full load of the data that they're loading. Staging table then is basically your change data capture table, so you can utilize queries or DML to marge that data into your staging table and have a full history of all the data that's being inserted. And then finally your reporting tables are going to be the ingest that you publish out to your users.
Speeding up pipelines using DataFlow:
When you're getting into streaming loads really complex batch loads (that doesn't really fit into SQL cleanly), you can leverage DataFlow or DataFusion to speed up those pipelines, and do more complex activities on that data. And if you're starting with streaming, I recommend using the DataFlow templates - Google provides it for loading data from multiple different places and moving data around. You can find those templates in DataFlow UI, within Create Job from Template button, you'll find all these templates.
And if you find that it mostly fits your use case, but want to make one slight modification, all those templates are also open sourced (so you can go to repo, modify the code to fit your needs).
Partitioning:
Partition in BQ physically split your data on disk, based on ingestion time or based on a column within your data. Efficiently query over the parts of the table you want. This provides huge cost and performance benefits, especially on large fact tables. Whenever you have a fact table or temporal table, utilize a partition column on your date dimension.
Cluster Frequently Accessed Fields:
Clustering allows you to physically order data within a partition. So you can do Clustering by one or multiple keys. This provide massive performance benefits when used properly.
BQ reservations:
It allows to create reservations of slots, assign project to those reservations, so you can allocate more or less resources to certain types of queries.
Best practices considering saving costs you can find in official documentation.
I hope it helps you.
According to this Google Cloud Documentation, the following questions should be done to choose between DataFlow or BigQuery tool for ELT.
Although the data is small and can quickly be uploaded by using the BigQuery UI, for the purpose of this tutorial you can also use Dataflow for ETL. Use Dataflow for ETL into BigQuery instead of the BigQuery UI when you are performing massive joins, that is, from around 500-5000 columns of more than 10 TB of data, with the following goals:
You want to clean or transform your data as it's loaded into BigQuery, instead of storing it and joining afterwards. As a result,
this approach also has lower storage requirements because data is only
stored in BigQuery in its joined and transformed state.
You plan to do custom data cleansing (which cannot be simply achieved with SQL).
You plan to combine the data with data outside of the OLTP, such as logs or remotely accessed data, during the loading process.
You plan to automate testing and deployment of data-loading logic using continuous integration or continuous deployment (CI/CD).
You anticipate gradual iteration, enhancement, and improvement of the ETL process over time.
You plan to add data incrementally, as opposed to performing a one-time ETL.
I am trying to load a series of CSV files, ranging from 100MB to 20GB in size (total of ~3TB). So, I need every performance enhancement that I can. I am aiming to use filegrouping, and partitioning as a mean. I performed a series of tests to see the optimum approach.
First, I tried various filegroup combination; the best I get is when I am loading into a table that is on 1 filegroup; with multiple files assigned to it, and they are all siting on one disc. This combination outperformed to the case that I have multiple filegroups.
Next step was naturally to have partitioning. ODDLY, all the partitioning combination that I examined have lower performance. I tried defining various partition function/schemes and various filegroup combinations. But all showed a lower loading speed.
I am wondering what I am missing here!?
So far, I managed to load (using bulk insert) a 1GB csv file in 3 minutes. Any idea is much appreciated.
For gaining optimal Data Loading speed you need to first understand SQL Server data load process, which means understanding how SQL Server achieves below mentioned optimizations.
Minimal Logging.
Parallel Loading.
Locking Optimization.
These two article will explain in detail how you can achieve all the above optimizations in detail. Fastest Data Loading using Bulk Load and Minimal Logging and Bulk Loading data into HEAP versus CLUSTERED Table
Hope this helps.
Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of
these (such as indexes) are less important due to Spark SQL’s
in-memory computational model. Others are slotted for future releases
of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use
case the data size far exceeds the size of available memory.
Assuming this is not uncommon, what is meant by "Spark SQL’s
in-memory computational model"? Is Spark SQL recommended only for
cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large
dataset can take a long time. I read this argument against
indexing in in-memory database, but I was not convinced. The example
there discusses a scan of a 10,000,000 records table, but that's not
really big data. Scanning a table with billions of records can cause
simple queries of the "SELECT x WHERE y=z" type take forever instead
of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?
Indexing input data
The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.
In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".
So what is an index, anyway?
Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.
But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.
40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?
Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.
Seriouusly. Think about that a second.
The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.
If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.
Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.
And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.
What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.
Tell me again why you want indexes.
Indexes vs ___ ?
Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?
Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.
If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.
If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!
But I digress...
Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.
I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle operations like egen, collapse, merging, etc in a reasonable amount of time for the 10% sample when using Stata-MP on a UNIX server with ~50G of RAM and multiple cores.
However, now I want to move on to analyzing the whole sample. Even if I use a machine that has enough RAM to hold the dataset, simply generating a variable takes ages. (I think perhaps the background operations are causing Stata to run into virtual mem)
The problem is also very amenable to parallelization, i.e., the rows in the dataset are independent of each other, so I can just as easily think about the one large dataset as 100 smaller datasets.
Does anybody have any suggestions for how to process/analyze this data or can give me feedback on some suggestions I currently have? I mostly use Stata/SAS/MATLAB so perhaps there are other approaches that I am simply unaware of.
Here are some of my current ideas:
Split the dataset up into smaller datasets and utilize informal parallel processing in Stata. I can run my cleaning/processing/analysis on each partition and then merge the results after without having the store all the intermediate parts.
Use SQL to store the data and also perform some of the data manipulation such as aggregating over certain values. One concern here is that some tasks that Stata can handle fairly easily such as comparing values across time won't work so well in SQL. Also, I'm already running into performance issues when running some queries in SQL on a 30% sample of the data. But perhaps I'm not optimizing by indexing correctly, etc. Also, Shard-Query seems like it could help with this but I have not researched it too thoroughly yet.
R also looks promising, but I'm not sure if it would solve the problem of working with this enormous amount of data.
Thanks to those who have commented and replied. I realized that my problem is similar to this thread. I have re-written some of my data manipulation code in Stata into SQL and the response time is much quicker. I believe I can make large optimization gains by correctly utilizing indexes and using parallel processing via partitions/shards if necessary. After all the data manipulation has been done, I can import that data via ODBC in Stata.
Since you are familiar with Stata there is a well documented FAQ about large data sets in Stata Dealing with Large Datasets: you might find this helpful.
I would clean via columns, splitting those up, running any specific cleaning routines and merge back in later.
Depending on your machine resources, you should be able to hold the individual columns in multiple temporary files using tempfile. Taking care to select only the variables or columns most relevant to your analysis should reduce the size of your set quite a lot.
I was wondering if anyone had any experience with what I am about to embark on. I have several csv files which are all around a GB or so in size and I need to load them into a an oracle database. While most of my work after loading will be read-only I will need to load updates from time to time. Basically I just need a good tool for loading several rows of data at a time up to my db.
Here is what I have found so far:
I could use SQL Loader t do a lot of the work
I could use Bulk-Insert commands
Some sort of batch insert.
Using prepared statement somehow might be a good idea. I guess I was wondering what everyone thinks is the fastest way to get this insert done. Any tips?
I would be very surprised if you could roll your own utility that will outperform SQL*Loader Direct Path Loads. Oracle built this utility for exactly this purpose - the likelihood of building something more efficient is practically nil. There is also the Parallel Direct Path Load, which allows you to have multiple direct path load processes running concurrently.
From the manual:
Instead of filling a bind array buffer
and passing it to the Oracle database
with a SQL INSERT statement, a direct
path load uses the direct path API to
pass the data to be loaded to the load
engine in the server. The load engine
builds a column array structure from
the data passed to it.
The direct path load engine uses the
column array structure to format
Oracle data blocks and build index
keys. The newly formatted database
blocks are written directly to the
database (multiple blocks per I/O
request using asynchronous writes if
the host platform supports
asynchronous I/O).
Internally, multiple buffers are used
for the formatted blocks. While one
buffer is being filled, one or more
buffers are being written if
asynchronous I/O is available on the
host platform. Overlapping computation
with I/O increases load performance.
There are cases where Direct Path Load cannot be used.
With that amount of data, you'd better be sure of your backing store - the dbf disks' free space.
sqlldr is script drive, very efficient, generally more efficient than a sql script.
The only thing I wonder about is the magnitude of the data. I personally would consider several to many sqlldr processes and assign each one a subset of data and let the processes run in parallel.
You said you wanted to load a few records at a time? That may take a lot longer than you think. Did you mean a few files at a time?
You may be able to create an external table on the CSV files and load them in by SELECTing from the external table into another table. Whether this method will be quicker not sure however might be quicker in terms of messing around getting sql*loader to work especially when you have a criteria for UPDATEs.