I want to store product images in google bigquery database so that i can display these images in my reports.
Is there any way to achieve it?
You can store binary data in BigQuery by base64-encoding it and storing it as a string.
That said, make sure you consider the caveats in Incognito's answer. Storing images in BigQuery may not be the most efficient use of your query dollar, since you'll pay to query the entire column, which will likely contain a large amount of data and will therefore make your query much more expensive. You might consider Google Cloud Storage or Google Cloud Datastore as better alternatives for storing binary data for direct lookup.
I don't think storing images is possible in Google Storage to be processed by Google BigQuery.
Also, as per the official documentation available for Google BigQuery, BigQuery supports CSV and JSON data formats and the supported data types can include:
Data types
Your data can include strings, integers, floats, booleans,
nested/repeated records and timestamps
Also, since BigQuery is for analyzing huge datasets, why would you store anything other than data which you don't want to process and harness for gathering more information. Plus, you also pay for storage and storing images obviously takes a lot of space on the server, so you would cut on that too, to save costs.
Related
To avoid questions about. Why do you use casandra in favour of another database. we have to because our custoner decided that Im my option a completely wrong decision.
In our Applikation we have to deal with PDF documents, i.e. Reader them and populate them with data.
So my intention was to hold the documents (templates) in the database read them and then do what we need to do with them.
I noticed that cassandra provieds a blob column type.
However for me it seems that this type has nothing to with a blob in qn Oracle or other relational database.
From what I understand is that cassandra is not for storing documnents and therefore it is not possible?
Or is the only way to make byte-array out of the document?
what is the intention of the blob column type?
The blob type in Cassandra is used to store raw bytes, so it's "theoretically" could be used to store PDF files as well (as bytes). But there is one thing that should be taken into consideration - Cassandra doesn't work well with big payloads - usual recommendation is to store 10s or 100s of Kb, not more than 1Mb. With bigger payloads, operations, such as repair, addition/removal of nodes, etc. could lead to increased overhead and performance degradation. On older versions of Cassandra (2.x/3.0) I have seen the situations when people couldn't add new nodes because join operation failed. It's a bit better situation with newer versions, but still it should be evaluated before jumping into implementation. It's recommended to do performance testing + some maintenance operations at scale to understand if it will work for your load. NoSQLBench is a great tool for such things.
It is possible to store binary files in a CQL blob column however the general recommendation is to only store a small amount of data in blobs, preferably 1MB or less for optimum performance.
For larger files, it is better to place them in an object store and only save the metadata in Cassandra.
Most large enterprises whose applications hold large amount of media files (music, video, photos, etc) typically store them in Amazon S3, Google Cloud Store or Azure Blob then store the metadata (such as URLs) of the files in Cassandra. These enterprises are household names in streaming services and social media apps. Cheers!
We have data dumped into S3 buckets and we are using these data to pull some reports in Quicksight some directly accessing s3 as datasource and for other reports, we used Athena to query S3.
At which point, one need to use Redshift? Is there any advantage of using Redshift over S3+Athena?
No you might be perfectly fine with just QuickSight, Athena and S3 - it will be relatively cheaper as well if you keep Redshift out of the equation. Athena is based on PRESTO and is pretty comprehensive in terms of functionality for most SQL reporting needs.
You would need Redshift if you approach or hit the QuickSight's SPICE limits and would still like your reports to be snappy and load quickly. From a data engineering side, if you need to update existing records it is easier to micro batch and update records in RedShift. With athena/s3 you also need to take care of optimising the storage format (use orc/parquet file formats, use partitions, not use small files etc...) - it is not rocket science but some people prefer paying for RedShift and not having to worry about that at all.
In the end, RedShift will probably scale better when your data grows very large (into the petabyte scale). However, my suggestion would be to keep using Athena and follow it's best practices and only use RedShift if you anticipate huge growth and want to be sure that you can scale the underlying engine on demand (and, of course, are willing to pay extra for it).
I already have terabytes of data stored on BigQuery and I'm wondering to perform heavy data transformations on it.
Considering COSTS and PERFORMANCE, what the best approach you guys would suggest to perform these transformations for future usage of these data on BigQuery?
I'm considering a few options:
1. Read raw data from DataFlow and then load the transformed data back into BigQuery?
2. Do it directly from BigQuery?
Any ideas about how to proceed with this?
I wrote down some most important things about performance, you can find there consideration regarding your question about using DataFlow.
Best practices considering performance:
Choosing file format:
BigQuery supports a wide variety of file formats for data ingestion. Some are going to be naturally faster than others. When optimizing for load speed, prefer using the AVRO file format, which is binary, row-based format and enables to split it and then read it in parallel with multiple workers.
Loading data from compressed files, specifically CSV and JSON, is going to be slower than loading data in a other format. And the reason being is because, since the compression of Gzip is not splitable, there is a need to take that file, load it onto a slot within BQ, and then do the decompression, and then finally parallelize the load afterwards.
**FASTER**
Avro(Compressed)
Avro(Uncompressed)
Parquet/ORC
CSV
JSON
CSV (Compressed)
JSON(Compressed
**SLOWER**
ELT / ETL:
After loading data into BQ, you can think about transformations (ELT or ETL). So in general, you want to prefer ELT over ETL where possible. BQ is very scalable and can handle large transformations on a ton of data. ELT is also quite a bit simpler, because you could just write some SQL queries, transform some data and then move data around between tables, and not have to worry about managing a separate ETL application.
Raw and staging tables:
Once, you have started loading data into BQ, in general, within your warehouse, you're going to want to leverage raw and staging tables before publishing to reporting tables. The raw table essentially contains the full daily extract, or a full load of the data that they're loading. Staging table then is basically your change data capture table, so you can utilize queries or DML to marge that data into your staging table and have a full history of all the data that's being inserted. And then finally your reporting tables are going to be the ingest that you publish out to your users.
Speeding up pipelines using DataFlow:
When you're getting into streaming loads really complex batch loads (that doesn't really fit into SQL cleanly), you can leverage DataFlow or DataFusion to speed up those pipelines, and do more complex activities on that data. And if you're starting with streaming, I recommend using the DataFlow templates - Google provides it for loading data from multiple different places and moving data around. You can find those templates in DataFlow UI, within Create Job from Template button, you'll find all these templates.
And if you find that it mostly fits your use case, but want to make one slight modification, all those templates are also open sourced (so you can go to repo, modify the code to fit your needs).
Partitioning:
Partition in BQ physically split your data on disk, based on ingestion time or based on a column within your data. Efficiently query over the parts of the table you want. This provides huge cost and performance benefits, especially on large fact tables. Whenever you have a fact table or temporal table, utilize a partition column on your date dimension.
Cluster Frequently Accessed Fields:
Clustering allows you to physically order data within a partition. So you can do Clustering by one or multiple keys. This provide massive performance benefits when used properly.
BQ reservations:
It allows to create reservations of slots, assign project to those reservations, so you can allocate more or less resources to certain types of queries.
Best practices considering saving costs you can find in official documentation.
I hope it helps you.
According to this Google Cloud Documentation, the following questions should be done to choose between DataFlow or BigQuery tool for ELT.
Although the data is small and can quickly be uploaded by using the BigQuery UI, for the purpose of this tutorial you can also use Dataflow for ETL. Use Dataflow for ETL into BigQuery instead of the BigQuery UI when you are performing massive joins, that is, from around 500-5000 columns of more than 10 TB of data, with the following goals:
You want to clean or transform your data as it's loaded into BigQuery, instead of storing it and joining afterwards. As a result,
this approach also has lower storage requirements because data is only
stored in BigQuery in its joined and transformed state.
You plan to do custom data cleansing (which cannot be simply achieved with SQL).
You plan to combine the data with data outside of the OLTP, such as logs or remotely accessed data, during the loading process.
You plan to automate testing and deployment of data-loading logic using continuous integration or continuous deployment (CI/CD).
You anticipate gradual iteration, enhancement, and improvement of the ETL process over time.
You plan to add data incrementally, as opposed to performing a one-time ETL.
BigQuery (BQ) has its own storage system which is completely separated from the Google Cloud Store (GCS).
My question is: why doesn't BQ directly process data stored on the GCS like Hadoop Hive? What is the benefit and necessity of this design?
That is because BigQuery uses column oriented database systems and it has background processes that constantly check if the data is stored in the optimal way. Therefore, the data is managed by BigQuery (that's why it has own storage) and it only exposes the highest layer to the user.
See this article for more details:
When you load bits into BigQuery, the service takes on the full
responsibility of managing that data, and only exposing the logical
database primitives to you
BigQuery gains several benefits from having its own separate storage.
For one, BigQuery is able to optimize the storage of it’s data constantly by moving and reordering it on the disks that it is stored on and by adding more disks and repeating the process as the database grows larger and larger.
BigQuery also utilizes a separate compute layer to query the storage layer, allowing the storage layer to scale while requiring less overall hardware to run the queries. This gives BigQuery the ability to call on more processing power as it needs it, but not have idle hardware when queries from a specific database are not being executed.
For a more in depth explanation of BigQueries structure and optimizations you can checkout this article I wrote for The Data School.
Inserting large amounts of test data into BigQuery can be slow, especially if the exact details of the data aren't important and you just want to test the performance of a particular shape of query/data.
What's the best way to achieve this without waiting around for many GB of data to upload to GCS?
In general, I'd recommend testing over small amounts of data (to save money and time).
If you really need large amounts of test data, there are several options.
If you care about the exact structure of the data:
You can upload data to GCS in parallel (if a slow single transfer is the bottleneck).
You could create a short-lived Compute Engine VM and use it to insert test data into GCS (which is likely to provide higher throughput than over your local link). This is somewhat involved, but gives you a very fast path for inserting data generated on-the-fly by a script.
If you just want to try out the capabilities of the platform, there are a number of public datasets available for experimentation. See:
https://cloud.google.com/bigquery/docs/sample-tables
If you just need a large amount of data and duplicate rows are acceptable:
You can insert a moderate amount of data via upload to GCS. Then duplicate it by querying the table and appending the result to the original. You can also use the bq command line tool with copy and the --append flag to achieve a similar result without being charged for a query.
This method has a bit of a caveat -- to get performance similar to typical production usage, you'll want to load your data in reasonably large chunks. For a 400GB use case, I'd consider starting with 250MB - 1GB of data in a single import. Many tiny insert operations will slow things down (and are better handled via the streaming API, which does the appropriate batching for you).