I'm creating a Data Warehouse and I'm using Pentaho Data Integration for the ETL.
I have 2 options:
Each dimension ETL in its own transformation (I have to open all the transformation files to run my dimensions), I thought this might be practical if I want to run one dimension only for example.
Gather all the dimensions in one transformation. (Is this considered like creating a new job in PDI that includes all the transformation? If it is, what is the best way to relate the transformations?)
I'm new to all of this, by the way.
Any suggestions?
Thank you.
Related
We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)
I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.
We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.
Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.
I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time.
And a week to do the same using a series of SQL with potens.io utilizing
Cloud Function for Windowed Tables and parallelization with clustering transient tables.
I know there's been bunch improvement in CloudDataFlow since then, but at a time
the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.
Both BigQuery and DataProc can handle huge amounts of complex data.
I think that you should consider two points:
Which transformation would you like to do in your data?
Both tools can make complex transformations but you have to consider that PySpark will provide you a full programming language processing capability while BigQuery will provide you SQL transformations and some scripting structures. If only SQL and simple scripting structures can handle your problem, BigQuery is an option. If you need some complex scripts to transform your data or if you think you'll need to build some extra features involving transformations in the future, PySpark may be a better option. You can find the BigQuery scripting reference here
Pricing
BigQuery and DataProc have different pricing systems. While in BigQuery you'd need to concern about how much data you would process in your queries, in DataProc you have to concern about your cluster's size and VM's configuration, how much time your cluster would be running and some other configurations. You can find the pricing reference for BigQuery here and for DataProc here. Also, you can simulate the pricing in the Google Cloud Platform Pricing Calculator
I suggest that you create a simple POC for your project in both tools to see which one has the best cost benefit for you.
I hope these information help you.
I already have terabytes of data stored on BigQuery and I'm wondering to perform heavy data transformations on it.
Considering COSTS and PERFORMANCE, what the best approach you guys would suggest to perform these transformations for future usage of these data on BigQuery?
I'm considering a few options:
1. Read raw data from DataFlow and then load the transformed data back into BigQuery?
2. Do it directly from BigQuery?
Any ideas about how to proceed with this?
I wrote down some most important things about performance, you can find there consideration regarding your question about using DataFlow.
Best practices considering performance:
Choosing file format:
BigQuery supports a wide variety of file formats for data ingestion. Some are going to be naturally faster than others. When optimizing for load speed, prefer using the AVRO file format, which is binary, row-based format and enables to split it and then read it in parallel with multiple workers.
Loading data from compressed files, specifically CSV and JSON, is going to be slower than loading data in a other format. And the reason being is because, since the compression of Gzip is not splitable, there is a need to take that file, load it onto a slot within BQ, and then do the decompression, and then finally parallelize the load afterwards.
**FASTER**
Avro(Compressed)
Avro(Uncompressed)
Parquet/ORC
CSV
JSON
CSV (Compressed)
JSON(Compressed
**SLOWER**
ELT / ETL:
After loading data into BQ, you can think about transformations (ELT or ETL). So in general, you want to prefer ELT over ETL where possible. BQ is very scalable and can handle large transformations on a ton of data. ELT is also quite a bit simpler, because you could just write some SQL queries, transform some data and then move data around between tables, and not have to worry about managing a separate ETL application.
Raw and staging tables:
Once, you have started loading data into BQ, in general, within your warehouse, you're going to want to leverage raw and staging tables before publishing to reporting tables. The raw table essentially contains the full daily extract, or a full load of the data that they're loading. Staging table then is basically your change data capture table, so you can utilize queries or DML to marge that data into your staging table and have a full history of all the data that's being inserted. And then finally your reporting tables are going to be the ingest that you publish out to your users.
Speeding up pipelines using DataFlow:
When you're getting into streaming loads really complex batch loads (that doesn't really fit into SQL cleanly), you can leverage DataFlow or DataFusion to speed up those pipelines, and do more complex activities on that data. And if you're starting with streaming, I recommend using the DataFlow templates - Google provides it for loading data from multiple different places and moving data around. You can find those templates in DataFlow UI, within Create Job from Template button, you'll find all these templates.
And if you find that it mostly fits your use case, but want to make one slight modification, all those templates are also open sourced (so you can go to repo, modify the code to fit your needs).
Partitioning:
Partition in BQ physically split your data on disk, based on ingestion time or based on a column within your data. Efficiently query over the parts of the table you want. This provides huge cost and performance benefits, especially on large fact tables. Whenever you have a fact table or temporal table, utilize a partition column on your date dimension.
Cluster Frequently Accessed Fields:
Clustering allows you to physically order data within a partition. So you can do Clustering by one or multiple keys. This provide massive performance benefits when used properly.
BQ reservations:
It allows to create reservations of slots, assign project to those reservations, so you can allocate more or less resources to certain types of queries.
Best practices considering saving costs you can find in official documentation.
I hope it helps you.
According to this Google Cloud Documentation, the following questions should be done to choose between DataFlow or BigQuery tool for ELT.
Although the data is small and can quickly be uploaded by using the BigQuery UI, for the purpose of this tutorial you can also use Dataflow for ETL. Use Dataflow for ETL into BigQuery instead of the BigQuery UI when you are performing massive joins, that is, from around 500-5000 columns of more than 10 TB of data, with the following goals:
You want to clean or transform your data as it's loaded into BigQuery, instead of storing it and joining afterwards. As a result,
this approach also has lower storage requirements because data is only
stored in BigQuery in its joined and transformed state.
You plan to do custom data cleansing (which cannot be simply achieved with SQL).
You plan to combine the data with data outside of the OLTP, such as logs or remotely accessed data, during the loading process.
You plan to automate testing and deployment of data-loading logic using continuous integration or continuous deployment (CI/CD).
You anticipate gradual iteration, enhancement, and improvement of the ETL process over time.
You plan to add data incrementally, as opposed to performing a one-time ETL.
I was wondering if there's any software/tool out there which allows one to design/model a cube and have the software generate the code (at least the framework) necessary for updating dimensions and facts in a relational database?
Sorry if this is a dumb question. But, we have tools like Cognos TM1 which I can design a cube and associated ETL processes to bring data in and store in the TM1 cubes. However, I'm looking for something similar to store/maintain the data in the ROLAP star schema database.
Thanks!
You can use the Turbo Integrator to maintain the cube/dimension structure. See http://www.bedrocktm1.org/ for some help regarding loading and extracting data with TI processes and chores.
I haved used Cubeware Importer and SSIS as ETL tool in the past to load and save the parent child of TM1 projects. Everything from and to relational databases/SAP BW/flatfiles will work to store dimension tables and cube meta data.
I'm trying to figure it if one can build an SSAS cube quickly for prototyping from just one huge and wide table without doing any ETL and custom SQL. Is it even possible?
What we are trying to do, we have a bunch of these tables for different subject areas which were denormalized and a lot of efforts were put to create them and test them. We need a quick way to access this data now and run analytical queries but before we spend time on ETL/dimensional design, we wanted to build a quick cube.
Please do not suggest PowerPivot or any other in-memory tools - these tables are really big and we have very limited RAM at our disposal,
Yes, it's possible. Simply use the same table for creating both dimensions and cubes (measure groups). It's not ideal to do it like this for production, but you should be fine for prototyping.
Another alternative I always use in situations like this, create SQL views on top of the wide table to mimic the dimension and facts (dimensional model). And use views in the data source view. If you've time you can spend on creating the views, this is the best method. Because at the end of the prototype you know the model and functionality is working, and you just need to create physical data warehouse and ETL when you're ready to implement in production.
Is it possible to create a large cube based on existing cube data?
We'd like to test the performance of certain tools in combination with SSAS and currently do not have any cubes large enough.
e.g. We have a year's worth of data and want to expand it to be 10 year's worth.
Mostly I have created my own scripts for growing test data.
I have used Adventure Works as a base for names, address etc, also I have used Red Gate's data generator (was working at a place that had the full Red Gate product suite, you can download an evaluation copy to test it out).
Might be worth writing your own scripts. Then you can tweak the generation scripts to generate additional versions for testing.
To increase the size of your data you need to write custom scripts to copy it. There is no automatic way to "grow data" in SQL.