BigQuery best approach for ETL (external tables and views vs Dataflow) - google-bigquery

CSV files get uploaded to some FTP server (for which I don't have SSH access) in a daily basis and I need to generate weekly data that merges those files with transformations. That data would go into a history table in BQ and a CSV file in GCS.
My approach goes as follows:
Create a Linux VM and set a cron job that syncs the files from the
FTP server with a GCS bucket (I'm using GCSFS)
Use an external table in BQ for each category of CSV files
Create views with complex queries that transform the data
Use another cron job to create a table with the historic data and also the CSV file on a weekly basis.
My idea is to remove as much middle processes as I can and to make the implementation as easy as possible, including dataflow for ETL, but I have some questions first:
What's the problem with my approach in terms of efficiency and money?
Is there anything DataFlow can provide that my approach can't?
any ideas about other approaches?
BTW, I ran into one problem that might be fixable by parsing the csv files myself rather than using external tables, which is invalid characters, like the null char, so I can get rid of them, while as an external table there is a parsing error.

Probably your ETL will be simplified by Google DataFlow Pipeline batch execution job. Upload your files to the GCS bucket. For transforming use pipeline transformation to strip null values and invalid character (or whatever your need is). On those transformed dataset use your complex queries like grouping it by key, aggregating it (sum or combine) and also if you need side inputs data-flow provides ability to merge other data-sets into the current the data-set too. Finally the transformed output can written to BQ or you can write your own custom implementation for writing those results.
So the data-flow gives you very high flexibility to your solution, you can branch the pipeline and work differently on each branch with same data-set. And regarding the cost, if you run your batch job with three workers, which is the default that should not be very costly, but again if you just want to concentrate on your business logic and not worry about the rest, google data-flow is pretty interesting and its very powerful if used wisely.
Data-flow helps you to keep everything on a single plate and manage them effectively. Go through its pricing and determine if it could be the best fit for you (your problem is completely solvable with google data-flow), Your approach is not bad but needs extra maintenance with those pieces.
Hope this helps.

here are a few thoughts.
If you are working with a very low volume of data then your approach may work just fine. If you are working with more data and need several VMs, dataflow can automatically scale up and down the number of workers your pipeline uses to help it run more efficiently and save costs.
Also, is your linux VM always running? Or does it only spin up when you run your cron job? A batch Dataflow job only runs when it needed, which also helps to save on costs.
In Dataflow you could use TextIO to read each line of the file in, and add your custom parsing logic.
You mention that you have a cron job which puts the files into GCS. Dataflow can read from GCS, so it would probably be simplest to keep that process around and have your dataflow job read from GCS. Otherwise you would need to write a custom source to read from your FTP server.
Here are some useful links:
https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling

Related

What to use to serve as an intermediary data source in ETL job?

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:
JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv)
JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.
I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?
PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture.
Please share your views on this.
In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.
I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.
Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.
Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.
All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.
In a nutshell. The process would be.
|---Extract and Transform---|----Load----|
Source ---> Cloud Storage --> BigQuery
I would do ELT instead of ETL: load the source data as-is and transform in Bigquery using SQL functions.
This allows potentially to reshape data (convert to arrays), filter out columns/rows and perform transform in one single SQL.

How can I load data from BigQuery to Spanner?

I'd like to run a daily job that does some aggregations based on a BigQuery setup. The output is a single table that I write back to BigQuery that is ~80GB over ~900M rows. I'd like to make this dataset available to an online querying usage pattern rather than for analysis.
Querying the data would always be done on specific slices that should be easy to segment by primary or secondary keys. I think Spanner is possibly a good option here in terms of query performance and sharding, but I'm having trouble working out how to load that volume of data into it on a regular basis, and how to handle "switchover" between uploads because it doesn't support table renaming.
Is there a way to perform this sort of bulk loading programatically? We already are using Apache Airflow internally for similar data processing and transfer tasks, so if it's possible to handle it in there that would be even better.
You can use Cloud Dataflow.
In your pipeline, you could read from BigQuery and write to Cloud Spanner.

CAN we run ETL jobs on AWS EFS

I would like to know if we can run ETL jobs on EFS mount files..
if so how? is it using Hive or anyother service?
Our target is to reduce all the files in one mount point to one file...and store that one file in s3 for better processing
EFS in itself does not inherently have a particular data warehouse product included. For data warehousing and ETL you can choose what you want to use that operates in the AWS environment.
On to your problem:
You want to concatenate or in some way combine all of the files currently in your EFS mount into a single file and store that in S3, if I understand it correctly.
You do not mention what type of data you have or what type of files you want to combine. That makes a huge difference in how you would do this. So I will have to give general suggestions. If you have different types of data, SQL tables from different databases, documents, non-sql data; then you need to determine how to combine that data. For that you would be looking at a data integration solution that can accommodate raw data.
Amazon has a few different products that may assist the process such as Redshift, Athena, Snowflake and their ETL solution Glue. Adding products depends on your company's needs and budget.
So, a more flexible data integration approach would be to use ELT (extract, load, transform) instead of ETL. Basically you would create an appropriate file over on your S3 instance. Then you would extract each file on EFS one at a time and load them into your S3 file. Then when you query the data in your S3 file you would perform any transformations needed before seeing the query results. Here's an article that explains the differences in more detail: https://blog.panoply.io/etl-vs-elt-the-difference-is-in-the-how.
There are some vendors supporting the ELT process such as Talend, Hadoop/Hive/Spark, Terradata and Informatica should you want to investigate options.

Beam - Handling failures during huge data load for bigquery

I have recently started with Apache beam. I am sure I am missing something here. I have a requirement to load from a very huge database to bigquery. These tables are huge. I have written sample beam jobs to load minimal rows from simple tables.
How would I able to load n number of rows from tables using JDBCIO? Is there anyway that i can load these data in batches as we do in conventional data migration jobs.?
Can I do batch read from a database and write in batches to bigquery?
Also i have seen that, the suggested approach to load the data to bigquery is by adding the files to the data store buckets. But, in automated environment, the requirement is to write it as a dataflow job to load from db and write it to bigquery. What should my design approach to solve this issue using apache beam?
Please help.!
It looks[1] like BigQueryIO will write batches of data if it comes from a bounded PCollection (otherwise it uses streaming inserts). It also appears to bound the size of each file and batch, so I don't think you'll need to do any manual batching.
I'd just read from your database via JDBCIO, transform it if needed, and write it to BigQueryIO.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java

Automatic Hive or Cascading for ETL in AWS-EMR

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html