Best practice approaches to ETL on Bigquery? - google-bigquery

I'm wondering what some of the best practices/tools are people have found for building and managing ETL jobs on bigquery.
At the moment I have lots of sql 'templates' (horribly parameterized by lob, date etc using sed type string replacements into a tmp.sql file and then running that) and I use the command line tool to run sequences of them and send output to tables. It works fine but is getting a bit unwieldy. I still don't get why I can't run stored procedure type parameterized scripts on bigquery. Or even some sort of gui to build and manage pipelines.
I love bigquery but really feel like I'm either missing something very obvious here or its a real gap in the product (e.g. Pretty sure Apache Drill more built out in this regard).
So just wondering if anyone can share any best practice etl tips or approaches you use yourself.
I do also use xplenty for some jobs which is good but it's also a bit messy in that I can't just write sql in it so can be painful to build and debug complicated pipelines.
Was thinking about looking into Talend also, but really parameterized stored procedures, macros and SQL is all i'd ideally need.
Sorry if this is more of a discussion question then specific code. Happy to move it to reddit or something if more suited there.

Google Cloud Dataflow is closer to your needs than BigQuery in my opinion. We use it for real-time streaming ETL with automatic scaling. Works great, though you will need to code Java.

Related

BigQuery Testing, Debugging, and Design Patterns

We use BigQuery as the main data warehouse in our company.
We have gotten very efficient with SQL syntax and we write multi-page SQL queries with valid Syntax to analyze our data.
The main problem that we are struggling with are terrible logic mistakes in our queries. For example, it could be that a > should have been a >=, or that a join was treating NULL values the wrong way.
The effect is that we are getting wrong data out of BigQuery.
The logic within our data structure is so complicated ("what again was the definition of Customer Type ABC?") that it's terribly difficult to actually pull out anything useable. We estimate that up to 50% of analytics that we pull out of BigQuery are plain wrong.
Of course this is a problem that significantly hurts our bottom line and leads to wrong business decision. It has gotten so bad that we are craving for a normalized database structure that at least could be comprehended easier.
My hope is that maybe we are just missing certain design patterns to properly use BigQuery. However I find zero guidance about this online. The SQL we are using is so complex that I'm starting to think that although the Syntax is correct, SQL was not made for this. What we are doing feels like fitting a complex program into a single function, which in turn becomes untestable and a nightmare to work with.
I would appreciate any input and guidance
I can empathize here. I don't think your issue is unique, and there isn't one best practice. I can tell you what we have done to help with these same issues.
We are a small team of analysts, and only have a couple TB of data to crunch daily so your mileage will vary with these tips depending on your situation.
We use DBT - https://www.getdbt.com/. It has a free command line version, or you can pay for DBT cloud if you aren't confident with command line tools. It will help you go from Pages long SQL queries to smaller digestible chunks that are easier to maintain.
It helps with 3 main use cases for us.
database normalization/summarization - you can easily write queries, have them dependent on each other, have them scheduled to run at a certain time, while doing a lot of the more complex data engineering tasks for you. Such as making sure to run things in the right order, and that no circular references exist. This part of the tool helped us migrate away from pages long SQL queries to smaller digestible chunks that are useful in multiple applications.
documentation: there is a documentation site built in. So you can document a column and write out the definition of 'customer' easily.
Testing. We write loads of tests. We have a 100% accepted answer to certain metrics. Any time we need to reference this metric in other queries, or transform data to slice that metric by other dimensions, we write a test to make sure the new transformation matches back to the 100% accepted answer.
We have explored DBT, unfortunately we didn't have the bandwidth to support it at the company level. As an alternative we use airflow to build and maintain datasets in Bigquery. We use the BigQuery operators to interface with BQ through airflow. This helps us in the following ways:
Ability to build custom operators that can help with organizational level bells and whistles (integration with internal systems, data lifecycle management, lineage management etc.)
Ability to break down complex pieces of SQL into smaller manageable blocks that can be reused
Ability to incorporate testing in the process. You can build testing into your pipeline DAG or can build out separate DAGs of tests that can monitor your datasets and send out reports.
Ability to replay and recreate datasets
Ability to easily manage schema changes
I am sure there are other use cases where airflow helps, but these are some of the things that come to mind.

Is it possible to write tests for SQL in Bigquery?

Usually when writing code you can write tests, like Unit Tests in Java.
I'm not talking about syntax, the BigQuery Editor does a great job with that. I mean the semantics. Does the query actually fetch the data I'm looking for in the way I need it.
When analyzing data in BigQuery I don't really know what to expect, so all I can do is say "hmm, looks about right". But a lot of mistakes happen this way.
Often the SQL queries for analysis become very complex, so how do you make sure your SQL gives you the results you are actually looking for?
I'm by myself so I can't ask anybody else to look over it.
Is there a standard for this, if not, what techniques do others use to deal with the uncertainty?

Proper way to move data to a data warehouse

I am in the middle of a small project aimed to eventually create a data warehouse. I am currently moving data from a flat file system and two SQL Server databases. The project started in C# to automate the processing of data from the flat file system. Along with this, the project executes stored procedures to bring data from other databases. They are accessing the data from other databases using linked servers.
I am wondering if this is incorrect as even though it does get the job done, there may better approach? The other way I have thought about this is to use the app to pull data from each DB then push it to the data warehouse, but I am not sure about performance. Is there another way? Any path that I can look into is appreciated.
'proper' is a pretty relative term. I have seen a series of stored procedures, SSIS (microsoft), and third party tools. THey each have some advantages
Stored procedures
Using a job to schedule a series of stored procedures that insert rows from one server to the next works. I find sql developers more likely to take this path...it's flexible in design and good SQL programmers can accomplish nearly anything in here. That said, it is exceedingly difficult to support / troubleshoot / maintain / alter (especially if the initial developer(s) are no longer with the company). There is usually very poor error handling here
SSIS and other tools such as pentaho or data stage or ...google search it, theres a few.
This gives a more graphical design interface, although I've seen SSIS packages that simply called a stored procedures in order that may as well just been a job. These tools are really what you make of them. They give very easy to see work flows and are substantially robust when it comes to error handling and troubleshooting ability (trust me, every ETL process is going to have a few bad days and you'll be very happy for any logging you have to identify what you want). I find configuring a servers resources (multiple processors for example) is significantly easier with these tools. They all come with quite the learning curve though.
I find SQL developers are very much inclined to use the stored procedure route while people from a DBA background are generally more inclined to use the tools. If you're investing the time into it, the SSIS or equivlent tool is a better way to go from the future of your company standpoint, though takes a bit more to implement.
In choosing what to use you need to consider the following factors:
How much data are we talking about moving and how quickly does it need to be moved. There is s huge difference between using a linked server to move45,000 records and using it to move 100,000,000 records. Consider alo the expected growth of the data set to be moved over time. A process taht is fine in the early stages may chocke and die once you get more records. Tools like SSIS are much faster once you know how to use them which brings us to point 2.
How much development time do you have and what tools does the developer and the person who will maintain the import over time know? SSIS for instance is a complex tool, it can take a long time to feel comfortable with it.
How much data cleaning and transforming do you need to do? What kind of error trapping and exception processing do you need, what kind of logging will you need? The more complex the process, the more likely you will need to bite the bullet and learn an ETL specific tool.
Even there is a few answers, and I agree with two of them, I have to give my subjective opinion about the wider picture.
I am in the middle of a small project aimed to eventually create a data warehouse.
Question name perfectly suits to your question description. It could be very helpful to future readers. So, your project should create data warehouse. However it's small, learn to develop projects with scalability. Always!
In that point of view, search and study about how data warehouse project should look like. And develop each step.
Custom software vs Stored Procedures (Linked DBs) vs ETL
Custom software (in this case your C# project) should be used in two cases:
Medium scale projects where budget ETL cannot do everything
You're working for Enterprise level IT company, so developing your solution is cheaper and more manageable
And perhaps you think for tiny straight-forward projects. But NO, because those projects can grow and very quick outgrow your solution (new tables, new sources, changing ERP or CRM, ect).
If you're using just SQL Server, if you no need for data cleansing, if you no need for data profiling, if you no need for external data, Stored Procedures are OK. But, a lot of 'ifs' is here. And again, you're loosing scalability (your managment what's to add some data from Google Spreadsheet they internly use, KPI targets for example).
ETL tools are one native step in data warehouse development. In begining, there could be few table copy operation, or some SQL's, one source, one target. As far as your project is growing, you can adding new transformations.
SSIS is perhaps best as you're using SQL Server, but there is some good, free tools.

Database approach to use for Dynamic Form Data Collection which is suitable for good Reports and Searching

I am working on a project which involves collecting dynamic form data. These forms are user-defined (think surveymonkey) and thus a fixed schema cannot be defined for them. Data in terms of questions/answers would be retrieved for these forms and then stored into the database. Reporting/Searching on this answers (filtering and aggregation) is of utmost importance. There are two approaches which are feasible.
Use a SQL database and store the each field data as a separate row. Reporting/Searching is then done via SQL. My apprehension is that it would result in complicated joins for reporting.
Use a NoSQL database like MongoDB. This seems to be a perfect fit for storing the dynamic data since it is schema-less. However, I am not sure how good its reporting capabilities are.
It seems easier for target users to learn sql than to define map/reduce queries. How easy would it be to build a UI for reporting/searching over mongoDB.
Simple things like - list of users who gave a particular set of answers. How many such users over a period of time etc?
Thanks,
Pulkit
It's already been mentioned in the comments, but I'll re-iterate that you should look at Mongo's map/reduce functionality for reporting and the aggregation framework.
Having done map/reduce in both Couch and Mongo I can say that they are very similar. It's definitely a barrier to entry for a developer that isn't familiar with it, but once you get a few working examples, it's not too bad.
Consider that Mongo can output a map/reduce job to a collection, which I've found to be really useful. This means you can schedule the jobs and run them periodically and output to a place that you can then report on. It's not that hard to create a framework that lets developers write simple Javascript map and reduce functions and then plug them in to be run on a schedule.
The aggregation framework is much easier to understand for a developer coming from SQL. Still a learning curve, but not as bad as map/reduce. It is much more well suited to ad-hoc reporting queries and there is nothing comparable in Couch.
You could maybe make a reporting UI that maps to the aggregation framework, but I wouldn't try to do something similar for map/reduce queries.

Scripting your database first versus building the database via SQL Server Management Studio and then generating the script

I had a (friendly but heated) argument with my lead developer the other day because our project has TSQL Scripts that I code directly into SQL files which I then run against the database. I find that when I do this, it's easy to work out the schema in advance without fiddly pointing and clicking and then there's no opportunity to forget to generate a script to put into source control as generating the script no longer becomes a chore you have to do after the fact, but is an implicit part of the process (and also leads to cleaner scripts without the extra crap that SQL Server Management Studio inserts into the scripts it generates).
My lead developer insists that having to manually script it out is a pain in the arse and that he absolutely refuses to write his scripts by hand when there are perfectly good tools to do it without coding. I've noticed that the copying of his changes into the actual scripts tends to get delayed a bit as a result though.
What are your thoughts on the pros and/or cons of doing it one way vs the other? Am I being too rigid/old-school in my sticking to hand coding schema scripts or is he being too reliant on third party tools and losing something in the process?
I always script stuff myself because the wizards sometimes don't script things in a way that I like it and will also give funky names to defaults
scripting things yourself is also good in case you get laid off and you have to go for an interview where they ask you to script DDL on the whiteboard
As I usually collaborate with a colleague during the schema design, I tend to design the schema using the GUI tools, as its easier to discuss it with a diagram of the tables in front of you. I then generate the scripts, being careful to select the exact options that I want to avoid having to make manual changes post-export.
I think a decision on the relative merits of the two approaches might take into account factors such as
the frequency of changes to the schema
the frequency with which changes need to be propagated to other schemas (test, user acceptance, production, clients * n, etc)
the degree to which the schema may vary across development branches
how well-known in advance your various changes can be scheduled
whether or not you can generate SQL "diff" scripts between schemas.
On balance, I tend to prefer to work with a script for each change (or "migration"). It lets me resequence change releases as priorities shift.
Just because you can create tables in a graphical tool doesn't necessarily mean you should.
I find its as quick to write a script as it is to use SQLMS. You still have to type names in SQLMS, and the time spent moving from keyboard and mouse could be used writing the proper script anyway.
The two of you are almost working with two sets of code. Consistency seems to be a key factor on these types of decisions. In your case, if you create a script, your boss uses the gui to add a field, how do you stay in sync? You can't use your script to rebuild the table without editing it (Chance for error.).
Maybe he should pull rank and force you to format your scripts the same way the GUI creates them - just kidding.
I think you should flip on it..........