What are some methods of testing data analytics systems and ETL processes? - sql

I work primarily with so-called "Big Data"; the ETL and analytics parts. One of the challenges I constantly face is finding a good way to "test my data" so to speak. For my mapreduce and ETL scripts I write solid unit test coverage but if there are unexpected underlying changes in the data itself (coming from multiple application systems) the code won't necessarily throw a noticeable error which leaves me with bad / altered data that I don't know about.
Are there any best practices out there that help people keep an eye on what / how the underlying data may be changing?
Our technology stack is AWS EMR, Hive, Postgres, and Python. We're not really interested in bringing in a big ETL framework like Informatica.

You could create some kind of mapping files(maybe xml or something) as per the standards specific to your systems and validate your incoming data before putting it into your cluster, or maybe during the process itself. I was facing a similar issue sometime ago and ended up doing this.
I don't know how feasible it is for your data and your use case but it did the trick for us. I had to create the xml files once(I know it's boring and tedious, but worth giving a try) and now whenever I get new files I use these xml files to validate the data before putting it into my cluster to check whether the data is correct or not(as per the standards defined). This saves a lot of time and effort which would be involved if I have to check everything manually everytime I get some new data.

Related

BigQuery Testing, Debugging, and Design Patterns

We use BigQuery as the main data warehouse in our company.
We have gotten very efficient with SQL syntax and we write multi-page SQL queries with valid Syntax to analyze our data.
The main problem that we are struggling with are terrible logic mistakes in our queries. For example, it could be that a > should have been a >=, or that a join was treating NULL values the wrong way.
The effect is that we are getting wrong data out of BigQuery.
The logic within our data structure is so complicated ("what again was the definition of Customer Type ABC?") that it's terribly difficult to actually pull out anything useable. We estimate that up to 50% of analytics that we pull out of BigQuery are plain wrong.
Of course this is a problem that significantly hurts our bottom line and leads to wrong business decision. It has gotten so bad that we are craving for a normalized database structure that at least could be comprehended easier.
My hope is that maybe we are just missing certain design patterns to properly use BigQuery. However I find zero guidance about this online. The SQL we are using is so complex that I'm starting to think that although the Syntax is correct, SQL was not made for this. What we are doing feels like fitting a complex program into a single function, which in turn becomes untestable and a nightmare to work with.
I would appreciate any input and guidance
I can empathize here. I don't think your issue is unique, and there isn't one best practice. I can tell you what we have done to help with these same issues.
We are a small team of analysts, and only have a couple TB of data to crunch daily so your mileage will vary with these tips depending on your situation.
We use DBT - https://www.getdbt.com/. It has a free command line version, or you can pay for DBT cloud if you aren't confident with command line tools. It will help you go from Pages long SQL queries to smaller digestible chunks that are easier to maintain.
It helps with 3 main use cases for us.
database normalization/summarization - you can easily write queries, have them dependent on each other, have them scheduled to run at a certain time, while doing a lot of the more complex data engineering tasks for you. Such as making sure to run things in the right order, and that no circular references exist. This part of the tool helped us migrate away from pages long SQL queries to smaller digestible chunks that are useful in multiple applications.
documentation: there is a documentation site built in. So you can document a column and write out the definition of 'customer' easily.
Testing. We write loads of tests. We have a 100% accepted answer to certain metrics. Any time we need to reference this metric in other queries, or transform data to slice that metric by other dimensions, we write a test to make sure the new transformation matches back to the 100% accepted answer.
We have explored DBT, unfortunately we didn't have the bandwidth to support it at the company level. As an alternative we use airflow to build and maintain datasets in Bigquery. We use the BigQuery operators to interface with BQ through airflow. This helps us in the following ways:
Ability to build custom operators that can help with organizational level bells and whistles (integration with internal systems, data lifecycle management, lineage management etc.)
Ability to break down complex pieces of SQL into smaller manageable blocks that can be reused
Ability to incorporate testing in the process. You can build testing into your pipeline DAG or can build out separate DAGs of tests that can monitor your datasets and send out reports.
Ability to replay and recreate datasets
Ability to easily manage schema changes
I am sure there are other use cases where airflow helps, but these are some of the things that come to mind.

Proper way to move data to a data warehouse

I am in the middle of a small project aimed to eventually create a data warehouse. I am currently moving data from a flat file system and two SQL Server databases. The project started in C# to automate the processing of data from the flat file system. Along with this, the project executes stored procedures to bring data from other databases. They are accessing the data from other databases using linked servers.
I am wondering if this is incorrect as even though it does get the job done, there may better approach? The other way I have thought about this is to use the app to pull data from each DB then push it to the data warehouse, but I am not sure about performance. Is there another way? Any path that I can look into is appreciated.
'proper' is a pretty relative term. I have seen a series of stored procedures, SSIS (microsoft), and third party tools. THey each have some advantages
Stored procedures
Using a job to schedule a series of stored procedures that insert rows from one server to the next works. I find sql developers more likely to take this path...it's flexible in design and good SQL programmers can accomplish nearly anything in here. That said, it is exceedingly difficult to support / troubleshoot / maintain / alter (especially if the initial developer(s) are no longer with the company). There is usually very poor error handling here
SSIS and other tools such as pentaho or data stage or ...google search it, theres a few.
This gives a more graphical design interface, although I've seen SSIS packages that simply called a stored procedures in order that may as well just been a job. These tools are really what you make of them. They give very easy to see work flows and are substantially robust when it comes to error handling and troubleshooting ability (trust me, every ETL process is going to have a few bad days and you'll be very happy for any logging you have to identify what you want). I find configuring a servers resources (multiple processors for example) is significantly easier with these tools. They all come with quite the learning curve though.
I find SQL developers are very much inclined to use the stored procedure route while people from a DBA background are generally more inclined to use the tools. If you're investing the time into it, the SSIS or equivlent tool is a better way to go from the future of your company standpoint, though takes a bit more to implement.
In choosing what to use you need to consider the following factors:
How much data are we talking about moving and how quickly does it need to be moved. There is s huge difference between using a linked server to move45,000 records and using it to move 100,000,000 records. Consider alo the expected growth of the data set to be moved over time. A process taht is fine in the early stages may chocke and die once you get more records. Tools like SSIS are much faster once you know how to use them which brings us to point 2.
How much development time do you have and what tools does the developer and the person who will maintain the import over time know? SSIS for instance is a complex tool, it can take a long time to feel comfortable with it.
How much data cleaning and transforming do you need to do? What kind of error trapping and exception processing do you need, what kind of logging will you need? The more complex the process, the more likely you will need to bite the bullet and learn an ETL specific tool.
Even there is a few answers, and I agree with two of them, I have to give my subjective opinion about the wider picture.
I am in the middle of a small project aimed to eventually create a data warehouse.
Question name perfectly suits to your question description. It could be very helpful to future readers. So, your project should create data warehouse. However it's small, learn to develop projects with scalability. Always!
In that point of view, search and study about how data warehouse project should look like. And develop each step.
Custom software vs Stored Procedures (Linked DBs) vs ETL
Custom software (in this case your C# project) should be used in two cases:
Medium scale projects where budget ETL cannot do everything
You're working for Enterprise level IT company, so developing your solution is cheaper and more manageable
And perhaps you think for tiny straight-forward projects. But NO, because those projects can grow and very quick outgrow your solution (new tables, new sources, changing ERP or CRM, ect).
If you're using just SQL Server, if you no need for data cleansing, if you no need for data profiling, if you no need for external data, Stored Procedures are OK. But, a lot of 'ifs' is here. And again, you're loosing scalability (your managment what's to add some data from Google Spreadsheet they internly use, KPI targets for example).
ETL tools are one native step in data warehouse development. In begining, there could be few table copy operation, or some SQL's, one source, one target. As far as your project is growing, you can adding new transformations.
SSIS is perhaps best as you're using SQL Server, but there is some good, free tools.

What to use, Excel or database for data driven framework in Selenium?

I am working on Selenium automation using WebDriver. It is keyword and data driven approach. I am handling all the inputs of objects,data and test configuration from Microsoft Excel.
Now client want to use database. He is asking me which one is more good to use in framework. Either database or Microsoft Excel utility? I have to reply to him with valid points.
Which one is the better to use in framework and also why second one is not good to use?
This question really requires an opinionated answer, but because there are some important benefits from one or the other I will still answer it!
With Excel, data entry is easy and accessible to non-technical testers. It has good sorting ability and the data can be organised in a pretty intuitive fashion. However, you cannot add or alter data during a test easily (I understand you could do this but I would this this is overboard). This means data must be previously organised and specific to the test requirements.
In a database data can come from queries and be altered on the fly. You could write helper functions that populate required data if it can't be found in the db. This means tests can run without worrying about what data is currently in the db (to an extent) and can alleviate some project issues. However, with a database there can be issues importing/exporting and there could be a lot of coding overhead. Also, obviously non-technical testers will find it hard to change the data.
As I said this is really an opinion in the end and is very much a project by project decision.
With my experience efficiency plays an integral role comparing these too approaches. Using a database tend to slow down the performance unlike excel sheets. Using a file(excel) increases the efficiency. And in our company automations suits we rarely use databases. Instead will push data in to files and use them as the repository.

What database for crawler/scraper?

I am currently researching what database to use for a project I am working on. Hopefully you guys can give me some hints.
The project is an automated web crawler that checks websites as per a user's request, scrapes data under certain circumstances, and creates log files of what was done.
Requirements:
Only few tables with few columns; predefining columns is no problem
No overly complex associations between models
Huge amount of date & time based queries
Due to logging, database will grow rapidly and use up a lot of space
Should be able to scale over multiple servers
Fields contain mostly ids (int), strings (around 200-500 characters max), and unix timestamps
Two different types of servers will simultaneously read/write data directly to/from it:
One(/later more) rails app that takes user input and displays results upon request
One(/later more) Node.js server that functions as the executing crawler/scraper. It will have enough load to run continuously and make dozens of database queries every second.
I assume it will neither be a graph database (no complex associations), nor a memory based key/value store (too much data to hold in cached). I'm still on the fence for every other type of database I could find, each seems to have it's merits.
So, any advice from the pros how I should decide?
Thanks.
I would agree with Vladimir that you would want to consider a document-based database for this scenario. I am most familiar with MongoDB. My reasons for using it here are as follows:
Your 'schema requirements' of "only a few tables with few columns" fits well with the NoSQL nature of MongoDB.
Same as above for "no overly complex associations between nodes" -- you will want to decide whether you'd prefer nested documents or using dbref (I prefer the former)
Huge amount of time-based data (and other scaling requirements) - MongoDB scales well via sharding or partitioning
Read/write access - this is why I am recommending MongoDB over something like Hadoop. The interactive query requirement is best met by something other than a Hadoop-style store, as this type of storage is designed for batch (rather than interactive query) requirements.
Google built a database called "BigTable" for crawling, indexing and the search related business. They released a paper about it (google for "BigTable" if you're interested). There are several open source implementations for bigtable-like designs, one of them is Hypertable. We have a blog posting describing a crawler/indexer implementation (http://hypertable.com/blog/sehrchcom_a_structured_search_engine_powered_by_hypertable/) written by the guys from sehrch.com. And looking at your requirements: all of them are supported and are common use cases.
(disclaimer: i work for hypertable.)
Take a look at document-oriented database like a CouchDB or MongoDB.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/