How to use pure SQL for Exploratory Data Analysis? - sql

I'm an ETL developer using different tools for ETL tasks. The same question rises in all our projects: the importance of the data profiling before the Data Warehouse is build and before the ETL is build for data movement. Usually I have done data profiling (i.e. finding bad data, data anomalies, counts, distinct values etc.) using pure SQL because ETL tools does not provide a good alternative for these (there is some data quality components in our tools, but they are not so sophisticated). One option is to use R programming language or SPSS Modeler etc. kind of tools for this kind of Exploratory Data Analysis. But usually these kinds of tools are not available or does not qualify if there is millions of rows of data.
How to do this kind of profiling using SQL? Is there any helper scripts available? How do you do this kind of Exploratory Data Analysis before data cleaning and ETL?

Load the data into the some staging system and use the Data profiler task from SSIS. Use this link http://gowdhamand.wordpress.com/2012/07/27/data-profiling-task-in-ssis/ to verify how to data analysis. Hope this helps.

I found a good tool for this purpose: Datacleaner. This seems to do most of things I want to do with data in EDA process.

USe this Exploratory Data Analysis for SQL which can help in Data Profiling & Analysis
https://pypi.org/project/edaSQL/
source code:
https://github.com/selva221724/edaSQL

Related

NoSQL or SQL or Other Tools for scaling excel spreadsheets

I am looking to convert an excel spreadsheet into more of a scalable solution for reporting. The volume of the data is not very large. At the moment the spreadsheet around 5k rows and grows by about 10 every day. There are also semi-frequent changes in how we capture information i.e. new columns as we starting to mature the processes. The spreadsheet just stores attributes or dimensions data on cases.
I am just unsure whether I should use a traditional SQL database or NoSQL database (or any other tool). I have no experience in NoSQL but I understand that it is designed to be very flexible which is what I want compared to a traditional DB.
Any thoughts would be appreciated :) !
Your dataset is really small and any SQL database (say, PostgreSQL) will work just fine. Stay away from NoSQL DBs as they are more limited in terms of reporting capability.
However, since your facts schema is still not stable ("new columns as we starting to mature the processes.") you may simply use your Spreadsheet as a data source in BI tools. To keep your reports up-to-date you may use the following process:
Store your Spreadsheet on cloud storage (like Google Drive or OneDrive)
Use codeless automation platform (like Zapier) to setup a job to sync Spreadsheet file with BI tool when it changes. This is easily possible if BI tool is SeekTable, for instance.

Data processing - BigQuery vs Data Proc+BigQuery

We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)
I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.
We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.
Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.
I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time.
And a week to do the same using a series of SQL with potens.io utilizing
Cloud Function for Windowed Tables and parallelization with clustering transient tables.
I know there's been bunch improvement in CloudDataFlow since then, but at a time
the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.
Both BigQuery and DataProc can handle huge amounts of complex data.
I think that you should consider two points:
Which transformation would you like to do in your data?
Both tools can make complex transformations but you have to consider that PySpark will provide you a full programming language processing capability while BigQuery will provide you SQL transformations and some scripting structures. If only SQL and simple scripting structures can handle your problem, BigQuery is an option. If you need some complex scripts to transform your data or if you think you'll need to build some extra features involving transformations in the future, PySpark may be a better option. You can find the BigQuery scripting reference here
Pricing
BigQuery and DataProc have different pricing systems. While in BigQuery you'd need to concern about how much data you would process in your queries, in DataProc you have to concern about your cluster's size and VM's configuration, how much time your cluster would be running and some other configurations. You can find the pricing reference for BigQuery here and for DataProc here. Also, you can simulate the pricing in the Google Cloud Platform Pricing Calculator
I suggest that you create a simple POC for your project in both tools to see which one has the best cost benefit for you.
I hope these information help you.

What does ESRI provide that Google BigQuery cannot provide, and how these two tools can be used together?

Currently I am looking for big data technology that supports Big data Geo-spatial analysis. I came up to ESRI and found its main support for Geo-spatial data analysis and visualization. However, currently, they don't have extensive support for Big Data Geo-spatial analysis, except for the ArcGIS GeoAnalytics Server which requires licensing. At the same time, I found how powerful is Google BigQuery which recently provide support for Geospatial processing and analysis (pay for what you use, per second).
What I would like to know is: which tool I should pick for Geospatial big data processing, analysis and visualization? and which tool (ESRI vs. BigQuery) is better used for what?
I would like to run complex queries on very large temporal Geo-spatial dataset and finally visualize results on a map.
Please note that I have just started my research on Geospatial big data processing and I would like to chose between the alternative tools out there.
Any help is much appreciated!!
(note that Stack Overflow doesn't always welcome this type of questions... but you can always come to https://reddit.com/r/bigquery for more discussions)
For "Geospatial big data processing, analysis and visualization" my favorite recommendation right now is Carto+BigQuery. Carto is one of the leading GIS analysis companies, and they recently announced moving one of their backends to BigQuery. They even published some handy notebooks showing how to work with Carto and BigQuery:
https://carto.com/blog/carto-google-bigquery-data/

Why would you use industry standard ETL?

I've just started work at a new company who have a datawarehouse that uses some bizzare proprietary ETL built in PHP.
I'm looking for arguments as to why its worth the investment to move to a standard system such as SSIS or infomatica or something. The primary reasons I have at the moment are:
A wider and more diverse community of developers available for contract work, replacements etc.
A large online knowledge base/support networks
Ongoing updates and support will be better
What are other good high level arguments to bring a little standardisation in :)
The only real disadvantage is that a lot of the data sources are web apis returning individual row-by-row records which are more easily looped through with PHP as opposed to standard ETL.
Here are some more:
Simplifies development and deployment process.
Easy to debug and incorporate changes. Would reduce maintenance and enhancement costs.
Industry standard ETL tools perform better on large volume of data as they use various techniques like, grid computing, parallel processing, partitioning etc.
Can support many types for data as source or target. Less impact if source or target systems are migrated to a different data store.
Codes are re-usable. Same component of code can be used in multiple processes.

Alternatives to Essbase

I have Essbase as the BI solution (for Predictive Analytics and Data Mining) in my current workplace. It's a really clunky tool, hard to configure and slow to use. We're looking at alternatives. Any pointers as to where I can start at?
Is Microsoft Analysis Services an option I can look at? SAS or any others?
Essbase focus and strenght is in the information management space, not in predictive analytics and data mining.
The top players (and expensive ones) in this space are SAS (with Enterprise Miner & Enteprise Guide combination) and IBM with SPSS.
Microsoft SSAS (Analysis Services) is a lot less expensive (it's included with some SQL Server versions) and has good Data Mining capabilities but is more limited in the OR (operations research) and Econometrics/Statistics space.
Also, you could use R, an open source alternative, that is increasing its popularity and capabilities over time, for example some strong BI players (SAP, Microstrategy, Tableau, etc.) are developing R integration for predictive analytics and data mining.
Check www.kpionline.com , is a product cloud based in Artus.
It has many dashboards, scenarios and functions prebuilt to do analysis.
Other tool than you could check is microstrategy. It has many functions to analysis.