Hadoop and MS SQL Server Best Practices - sql

I've been following Hadoop for a while, it seems like a great technology. The Map/Reduce, Clustering it's just good stuff. But I haven't found any article regarding the use of Hadoop with SQL Server.
Let's say I have a huge claims table (600 million rows) and I want to take advantage of Hadoop. I was thinking but correct me if I'm wrong, I can query my table and extract all of my data and insert it into hadoop in chunks of any type (xml, json, csv). Then I can take advantage of Map/Reduce and Clustering with at least 6 machines and leave my SQL Server for other tasks. I'm just throwing a bone here I just want to know if anybody has done such a thing.

Importing and exporting data to and from a relational database is a very common use-case for Hadoop. Take a look at Cloudera's Sqoop utility, which will aid you in this process:
http://incubator.apache.org/projects/sqoop.html

Related

What are some workloads where it makes sense to use MapReduce over SQL and vice versa?

It seems like all queries expressed in SQL can be converted into MapReduce jobs. This is in essence what Spark SQL does. SparkSQL takes in SQL, converts it to a MapReduce job then executes the MapReduce job on Spark's runtime.
All questions which can be answered by SQL can be answered by MapReduce jobs. Can all MapReduce jobs also be written as SQL (maybe with custom user defined functions)? When does it make sense to use MapReduce over SQL or vice versa?
SQL is useful when you have structured data (e.g. tables, with clearly defined columns and, usually, data types). Using SQL with that structure you can select columns, join them, etc.
With MapReduce you can do that (Spark SQL will help you do that) but you can also do much more. A typical example is a word count app that counts the words in text files. Text files do not have any predefined structure that you can use to query them using SQL. Take into account that kind of applications are usually coded using Spark core (i.e. RDD) instead of Spark SQL, since Spark SQL needs also a structure.
Another maybe more real use case is processing large amounts of log files using MapReduce (again, log files does not have a relational structure such as the one required by SQL).
SQL and MapReduce also have their own advantages. To data analysist , they don't need to learn how to write the MapReduce Program. And from the perspective of developer, writing MapReduce program leave enough room to tuning the program ,like add random prefix to skewing data.
And In the long run, with the development of SQL interpreters, use SQL over MapReduce/Spark RDD.

Data processing - BigQuery vs Data Proc+BigQuery

We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)
I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.
We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.
Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.
I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time.
And a week to do the same using a series of SQL with potens.io utilizing
Cloud Function for Windowed Tables and parallelization with clustering transient tables.
I know there's been bunch improvement in CloudDataFlow since then, but at a time
the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.
Both BigQuery and DataProc can handle huge amounts of complex data.
I think that you should consider two points:
Which transformation would you like to do in your data?
Both tools can make complex transformations but you have to consider that PySpark will provide you a full programming language processing capability while BigQuery will provide you SQL transformations and some scripting structures. If only SQL and simple scripting structures can handle your problem, BigQuery is an option. If you need some complex scripts to transform your data or if you think you'll need to build some extra features involving transformations in the future, PySpark may be a better option. You can find the BigQuery scripting reference here
Pricing
BigQuery and DataProc have different pricing systems. While in BigQuery you'd need to concern about how much data you would process in your queries, in DataProc you have to concern about your cluster's size and VM's configuration, how much time your cluster would be running and some other configurations. You can find the pricing reference for BigQuery here and for DataProc here. Also, you can simulate the pricing in the Google Cloud Platform Pricing Calculator
I suggest that you create a simple POC for your project in both tools to see which one has the best cost benefit for you.
I hope these information help you.

Can i have any books about Azure Data Lake Internals?

I dont wanna use the ADL and ADLA as a black box. I need to understand how the gears rotate underhood to use it in an efficient way.
Where i can find an information that describe internals:
how U-SQL query is processed
how parallelism is worked
how storage is organized in ADL at low level
how DB's storage is organized in ADL at low level (is it rowstore or columnstore)
how partitioning is organized
etc
There is exists a lot of books and whitepappers that describes RDBMS engine's internals. Does it exists for ADL/ADLA?
There are a lot of guys who works in Azure. Could you publish any drafts/whitepappers to use as is (unoficially).
Some of that information is available in presentations we have given. For example you can find some of these presentations on my slideshare account at: http://www.slideshare.net/MichaelRys.
To answer some of your questions above:
The current clustered index version of U-SQL tables are stored in your catalog folder structured as so called structured stream files. These are highly compressible, scaled out files that use a row-oriented structure with self-contained meta data and statistics (more detailed stats can be created). The table construct provides 2 level partitioning: addressable partitions and internal distribution schemes (HASH, RANGE etc). Both help with parallelization, although distribution schemes are more for performance while partition more for data lifecycle management. There is no limit on them, although the sweet spot is 1GB to 4GB per distribution bucket.
1 AU is basically 1 container. And ADLS is NOT HDFS architecturally but offers the WebHDFS API for compatibility.
This is a pretty broad question. I assume you've started with the existing documentation on ADLA and U-SQL?
https://learn.microsoft.com/en-us/azure/data-lake-analytics/
https://msdn.microsoft.com/library/azure/mt591959
ADLA GA'd in November of 2016, compared to SQL Server in 1987 - that's a very apples and oranges comparison.
Maybe we can start with your specific questions?

how to represent hive data for mahout's recommendations?

I am a new bee in hadoop - big data analysis. I am referring book-"mahout in action".
Here i saw a topic which explains how we can represent recommender's data from database. In the book they have shown programmatic approach of connecting MySQL with mahout.
my question is "Is it possible to connect hive with mahout like we connect MySQL? if yes then how?
These are pretty completely different things -- you're talking about using a non-distributed recommender with MySQL, versus a Hadoop-based non-distributed recommender with the output of Hive on HDFS. If you're using recommenders, the simplest thing is to get your output as simple CSV data on HDFS and use that as input.

MongoDB and PostgreSQL thoughts

I've got an app fully working with PostgreSQL. After reading about MongoDB, I was interested to see how the app would work with it. After a few weeks, I migrated the whole system to MongoDB.
I like a few things with MongoDB. However, I found certain queries I was doing in PostgreSQL, I couldn't do efficiently in MongoDB. Especially, when I had to join several tables to calculate some logic. For example, this.
Moreover, I am using Ruby on Rails 3, and an ODM called Mongoid. Mongoid is still in beta release. Documentation was good, but again, at times I found the ODM to be very limiting compared to what Active Record offered with traditional (SQL) database systems.
Even to this date, I feel more comfortable working with PostgreSQL than MongoDB. Only because I can join tables and do anything with the data.
I've made two types of backups. One with PostgreSQL and the other with MongoDB. Some say, some apps are more suitable with one or the other type of db. Should I continue with MongoDB and eventually hope for its RoR ODM (Mongoid) to fully mature, or should I consider using PostgreSQL?
A few more questions:
1) Which one would be more suitable for developing a social networking site similar to Facebook.
2) Which one would be more suitable for 4-page standard layout type of website (Home, Products, About, Contact)
You dumped a decades-tested, fully featured RDBMS for a young, beta-quality, feature-thin document store with little community support. Unless you're already running tens of thousands of dollars a month in servers and think MongoDB was a better fit for the nature of your data, you probably wasted a lot of time for negative benefit. MongoDB is fun to toy with, and I've built a few apps using it myself for that reason, but it's almost never a better choice than Postgres/MySQL/SQL Server/etc. for production applications.
Let's quote what you wrote and see what it tells us:
"I like a few things with Mongodb. However, I found certain queries I was
doing in PostgreSql, I couldn't do efficiently in Mongodb. Especially,
when I had to join several tables to calculate some logic."
"I found the ODM to be very limiting compared to what Active Record offered
with traditional (SQL) database systems."
"I feel more comfortable working with PostgreSql than Mongodb. Only because
I can join tables and do anything with the data."
Based on what you've said it looks to me like you should stick with PostgreSQL. Keep an eye on MongoDB and use it if and when it's appropriate. But given what you've said it sounds like PG is a better fit for you at present.
Share and enjoy.
I haven't used MongoDB yet, and may never get round to it as I haven't found anything I can't do with Postgres, but just to quote the PostgreSQL 9.2 release notes:
With PostgreSQL 9.2, query results can be returned as JSON data types.
Combined with the new PL/V8 Javascript and PL/Coffee database
programming extensions, and the optional HStore key-value store, users
can now utilize PostgreSQL like a "NoSQL" document database, while
retaining PostgreSQL's reliability, flexibility and performance.
So looks like in new versions of Postgres you can have the best of both worlds. I haven't used this yet either but as a bit of a fan of PostgreSQL (excellent docs / mailing lists) I wouldn't hesitate using it for almost anything RDBMS related.
First of all postgres is an RDBMS and MongoDB is NoSQL .
but Stand-alone NoSQL technologies do not meet ACID standards because they sacrifice critical data protections in favor of high throughput performance for unstructured applications.
Postgres 9.4 providing NoSQL capabilities along with full transaction support, storing JSON documents with constraints on the fields data.
so you will get all advantages from both RDBMS and NoSQL
check it out for detailed article http://www.aptuz.com/blog/is-postgres-nosql-database-better-than-mongodb/
To experience Postgres' NoSQL performance for yourself. Download the pg_nosql_benchmark at GitHub. here is the link https://github.com/EnterpriseDB/pg_nosql_benchmark
We also have research on the same that which is better. PostGres or MongoDb. but with all facts and figures in hand, we found that PostGres is far better to use than MongoDb. in MongoDb, beside eats up memory and CPU, it also occupies large amount of disk space. It's increasing 2x size of disk on certain interval.
My experience with Postgres and Mongo after working with both the databases in my projects .
Postgres(RDBMS)
Postgres is recommended if your future applications have a complicated schema that needs lots of joins or all the data have relations or if we have heavy writing. Postgres is open source, faster, ACID compliant and uses less memory on disk, and is all around good performant for JSON storage also and includes full serializability of transactions with 3 levels of transaction isolation.
The biggest advantage of staying with Postgres is that we have best of both worlds. We can store data into JSONB with constraints, consistency and speed. On the other hand, we can use all SQL features for other types of data. The underlying engine is very stable and copes well with a good range of data volumes. It also runs on your choice of hardware and operating system. Postgres providing NoSQL capabilities along with full transaction support, storing JSON documents with constraints on the fields data.
General Constraints for Postgres
Scaling Postgres Horizontally is significantly harder, but doable.
Fast read operations cannot be fully achieved with Postgres.
NO SQL Data Bases
Mongo DB (Wired Tiger)
MongoDB may beat Postgres in dimension of “horizontal scale”. Storing JSON is what Mongo is optimized to do. Mongo stores its data in a binary format called BSONb which is (roughly) just a binary representation of a superset of JSON. MongoDB stores objects exactly as they were designed. According to MongoDB, for write-intensive applications, Mongo says the new engine(Wired Tiger) gives users an up to 10x increase in write performance(I should try this), with 80 percent reduction in storage utilization, helping to lower costs of storage, achieve greater utilization of hardware.
General Constraints of MongoDb
The usage of a schema less storage engine leads to the problem of implicit schemas. These schemas aren’t defined by our storage engine but instead are defined based on application behavior and expectations.
Stand-alone NoSQL technologies do not meet ACID standards because they sacrifice critical data protections in favor of high throughput performance for unstructured applications. It’s not hard to apply ACID on NoSQL databases but it would make database slow and inflexible up to some extent.
“Most of the NoSQL limitations were optimized in the newer versions and releases which have overcome its previous limitations up to a great extent”.
Which one would be more suitable for developing a social networking site similar to Facebook?
Facebook currently uses combination of databases like Hive and Cassandra.
Which one would be more suitable for 4-page standard layout type of website (Home, Products, About, Contact)
Again it depends how you want to store and process your data. but any SQL or NOSQL database would do the job.