How to validate data in Hive HQL while import from Source - hive

Please explain how to put the validation while importing data from Source in Hive table for example in a bulk of data if some data is corrupt which is not suppose to import so how to discard that data.

You need to develop ETL process and have strategy to discard the corrupt data. Either you can use 3rd party tools like Informatica big data edition, Talend etc or you need to develop your custom code. It is a major effort.

Related

Automate data transforming ( SQL) and then push processed data to Tableau

I have questions about ways to automate data transformation process.
What I normally do is that I transform data using python or postgresql and then export the processed data as csv. After that, I connect the csv file to Tableau.
I have done some research and found that ETL can help. However, I've watched some ETL tools' demo videos, and I'm not sure whether these tools' transform features would meet my need or not. For example, I have written 100+ sql lines for one of my data transforming task; it's better if I can use postgresql to run the query instead of using ETL tools.
The problem is that I don't know what's the proper way to automate the data transforming process and then push the data to Tableau. The csv files will be updated on a daily basis, so I'll need to refresh the data.
Data transformation can be done in various ways. It depends on your nature of data to figure out what can be the right fit.
If you have huge volume of data and you are comfortable in python/java and you can automate your transformation logic using spark and write it to a hive table and then connect tableau to read data from hive.
Most of the next gen ETL tools like pentaho and talend can be used but that erodes the flexibility and portability what a framework like spark or beam can give.
If you want to know , how can you achieve this using cloud provider services like GCP or AWS , please let me know
Prep is the Tableau tool for preparing data. It can be used for joining, appending, cleaning, pivoting, filtering and other data cleansing activities.
Tableau Prep is available:
for free if you have a Tableau Creator license
in desktop and Online/ Tableau server versions
Scheduling Prep flows is available in Tableau Online/ Server. To schedule flows you will need to acquire the Tableau Prep Conductor add-on.

Best way to replicate MongoDB NoSQL into SQL tables

How can i replicate (incremental load) MongoDB (NoSQL) to SQL tables.
We have a web-based solution that loading data into MongoDB. The data size is almost 1TB. We need to do BI Reporting in the Looker BI tool. but looker doesn't support MongoDB directly. So we have to replicate our data into SQL form we have redshift for the target database.
Main requirements for parsing NoSQL to SQL:
Parent Node should be the main table
Nested node/arrays should be a separate table with parent key (foreign key)
Whenever a new column is introduced in MongoDB source it should automatically start replicating that new field from any document to the target database.
Incremental refresh from source to target.
I've seen Stitch Data ETL which fits my requirement but I'm looking for OpenSource any ETL/DB tool or library.
Please help.
Posting answers to help out others with the same requirements.
I'm not able to get any open source ETL tool who can full fill the above 4 requirements.
Trying to writing python code to do so. But a paid tool named Precog helped me to fulfill all the above requirements, and a little bit cheaper than Stitch Data ETL.
Thanks

Which dashboard analytics will support Parse.com data source?

I've developed an app that uses Parse.com as the back end. I now need a dashboard analytics software package (such as iDashboards) that will enable me to pull data from my Parse.com database classes and present some of that data in a pretty dashboard fashion.
iDashboards looks to be the kind of tool i'm after, but it only supports certain data source inputs such as JDBC, ODBC, SQL, MySQL etc. Not being a database guru by any means, i'm not sure if Parse.com can be classed as any of the above, but from what i've read it doesn't come under any of these categories.
Can anybody recommend a way of either connecting Parse.com to iDashboard, or suggest another dashboard tool that will support Parse.com as a data source?
The main issue you are facing is that data coming out of Parse.com is going to be in json format. Most dashboards are going to prefer csv files.
The best dashboard I am aware of is Tableau and there is a discussion about getting json into Tableau here: http://community.tableau.com/ideas/1276
If your preference is using iDashboards then you need to convert the json coming out of Parse into a csv format that iDashboards can consume. You can do that using RJSON as mentioned in the post above but you'll probably have an easier time of it with a simple php or python script that periodically connects to Parse and pulls out data updates for you and then pushes it to your dashboard of choice.
Converting json to csv in php is addressed here: Converting JSON to CSV format using PHP
The difference is much more fundamental than "unsupported file format". In fact, JSON data coming out of Parse is stored in a so-called denormalized form, which means that a single JSON data file may contain the equivalent of arbitrarily many tables in a relational database. Stated differently, one JSON file may translated into potentially many CSV files, and there's no unique choice of how to perform that translation.
This is a so-called ETL problem, where ETL stands for Extract-Transform-Load. As such, you may be interested in open source ETL tools such as Kettle. Kettle is supported by Pentaho and includes functionality that can help you develop a workflow to turn JSON data into multiple CSV files that can then be imported into iDashboards (or similar). Aside from Kettle, Talend is also widely used for this purpose and has the same ability.
Finally, note that Parse is powered by MongoDB, and exports JSON data that is easily stored and manipulated in MongoDB. As such, a natural fit for reporting on Parse data is any reporting tool built for MongoDB.
As of the time of this writing, there are two such options:
JSON Studio, which is a commercial solution that is built explicitly for MongoDB and has your stated capability to produce dashboards.
SlamData, which is an open source solution, also built for MongoDB, which allows native SQL on the database. The current version does not have reporting capabilities (just CSV export), but the 2.09 version due out in June has reporting dashboards baked in.
An advantage of using a MongoDB reporting tool is that you will not have to wrangle your data into relational form. If it's heavily nested, using arrays, and so forth, it can be quite painful to develop an ETL workflow and keep it in sync with how the data is changing. Instead, all you have to do is built a script to pipe the raw data from Parse into a MongoDB instance (perhaps hosted by MongoLab or equivalent, if you don't want to host it yourself), and connect the MongoDB reporting tool on top.
You might also contact Parse and see if they have a recommended solution for this. It occurs to me they should probably bake some sort of analytical / reporting functionality into their APIs as this is such a common use case.
You can use Axibase Time-Series Database to ingest your data from parse.com and they have built in dashboards and widgets for visualization or you can just export data from ATSD to csv and use iDashboards.

How to import data from eXist database to PostgreSQL database?

Is there any extension/tool/script available to import data from eXist database to PostgreSQL database automatically?
From the tag description it's pretty clear that you're going to need to use an ETL tool or some custom code. Which is easier depends on the nature of the data and how you want to migrate it.
I'd start by looking at Talend Studio and Pentaho Kettle. See if either of them can meet your needs.
If you can turn the eXist data into structured CSV exports then you can probably just hand-define tables for it in PostgreSQL then COPY the data into it or use pgloader.
If not, then I'd suggest picking up the language you're most familiar with (Python, Java, whatever) and using the eXist data connector for that language along with the PostgreSQL data connector for the language. Write a script that fetches data from eXist and feeds it to PostgreSQL. If using Python I'd use the Psycopg2 database connector, as it's fast and supports COPY for bulk data loading.

Will SSIS work well for importing to multiple tables?

I won't have access to SSIS until tomorrow so I thought I'd ask for advice before I start work on this project.
We currently use Access to store our data. It's not stored in a relational format so it's an awful mess. We want to move to a centralized database (SQL Server 2008 R2), which would require rewriting much of our codebase (which, incidentally, is also an awful mess.) Due to a time constraint, well before that can be done we are going to need to get a centralized database set up solely for the purpose of on-demand report generation for a client. So, our applications will still be running on Access. Instead of:
Receive data -> Import to Access initial file with one table -> Data processing -> Access result file with one table -> Report generation
The goal is:
Receive data -> Import to Access initial file with one table -> Import initial data to multiple tables in SQL Server -> Export Access working file with one table -> Data processing -> Access result file -> Import result to multiple tables in SQL Server -> Report generation whenever
We're going to use SSRS for the reporting component, which seems like it'll be straightforward enough. I'm not sure if SSIS alone would work well for splitting the Access data up into numerous tables, or if everything should be imported into a staging table with SSIS and then split up with stored procedures, or if I'll need to be writing a standalone application for this.
Haven't done much of any work with SQL Server before, so any advice is appreciated.
In SSIS package, you can write code (e.g. C#) to do your own/custom data transformations. However, SSIS comes with built-in transformations that may be good for your needs. SSIS is very powerful and flexible. Actually, you may do pretty much anything you want with the data in SSIS.
The high level workflow for your task could like like the following:
1. Connect to the data source and pull the data
2. Transform the data
3. Output data to the destination data source
You certainly can split a data flow into two separate branches and send it to two destinations. All you need to do is put a multi-cast in the dataflow and then the bulk of the transformations will happen after that.
From what you've said, however, a better solution might be to use the Access tables as a staging database and then grab the data from there and send it to SQL Server. That would mean two data flows but it will be a cleaner implementation.