Hive Backup and Restore - hive

I want to take a monthly / quarterly backup of both Hive metadata and Hive data at once for more than 1000 tables with easy restoring capability. So far, I found below options but not sure which is best for backing up Hive tables in production. Any tips ?
Apache Falcon - http://saptak.in/writing/2015/08/11/mirroring-datasets-hadoop-clusters-apache-falcon
Pro: Easily available as a service in Ambari for install
Con: No community support
Hortonworks Data flow - https://docs.hortonworks.com.s3.amazonaws.com/HDPDocuments/Ambari-2.7.4.0/bk_ambari-upgrade-major/content/prepare_hive_for_upgrade.html
Pro: Latest
Con: No much documentation to test. Please share any resources of how to backup with Hortonworks data flow
Other ways - Hive data backup With Distcp, Export/Import, Snapshots and hive metadata backup using relational database dumps
Con: Not sure if both Hive data and Hive metadata get backed-up at same time. Time-taking to implement a monthly / quarterly scheduler.

Related

Data Migration to Amazon S3 buckets

Experts: I have a situation where I need to transfer incremental data( every 5 minutes ) & daily data from an application database that has about 500+ tables to S3 for a lake house implementation. The data volumes for 5 minute interval is less than 0.5 million records. In the current world, there is SQL Server CDC that copies the data to another SQL ODS and gets into 2 different Data marts that's being used for Operational reporting.
Need your expertise to answer below questions
If we choose AWS Glue to transfer data to S3, do I need to write 500+ glue jobs one for each table? Is this right way of doing ? Are there any other tools or technologies that can transfer data easily.
If we had to do both incremental ( every 5 minute ) and also batch ( hourly/daily ), can the same jobs be used? if yes, where and how to configure the time period for extraction?
If more tables or columns get added in the source database , do I need to keep writing additional jobs or can I write a template job and call with parameters?
4.Are there any other tools ( apart from Glue ) and AWS cloud watch to monitor delays, failures & long running jobs
You can use AWS DMS to migrate data to s3 target. DMS also supports CDC. Whieh means it can also sync changes post initial migration.
To transfer data for example from on-prem to cloud, you need to have a replication instance. This can be any tier based on the size of data transfer.
Then a replication task has to be created. This can be execute immediately, or scheduled run at periodic intervals.
This use case can be solved by using AWS Database Migration Service. AWS Database Migration Service (AWS DMS) is a cloud service that makes it easy to migrate relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups.
Look at the doc for more information.
AWS Database Migration Service User Guide

Accessing Spark Tables (Parquet on ADLS) using Presto from a Local Linux Machine

Would like to know if we can access the Spark External tables with MS SQL as metastore and external files on Azure Data lake using Hive Metastore service (Presto) from a Linux Machine.
We are trying to access the spark delta tables having parquet files on ADLS through Presto. Below is the scenario. I would like to know if there is a possible way to achieve this. We are doing this as a POC only and we believe knowing the answer will take us to next step.
Our central data repository is all spark Delta tables created by many pipelines. The data is stored in Parquet format. MS SQL is the external metastore. Data in these spark tables are used by other teams/applications and they would like to access these data through Presto.
We learnt that Presto uses the metastore service of Hive to access the hive table details. We tried accessing the tables from Hive (thinking if this works Presto also works). But we find problems with different filesystems. We have setup Hadoop and Hive in one single Linux machine. The versions are 3.1.2 & 3.1.1. The hive service is connecting to the SQL metastore and showing the results of few basic commands. However when it comes to accessing the actual data stored in parquet in a ADLS path, it fails saying File system exception. I understand this problem that it is an interaction of many file systems like (ADFS, HDFS, linux) but not finding any blogs that guides us. Kindly help.
Hive Show Database command:
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> SHOW DATABASES;
OK
7nowrtpocqa
c360
default
digital
Hive Listing tables:
hive> SHOW TABLES;
OK
amzn_order_details
amzn_order_items
amzn_product_details
Query data from Orders table:
hive> select * from dlvry_orders limit 3;
OK
Failed with exception java.io.IOException:org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "dbfs"
Time taken: 3.37 seconds
How can I make my setup access the Datalake files and bring in the data?
I believe my metastore should have the exact full path of the ADLS where files are stored. If it is, how will my Hive/Hadoop in Linux will understand the path.
If it can recognize the path also, in which configuraion file should I give the credentials for accessing the data lake (in any .XML)
How can the different file systems interact
Kindly help. Thanks for all the inputs.

Data migration from teradata to bigquery

My requirement is to migrate data from teradata database to Google bigquery database where table structure and schema remains unchanged. Later, using the bigquery database, I want to generate reports.
Can anyone suggest how I can achieve this?
I think you should try TDCH to export data to Google Cloud Storage in Avro format. TDCH runs on top of hadoop and exports data in parallel. You can then import data from avro files into BigQuery.
I was part of a team that addressed this issue in a Whitepaper.
The white paper documents the process of migrating data from Teradata Database to Google BigQuery. It highlights several key areas to consider when planning a migration of this nature, including the rationale for Apache NiFi as the preferred data flow technology, pre-migration considerations, details of the migration phase, and post-migration best practices.
Link: How To Migrate From Teradata To Google BigQuery
I think you can also try to use cloud composer(apache airflow) or install apache airflow in instance.
If you can open the ports from Teradata DB then you can run 'gsutil' command from there and schedule it via airflow/composer to run the jobs on daily basis. Its quick and you can leverage the scheduling capabilities of airflow.
BigQuery introduced Migration Service which is a comprehensive solution for migrating the data warehouse to BigQuery. It includes free to use tools that help with each phase of migration including assessment and planning to execution and verification.
Reference:
https://cloud.google.com/bigquery/docs/migration-intro

Backup Strategy in SQL Azure

Do we have any standard way of daily backing up SQL Azure databases to onprim?
I know about the option of making backups to Storage, but on a daily basis, if we backup on storage blobs we will end up having many files and it will incur a high cost with Azure.
We need to take backup of SQL Azure Databases directly to onprim disks.
Maintain only two latest backups on the disk.
Do we have mechanism to schedule daily backup to storage blob and maintain only two copies at any moment of time?
Not sure if the new SQL Database Point-in-time feature is good to you. You need to enable this preview feature and select a new edition when creating your database. "Basic" edition support daily backup stores in the past 24 hours while "Standard" and "Premium" editions provides point-in-time backup stores backups in the past 7 days and 35 days.
Alternatively you need to pragmatically backup the database in blob and remove the old ones through worker role, scheduler, webjob.

Tool to "Data Load" or "ETL" -- from SQL Server into Amazon Redshift

I am trying to figure out decent but simple tool which I can host myself in AWS EC2, which will allow me to pull data out of SQL Server 2005 and push to Amazon Redshift.
I basically have a view in SQL Server on which I am doing SELECT * and I need just put all this data into Redshift. The biggest concern is that there is a lot of data, and this will need to be configurable so I can queue it, run as a nighly/continuous job, etc.
Any suggestions?
alexeypro,
dump tables to files, then you have two fundamental challenges to solve:
Transporting data to Amazon
Loading data to Redshift tables.
Amazon S3 will help you with both:
S3 supports fast upload of files to Amazon from your SQL server location. See this great article. It is from 2011 but I did some testing a few months back and saw very similar results. I was testing with gigabytes of data and 16 uploader threads were ok, as I'm not on backbone. Key thing to remember is that compression and parallel upload are your friends to cut down the time for upload.
Once data are on S3, Redshift supports high-performance parallel load from files on S3 to table(s) via COPY SQL command. To get fastest load performance pre-partition your data based on table distribution key and and pre-sort it to avoid expensive vacuums. All is well documented in Amazon's best practices. I have to say these guys know how to make things neat & simple, so just follow the steps.
If you are coder you can orchestrate the whole process remotely using scripts in whatever shell/language you want. You'll need tools/libraries for parallel HTTP upload to S3 and command line access to Redshift (psql) to launch the COPY command.
Another options is Java, there are libraries for S3 upload and JDBC access to Redshift.
As other posters suggest, you could probably use SSIS (or essentially any other ETL tool) as well. I was testing with CloverETL. Took care of automating the process as well as partitioning/presorting the files for load.
Now Microsoft released SSIS Powerpack, so you can do it natively.
SSIS Amazon Redshift Data Transfer Task
Very fast bulk copy from on-premises data to Amazon Redshift in few clicks
Load data to Amazon Redshift from traditional DB engines like SQL Server, Oracle, MySQL, DB2
Load data to Amazon Redshift from Flat Files
Automatic file archiving support
Automatic file compression support to reduce bandwidth and cost
Rich error handling and logging support to troubleshoot Redshift Datawarehouse loading issues
Support for SQL Server 2005, 2008, 2012, 2014 (32 bit and 64 bit)
Why SSIS PowerPack?
High performance suite of Custom SSIS tasks, transforms and adapters
With existing ETL tools, an alternate option to avoid staging data in Amazon (S3/Dynamo) is to use the commercial DataDirect Amazon Redshift Driver which supports a high performance load over the wire without additional dependencies to stage data.
https://blogs.datadirect.com/2014/10/recap-amazon-redshift-salesforce-data-integration-oow14.html
For getting data into Amazon Redshift, I made DataDuck http://dataducketl.com/
It's like Ruby on Rails but for building ETLs.
To give you an idea of how easy it is to set up, here's how you get your data into Redshift.
Add gem 'dataduck' to your Gemfile.
Run bundle install
Run datatduck quickstart and follow the instructions
This will autogenerate files representing the tables and columns you want to migrate to the data warehouse. You can modify these to customize it, e.g. remove or transform some of the columns.
Commit this code to your own ETL project repository
Git pull the code on your EC2 server
Run dataduck etl all on a cron job, from the EC2 server, to transfer all the tables into Amazon Redshift
Why not Python+boto+psycopg2 script?
It will run on EC2 Windows or Linux instance.
If it's OS Windows you could:
Extract data from SQL Server( using sqlcmd.exe)
Compress it (using gzip.GzipFile).
Multipart upload it to S3 (using boto)
Append it to Amazon Redshit table (using psycopg2).
Similarly, it worked for me when I wrote Oracle-To-Redshift-Data-Loader