Automate the process of Pig, Hive, Sqoop - hive

I have data in HDFS(Azure HDInsight) in csv format. I am using Pig to process this Data. After processing in Pig the Summarize data will be stored in Hive. And then Hive table is exported in RDBMS using Sqoop. Now I need to automate all this process. Is this possible that I will write particular method for all these 3 task in MapReduce, then run this MapReduce job, and all these task execute one by one.
​For create MapReduce job , I want to use .Net SDK. So my question is this possible, and if YES than suggest some steps and reference link for this Question.
​Thank You.

If you need to run those task periodically I would recommend using Oozie. Check out existing example and it have fairly good documentation
If you don't have this framework on your cloud, you can write your own MR, but I you have Oozie you can write DAG flow where each action on the graph can be pig/bash/hive/hdfs and more.
It can run every X day/hours/min and can email you in case of failure

Related

Running BQ Load command using Google cloud scheduler

I would like to run bq load command once every day at 00:00 UTC. Can I use Google Cloud scheduler for scheduling this command?
As mentioned by #Daniel, there is no direct way to use cloud scheduler to execute-schedule queries, however there are options you can consider to run queries on schedule.
Use scheduled queries directly in BQ
Since your source is GCS, you can load data from GCS to BQ and then execute scheduled queries like mentioned here
Use scheduled Cloud Function to run queries
Schedule using Data Transfer
You can also try what #Graham Polley has mentioned in this blog, which requires an architecture combining Cloud Scheduler, Cloud Sourse Repositories and Cloud Build
Assuming you have a file that is being loaded into Cloud Storage everyday before 7am, you may consider a more resilient design: when the file is created in CS create a notification that starts the process to load it. It is a better design that will get the information earlier into BigQuery and it will keep working even if the file creation is delayed.
When the file is created in Cloud Storage get a message in PubSub: https://cloud.google.com/storage/docs/pubsub-notifications
Then, a Cloud Function is invoked that will execute the bq load command.
BTW if you have many files or even some dependencies, consider using Cloud Composer as an orchestrator to keep its complexity under control.
You would not be able to do it directly with Cloud Scheduler you would need an intermediary like a Cloud Function to execute a command. Alternatively you could try scheduling a data transfer, depending on the requirements of your load job.
Here is an example from the documentation:
https://cloud.google.com/bigquery/docs/cloud-storage-transfer#setting_up_a_cloud_storage_transfer
Based on your update of desiring to shard the table based on date, try scheduled queries in the following manner.
Create an external table pointed to the desired path in GCS as described here
Define your query, i recommend defining a query with column names and appropriate casting.
SELECT *
FROM myproject.dataset_id.external_table_name
-- INCLUDE FILTERING ON _FILE_NAME IF NEEDED LIKE FOLLOWING:
-- WHERE _FILE_NAME LIKE SOME_VALUE
Create Schedule Query with Run_Date Parmeter in the table name like new_table_{run_date}

How can I write streaming Dataflow pipelines that support schema evolution?

I'm building some data streaming pipelines that read from Kafka and write to various sinks using Google Cloud Dataflow. The pipeline looks something like this (simplified).
// Example pipeline that writes to BigQuery.
Pipeline.create(options)
.apply(KafkaIO.read().withTopic(options.topic))
.apply(/* Convert to a Row type */)
.setRowSchema(schemaRegistry.lookup(options.topic))
.apply(
BigQueryIO.write<Row>()
.useBeamSchema()
.withCreateDisposition(CreateDispotion.CREATE_IF_NEEDED)
.withProject(options.outputProject)
.withDataset(options.outputDataset)
.withTable(options.outputTable)
)
I plan to run a pipeline for each of our Kafka topics, of which there are hundreds. The pipeline looks up the schema for the given topic during the planning stage. This allows BigQueryIO to create the necessary tables before starting the pipeline.
Question: How can I support evolving schemas in my Dataflow pipelines?
I've explored the option of updating an existing Dataflow job (using the --update flag). The thought is that I could automate the process of submitting an updated job whenever a schema changes. But updating a job seems to incur about 3 minutes of downtime. For some of the jobs, that much downtime won't work. I'm looking for other solutions that hopefully have no more than a few seconds of downtime.

Using AWS Glue to Create a Table and move the dataset

I've never used AWS Glue however believe it will deliver what I want and am after some advice. I have a monthly CSV data upload that I push to S3 that has a staging Athena table (all strings) associated to it. I want Glue to perform a Create Table As (with all necessary convert/cast) against this dataset in Parquet format, and then move that dataset from one S3 bucket to another S3 bucket, so the primary Athena Table can access the data.
As stated, never used Glue before, and want a starter for 10, so I don't go down rabbit holes.
I currently perform all these steps manually, so want to understand how to use Glue to automate my manual tasks.
Yes, you can use AWS Glue ETL jobs to do exactly what you described. However, it doesn't perform CREATE TABLE AS SELECT queries, instead it does it with ETL jobs based on spark. Here is github repo that describes such process in quite detailed way and here is more of official AWS documentation on ETL programming based on AWS Glue service. After the initial setup, you can define some trigger events/scheduling to run your Glue ETL jobs automatically.
However, one thing to remember is cost of using AWS Glue services. Since it is based on execution time, sometimes it is not that trivial to forecast the final cost. For the workflow you described, performing CTAS queries with Athena would work just fine to transform your data and write it into a different s3 bucket. In this case you would know exactly price since it depends on the size of your data. Then you can use AWS API to do some manipulation with metadata catalog, so that new information would be accessible and in once place.
Since you are new to AWS Glue ETL jobs, I would suggest to stick with CTAS queries for simple tasks (although you can come up with quite complicated queries) and look into an open source project Apache Airflow for automation/scheduling and orchestration. This is the approach the I am using for tasks similar to yours. Airflow is easy to setup on both local and remote machines, has reach CLI and GUI for task monitoring, abstracts away all scheduling and retrying logic. It even has hooks to interact with AWS services. Hell, Airflow even provides you with a dedicated operator for sending queries to Athena. I wrote a little bit more about this approach here.

Moving data from hdfs to sql

Im testing my setup and i need to move the data in hdfs to a sql DB and that too when the data is generated. What i mean is.. once the mapreduce job is completed, it will send a ActivMQ message. I need to move it to sql automatically once i receive a ActivMQ message using Sqoop. Can some one help how to acheive this.
Can someone let me know whether MQ & Sqoop work together..?
Thank You..
I am not entirely clear about the use-case but you can set up a Ooizie Work-Flow.The Sqoop job will only start once the map-reduce job is complete.You can actually create a complex DAG using Oozie.The Oozie work flow can inturn be invoked from a remote java client.
Hope this helped.

Automatic Hive or Cascading for ETL in AWS-EMR

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html