Im testing my setup and i need to move the data in hdfs to a sql DB and that too when the data is generated. What i mean is.. once the mapreduce job is completed, it will send a ActivMQ message. I need to move it to sql automatically once i receive a ActivMQ message using Sqoop. Can some one help how to acheive this.
Can someone let me know whether MQ & Sqoop work together..?
Thank You..
I am not entirely clear about the use-case but you can set up a Ooizie Work-Flow.The Sqoop job will only start once the map-reduce job is complete.You can actually create a complex DAG using Oozie.The Oozie work flow can inturn be invoked from a remote java client.
Hope this helped.
Related
I am running Spark 2.4.4 on AWS EMR and experienced a long delay after the spark write parquet file to S3. I checked the S3 write process should be completed in few seconds (data files and _success file found in the S3). But it still delayed around 5 mins to start the following jobs.
I saw someone said this is called "Parquet Tax". I have tried the proposed fixes from those articles but still cannot resolve the issue. Can anyone give me a hand? thanks so much.
You can start with spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2.
You can set this config by using any of the following methods:
When you launch your cluster, you can put spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 in the Spark config.
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version",
"2")
When you write data using Dataset API, you can set it in the option, i.e. dataset.write.option("mapreduce.fileoutputcommitter.algorithm.version",
"2").
That's the overhead of the commit-by-rename committer having to fake rename by copying and deleting files.
Switch to a higher performance committer, e.g ASF Spark's "zero rename committer" or the EMR clone, "fast spark committer"
Actually I am doing some dataframe work for ETL, the dataframe is read from Azure datawarehouse . and seems somehow to notebook hang forever , but I don't know where it is and why it hang so long !!!
Any one has idea and the experience ?
There are various rare scenarios and corner cases that can cause a streaming or batch job to hang. It is also possible for a job to hang because the Databricks internal metastore has become corrupted.
Often restarting the cluster or creating a new one resolves the problem. In other cases, run the following script to unhang the job and collect notebook information, which can be provided to Databricks Support.
For more details, refer "How to Resolve Job Hands and Collect Diagnostic Information".
Hope this helps.
I have a use case
We have java framework to parse realtime data from Kinesis to Hive table in every half an hour.
I need to access this hive table and do some processing near realtime. An hour delay is fine, as I dont have permission to access Kinesis stream.
Once processing is done in spark (pyspark preferably), I have to create a new kinesys stream and push the data.
I will then use Splunk and pull it near realtime.
Question is, any one has done spark streaming from hive using python ? I have to do a POC and then the actual work.
Any help will be highly appreciated.
Thanks in advance!!
There are 2 ways to go ahead on this:
Use spark-streaming to obtain messages drectly from Kinesis. That will give you something that is real time.
Once the file drops into your staging area ( either your hive ware-house OR your some HDFS location ), you can pick it up for processing using spark-streaming for files.
Do let us know which approch worked best for you.
I have data in HDFS(Azure HDInsight) in csv format. I am using Pig to process this Data. After processing in Pig the Summarize data will be stored in Hive. And then Hive table is exported in RDBMS using Sqoop. Now I need to automate all this process. Is this possible that I will write particular method for all these 3 task in MapReduce, then run this MapReduce job, and all these task execute one by one.
For create MapReduce job , I want to use .Net SDK. So my question is this possible, and if YES than suggest some steps and reference link for this Question.
Thank You.
If you need to run those task periodically I would recommend using Oozie. Check out existing example and it have fairly good documentation
If you don't have this framework on your cloud, you can write your own MR, but I you have Oozie you can write DAG flow where each action on the graph can be pig/bash/hive/hdfs and more.
It can run every X day/hours/min and can email you in case of failure
There are plenty of documentation about how to write Pig UDFs in the various languages but I haven't found anything on how they are distributed to the data nodes.
Are they done automatically when pig script is invoked? If it makes any difference, I'd be writing UDF in Java.
Let me make it more clear. Whenever we wite a UDF and the pig is in hdfs mode. Then UDFs, which initially resides in the local or the client side, is carried to the cluster as per the internal architecture of hadoop. Now the UDFs task is performed by the task tracker and it becomes the duty of the job tracker to assign the the UDFs to task tracker, which is near to the data node where the input file resides.
Note: Its always the job tracker(component of name node), which actually decides which task tracker should perform the execution of the UDFs.
If the input file is in local file system(local mode), then the UFDs get executed in the local JVM.
The fact is apache pig works in two modes
1) local mode
2) hdfs mode
To answer you question, which belongs to pig running in hdfs mode, we only made sure that the input file that we are loading is present in the hdfs(data node). When the question comes for UDF, this is simply a function that is used to process the input file, just link pig latin language. We are writing UDFs, pig latin via the client side node and thus all the data related to this will be stored in the client side machine.
Above all, we have configure the pig so that client can interact with the hdfs to process the required result.
Hope this helps