I am trying to profile a pig query but haven't got any thing useful so far.
I am trying to measure CPU, disk I/O, RAM usage.
Can anyone guide me on this ?
Things tried so far
Starfish - Works with Hadoop job but NOT with Pig
- Does not support pig query
Hprof - Works with Hadoop job but NOT with Pig query.
- Generates profile file only for Hadoop job
Both Hadoop and pig jobs are executed in the same cluster.
Thanks for reading !!
You could get some latency data using JXInsight/Opus (which is free) and marking or tagging the cluster before executing the query and then taking a snapshot following completion of the job.
http://www.jinspired.com/site/jxinsight-opus-1-0
We will be coming out with JXInsight/Opus for X editions for various big data platforms including Cassandra, Hadoop, Pig,....
If you need more power and more meters (cpu, io,...) you can then always look at the JXInsight/OpenCore product.
You can use hprof or other profiling tools on the MR job generated by pig . See https://cwiki.apache.org/confluence/display/PIG/HowToProfile
Related
I have a CDP environment running Hive, for some reason some queries run pretty quickly and others are taking even more than 5 minutes to run, even a regular select current_timestamp or things like that.
I see that my cluster usage is pretty low so I don't understand why this is happening.
How can I use my cluster fully? I read some posts in the cloudera website, but they are not helping a lot, after all the tuning all the things are the same.
Something to note is that I have the following message in the hive logs:
"Get Query Coordinator (AM) 350"
Then I see that the time to execute the query was pretty low.
I am using tez, any idea what can I look at?
Besides taking care of the overall tuning: https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279
Please check my answer to this same issue here Enable hive parallel processing
That post explains what you need to do to enable parallel processing.
I'm reading the literature on comparing Hive and Impala.
Several sources state some version of the following "cold start" line:
It is well known that MapReduce programs take some time before all nodes are running at full capacity. In Hive, every query suffers this “cold start” problem.
Reference
In my opinion, it is not sufficient to understand what is meant by "cold start". Looking for more information and clarity to understand this.
For context, I'm a data scientist. I create queries, and have only basic understanding of big data concepts.
I've referred to questions that explain why Impala is faster (example), but they don't explicitly address or define cold start.
With every Hive query, a MapReduce Job is executed which requires overhead and time for nodes within the MapReduce cluster to work on the task. This is known as "cold start". On the other hand, because Impala sits directly atop HDFS, it does not invoke a MapReduce job and avoids the overhead and time needed in a MapReduce job. Rather, Impala daemon processes are active at boot time and ready to process queries.
Takeaway: cold start refers to the overhead required in booting and executing a MapReduce job.
I want to know the difference between hive and map reduce
And if there any comparision between them.
Does hive also show some part of map reduce
Hive and MapReduce have completely different purpose, they are like oranges and apples.
MapReduce is a software framework for writing applications which process big amounts of data on large clusters in parallel.
Hive is a database for processing large datasets residing in the distributed file system using SQL. Hive on Tez and Hive on MapReduce translates SQL queries into series of mapReduce jobs (Tez execution engine uses DAGs).
MapReduce is general purpose purpose framework (a set of libraries and tools), you can use it to write your own MapReduce application in Java, Python, Scala, R.
And Hive is SQL database, it has reach SQL and data warehousing features and cost-based optimizer for building optimal query plan.
I am new to Hadoop and trying to learn it on datawarehousing and analytical front.
Can someone advise me on how to set up my practice machines, especially with regards to
1.Number of machines/nodes required to start learning
2.Is it advisable to set up on Windows?
3.What software needs to be installed
4.Availability of test/sample data
Also I would like to get advice on the best way to perform BI actions with Hive.
Thank you.
I would suggest to download cloudera VM if you more interested in hadoop machinery. Another way to jump start immidiately - to use amazon EMR (elastic mapreduce). There is an option to create interactive hive cluster there and start playing with datasets stored in S3.
Regarding number of nodes - it depends on goals. If you interested to "feel" some hadoop performance - try at least 4-6 nodes.
Both ways listed above are good if you do not have access to organization's internal hadoop / hive cluster. And even in this case - I would suggest to try with them to gain some hands-on before using shared environment.
I've been following Hadoop for a while, it seems like a great technology. The Map/Reduce, Clustering it's just good stuff. But I haven't found any article regarding the use of Hadoop with SQL Server.
Let's say I have a huge claims table (600 million rows) and I want to take advantage of Hadoop. I was thinking but correct me if I'm wrong, I can query my table and extract all of my data and insert it into hadoop in chunks of any type (xml, json, csv). Then I can take advantage of Map/Reduce and Clustering with at least 6 machines and leave my SQL Server for other tasks. I'm just throwing a bone here I just want to know if anybody has done such a thing.
Importing and exporting data to and from a relational database is a very common use-case for Hadoop. Take a look at Cloudera's Sqoop utility, which will aid you in this process:
http://incubator.apache.org/projects/sqoop.html