Pig step execution details - apache-pig

I am newbee to pig .
I have written a small script in pig , where in i first load the data from two different tables and further right outer join the two tables ,later also i have next join of tables for two different st of data .It works fine .But i want to see
the steps of execution , like in which step my data is loaded that way i can note the time
needed for loading later details of step for data joining like how much time it is
taking for these much records to be joined .
Basically i want to know which part of my pig script is taking longer time to run so
that way i can further optimize my pig script .
Anyway we could println within the script and find which steps got executed which has started to execute .
Through jobtracker details link i could not get much info , just could see mapper is running & reducer is running , but idealy mapper for which part of script is running could not find that.
For example for a hive job run we can see in the jobtracker details link which step is currently getting executed.
Any information will be really helpfull.
Thanks in advance .

I'd suggest you to have a look at the followings:
Pig's Progress Notification Listener
Penny : this is a monitoring tool but I'm afraid that it hasn't been updated in the recent past (e.g: it won't compile for Pig 0.12.0 unless you do some code changes)
Twitter's Ambrose project. https://github.com/twitter/ambrose
On the other, after executing the script you can see a detailed statistics about the execution time of each alias (see: Job Stats (time in seconds)).

Have a look at the EXPLAIN operator. This doesn't give you real-time stats as your code is executing, but it should give you enough information about the MapReduce plan your script generates that you'll be able to match up the MR jobs with the steps in your script.
Also, while your script is running you can inspect the configuration of the Hadoop job. Look at the variables "pig.alias" and "pig.job.feature". These tell you, respectively, which of your aliases (tables/relations) is involved in that job and what Pig operations are being used (e.g., HASH_JOIN for a JOIN step, SAMPLER or ORDER BY for an ORDER BY step, and so on). This information is also available in the job stats that are output to the console upon completion.

Related

Need to simulate resourceName with full table path in Log Explorer

I need to understand under what circumstance does the protoPayload.resourceName with full table path i.e., projects/<project_id>/datasets/<dataset_id>/tables/<table_id> appear in the Log Explorer as shown in the example below.
The below entries were generated by a composer dag running a kubernetespodoperator executing some dbt commands on some models. On the basis of this, I have a sink linked to pub/sub for further processing.
As seen in the image the resourceName value is appearing as-
projects/gcp-project-name/datasets/dataset-name/tables/table-name
I have shaded the actual values of projectid, datasetid, and tablename.
I can't run the similar dag job with kuberenetesoperator on test tables owing to environment restrictions. So I tried running some update queries and insert queries using BigQuery Editor. Here is how value of protoPayload.resourceName comes as -
projects/gcp-project-name/jobs/bxuxjob_
I tried same queries using Composer DAG using BigQueryInsertJobOpertor. Here is how the value of protoPayload.resourceName comes as -
projects/gcp-project-name/jobs/airflow_<>_
Here is my question. What operation/operations in BigQuery will give me protoPayload.resourceName as the one that I am expecting i.e. -
projects/<project_id>/datasets/<dataset_id>/tables/<table_id>

How to execute X times a Job Executor step

Introduction
To keep it simple, let's imagine a simple transformation.
This transformation gets an input of 4 rows, from a Data Grid step.
The stream passes through a Job Executor, referencing to a simple job, with a Write Log component.
Expectations
I would like the simple job executes 4 times, that means 4 log messages.
Results
It turns out that the Job Executor step launches the simple job only once, instead of 4 times : I only have one log message.
Hints
The documentation of the Job Executor component specifies the following :
By default the specified job will be executed once for each input row.
This is parametrized in the "Row grouping" tab, with the following field :
The number of rows to send to the job: after every X rows the job will be executed and these X rows will be passed to the job.
Answer
The step actually works well : an input of X rows will execute a "Job Executor" step X times. The fact is I wasn't able to see it with the logs.
To verify it, I have added a simple transformation inside the "Job Executor" step, which writes into a text file. After I have checked this file, it appeared that the "Job Executor" was perfectly executed X times.
Research
Trying to understand why I didn't have X log messages after the X times execution of "Job Executor", I have added a "Wait for" component inside the initial simple job. Finally, adding two seconds allowed me to see X log messages appearing during the execution.
Hope this helps because it's pretty tricky. Please feel free to provide further details.
A little late to the party, as a side note:
Pentaho is a set of programs (Spoon, Kettle, Chef, Pan, Kitchen), The engine is Kettle, and everything inside transformations is started in parallel. This makes log retrieving a challenging task for Spoon (the UI), you don't actually need a Wait for entry, try outputting the logs into a file (specifying a log file on the Job executor entry properties) and you'll see everything in place.
Sometimes we need to give Spoon a little bit of time to get everything in place, personally that's why I recommend not relying on Spoon Execution Results logging tab, it is better to output the logs to a DB or files.

Hive analyze compute stats query failing

I'm running Hive 1.0, trying to compute column statistics using the built-in analyze command. HQL script looks like:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
use db;
analyze table tbl compute statistics for columns;
Which kicks off a map-only MR task as expected. The job runs to 100% for both map and reduce, then reports:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
But the job is registered as a SUCCESS.
Googling led me to this JIRA ticket, but the resolution says the problem is resolved in Hive 0.14. Is there something simple I'm missing in the query?
EDIT: Five and a half years later, I've changed jobs and industries twice, picked up Spark and then abandoned Hadoop altogether in all my workflows, and the world aligned around efficient cloud data lakes that don't require a new query language. Hive is a distant memory for me, but I hope the other answer seekers found sufficient workarounds. I don't think I ever did.

Error In Query Operation: Cannot start a job without a project id

I keep getting an error using the bqcommand line tool. For example, I can easily run this query and it returns the table that I want:
head -n 10 xxxx-bq:name_name.Report2
Note that xxxx-bq is the projectid, and name_name is the dataset id. When I try to run a query against this table, say the follwing:
query "SELECT count(*) FROM xxxx-bq:name_name.Report2
I get an error that says that I cannot start a job without a project id. What am I doing wrong here? How can I specify in the query the project ID? I know people have asked some similar questions. That said, I have been following along and my approach is not working.
Do you have a project id? If not, this page can help you set one up: https://developers.google.com/bigquery/bq-command-line-tool-quickstart
All BigQuery jobs (which include queries) require a project id, which is the project that gets billed for any damage done by the job. (by damage, I mean work)
You should either set your default project id (you can do this by running bq init)
or set the project id that you're running the job under via --project_id=
So if you're running bq shell, you would use bq shell --project_id=myprojectid instead.
strange... I just started working with bq & got the same error but it didn't like me passing --project_id=[myprojectid]. Although I was already authed with gcloud auth login, I had to run bq init (and it seemingly didn't do anything) -- after that, my queries worked just fine.

Running Elastic Mapreduce Hive Queries from an Application

I've run Hive on elastic mapreduce in interactive mode:
./elastic-mapreduce --create --hive-interactive
and in script mode:
./elastic-mapreduce --create --hive-script --arg s3://mybucket/myfile.q
I'd like to have an application (preferably in PHP, R, or Python) on my own server be able to spin up an elastic mapreduce cluster and run several Hive commands while getting their output in a parsable form.
I know that spinning up a cluster can take some time, so maybe my application might have to do that in a separate step and wait for the cluster to become ready. But is there any way to do something like this somewhat concrete hypothetical example:
create Hive table customer_orders
run Hive query "SELECT dt, count(*) FROM customer_orders GROUP BY dt"
wait for result
parse result in PHP
run Hive query "SELECT MAX(id) FROM customer_orders"
wait for result
parse result in PHP
...
Does anyone have any recommendations on how I might do this?
You may use MRJOB. It lets you write MapReduce jobs in Python 2.5+ and run them on several platforms.
An alternative is HiPy, it is an awesome project which should perhaps be enough for all your needs. The purpose of HiPy is to support programmatic construction of Hive queries in Python and easier management of queries, including queries with transform scripts.
HiPy enables grouping together in a single script of query
construction, transform scripts and post-processing. This assists in
traceability, documentation and re-usability of scripts. Everything
appears in one place and Python comments can be used to document the
script.
Hive queries are constructed by composing a handful of Python objects,
representing things such as Columns, Tables and Select statements.
During this process, HiPy keeps track of the schema of the resulting
query output.
Transform scripts can be included in the main body of the Python
script. HiPy will take care of providing the code of the script to
Hive as well as of serialization and de-serialization of data to/from
Python data types. If any of the data columns contain JSON, HiPy takes
care of converting that to/from Python data types too.
Check out the Documentation for details!