How to run a query for a map job only in Apache hive - hive

If I write a query in Apache hive then it executes mapreduce job behind the scene but how I can run only map job in hive?
Thanks

Certain optimized queries do in fact only require a map phase. You may provide a MAPJOIN hint in Hive to achieve same: this is recommended for small secondary tables:
SELECT /*+ MAPJOIN(...) */ * FROM ...

This was a question which was asked to me in an interview,I didn't knew the answer that time but then i figured it out later on.
The following query runs a Map only job.So selecting column values will run map only job.Hence we dont need reducer for this scenario.
select id,salary from tableA;

Related

How to calculate accumulated sum of query timings?

I have a sql file running many queries. I want to see the accumualted sum of all queries. I know that if I turn on timing, or call
\timing
query 1;
query 2;
query 3;
...
query n;
at the beginning of the script, it will start to show time it takes for each query to run. However, I need to have the accumulate results of all queries, without having to manually add them.
Is there a systematic way? If not, how can I fetch the interim times to throw them in a variable.
The pg_stat_statements is a good module that provides a means for tracking execution statistics.
First, add pg_stat_statements to shared_preload_libraries in the
postgresql.conf file. To know where this .conf file exists in your
filesystem, run show config_file;
shared_preload_libraries = 'pg_stat_statements'
Restart Postgres database
Create the extension
CREATE EXTENSION pg_stat_statements;
Now, the module provides a View, pg_stat_statements, which helps you to analyze various query execution metrics.
Reset the contents of stat collected before running queries.
SELECT pg_stat_statements_reset();
Now, execute your script file containing queries.
\i script_file.sql
You may get all the timing statistics of all the queries executed. To get the total time taken, simply run
select sum(total_time) from pg_stat_statements
where query !~* 'pg_stat_statements';
The time you get is in milliseconds, which may be converted to desired format using various timestamp related Postgres functions
If you want to time the whole script, on linux or mac you can use the time utility to launch the script.
The measurement in this case is a bit more than the sum of the raw query times, because it includes some overhead of starting and running the psql command. On my system this overhead is around 20ms.
$ time psql < script.sql
…
real 0m0.117s
user 0m0.008s
sys 0m0.007s
The real value is the time it took to execute the whole script, including the aforementioned overhead.
The approach in this answer is a crude, simple client side way to measure the runtime of the overall script. It is not useful to measure milli-second precision server side execution times. It still might be sufficient for many use-cases.
The solution of Kaushik Nayak is a way more precise method to time executions directly on the server. It also provides much more insight into the execution (eg. query level times).

BQ PY Client Libraries :: client.run_async_query() vs client.run_sync_query()

I'm looking at BQ PY Client Libraries:
There used to be two different operations to query a table
client.run_async_query()
client.run_sync_query()
But in the latest version (v1.3) it seems there's only one operations to execute a query, Client.query(). Did I understand correctly?
And looking at GH code it looks Client.query() just returns the query job, not the actual query results / data.... Making me conclude it works in a similar way as client.run_async_query().. there's no replacement for client.run_sync_query() operation anymore which return query results (data) synchronously / immediately?
Thanks for the clarification!
Cheers!
Although .run_sync_query() has been removed, the Query reference says that short jobs may return results right away if they don't take long to finish:
query POST /projects/projectId/queries
Runs a BigQuery SQL query and returns results if the query completes within a specified timeout.

Issue Counting Rows in Hive

I have a table in Hive. When I run the following I always get 0 returned:
select count(*) from <table_name>;
Event though if I run something like:
select * from <table_name> limit 10;
I get data returned.
I am on Hive 1.1.0.
I believe the following two issues are related:
https://issues.apache.org/jira/browse/HIVE-11266
https://issues.apache.org/jira/browse/HIVE-7400
Is there anything I can do to workaround this issue?
The root cause is the old and outdated statistics of the table. Try issuing this command which should solve the problem.
ANALYZE TABLE <table_name> COMPUTE STATISTICS;
When you import the table first there may be various reasons the statistics is not updated by Hive services. I am still looking for options and properties to make it right.

Set a timeout on a tNetezzaInput in Talend

I'm making an ETL using Talend, which should permit to create an Excel file with some informations resulting from a tNetezzaInput component in Talend, executing a dynamic query.
It works perfectly. However, some queries finish after 2 hours, depending of the table size ( I have more than 1000 queries to execute ).
I would like to set a timeout (30seconds/1 minute) on my tNetezzaInput.
Is that possible?
Thank you
Not sure about Talend/tNetezzaInput. But you can address this issue at Netezza side using query timeout limits along with runaway query event.
Secondly, how big tables are that query taking 2 hours. Hope its not some query/join issue.
Thanks,
Sanjit
A query on 200 millions records with 100+ columns tables should not take 2 hrs, unless there is some issue with data distribution, join, statistics or workload manager.
Have you checked the pg, dbos logs, planfile about this query?
Thanks,
Sanjit

SQL Performance: LIKE vs IN

In my logging database I have serveral qualifiers specifying the logged action. A few of them are built like 'Item.Action' e.g. 'Customer.Add'. I wonder which approach will be faster if I want to get all logging items that starts with 'Customer.':
SELECT * FROM log WHERE action LIKE 'Customer.%'
or
SELECT * FROM log WHERE action IN ('Customer.Add', 'Customer.Delete', 'Customer.Update', 'Customer.Export', 'Customer.Import')
I use PostgreSql.
Depends on indexes on log table. Most likely - queries will have the same performance. To check - use explain or explain analyze. Queries with the same execution plan (output of explain) will have the same performance.