how can we test HIVE functions without referencing a table - sql

I wanted to understand the UDF WeekOfYear and how it starts the first week. I had to artifically hit a table and run
the query . I wanted to not hit the table and compute the values. Secondly can I look at the UDF source code?
SELECT weekofyear
('12-31-2013')
from a;

You do not need table to test UDF since Hive 0.13.0.
See this Jira: HIVE-178 SELECT without FROM should assume a one-row table with no columns
Test:
hive> SELECT weekofyear('2013-12-31');
Result:
1
The source code (master branch) is here: UDFWeekOfYear.java

If you are Java developer, you can write Junit Test cases and test the UDFs..
you can search the source code of all hive built in functions in grepcode.
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.0.0/org/apache/hadoop/hive/ql/udf/UDFWeekOfYear.java

I don't think executing UDF without hitting the tables is possible in Hive.
Even Hive developers hit the table in UDF testing.
To make query run faster you can:
Create table with only one row and run UDF queries on this table
Run Hive in local mode.
Hive source code is located here.
UDFWeekOfYear source is here.

You should be able to use any table with at least one row to test functions.
Here is an example using a few custom functions that perform work and output a string result.
Replace anytable with an actual table.
SELECT ST_AsText(ST_Intersection(ST_Polygon(2,0, 2,3, 3,0), ST_Polygon(1,1, 4,1, 4,4, 1,4))) FROM anytable LIMIT 1;
HIVE Resuluts:
OK
POLYGON ((2 1, 2.6666666666666665 1, 2 3, 2 1))
Time taken: 0.191 seconds, Fetched: 1 row(s)
hive>

Related

Oracle SQL - Subquery Works fine, However Create Table with that subquery appears to hang

I have the following query structure
CREATE TABLE <Table Name> AS
(
SELECT .... FROM ...
)
When i run the SELECT statement on its own, this compiles and returns the results within seconds. however when I run that with the CREATE Table Statement it takes hours to the point where I believe it has hung and will never compile.
What is the reason for this? and what could a work around be?
Oracle Database 12c <12.1.0.2.0>
If you ran that SELECT in some GUI, note that most (if not all) of them return only a few hundred rows, not the whole result set. For example: if your query really returns 20 million rows, GUI displays the first 50 (or 500, depending on tool you use) rows which is kind of confusing - just like it confused you.
If you used current query as an inline view, e.g.
select count(*)
from
(select ... from ...) --> this is your current query
it would "force" Oracle to fetch all rows, so you'd see how long it actually takes.
Apart from that, see if SELECT can be optimized, e.g.
see whether columns used in WHERE clause are indexed
collect statistics for all involved tables (used in the FROM clause)
remove ORDER BY clause (if there's any; it is irrelevant in CTAS operation)
check explain plan
Performance Tuning is far more from what I've suggested; those are just a few suggestions you might want to look at.
Have you tried Direct Load insert by first creating the table using CTAS where 1= 2and then doing the insert. This will atleast tell us if anything is wrong in data(corrupt data) or if it is a performance issue.
I had the same problem before since the new data is too large (7 million rows) and it took me 3 hours to execute the code.
My best suggestion is to create a view since it took less space instead of a new table.
So the answer to this one.
CREATE TABLE <Table Name> AS
(
SELECT foo
FROM baa
LEFT JOIN
( SELECT foo FROM baa WHERE DATES BETWEEN SYSDATE AND SYSDATE - 100 )
WHERE DATES_1 BETWEEN SYSDATE - 10 AND SYSDATE - 100
)
The problem was that the BETWEEN statements did not match the same time period and the sub query was looking at more data than the main query (I guess this was causing a full scan over the tables?)
The below query has the matching between statement time period and this returned the results in less than 3 minutes.
CREATE TABLE <Table Name> AS
(
SELECT foo FROM baa
LEFT JOIN ( SELECT foo FROM baa WHERE DATES BETWEEN SYSDATE - 10 AND SYSDATE - 100 )
WHERE DATES_1 BETWEEN SYSDATE - 10 AND SYSDATE - 100
)

SQL Server Select Query Is Slow

I have approx 820,000 records in my SQL Server table and it is taking 5 seconds to select the data from the table. The table has one clustered index on a time column that could be NULL (as of now it does not contain any NULL value). Why is it taking 5 to 6 seconds to fetch only this much records?
What did you mean by 'select the data'? If you are fetching so many records in management studio (displaying all the records) most of this 6 seconds is consumed by showing you all the rows. If it is the case just insert the records to a temp table. It will be much faster.
I recomend you this:
1.Check if you are using Clustered and Non-Clustered in you columns (best way I think with a sp_help NameTable).
2.When you using comand "select" specific always all name columns (never use Select * From ..... ).
3.If you are using SSMS check in tools SQL Execution Plan , with this tool you can make simple review your TSQL (you can see cost
each query you make.
4.Dont use "convert(...." in clause WHERE , for example .. Where Convert(int,NameColum)=100.

How to write UDF (hive/spark-scala) to return a value from hive query

I am trying to write a Hive UDF using Scala. This UDF should run a query on another hive table and return the obtained value
For eg. I have a master table in hive with columns emp_id,start_date,end_date,salary. I am trying to come up with a hive udf using scala to create a function(getSal) where I can pass id and some date and get the effective salary for that id in another hive query as
select *, getSal(emp_id,passed_date) as salary from some table;
Can you tell me how to achieve this?
note - i can get the details by joining my table with master table and running between clause but would like to explore the UDF solution.
This is simply not possible, nor desirable when you think about it from a performance point of view. At least if I have understood your question, more so from the title.

Hive Query Efficiency

Could you help me with a Hive Query Efficiency problem? I have two queries working for the same problem. I just cannot figure out why one is much faster than the other. If you know please feel free to provide insight. Any info is welcomed!
Problem: I am trying to check the minimum value of a bunch of variables in a Hive parquet table.
Queries: I tried two two queries as follows:
query 1
drop table if exists tb_1 purge;
create table if not exists tb_1 as
select 'v1' as name, min(v1) as min_value from src_tb union all
select 'v2' as name, min(v2) as min_value from src_tb union all
select 'v3' as name, min(v3) as min_value from src_tb union all
...
select 'v200' as name, min(v200) as min_value from src_tb
;
query 2
drop table if exists tb_2 purge;
create table if not exists tb_2 as
select min(v1) as min_v1
, min(v2) as min_v2
, min(v3) as min_v3
...
, min(v200) as min_v200
from src_tb
;
Result: Query 2 is much faster than query 1. It took probably 5 mins to finish the second query. I don't know how long will query 1 take. But after I submit the first query, it took a long time to even react to the query, by which I mean that usually after I submit a query, the system will start to analyze and provides some compiling information in the terminal. However, for my first query, after my submission, the system won't even react to this. So I just killed it.
What do you think? Thank you in advance.
Query execution time depends on environment that you execute it.
In MSSQL.
Some people like you think query execution is similar to algorithm that they see in some theoretical resources, but in practical situation, it depends on other things.
For example both of your queries have SELECT statement that perform on a table and at first glance, they need to read all rows, but database server must analyze the statement to determine the most efficient way to extract the requested data. This is referred to as optimizing the SELECT statement. The component that does this is called the Query Optimizer. The input to the Query Optimizer consists of the query, the database schema (table and index definitions), and the database statistics. The output of the Query Optimizer is a query execution plan, sometimes referred to as a query plan or just a plan. (Please see this for more information about query-processing architecture)
You can see execution plan in MSSQL by reading this article and I think you will understand better by seeing execution plan for both of your queries.
Edit (Hive)
Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
A Hive query gets converted into a sequence of stages. The description of the stages itself shows a sequence of operators with the metadata associated with the operators.
Please see LanguageManual Explain for more information.
What is surprising? The first query has to read src_tb a total of 200 times. The second reads the data once and performs 200 aggregations. It is a no brainer that it is faster.

when using functions on PostgreSQL partitoned tables it does a full table scan

The tables are partitioned in a PostgreSQL 9 database. When I run the following script:
select * from node_channel_summary
where report_date between '2012-11-01' AND '2012-11-02';
it send the data from the proper tables without doing a full table scan. Yet if I run this script:
select * from node_channel_summary
where report_date between trunc(sysdate)-30 AND trunc(sysdate)-29;
in this case it does a full table scan which performance is unacceptable. The -30 and -29 will be replaced by parameters.
After doing some research, Postgres doesn't work properly with functions and partitioned tables.
Does somebody know as a work around to resolve this issue?
The issue is that PostgreSQL calculates and caches execution plans when you compile the function. This is a problem for partitioned tables because PostgreSQL uses the query planner to eliminate partitions. You can get around this by specifying your query as a string, forcing PostgreSQL to re-parse and re-plan your query at run time:
FOR row IN EXECUTE 'select * from node_channel_summary where report_date between trunc(sysdate)-30 AND trunc(sysdate)-29' LOOP
-- ...
END LOOP;
-- or
RETURN QUERY EXECUTE 'select * from ...'