Using an UDF to query a table in Hive - sql

I have the following UDF available on Hive to convert a time bigint to date,
to_date(from_utc_timestamp(from_unixtime(cast(listed_time/1000 AS bigint)),'PST'))
I want to use this UDF to query a table on a specific date. Something like,
SELECT * FROM <table_name>
WHERE date = '2020-03-01'
ORDER BY <something>
LIMIT 10

I would suggest to change the logic: avoid applying the function to the column being filtered, because it is an inefficient approach. The function needs to be invoked for every row, which prevents the query from benefiting an index.
On the other hand, you can simply convert the input date to a unix timestamp (possibly with an UDF). This should look like;
SELECT * FROM <table_name>
WHERE date = to_utc_timestamp('2020-03-01', 'PST') * 1000
ORDER BY <something>
LIMIT 10

Related

How to reuse a computed value multiple times?

Basically I just want a simple way of finding the most recent date in a table, saving it as a variable, and reusing that variable in the same query.
Right now this is how I'm doing it:
with recent_date as (
select max(date)
from mytable
)
select *
from mytable
where date = (select * from recent_date)
(For this simple example, a variable is overkill, but in my real-world use-case I reuse the recent date multiple times in the same query.)
But that feels cumbersome. It would be a lot cleaner to save the recent date to a variable rather than a table and having to select from it.
In pseudo-code, something like this would be nice:
$recent_date = (select max(date) from mytable)
select *
from mytable
where date = $recent_date
Is there something like that in Postgres?
Better for the simple case
For the scope of a single query, CTEs are a good tool. In my hands the query would look like this:
WITH recent(date) AS (SELECT max(date) FROM mytable)
SELECT m.*
FROM recent r
JOIN mytable m USING (date)
Except that the actual example query would burn down to this in my hands:
SELECT *
FROM mytable
ORDER BY date DESC NULLS LAST
FETCH FIRST 1 ROWS WITH TIES;
NULLS LAST only if there can be NULL values. See:
Sort by column ASC, but NULL values first?
WITH TIES only if date isn't UNIQUE NOT NULL. See:
Get top row(s) with highest value, with ties
In combination with an index on mytable (date) (or more specific), this produces the best possible query plan. Look no further.
No, I need variables!
If you positively need variables scoped for the same command, transaction, session or more, there are various options.
The closest thing to "variables" in SQL in Postgres are "customized options". See:
User defined variables in PostgreSQL
You can only store text, any other type has to be cast (and cast back on retrieval).
To set and retrieve a value from within a query, use the Configuration Settings Functions set_config() and current_setting():
SELECT set_config('foo.recent', max(date)::text, false) FROM mytable;
SELECT *
FROM mytable
WHERE date = current_setting('foo.recent')::date;
Typically, there are more efficient ways.
If you need that "recent date" a lot, consider a simple function as "global variable", usable by all transactions in all sessions (but each new command sees its own current state):
CREATE FUNCTION f_recent_date()
RETURNS date
LANGUAGE sql STABLE PARALLEL SAFE AS
'SELECT max(date) FROM mytable';
STABLE is a valid volatility setting as the function returns the same result within the same query. Be sure to actually make it STABLE, so Postgres does not evaluate repeatedly. In Postgres 9.6 or later, also make it PARALLEL SAFE. Then your query becomes:
SELECT * FROM mytable WHERE date = f_recent_date();
More options:
Is there a way to define a named constant in a PostgreSQL query?
Passing user id to PostgreSQL triggers
Typically, if I need variables in Postgres, I use a PL/pgSQL code block in a function, a procedure, or a DO statement for ad-hoc use without the need to return rows:
DO
$do$
DECLARE
_recent_date date := (SELECT max(date) FROM mytable);
BEGIN
PERFORM * FROM mytable WHERE date = _recent_date;
-- more queries using _recent_date ...
END
$do$;
PL/pgSQL may be what you should be using to begin with. Further reading:
When to use stored procedure / user-defined function?
Keep in mind that in SQL you cannot directly declare a variable. Basically a CTE is creating variable (or a set of) and in SQL to use a variable you select it. However, if you want to avoid that structure you can just get the variable directl from a subset directly.
select *
from mytable
where date = (select max(date) from mytable);

Optimize performance for date range query in SQL

I have scenario as below: I have a table Tblbalance which contains a date and time column.
When I search using normal between clause in my stored procedure, it will throw error timeout in linq.
I tried to extend default time from 30 sec to 2 min in linq but not worked.
Please help me to get fastest way to return from select query
Sample query I tried is as below:
select
col1,
col2,
col3,
(select top 1 col from tblname where id = tblbalance.id)
from
tblbalance
where
datecol between startdate and todate
order by
col3
I would suggest to include one more column in you table and copy numeric of datetime in that column, Also create a index on the newly created numeric column. Then instead of using the datetime column in where clause, try using numeric column.
This way your query will run on index and will be optimised.

How to split the column with a Timestamp and extract the hours, minutes and seconds into the separate column?

The timestamp in a column looks like this "2020-04-26 17:45:14".
Now I need to extract and insert the "17:45:14" part into another separate column.
How do I do that? I was thinking about using the following code sequence:
SELECT
EXTRACT(____ FROM column)
FROM table
Just cast it to a time value:
select the_column::time as time_only
from the_table
Or compliant with the SQL standard:
select cast(the_column as time) as time_only
from the_table

current_timestamp redshift

When I select current_timestamp from redshift, I am getting a list of values instead of one value.
Here is my SQL query
select current_timestamp
from stg.table;
Does anybody know why I am getting a list of values instead a single value?
This is your query:
select current_timestamp from stg.table
This produces as many rows as there are in table stg.table (since that's the from clause), with a single column that always contains the current date/time on each row. On the other hand, if the table contains no row, the query returns no rows.
If you want just one row, use a scalar subquery without a from clause:
select current_timestamp as my_timestamp
You will receive a row for each row in stg.table. According to the RedShift docs you should be using GETDATE() or SYSDATE() instead. Perhaps you want, e.g.:
select GETDATE() as my_timestamp

Why hive partition not work when i using funciton unix_timestamp()

As i know, hive partition can reduce the number of input file if you use the partition column in where clause. For example, in my table t i define a partition named date_entry(type is string, which stores the timestamp).
select count(*) from t where
date_entry >= (unix_timestamp() - 2 * 24 * 3600) * 1000
I try to execute this query, i expect it will filter some files by the where clause, but it not.
If i don't use function unix_timestamp() , it will work.
Can anybody knows why or give the workaround.