Generating dates in Hive SQL

Generating dates in Hive SQL - hive

I'm looking to be able to create a table that contains all of the dates (inclusive) between the min and max date from another table. See below the simple query to get these dates
-- Get the min and max dates from the table
select min(date(sale_date)) as min_date,
max(date(sale_date)) as max_date
from TABLE;
I've spent the last hour googling this problem and have found attempts at doing this on MySQL and Oracle SQL but not on Hive SQL which I've been unable to convert to Hive SQL. If anyone has any idea on how to do this, please let me know. Thanking you in advance.

Ok this isn't my answer. A colleague was able to answer it. Still I think its important that I show my colleague's solution for your future benefit. It assumes that you've created a table that contains the min date and max date.
CREATE TABLE TABLE_2
STORED AS AVRO
LOCATION 'xxxxxx'
AS
SELECT date_add (t.min_date,pe.i) AS date_key
FROM TABLE_1 t
LATERAL VIEW
posexplode(split(space(datediff(t.max_date,t.min_date)),' ')) pe AS i,x;

Related

Teradata logs. Statistic/Partitioning usage

May be someone knows is it possible to find in Teradata logs what statistics or even better what partition ranges was used in particular query?
For example in table definition we have date range:
PARTITION BY (
Range_N(TransactionDate BETWEEN DATE '2012-01-01' AND DATE '2022-12-31' EACH INTERVAL '1' MONTH)
)
So question is it possible to see which range was used in particular query? I believe not, but may be still there is some way to do it?
I tried to analyze some DBC.DBQ tables, but no results.

Clean SQL: Should I reference dates that I could ask for in my database?

I have millions of rows in my database SQLite Studio.
I would like to ask it which months I can have.
When I do my request, it takes a long time (5 min, more than the half of time that it took to get my data for a month!)
--How I call for months BTW --
SELECT DISTINCT strftime('%Y-%m', time_UTC) AS month FROM transacts ORDER BY month ASC;
It could be really fast if I create a table with all available months but I hear a little voice in my head telling:
YOU HAVE NOT TO MAKE A DATE TABLE
But I don't remember if it's bad and why. It sound to me pretty well.
What do you think about please?
EDIT :
How to speed up a request SQL on dates available

You can create an index on the expression strftime('%Y-%m', time_UTC) to get better performance:
CREATE INDEX id_transacts_time ON transacts(strftime('%Y-%m', time_UTC));
Check in the demo the query plan that uses the index.

Optimize SELECT MAX(timestamp) query

I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.

What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).

explain the two conversions used in between hive date functions?

i am trying to count the number of records in the particular date.
eventually, got the query worked but confused between these two queries which seemed to same for me. why should i enclose the date_time instead of quote in the conversion.
when i hit the query,
select count(*) from TABLENAME
where FROM_UNIXTIME(UNIX_TIMESTAMP(date_time), 'yyyyMMdd')='20170312';
result is count of the particular date is arrived.
but when i hit,
select count(*) from TABLENAME
where FROM_UNIXTIME(UNIX_TIMESTAMP('date_time', 'yyyyMMdd'))='20170312';
the result is 0.
please explain the difference of these queries.

date_time is a column while 'date_time' is a string and the attempt to use it as date result in NULL.
If you want to qualify the column name you should use `date_time`

Select record online by max online ordered by date

Needs help in sql:
I need to group max online of each day by days
(http://prntscr.com/a7j2sm)
my sql select:
SELECT id, date, MAX(online)
FROM `record_online_1`
GROUP BY DAY(date)
and result - http://prntscr.com/a7j3sp
This is incorrect result because, max online is correct, but date and id of this top online is incorrect. I dont have ideas how solve this issue..
UPD: using MySQL MariaDB

When you perform an aggregate functions, you have to include items in the SELECT statement that aren't a part of an aggregate function in the GROUP BY clause. In T-SQL, you simply cannot execute the above query if you don't also GROUP BY "id" for example. However, some database systems allow you to forego this rule, but it's not smart enough to know which ID it should bring back to you. You should only be doing this if, for example, all "ids" for that segment are the same.
So what should you do? Do this in two steps. Step one, find the max values. You will lose the ID and DATETIME data.
SELECT DAY(date) AS Date, MAX(online) AS MaxOnline
FROM `record_online_1` GROUP BY DAY(date)
The above will get you a list of dates with the max for each day. INNER JOIN this to the original "record_online_1" table, joining specifically on the date and max value. You can use a CTE, temp table, subquery, etc to do this.
EDIT: I found an answer that is more eloquent than my own.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Generating dates in Hive SQL - hive

Related

Teradata logs. Statistic/Partitioning usage

Clean SQL: Should I reference dates that I could ask for in my database?

Optimize SELECT MAX(timestamp) query

explain the two conversions used in between hive date functions?

Select record online by max online ordered by date

Categories

Resources