Hive Query doesn`t allow >= in the where clause - hive

I have a Hive query which returns the max_date (single row,single column) which is to be used against master table for filtering data. I use a temp table to generate the logic for max_date
create temporary table umt (processdate date);
insert into umt (
select
code for max_date from a_table
where processname = 'a_process'
);
I want to use this max_date in my master_table query.
If I use a hardcode date to filter the master_table I get the results in 1-2 mins
select data from master_table where process_date >= '2020-09-01'; 1-2 mins
however If I do a join it takes more than 15 mins. The master_table is partitioned on process_date column
select data from master_table inner join umt on process_date >= umt._max_date; ~ 15 mins
Is there any way to improve the performance or any way to substitute the max_date in the query to avoid join?

Related

Combining overlapping date ranges without using a cross join in BigQuery

If I have this dataset:
create schema if not exists dbo;
create table if not exists dbo.player_history(team_id INT, player_id INT, active_from TIMESTAMP, active_to TIMESTAMP);
truncate table dbo.player_history;
INSERT INTO dbo.player_history VALUES(1,1,'2020-01-01', '2020-01-08');
INSERT INTO dbo.player_history VALUES(1,2,'2020-06-01', '2020-09-08');
INSERT INTO dbo.player_history VALUES(1,3,'2020-06-10', '2020-10-01');
INSERT INTO dbo.player_history VALUES(1,4,'2020-02-01', '2020-02-15');
INSERT INTO dbo.player_history VALUES(1,5,'2021-01-01', '2021-01-08');
INSERT INTO dbo.player_history VALUES(1,6,'2021-01-02', '2022-06-08');
INSERT INTO dbo.player_history VALUES(1,7,'2021-01-03', '2021-06-08');
INSERT INTO dbo.player_history VALUES(1,8,'2021-01-04', '2021-06-08');
INSERT INTO dbo.player_history VALUES(1,9,'2020-01-02', '2021-02-05');
INSERT INTO dbo.player_history VALUES(1,10,'2020-10-01', '2021-04-08');
INSERT INTO dbo.player_history VALUES(1,11,'2020-11-01', '2021-05-08');
and I want to get combine overlapping date ranges, so that I can identify 'islands' where at least one player was active. Then I can do a cross-join and a correlated subquery to get the results as such:
with data_set as (
SELECT
a.team_id
, a.active_from
, ARRAY_AGG(b.active_to ORDER BY b.active_to DESC LIMIT 1)[SAFE_OFFSET(0)] AS active_to
FROM dbo.player_history a
LEFT JOIN dbo.player_history b
on a.team_id = b.team_id
where a.active_from between b.active_from and b.active_to
group by 1,2
)
select team_id
, min(active_from) as active_from
, active_to
from data_set
group by 1,3
order by active_from, active_to
and this gives me the desired results, however with larger data set this approach is not feasible, and BigQuery does not recommend doing joins in such a manner. Looking at the execution plan its mostly the join which causes the slowness. Are there any ways to achieve the desired output in a more efficient way?
You can use a partitioned table to get better performance with large amounts of information. The partitioned tables divide a large table into smaller partitions, thus you can improve query performance. The partitioned tables are based on a TIMESTAMP, DATE, or DATETIME .
An option could be:
Create a partitioned table
Load the data in the partitioned table
Execute the query
You can see this example:
With this query, you are creating a partitioned table and load the data at the same time. Maybe it takes some time the first time you load the data only, but it’ll be much faster when you access the partitioned table.
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
AS SELECT transaction_id, transaction_date FROM mydataset.mytable
Then execute the query
SELECT transaction_id, transaction_date FROM mydataset.newtable
Where transaction_date between start_date and finish_date
There are some limitations using a partitioned table because it uses the results saved on cache.
Also, you can see this documentation about some points you need to consider to get the best performance when you create a query.
A very fast query to obtain for each team a list of time periodes with at least one player active:
create temporary function test(a array<date>,b array<date>)
returns array<struct<a date,b date>>
language js
as """
var out=[];
var start=a[0];
var end=a[0];
for(var i=0;i<a.length;i++)
{
if(a[i]<=end) {if(end<b[i]) end=b[i]}
else {
var tmp={"a":start,"b":end};
out.push(tmp);
start=a[i];
end=b[i];
}
}
out.push({"a":start,"b":end});
return out;
""";
select team_id, test(array_agg(active_from order by active_from),array_agg(active_to order by active_from))
from
dbo.player_history
group by 1
Your result shows:
It is confusing to show that the start date is within the previous time segment.
If your player are only active for a few years in average this query gives a a list of all dates on which a team consist of only one player or less.
with tbl_lst as (
Select team_id,date_diff(active_to,active_from,day),
generate_date_array(active_from, active_to, INTERVAL 1 DAY) as day_list
from dbo.player_history )
SELECT team_id,day,sum(active_players) as active_players
FROM (
SELECT team_id,day,count(1) as active_players
from tbl_lst,unnest(tbl_lst.day_list) as day
group by 1,2
Union ALL
Select team_id, day,0 from
(Select team_id,min(active_from) as team_START,max(active_from) as team_END
from dbo.player_history
group by 1),
unnest(generate_date_array(team_START, team_END, INTERVAL 1 DAY)) day
)
group by 1,2
having active_players<2
This following query needs 16 stages and is slow, but obtains for every time interval the amount of active players. Two tables are joined and the table data_set are only the dates in the interval, therefore it can be a maximum of 3650 rows for 10 years.
#generate a list of all dates
with dates as (
SELECT active_from as start_date from dbo.player_history
Union ALL SELECT active_to from dbo.player_history
#Union ALL Select timestamp("2050-01-01")
),
# add next date to this list
data_set as (
SELECT Distinct start_date, lead(start_date) over (order by start_date) as end_date
from dates
order by 1
)
# count player at each time
Select team_id, start_date,end_date,
count(player_id) as active_player,
string_agg(cast(player_id as string)) as player_list
from dbo.player_history
RIGHT JOIN
data_Set
on active_from<=start_date and active_to>=end_date
group by 1,2,3
having active_player<2
order by start_date

Select difference between two tables

I want to list four columns, date, hourly count, daily count and difference between two counts.
I have used union all for two tables, but I am getting 2rows as shown in the image:
Select a.date, a.hour,b.daily,sum(a.hour-b.daily)
from (select date,count(*) hour,''daily
From table a union all select '' hour,count(*) daily from table b)
Group by date, daily, hourly..
Please suggest to me a solution.
I see that the code supplied uses a UNION to achieve the output. This would be better served by using a JOIN of some kind.
The result is the total number of rows in table_a grouped by the date subtracted from the total number of rows in table_b grouped by the date.
This code is untested but should give a good indication of how to achieve this:
SELECT a.date,
a.hour,
ISNULL(b.daily, 0) AS daily,
a.hour - ISNULL(b.daily) AS difference
FROM (
SELECT date,
COUNT(*) AS hour
FROM table_a
GROUP BY date
) a
LEFT JOIN (
SELECT date,
COUNT(*) AS daily
FROM table_b
GROUP BY date
) b ON b.date = a.date
ORDER BY a.date;
This works by:
Calculating the count per date in table_a.
Calculating the count per date in table_b.
Joining all results from table_a with those matching in table_b.
Outputting the date, the hour from table_a, the daily (or 0 if NULL) from table_b, and the difference between the two.
Notes:
I have renamed table a and table b to table_a and table_b. I presume these are not the actual table names
An INNER JOIN may be preferable if you only want results that have matching date columns in both tables. Using the LEFT JOIN will return all results from table_a regardless of whether table_b has an entry.
I'm not convinced that date is an allowed column name but I have reproduced it in the code as per the example given by OP.
Your method is fine. Your group by columns are not correct:
Select date, sum(hourly) as hourly, sum(daily) as daily,
sum(hourly) - sum(daily) as diff
from ((select date, count(*) as hourly, 0 as daily
from table a
group by date
) union all
(select date, 0 as hourly, count(*) as daily
from table b
group by date
)
) ab
group by date;
The key idea is that the outer query aggregates only by date -- and you still need aggregation functions there as well.
You have other errors in your subquery, such as missing group bys and date columns. I assume those are transcription errors.

Finding Max(Date) BEFORE specified date in Redshift SQL

I have a table (Table A) in SQL (AWS Redshift) where I've isolated my beginning population that contains account id's and dates. I'd like to take the output from that table and LEFT join back to the "accounts" table to ONLY return the start date that precedes or comes directly before the date stored in the table from my output.
Table A (Beg Pop)
-------
select account_id,
min(start_date),
min(end_date)
from accounts
group by 1;
I want to return ONLY the date that precedes the date in my current table where account_id match. I'm looking for something like...
Table B
-------
select a.account_id,
a.start_date,
a.end_date,
b.start_date_prev,
b.end_date_prev
from accounts as a
left join accounts as b on a.account_id = b.account_id
where max(b.start_date) less than a.start_date;
Ultimately, I want to return everything from table a and only the dates where max(start_date) is less than the start_date from table A. I know aggregation is not allowed in the WHERE clause and I guess I can do a subquery but I only want the Max date BEFORE the dates in my output. Any suggestions are greatly appreciated.
I want to return ONLY the date that precedes the date in my current table where account_id match
If you want the previous date for a given row, use lag():
select a.*,
lag(start_date) over (partition by account_id order by start_date) as prev_start_date
from accounts a;
As I understand from the requirement is to display all rows from a base table with the preceeding data sorted based on a column and with some conditions
Please check following example which I took from article Select Next and Previous Rows with Current Row using SQL CTE Expression
WITH CTE as (
SELECT
ROW_NUMBER() OVER (PARTITION BY account_id ORDER BY start_date) as RN,
*
FROM accounts
)
SELECT
PreviousRow.*,
CurrentRow.*,
NextRow.*
FROM CTE as CurrentRow
LEFT JOIN CTE as PreviousRow ON
PreviousRow.RN = CurrentRow.RN - 1 and PreviousRow.account_id = CurrentRow.account_id
LEFT JOIN CTE as NextRow ON
NextRow.RN = CurrentRow.RN + 1 and NextRow.account_id = CurrentRow.account_id
ORDER BY CurrentRow.account_id, CurrentRow.start_date;
I tested with following sample data and it seems to be working
create table accounts(account_id int, start_date date, end_date date);
insert into accounts values (1,'20201001','20201003');
insert into accounts values (1,'20201002','20201005');
insert into accounts values (1,'20201007','20201008');
insert into accounts values (1,'20201011','20201013');
insert into accounts values (2,'20201001','20201002');
insert into accounts values (2,'20201015','20201016');
Output is as follows

Faster sql query for Oracle database

I have a oracle database with over 2 million rows and 200 columns. I'm trying to query data in five columns where one of the columns is equal to the most recent date. This query below works but seems to be taking long (over 2 min) to process. Is there a different logic I can use to speed up the query?
SELECT a,b,c,date,e FROM table a WHERE a.date = (SELECT MAX(date) FROM table)
try rownum
SELECT * FROM (
SELECT A,B,C,DATE,E FROM TABLE A WHERE A.DATE DESC
)
WHERE ROWNUM=1
you can also use another similar solution
with tbl1 as
(SELECT A,B,C,DATE,E,
first_value(date) over( order by date desc) maxdate
FROM TABLE A)
select A,B,C,DATE,E from tbl1 where date = maxdate
YES !!!
Set Index "date" to your table as I_TABLE_DATE, after that change your query to this
SELECT --+ index_desc(a I_TABLE_DATE)
a,b,c,date,e
FROM table a
WHERE a.date = (SELECT --+ index_desc(b I_TABLE_DATE)
b.Date
FROM table b
where Rownum = 1)
it will be more faster because, during getting maximum date there will be only scanning index by descening, and your main query will work by index ascening and you dont need scan all table

SQL INSERT INTO with SELECT statement and INNER JOIN with where clause

I wrote this query for insert data from one table to another with the following condition hours on table field CHECKTIME >= 12 should be insert into Att_process table, This query executing successfully on SQL Server but data doesn't insert in to the table, but hours > 12 data also in the table
INSERT INTO Att_process(USERID,checkout_time)
SELECT
CHECKINOUT.USERID, CHECKINOUT.CHECKTIME
FROM
CHECKINOUT
INNER JOIN
Att_process ON CHECKINOUT.USERID = Att_process.USERID
WHERE
DATEPART(HOUR, CHECKTIME) >= 12;
Can any one help me on this really appreciated
Is there already data in your Att_process table ?
You are joining with User_ID of Att_process table at the same time trying to insert in the table you are joining. So how join will produce data?
Please let us know about Att_process table and its relation with CHECKINOUT table.
probably what you may need is
INSERT INTO Att_process(USERID,checkout_time)
(SELECT
CHECKINOUT.USERID, CHECKINOUT.CHECKTIME
FROM
CHECKINOUT
WHERE
DATEPART(HOUR, CHECKTIME) >= 12;
)