Select from a Hive Table where date less than max date - hive

I'm trying to select * from a Hive table where a date column called TRANS_DATE say is >= than the 365 days before the maximum TRANS_DATE.
Below is the query I've tried so far:
select * from TABLE
where (TRANS_DATE > DATE_SUB(max(TRANS_DATE), 365)) and
(TRANS_DATE < max(TRANS_DATE));
Below is the error I have gotten:
"Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 2:28 Not yet supported place for UDAF 'max'"
An example of the date format is: "2006-05-30 00:00:00.0"
The query is to be to read data from a hive table into Qlikview so ideally I'd like not to define variables before hand and would prefer to do the select dynamically. Apologies if any of this is daft as I'm new to Hive.

calculate max_date in a subquery and cross join with table:
select *
from TABLE a
cross join --cross join with single row
(select max(TRANS_DATE) as max_trans_date from TABLE) b
where (a.TRANS_DATE > DATE_SUB(b.max_trans_date, 365))
and (a.TRANS_DATE < b.max_trans_date);
With analytic function:
select a.* from
(select a.*,
max(TRANS_DATE) over() as max_trans_date
from TABLE a) a
where (a.TRANS_DATE > DATE_SUB(a.max_trans_date, 365))
and (a.TRANS_DATE < a.max_trans_date);

Related

Restructure a query in Impala/Hive that is using subquery to create new column in table

I am converting SQL query to Impala. The SQL query is using a subquery in select to create a new column and is as follows-
select *, (select min(day)
from date_series
where day > t.work_day) as next_work_day
from table1 t
However, Impala does not support subquery in select for creating new column and this query fails. Can I please get help to rewrite this query in a way Impala can execute.
Purpose of Query: Find the next working day for the work_day column.
Table1 is the outer table and contains
table1 contains 4 columns including the work day column
date_series contains all working dates stating from 2019-06-18 to current_day + 5 like
2019-06-20
2019-06-21
2019-06-24
.
.
I think you can do this:
select t.*, ds.next_day
from table1 t left join
(select ds.*, lead(day) over (order by day) as next_day
from date_series ds
) ds
on t.current_work_day >= ds.day and
(t.current_work_day < ds.next_day or ds.next_day is null);
You can re-write your query as following
select
t.*,
work_day
from table1 t
join (
select
min(day) as work_day
from date_series
) ds
on t.current_work_day = ds.work_day
where ds.work_day > t.current_work_day

SQL query to sum a column prior to date and show all entries after that date

I have a table where limits were sanctioned to the customer
I am trying to get the output as below picture i.e. total amount sanctioned till particular date
I am trying below code but this sums the total sanction amount
select gam.id, sum(SANCTION_AMOUNT) from gam
join (select ID,ACCOUNT_OPEN_DATE from gam where ACCOUNT_OPEN_DATE between'01-04-2019' and '30-04-2019' AND SCHEME_CODE IN ('SB','CCKLY')) ) action
on( gam.ACCOUNT_OPEN_DATE <=action.ACCOUNT_OPEN_DATE and gam.id=action.cust_id) group by gam.id;
In Oracle, this can be a way:
select id, sanction_amount, scheme_code, account_open_date,
sum(sanction_amount) over (partition BY ID order by account_open_date) as total_sanction_amount
from gam
order by account_open_date
Not sure your database is MySQL or Oracle, But this below script is workable in most of the database. Just adjust the table and column names accordingly.
You can check MySQL DEMO HERE
SELECT *,
(
SELECT SUM(sanction_Amount)
FROM Your_Table B
WHERE B.ID = A.ID
AND B.acc_open_date <= A.acc_open_date
) Total_sanction_Amount
FROM Your_Table A

Faster sql query for Oracle database

I have a oracle database with over 2 million rows and 200 columns. I'm trying to query data in five columns where one of the columns is equal to the most recent date. This query below works but seems to be taking long (over 2 min) to process. Is there a different logic I can use to speed up the query?
SELECT a,b,c,date,e FROM table a WHERE a.date = (SELECT MAX(date) FROM table)
try rownum
SELECT * FROM (
SELECT A,B,C,DATE,E FROM TABLE A WHERE A.DATE DESC
)
WHERE ROWNUM=1
you can also use another similar solution
with tbl1 as
(SELECT A,B,C,DATE,E,
first_value(date) over( order by date desc) maxdate
FROM TABLE A)
select A,B,C,DATE,E from tbl1 where date = maxdate
YES !!!
Set Index "date" to your table as I_TABLE_DATE, after that change your query to this
SELECT --+ index_desc(a I_TABLE_DATE)
a,b,c,date,e
FROM table a
WHERE a.date = (SELECT --+ index_desc(b I_TABLE_DATE)
b.Date
FROM table b
where Rownum = 1)
it will be more faster because, during getting maximum date there will be only scanning index by descening, and your main query will work by index ascening and you dont need scan all table

Exclude values from same column depending on date, using Hive SQL

I need to extract a set of IDs from a table using Hive. The table from which I am to extract the data is partitioned by date. What I need are distinct IDs that appear in the table eight days ago but are not in the table for dates that represent the last seven days. I have tried using a subquery:
SELECT DISTINCT id
FROM my_table
WHERE date = '2016-07-14'
AND id NOT IN (
SELECT DISTINCT id
FROM my_table
WHERE date BETWEEN '2016-07-15' AND '2016-07-21'
);
However, I am getting an error message containing Unsupported language features in query (entire error message is too long to post here). Since I cannot use this approach in Hive SQL, what are my options here?
The same functionality can be done using LEFT JOIN:
SELECT a.ID
FROM
(
SELECT DISTINCT ID
FROM my_table
WHERE date = '2016-07-14'
)a
LEFT JOIN (
SELECT DISTINCT ID
FROM my_table
WHERE date BETWEEN '2016-07-15' AND '2016-07-21'
) s on a.ID=s.ID
WHERE s.ID IS NULL;

How to get number of invoices for last 12 weeks in Postgres

Invoice database contains invoice dates:
create table dok (
dokumnr serial primary key,
invoicedate date not null
);
Dashboard requires comma separated list containing number of invoices for last 12 weeks, e.q
4,8,0,6,7,6,0,6,0,4,5,6
List contains always 12 elements. If there are no invoices for some 7 day interval, 0 should appear.
Every element should contain number of invoices for 7 days.
Query should find maximum date before current date:
select max(invoicedate) as last_date from dok;
And after that probably use count(*) and string_agg() to create list.
Last (12th) element should contain number of invoices for
last_date .. last_date-interval'6days'
11 element (one before last) should contain number of invoices for days
last_date-interval'7days' .. last_date-interval'14days'
etc.
How to write this query in Postgres 9.1+ ?
This is ASP.NET MVC3 C# application and some parts of query can also done in C# code if this helps.
I ended with
with list as (
SELECT count(d.invoicedate) as cnt
FROM (
SELECT max(invoicedate) AS last_date
FROM dok
WHERE invoicedate< current_date
) l
CROSS JOIN generate_series(0, 11*7, 7) AS g(days)
LEFT JOIN dok d ON d.invoicedate> l.last_date - g.days - 7
AND d.invoicedate<= l.last_date - g.days
GROUP BY g.days
ORDER BY g.days desc
)
SELECT string_agg( cnt::text,',')
from list
CROSS JOIN the latest date to generate_series(), followed by a LEFT JOIN to the main table.
SELECT ARRAY(
SELECT count(d.invoicedate) AS ct
FROM (
SELECT max(invoicedate) AS last_date
FROM dok
WHERE invoicedate < current_date -- "maximum date before current date"
) l
CROSS JOIN generate_series(0, 11*7, 7) AS g(days)
LEFT JOIN dok d ON d.invoicedate > l.last_date - g.days - 7
AND d.invoicedate <= l.last_date - g.days
GROUP BY g.days
ORDER BY g.days
);
Assuming there is at least one valid entry in the table,
this returns an array of bigint (bigint[]) with the latest week first.
current_date depends on the timezone setting of your session.
If you need the result to be a comma-separated string you could use another query layer with string_agg() instead. Or you feed the above to array_to_string():
SELECT array_to_string(ARRAY(SELECT ...), ',');
Your query audited:
It's an implementation detail, but it's documented:
The aggregate functions array_agg, json_agg, jsonb_agg,
json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as
well as similar user-defined aggregate functions, produce meaningfully
different result values depending on the order of the input values.
This ordering is unspecified by default, but can be controlled by
writing an ORDER BY clause within the aggregate call, as shown in
Section 4.2.7. Alternatively, supplying the input values from a
sorted subquery will usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
Beware that this approach can fail if the outer query level contains
additional processing, such as a join, because that might cause the
subquery's output to be reordered before the aggregate is computed.
Bold emphasis mine.
To stay standard compliant, you could write:
WITH list AS (
SELECT g.days, count(d.invoicedate)::text AS cnt
FROM (
SELECT max(invoicedate) AS last_date
FROM dok
WHERE invoicedate < current_date
) l
CROSS JOIN generate_series(0, 11*7, 7) AS g(days)
LEFT JOIN dok d ON d.invoicedate > l.last_date - g.days - 7
AND d.invoicedate <= l.last_date - g.days
GROUP BY 1
)
SELECT string_agg(cnt, ',' ORDER BY days DESC)
FROM list;
But this is a bit slower. Also, the CTE is not technically necessary and also a bit slower than a subquery.
SELECT array_to_string(ARRAY( SELECT ...), ',') like I proposed is fastest because the array constructor is faster for a single result than the aggregate function string_agg().