“Hive” max column value from multiple columns - sql

Hi: I have a situation where I need to find the max value on 3 calculated fields and store it in another field, is it possible to do it in one SQL query? Below is the example
SELECT Income1 ,
Income1 * 2% as Personal_Income ,
Income2 ,
Income2 * 10% as Share_Income ,
Income3 ,
Income3 * 1% as Job_Income ,
Max(Personal_Income, Share_Income, Job_Income )
From Table
One way I tried is to calculate Personal_Income, Share_Income, Job_Income in the first pass and in the second pass I used
Select
Case when Personal_income > Share_Income and Personal_Income > Job_Income
then Personal_income
when Share_income > Job_Income
then Share_income
Else Job_income as the greatest_income
but this require me to do 2 scans on a billion rows table, How can I avoid this and do it in a single pass? Any help much appreciated.

As of Hive 1.1.0 you can use greatest() function. This query will do in a single table scan:
select Income1 ,
Personal_Income ,
Income2 ,
Share_Income ,
Income3 ,
Job_Income ,
greatest(Personal_Income, Share_Income, Job_Income ) as greatest_income
from
(
SELECT Income1 ,
Income1 * 2% as Personal_Income ,
Income2 ,
Income2 * 10% as Share_Income ,
Income3 ,
Income3 * 1% as Job_Income ,
From Table
)s
;

Related

Sql query with group by takes too long

I have a very simple query but it takes too long to load when I use Max and group by. Could you please propose an alternative?. I use Oracle 18g for running this query. a_num_ver, id, site_id is a primary key.
SELECT id
, site_id
, sub_id
, max(a_num_ver) as a_num_ver
, ae_no
, max(aer_ver) AS aer_ver
FROM table_1
GROUP BY id
, site_id
, sub_id
, ae_no
Try using parallel hints 4 OR 8 if that is allowed from DBA. I have tried a similar query in a table with around 296,292,720 rows. Without hints, it took around 2 minutes to execute. It comes down to 20 seconds with PARALLEL 8.
SELECT /*+ PARALLEL(8) */
id
, site_id
, sub_id
, max(a_num_ver) as a_num_ver
, ae_no
, max(aer_ver) AS aer_ver
FROM table_1
GROUP BY id
, site_id
, sub_id
, ae_no

Row_number partition by performance

How to improve the performance when row_number Partitioned by used in Hive query.
select *
from
(
SELECT
'123' AS run_session_id
, tbl1.transaction_id
, tbl1.src_transaction_id
, tbl1.transaction_created_epoch_time
, tbl1.currency
, tbl1.event_type
, tbl1.event_sub_type
, tbl1.estimated_total_cost
, tbl1.actual_total_cost
, tbl1.tfc_export_created_epoch_time
, tbl1.authorizer
, tbl1.acquirer
, tbl1.processor
, tbl1.company_code
, tbl1.country_of_account
, tbl1.merchant_id
, tbl1.client_id
, tbl1.ft_id
, tbl1.transaction_created_date
, tbl1.event_pst_time
, tbl1.extract_id_seq
, tbl1.src_type
, ROW_NUMBER() OVER(PARTITION by tbl1.transaction_id ORDER BY tbl1.event_pst_time DESC) AS seq_num -- while writing back to the pfit events table, write each event so that event_pst_time populates in right way
FROM nest.nest_cost_events tbl1 --<hiveFinalDB>-- -- DB variables wont work, so need to change the DB accrodingly for testing and PROD deployment
WHERE extract_id_seq BETWEEN 275 - 60
AND 275
AND event_type in('ACT','CBR','SKU','CAL','KIT','BXT' )) tbl1
where seq_num=1;
This table is partitioned by src_type.
Now it is taking 20 mnts to process 154M records. I want to reduce to 10 mnts.
Any suggestions ?
Thanks

Access Query subtract 2 different column from different row in same table with same ID

I have a table deposit which have column Refund_amt ,Deposit_amt having different Rows with same GR_no . here my question is ,I want to subtract deposit_amt column from Refund_amt
I tried various alternative in query but didn't succeed
My query :
SELECT d.Gr_no
, d.Rec_No
, d.Deposite_Amt
, d.penalty_Amt
, d.Refund_Amt - Refund
, s.Name
, s.cur_std
, cur_div
From
( select d.Refund_Amt refund
from deposite d
, std_gr s
where d.Gr_no = s.Gr_no )
Result would look like this in final total column :
Thank you
You are looking for an aggregation per std_gr: the sum of the deposites minus the sum of the refunds. One way is to do this aggregation in a subquery and join this subquery to your table.
select
d.*, sums.final_total
from deposite d
join
(
select std_gr, nz(sum(deposite_amt),0) - nz(sum(refund_amt),0) as final_total
from deposite
group by std_gr
) as sums on sums.std_gr = d.std_gr
order by d.rec_no;

SQL query, select from 2 tables random

Hello all i have a problem that i just CANT get to work like i what it..
i want to show news and reviews (2 tables) and i want to have random output and not the same output
here is my query i really hope some one can explain me what i do wrong
SELECT
anmeldelser.billed_sti ,
anmeldelser.overskrift ,
anmeldelser.indhold ,
anmeldelser.id ,
anmeldelser.godkendt
FROM
anmeldelser
LIMIT 0,6
UNION ALL
SELECT
nyheder.id ,
nyheder.billed_sti ,
nyheder.overskrift ,
nyheder.indhold ,
nyheder.godkendt
FROM nyheder
ORDER BY rand() LIMIT 0,6
First off it looks like the column order for the two SELECT statements don't match which they need to for a UNION.
What does the following return?
SELECT
anmeldelser.billed_sti ,
anmeldelser.overskrift ,
anmeldelser.indhold ,
anmeldelser.id ,
anmeldelser.godkendt
FROM
anmeldelser
LIMIT 0,6
UNION ALL
SELECT
nyheder.billed_sti ,
nyheder.overskrift ,
nyheder.indhold ,
nyheder.id ,
nyheder.godkendt
FROM nyheder
ORDER BY rand() LIMIT 0,6
(which RDBMS are you using? the SQL you have is not valid for Sybase but there may be techniques depending on the 'flavour' of SQL you are using)
Since RAND() appears only in the ORDER BY clause, would it not only be evaluated once for the whole query, and not once per row?
The problem is the first table is not selecting random elements
SELECT temp.* FROM
(
SELECT
anmeldelser.id ,
anmeldelser.billed_sti ,
anmeldelser.overskrift ,
anmeldelser.indhold ,
anmeldelser.godkendt,
'News' as artType
FROM anmeldelser
UNION
SELECT
nyheder.id ,
nyheder.billed_sti ,
nyheder.overskrift ,
nyheder.indhold ,
nyheder.godkendt,
'Review' as artType
FROM nyheder
) temp
ORDER BY rand() LIMIT 0,6

SQL query ...multiple max value selection. Help needed

Business World 1256987 monthly 10 2009-10-28
Business World 1256987 monthly 10 2009-09-23
Business World 1256987 monthly 10 2009-08-18
Linux 4 U 456734 monthly 25 2009-12-24
Linux 4 U 456734 monthly 25 2009-11-11
Linux 4 U 456734 monthly 25 2009-10-28
I get this result with the query:
SELECT DISTINCT ljm.journelname,ljm. subscription_id,
ljm.frequency,ljm.publisher, ljm.price, ljd.receipt_date
FROM lib_journals_master ljm,
lib_subscriptionhistory
lsh,lib_journal_details ljd
WHERE ljd.journal_id=ljm.id
ORDER BY ljm.publisher
What I need is the latest date in each journal?
I tried this query:
SELECT DISTINCT ljm.journelname, ljm.subscription_id,
ljm.frequency, ljm.publisher, ljm.price,ljd.receipt_date
FROM lib_journals_master ljm,
lib_subscriptionhistory lsh,
lib_journal_details ljd
WHERE ljd.journal_id=ljm.id
AND ljd.receipt_date = (
SELECT max(ljd.receipt_date)
from lib_journal_details ljd)
But it gives me the maximum from the entire column. My needed result will have two dates (maximum of each magazine), but this query gives me only one?
You could change the WHERE statement to look up the last date for each journal:
AND ljd.receipt_date = (
SELECT max(subljd.receipt_date)
from lib_journal_details subljd
where subljd.journelname = ljd.journelname)
Make sure to give the table in the subquery a different alias from the table in the main query.
You should use Group By if you need the Max from date.
Should look something like this:
SELECT
ljm.journelname
, ljm.subscription_id
, ljm.frequency
, ljm.publisher
, ljm.price
, **MAX(ljd.receipt_date)**
FROM
lib_journals_master ljm
, lib_subscriptionhistory lsh
, lib_journal_details ljd
WHERE
ljd.journal_id=ljm.id
GROUP BY
ljm.journelname
, ljm.subscription_id
, ljm.frequency
, ljm.publisher
, ljm.price
Something like this should work for you.
SELECT ljm.journelname
, ljm.subscription_id
, ljm.frequency
, ljm.publisher
, ljm.price
,md.max_receipt_date
FROM lib_journals_master ljm
, ( SELECT journal_id
, max(receipt_date) as max_receipt_date
FROM lib_journal_details
GROUP BY journal_id) md
WHERE ljm.id = md.journal_id
/
Note that I have removed the tables from the FROM clause which don't contribute anything to the query. You may need to replace them if yopu simplified your scenario for our benefit.
Separate this into two queries one will get journal name and latest date
declare table #table (journalName as varchar,saleDate as datetime)
insert into #table
select journalName,max(saleDate) from JournalTable group by journalName
select all fields you need from your table and join #table with them. join on journalName.
Sounds like top of group. You can use a CTE in SQL Server:
;WITH journeldata AS
(
SELECT
ljm.journelname
,ljm.subscription_id
,ljm.frequency
,ljm.publisher
,ljm.price
,ljd.receipt_date
,ROW_NUMBER() OVER (PARTITION BY ljm.journelname ORDER BY ljd.receipt_date DESC) AS RowNumber
FROM
lib_journals_master ljm
,lib_subscriptionhistory lsh
,lib_journal_details ljd
WHERE
ljd.journal_id=ljm.id
AND ljm.subscription_id = ljm.subscription_id
)
SELECT
journelname
,subscription_id
,frequency
,publisher
,price
,receipt_date
FROM journeldata
WHERE RowNumber = 1