HiveQL: Select value of column paired with Max(value) of another column - hive

Let´s assume i have this table
Date Department Value
2017-01-02 A 30
2017-01-02 B 60
2017-01-02 C 10
2017-01-02 D 40
2017-01-03 C 20
2017-01-03 D 150
2017-01-03 E 100
2017-01-03 F 20
...
And i want to get the department which have a higher 'value' each day
Which would result in
Date Department Value
2017-01-02 B 60
2017-01-03 D 150
How could i achieve this?

Your Base data:
hive> create table tx1(date1 date,department string,value int) row format delimited fields terminated by ',';
OK
Time taken: 1.172 seconds
hive> load data local inpath '/home/vivekanand/vivek/hive/test.dat' into table tx1;
Loading data to table default.tx1
OK
Time taken: 0.727 seconds
hive> select * from tx1;
OK
2017-01-02 A 30
2017-01-02 B 60
2017-01-02 C 10
2017-01-02 D 40
2017-01-03 C 20
2017-01-03 D 150
2017-01-03 E 100
2017-01-03 F 20
Time taken: 1.89 seconds, Fetched: 8 row(s)
You can use the analytical function here as below:
select date1,department,value
from(select date1,department,value,rank() over(partition by date1 order by value desc) f from tx1) k
where f=1;
Output:
Total MapReduce CPU Time Spent: 0 msec
OK
2017-01-02 B 60
2017-01-03 D 150
Time taken: 1.569 seconds, Fetched: 2 row(s)

Break problem in two parts first get max value by date ie CTE.
Join result set with base table and get desired result
ie
with temp as (
select Date ,max(value) as value
from tableName group by Date
)
selct a.Date , a.Department ,a.Value
from tableName a join temp b
on a.Date=b.Date
and a.value=b.value

Use rank() analytic function. rank() will assign 1 to rows with higher value per day.
select Date, Department, Value
from
(
select a.Date, a.Department, a.Value,
rank() over(partition by a.Date order by a.Value desc) as rnk
from tableName a
)s
where rnk=1
;

Related

Netezza add new field for first record value of the day in SQL

I'm trying to add new columns of first values of the day for location and weight.
For instance, the original data format is:
id dttm location weight
--------------------------------------------
1 1/1/20 11:10:00 A 40
1 1/1/20 19:07:00 B 41.1
2 1/1/20 08:01:00 B 73.2
2 1/1/20 21:00:00 B 73.2
2 1/2/20 10:03:00 C 74
I want each id to have only one day record, such as:
id dttm location weight
--------------------------------------------
1 1/1/20 11:10:00 A 40
2 1/1/20 08:01:00 B 73.2
2 1/2/20 10:03:00 C 74
I have other columns in my data set that I'm using location and weight to create, so I don't think I can just filter for 'first' records of the day.. Is it possible to write query to recognize first record of the day for those two columns and create new column with those values?
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by id, ddtm::date order by dttm) as seqnum
from t
) t
where seqnum = 1;

Problems with complex query

There are two tables.
In the first I have columns:
id - a person
time - the time of receiving the bonus (timestamp)
money - size of bonus
And the second:
id
time - time of getting a rank (timestamp)
range - military rank (int)
The task is to withdraw the amount and number of bonuses received by people in the rank of captain (range = 7) with aggregation by day.
I have no ideas how to do a table with this data. I can summarize data by all days such as
SELECT DISTINCTROW Payment.user_id AS user_id, Sum(IIf(IsNull(Payment.money),0,Payment.money)) AS [Sum - money], Count(Payment.money) AS [Count - Payment], Format(Payment.time, "Short Date") as day
FROM Payment
GROUP BY Payment.user_id, Format (Payment.time, "Short Date")
Having ((Count(Payment.money) > 0));
Can you help me with second part and summarize them? thanks
For example: first table (Payment):
user_id time money
a 01.01.10 00:00:00 15,00
a 01.01.10 10:00:00 2,00
a 03.01.10 00:00:00 3,00
c 04.01.10 00:00:00 4,00
c 04.01.10 00:05:00 5,00
d 06.01.10 00:00:00 6,00
e 07.01.10 00:00:00 7,00
e 08.01.10 00:00:00 8,00
The second one:
user_id time range
a 01.01.10 00:00:00 6
a 01.01.10 09:00:00 7
a 04.01.10 00:00:00 8
b 04.01.10 00:00:00 4
c 04.01.10 00:05:00 7
d 06.01.10 00:00:00 5
e 07.01.10 00:00:00 6
f 08.01.10 00:00:00 6
g 08.01.10 00:00:00 7
I expected:
user_id time sum
a 01.01.10 2
a 03.01.10 3
c 04.01.10 5
Here is one possible method using joins:
select t1.user_id, datevalue(p.time) as [time], sum(p.money) as [sum]
from
(
(select t.user_id, t.time from rank t where t.range = 7) t1
inner join payment p on t1.user_id = p.user_id
)
left join
(select t.user_id, t.time from rank t where t.range > 7) t2 on p.user_id = t2.user_id
where
p.time >= t1.time and (t2.user_id is null or p.time < t2.time)
group by
t1.user_id, datevalue(p.time)
I have assumed that your second table is called rank (this was not stated in your question).
Here, the subquery t1 obtains the set of users with range = 7 (captain), and the subquery t2 obtains the set of users with range > 7. I then select all records with a payment date greater than or equal to the date of promotion to captain, but less than any subsequent promotion (if it exists).
This yields the following result:
+---------+------------+------+
| user_id | time | sum |
+---------+------------+------+
| a | 01/01/2010 | 2.00 |
| a | 03/01/2010 | 3.00 |
| c | 04/01/2010 | 5.00 |
+---------+------------+------+
Unless I have misunderstood, I would argue that your expected result is incorrect as the payment below occurs before user_id = c achieved the rank of captain:
c 04.01.10 00:00:00 4,00
c 04.01.10 00:05:00 7

Take the last row Group By date

I need to select content statistics group By Date.
Here example of records :
id cid viewCount created_at
1 1 50 31-12-2018 18:00:00
2 1 50 01-01-2019 18:00:00
3 2 50 01-01-2019 18:00:00
4 2 100 01-01-2019 19:00:00
5 2 150 01-01-2019 20:00:00
6 3 1000 01-01-2019 15:00:00
Need to return :
id cid viewCount date
1 1 50 31-12-2018
2 1 50 01-01-2019
5 2 150 01-01-2019
6 3 1000 01-01-2019
I tried the following code
$qb = $this->createQueryBuilder('c');
$qb->select('a.id as id')
->addSelect('COALESCE(SUM(a.viewCount),0) as viewCount')
->addSelect('DATE_FORMAT(a.createdAt, \'%d-%m-%Y\') as date');
->innerJoin('c.analytics', 'a')
->groupBy('c.cid')
->addGroupBy('date')
->orderBy('a.createdAt', 'ASC');
return:
id cid viewCount date
1 1 50 31-12-2018
2 1 50 01-01-2019
3 2 50 01-01-2019
4 2 100 01-01-2019
5 2 150 01-01-2019
6 3 1000 01-01-2019
I have tried to create a subquery :
$qbLastHour = $this->createQueryBuilder('cc');
$qbLastHour->select('MAX(DATE_FORMAT(aa.createdAt, \'%H\'))')
->innerJoin('cc.analytics', 'aa')
->where('cc.id=c.id')
->groupBy('cc.cid')
->addGroupBy('s');
$qb->addSelect(sprintf("(%s) AS r", $qbLastHour->getDQl()));
But something go wrong because i dont groupBy date at the subquery.
If someone can help me. Thank you
Update
Here is an attempt, in sql again, to select only one row per date and cid based on the max time per day
SELECT id, c.cid, viewCount, max_date
FROM content a
JOIN content_analytic c ON a.id = c.content_id
RIGHT JOIN (SELECT c.cid, DATE_FORMAT(created_at, '%d-%m-%Y') dt, MAX(created_at) max_date
FROM content a
JOIN content_analytic c ON a.id = c.content_id
GROUP BY dt, c.cid) x ON x.max_date = a.created_at and x.cid = c.cid
This is how I believe the query should be in pure sql
SELECT c.cid, COALESCE(SUM(a.viewCount), 0), DATEFORMAT(a.created_at, ‘%d-%m-%Y’) as date
FROM content a
INNER JOIN content_analytic c ON a.id = c.content_id
GROUP BY c.cid, date
ORDER BY date

Join tables with dates within intervals of 5 min (get avg)

I want to join two tables based on timestamp, the problem is that both tables didn't had the exact same timestamp so i want to join them using a near timestamp using a 5 minute interval.
This query needs to be done using 2 Common table expressions, each common table expression needs to get the timestamps and group them by AVG so they can match
Freezer | Timestamp | Temperature_1
1 2018-04-25 09:45:00 10
1 2018-04-25 09:50:00 11
1 2018-04-25 09:55:00 11
Freezer | Timestamp | Temperature_2
1 2018-04-25 09:46:00 15
1 2018-04-25 09:52:00 13
1 2018-04-25 09:59:00 12
My desired result would be:
Freezer | Timestamp | Temperature_1 | Temperature_2
1 2018-04-25 09:45:00 10 15
1 2018-04-25 09:50:00 11 13
1 2018-04-25 09:55:00 11 12
The current query that i'm working on is:
WITH Temperatures_1 (
SELECT Freezer, Temperature_1, Timestamp
FROM TABLE_A
),
WITH Temperatures_2 (
SELECT Freezer, Temperature_2, Timestamp
FROM TABLE_B
)
SELECT A.Freezer, A.Timestamp, Temperature_1, Temperature_2
FROM Temperatures_1 as A
RIGHT JOIN Temperatures_2 as B
ON A.FREEZER = B.FREEZER
WHERE A.Timestamp = B.Timestamp
You should may want to modify your join criteria instead of filtering the output. Use BETWEEN to bracket your join value on the timestamps. I chose +/- 150 seconds because that's half of 2-1/2 minutes to either side (5-minute range to match). You may need something different.
;WITH Temperatures_1 (
SELECT Freezer, Temperature_1, Timestamp
FROM TABLE_A
),
WITH Temperatures_2 (
SELECT Freezer, Temperature_2, Timestamp
FROM TABLE_B
)
SELECT A.Freezer, A.Timestamp, Temperature_1, Temperature_2
FROM Temperatures_1 as A
RIGHT JOIN Temperatures_2 as B
ON A.FREEZER = B.FREEZER
AND A.Timestamp BETWEEN (DATEADD(SECOND, -150, B.Timestamp)
AND (DATEADD(SECOND, 150, B.Timestamp)
You should change the key of join two table by adding the timestamp. The timestamp you should need to approximate the datetime on both side tables A and B tables.
First you should check if the value of the left table (A) datetime is under 2.5 minutes then approximate to the near 5 min. If it is greater the approximate to the next 5 minutes. The same thing you should do on the right table (B). Or you can do this on the CTE and the right join remains the same as your query.

SQL to capture time periods that a certain condition exists

I have a table that captures daily data on users.
I want to pull the start and end dates for users when IS_AWESOME = 'Y'
I do not know how to do this using SQL
USER_ID DATE IS_AWESOME
123 2017-01-01 Y
123 2017-01-02 Y
123 2017-01-03 Y
123 2017-01-04 N
123 2017-01-05 Y
123 2017-01-06 Y
123 2017-01-07 Y
123 2017-01-08 N
123 2017-01-09 Y
123 2017-01-10 Y
123 2017-01-11 N
If I use MIN(DATE) and MAX(DATE) I will not get the intervals between those two dates.
A typical way to do this uses a difference row_number()s (an ANSI-standard function supported by most databases):
select user_id, min(date), max(date)
from (select t.*,
row_number() over (partition by user_id order by date) as seqnum_u,
row_number() over (partition by user_id, is_awesome order by date) as seqnum_uia
from t
) t
where is_awesome = 'Y'
group by user_id, is_awesome, (seqnum_u - seqnum_uia) ;
Explaining how this works is a bit tricky. If you run the subquery, you will see how the difference of the row numbers defines each group of sequential values.