Hive: Query to get max count per word per date

Hive: Query to get max count per word per date - hive

Here's the data I have:
date | word | count
01/01/2020 #abc 1
01/01/2020 #xyz 2
02/05/2020 #ghi 2
02/05/2020 #def 1
02/04/2020 #pqr 4
02/04/2020 #cde 3
01/01/2020 #lmn 1
Here's the result that I want:
date | word | count
01/01/2020 #xyz 2
02/04/2020 #pqr 4
02/05/2020 #ghi 2
So basically, I want the word with maximum count on each particular date.
Can someone help me out with the query?

Use row_number window function with partition by and order by clause and select only the maximum count from the partition!
SELECT date,word,count
FROM (
SELECT date,word,count,row_number() over (partition by date order by count desc) as rn
from <table_name>) sq
WHERE sq.rn = 1;

Related

How to calculate average monthly number of some action in some perdion in Teradata SQL?

I have table in Teradata SQL like below:
ID trans_date
------------------------
123 | 2021-01-01
887 | 2021-01-15
123 | 2021-02-10
45 | 2021-03-11
789 | 2021-10-01
45 | 2021-09-02
And I need to calculate average monthly number of transactions made by customers in a period between 2021-01-01 and 2021-09-01, so client with "ID" = 789 will not be calculated because he made transaction later.
In the first month (01) were 2 transactions
In the second month was 1 transaction
In the third month was 1 transaction
In the nineth month was 1 transactions
So the result should be (2+1+1+1) / 4 = 1.25, isn't is ?
How can I calculate it in Teradata SQL? Of course I showed you sample of my data.

SELECT ID, AVG(txns) FROM
(SELECT ID, TRUNC(trans_date,'MON') as mth, COUNT(*) as txns
FROM mytable
-- WHERE condition matches the question but likely want to
-- use end date 2021-09-30 or use mth instead of trans_date
WHERE trans_date BETWEEN date'2021-01-01' and date'2021-09-01'
GROUP BY id, mth) mth_txn
GROUP BY id;

Your logic translated to SQL:
--(2+1+1+1) / 4
SELECT id, COUNT(*) / COUNT(DISTINCT TRUNC(trans_date,'MON')) AS avg_tx
FROM mytable
WHERE trans_date BETWEEN date'2021-01-01' and date'2021-09-01'
GROUP BY id;
You should compare to Fred's answer to see which is more efficent on your data.

Query to find value in column dependent on a different column in table being the minimum date

I have a dataset that looks like this. I would like to pull a distinct id, the minimum date and value on the minimum date.
id date value
1 01/01/2020 0.5
1 02/01/2020 1
1 03/01/2020 2
2 01/01/2020 3
2 02/01/2020 4
2 03/01/2020 5
This code will pull the id and the minimum date
select Distinct(id), min(nav_date)
from table
group by id
How can I get the value on the minimum date so the output of my query looks like this?
id date value
1 01/01/2020 0.5
2 01/01/2020 3

Use distinct on:
select distinct on (id) t.*
from t
order by id, date;
This can take advantage of an index on (id, date) and is typically the fastest way to do this operation in Postgres.

How to get latest records based on two columns of max

I have a table called Inventory with the below columns
item warehouse date sequence number value
111 100 2019-09-25 12:29:41.000 1 10
111 100 2019-09-26 12:29:41.000 1 20
222 200 2019-09-21 16:07:10.000 1 5
222 200 2019-09-21 16:07:10.000 2 10
333 300 2020-01-19 12:05:23.000 1 4
333 300 2020-01-20 12:05:23.000 1 5
Expected Output:
item warehouse date sequence number value
111 100 2019-09-26 12:29:41.000 1 20
222 200 2019-09-21 16:07:10.000 2 10
333 300 2020-01-20 12:05:23.000 1 5
Based on item and warehouse, i need to pick latest date and latest sequence number of value.
I tried with below code
select item,warehouse,sequencenumber,sum(value),max(date) as date1
from Inventory t1
where
t1.date IN (select max(date) from Inventory t2
where t1.warehouse=t2.warehouse
and t1.item = t2.item
group by t2.item,t2.warehouse)
group by t1.item,t1.warehouse,t1.sequencenumber
Its working for latest date but not for latest sequence number.
Can you please suggest how to write a query to get my expected output.

You can use row_number() for this:
select *
from (
select
t.*,
row_number() over(
partition by item, warehouse
order by date desc, sequence_number desc, value desc
) rn
from mytable t
) t
where rn = 1

How to get the count of distinct values until a time period Impala/SQL?

I have a raw table recording customer ids coming to a store over a particular time period. Using Impala, I would like to calculate the number of distinct customer IDs coming to the store until each day. (e.g., on day 3, 5 distinct customers visited so far)
Here is a simple example of the raw table I have:
Day ID
1 1234
1 5631
1 1234
2 1234
2 4456
2 5631
3 3482
3 3452
3 1234
3 5631
3 1234
Here is what I would like to get:
Day Count(distinct ID) until that day
1 2
2 3
3 5
Is there way to easily do this in a single query?

Not 100% sure if will work on impala
But if you have a table days. Or if you have a way of create a derivated table on the fly on impala.
CREATE TABLE days ("DayC" int);
INSERT INTO days
("DayC")
VALUES (1), (2), (3);
OR
CREATE TABLE days AS
SELECT DISTINCT "Day"
FROM sales
You can use this query
SqlFiddleDemo in Postgresql
SELECT "DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN days
WHERE "Day" <= "DayC"
GROUP BY "DayC"
OUTPUT
| DayC | count |
|------|-------|
| 1 | 2 |
| 2 | 3 |
| 3 | 5 |
UPDATE VERSION
SELECT T."DayC", COUNT(DISTINCT "ID")
FROM sales
cross JOIN (SELECT DISTINCT "Day" as "DayC" FROM sales) T
WHERE "Day" <= T."DayC"
GROUP BY T."DayC"

try this one:
select day, count(distinct(id)) from yourtable group by day

Select info from table where row has max date

My table looks something like this:
group date cash checks
1 1/1/2013 0 0
2 1/1/2013 0 800
1 1/3/2013 0 700
3 1/1/2013 0 600
1 1/2/2013 0 400
3 1/5/2013 0 200
-- Do not need cash just demonstrating that table has more information in it
I want to get the each unique group where date is max and checks is greater than 0. So the return would look something like:
group date checks
2 1/1/2013 800
1 1/3/2013 700
3 1/5/2013 200
attempted code:
SELECT group,MAX(date),checks
FROM table
WHERE checks>0
GROUP BY group
ORDER BY group DESC
problem with that though is it gives me all the dates and checks rather than just the max date row.
using ms sql server 2005

SELECT group,MAX(date) as max_date
FROM table
WHERE checks>0
GROUP BY group
That works to get the max date..join it back to your data to get the other columns:
Select group,max_date,checks
from table t
inner join
(SELECT group,MAX(date) as max_date
FROM table
WHERE checks>0
GROUP BY group)a
on a.group = t.group and a.max_date = date
Inner join functions as the filter to get the max record only.
FYI, your column names are horrid, don't use reserved words for columns (group, date, table).

You can use a window MAX() like this:
SELECT
*,
max_date = MAX(date) OVER (PARTITION BY group)
FROM table
to get max dates per group alongside other data:
group date cash checks max_date
----- -------- ---- ------ --------
1 1/1/2013 0 0 1/3/2013
2 1/1/2013 0 800 1/1/2013
1 1/3/2013 0 700 1/3/2013
3 1/1/2013 0 600 1/5/2013
1 1/2/2013 0 400 1/3/2013
3 1/5/2013 0 200 1/5/2013
Using the above output as a derived table, you can then get only rows where date matches max_date:
SELECT
group,
date,
checks
FROM (
SELECT
*,
max_date = MAX(date) OVER (PARTITION BY group)
FROM table
) AS s
WHERE date = max_date
;
to get the desired result.
Basically, this is similar to #Twelfth's suggestion but avoids a join and may thus be more efficient.
You can try the method at SQL Fiddle.

Using an in can have a performance impact. Joining two subqueries will not have the same performance impact and can be accomplished like this:
SELECT *
FROM (SELECT msisdn
,callid
,Change_color
,play_file_name
,date_played
FROM insert_log
WHERE play_file_name NOT IN('Prompt1','Conclusion_Prompt_1','silent')
ORDER BY callid ASC) t1
JOIN (SELECT MAX(date_played) AS date_played
FROM insert_log GROUP BY callid) t2
ON t1.date_played = t2.date_played

SELECT distinct
group,
max_date = MAX(date) OVER (PARTITION BY group), checks
FROM table
Should work.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive: Query to get max count per word per date - hive

Use row_number window function with partition by and order by clause and select only the maximum count from the partition! SELECT date,word,count FROM ( SELECT date,word,count,row_number() over (partition by date order by count desc) as rn from <table_name>) sq WHERE sq.rn = 1;

Related

How to calculate average monthly number of some action in some perdion in Teradata SQL?

Query to find value in column dependent on a different column in table being the minimum date

How to get latest records based on two columns of max

How to get the count of distinct values until a time period Impala/SQL?

Select info from table where row has max date

Categories

Resources