HIVEQL, COUNT AMOUNT OF RECORDS PER DAY - hive

I have a database in hive that is in this structure:
+--------+------------------+---------+
| rating | date | version |
+--------+------------------+---------+
| 3 | 2021-07-01 12:13 | 2.1.9 |
| 5 | 2021-07-01 10:39 | 2.2.6 |
| 4 | 2021-07-02 10:24 | 2.2.7 |
| 5 | 2021-07-02 05:37 | 3.2.4 |
| 1 | 2021-07-02 21:40 | 3.2.5 |
How do I get the number of records per day and month with HiveQL?

count per day:
select substr(`date`,1,10) as `day`,
count(*) cnt
from table_name
group by substr(`date`,1,10);
Monthly:
select substr(`date`,1,7) as `month`,
count(*) cnt
from table_name
group by substr(`date`,1,7);

Related

Determine closest date to another date value teradata

My dataset looks like this. For every combination of customerid,orderid and ship date, i would like to retrieve 1 process date that is less than or equal to the ship date. If the process date is greater than the ship date and no lower process date exist, then use the ship date as the process date
+-------------+----------+------------+--------------+--+
| Customer ID | Order ID | Ship Date | Process Date | |
+-------------+----------+------------+--------------+--+
| 1000 | 100 | 9/17/2020 | 9/17/2020 | |
| 1000 | 100 | 9/17/2020 | 10/16/2020 | |
| 1000 | 100 | 9/17/2020 | 9/16/2020 | |
| 2000 | 200 | 8/15/2020 | 8/13/2020 | |
| 2000 | 300 | 10/14/2020 | 10/13/2020 | |
| 3000 | 400 | 3/4/2020 | 4/2/2020 | |
| 3000 | 400 | 3/4/2020 | 3/3/2020 | |
| 3000 | 400 | 3/4/2020 | 3/5/2020 | |
| 4000 | 500 | 5/1/2020 | 5/3/2020 | |
| 5000 | 600 | 6/1/2020 | 7/1/2020 | |
| 5000 | 600 | 6/1/2020 | 7/2/2020
| 6000 | 700 | 7/14/2020 | 7/13/2020 | |
| 6000 | 700 | 7/14/2020 | 6/10/2020 | |
+-------------+----------+------------+--------------+--+ | |
+-------------+----------+------------+--------------+--+
Desired Output
+-------------+----------+------------+--------------+--+
| Customer ID | Order ID | Ship Date | Process Date | |
+-------------+----------+------------+--------------+--+
| 1000 | 100 | 9/17/2020 | 9/17/2020 | |
| 2000 | 200 | 8/15/2020 | 8/13/2020 | |
| 2000 | 300 | 10/14/2020 | 10/13/2020 | |
| 3000 | 400 | 3/4/2020 | 3/3/2020 | |
| 4000 | 500 | 5/1/2020 | 5/1/2020 | |
| 5000 | 600 | 6/1/2020 | 6/1/2020 | |
| 6000 | 700 | 7/14/2020 | 7/13/2020 | |
+-------------+----------+------------+--------------+--+
I tried using ROWNUM and date difference, but I'm stuck after getting the row number in ascending order.Not sure how to proceed ahead.
"If the process date is greater than the ship date and no lower process date exist, then use the ship date as the process date."
Do a GROUP BY. You can use MAX() to return the latest ProcessDate <= ShipDate. If no such ProcessDate exists, return ShipDate.
select CustomerID, orderID, ShipDate,
coalesce(MAX(case when ProcessDate <= ShipDate then ProcessDate end), ShipDate)
from tablename
group by CustomerID, orderID, ShipDate
I think you want filtering and row_number():
select t.*
from (select t.*,
row_number() over (partition by customer_id, order_id, ship_date order by process_date desc) as seqnum
from t
where process_date <= ship_date
) t
where seqnum = 1;
I'm not sure if customer_id and ship_date are really needed in the partition by clause. order_id seems sufficient.
This should return the expected result:
select CustomerID, orderID, ShipDate,
-- If the process date is greater than the ship date and no lower
-- process date exist, then use the ship date as the process date
least(ProcessDate, ShipDate)
from tablename
qualify
-- retrieve 1 process date that is less than or equal to the ship date
row_number()
over (partition by CustomerID, orderI
order by case when ProcessDate <= ShipDate then ProcessDate end desc nulls last) = 1

SQL (Redshift) get start and end values for consecutive data in a given column

I have a table that has the subscription state of users on any given day. The data looks like this
+------------+------------+--------------+
| account_id | date | current_plan |
+------------+------------+--------------+
| 1 | 2019-08-01 | free |
| 1 | 2019-08-02 | free |
| 1 | 2019-08-03 | yearly |
| 1 | 2019-08-04 | yearly |
| 1 | 2019-08-05 | yearly |
| ... | | |
| 1 | 2020-08-02 | yearly |
| 1 | 2020-08-03 | free |
| 2 | 2019-08-01 | monthly |
| 2 | 2019-08-02 | monthly |
| ... | | |
| 2 | 2019-08-31 | monthly |
| 2 | 2019-09-01 | free |
| ... | | |
| 2 | 2019-11-26 | free |
| 2 | 2019-11-27 | monthly |
| ... | | |
| 2 | 2019-12-27 | monthly |
| 2 | 2019-12-28 | free |
+------------+------------+--------------+
I would like to have a table that gives the start and end dats of a subscription. It would look something like this:
+------------+------------+------------+-------------------+
| account_id | start_date | end_date | subscription_type |
+------------+------------+------------+-------------------+
| 1 | 2019-08-03 | 2020-08-02 | yearly |
| 2 | 2019-08-01 | 2019-08-31 | monthly |
| 2 | 2019-11-27 | 2019-12-27 | monthly |
+------------+------------+------------+-------------------+
I started by doing a LAG windown function with a bunch of WHERE statements to grab the "state changes", but this makes it difficult to see when customers float in and out of subscriptions and i'm not sure this is the best method.
lag as (
select *, LAG(tier) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan
, LAG(date) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan_date
from data
)
SELECT *
FROM lag
where (current_plan = 'free' and previous_plan in ('monthly', 'yearly'))
This is a gaps-and-islands problem. I think a difference of row numbers works:
select account_id, current_plan, min(date), max(date)
from (select d.*,
row_number() over (partition by account_id order by date) as seqnum,
row_number() over (partition by account_id, current_plan order by date) as seqnum_2
from data
) d
where current_plan <> free
group by account_id, current_plan, (seqnum - seqnum_2);

max(sum(field query in Hive/SQL

I have a table with lots of transactions for users across a month.
I need to take the hour from each day where Sum(cost) is at its highest.
I've tried MAX(SUM(Cost)) but get an error.
How would I go about doing this please?
here is some sample data
+-------------+------+----------+------+
| user id | hour | date | Cost |
+-------------+------+----------+------+
| 343252 | 13 | 20170101 | 21.5 |
| 32532532 | 13 | 20170101 | 22.5 |
| 35325325 | 13 | 20170101 | 30.5 |
| 325325325 | 13 | 20170101 | 10 |
| 64643643 | 12 | 20170101 | 22 |
| 643643643 | 12 | 20170101 | 31 |
| 436325234 | 13 | 20170101 | 15 |
| 213213213 | 13 | 20170101 | 12 |
| 53265436436 | 17 | 20170101 | 19 |
+-------------+------+----------+------+
Expected Output:
I need just one row per day, where it shows the total cost from the 'most expensive' hour. In this case, 13:00 had a total cost of 111.5
select hr
,dt
,total_cost
from (select dt
,hr
,sum(cost) as total_cost
,row_number () over
(
partition by dt
order by sum(cost) desc
) as rn
from mytable
group by dt,hr
) t
where rn = 1
+----+------------+------------+
| hr | dt | total_cost |
+----+------------+------------+
| 13 | 2017-01-01 | 111.5 |
+----+------------+------------+
Try this:
select AVG(hour) as 'Hour',date as 'Date',sum(cost) as 'TotalCost' from dbo.Table_3 group by date

Postgresql transpose rows to columns

I have this query
select * from sales
shop | date | hour | row_no | amount
-----------+------------+-----------+--------+-----------
shop_1 | 2012-08-14 | 00:08:00 | P01 | 10
shop_2 | 2012-08-12 | 00:12:00 | O05 | 40
shop_2 | 2012-08-12 | 00:12:00 | A01 | 20
I have 1 millon rows, I can do this query
select shop, SUM(amount)
from sales
group by shop
shop | amount |
-----------+------------+
shop_1 | 5666 |
shop_2 | 4044 |
shop_3 4044 |
But I need to spend the days at the columns and I do not know if they could help me do this
shop | 2012-08-1 | 2012-08-2 | 2012-08-3 |
-----------+------------+-----------+--------+-----------
shop_1 | 4005 | 5667 | 9987 |
shop_2 | 4333 | 4554 | 1234 |
shop_3 | 4555 | 6778 | 6677 |
Would be group by store in the rows, and group by days in the columns in postgresql
First, you must install tablefunc extension. Since version 9.1 you can do it using create extension:
CREATE EXTENSION tablefunc;
select * from crosstab (
select shop, date, SUM(amount)
from sales
group by shop
'select date from sales order by 1')
AS ct(shop: text, '2012-08-1' text, '2012-08-2' text, '2012-08-3' text)

Return if more than x day of the week

I have a table with a list of transactions per day and per customer. I need to find the customers/transaction date that had more than x occurrences of transactions on Sundays over a 6 month period.
Note, there might be more than 1 transaction per customer per day but as long as they have even 1 transaction on a Sunday then that Sunday counts towards the Sunday count for the 6 month period.
This is the code I have so far. I used the sum(transactionvalue) as a method of combining possible multiple transactions on a day into 1 record:
select customernumber,sum(transactionvalue),date from transactions
where date between '2015-01-01' and '2015-06-01'
and datename(weekday, date) = 'Sunday'
group by customernumber,date
having count(date) >= x
However, as I change the count value i.e. 'x' gets bigger, the records for a given customer gets smaller. If a customer has 7 Sundays over the time period then I expect to return 7 records whether x is 1 or 7. Only when x is greater than 7 should all of that customer's transactions not be returned.
Here is some sample data:
+-----------------+------------+--------------------+
| Customer Number | Date | Transaction Amount |
+-----------------+------------+--------------------+
| 1 | 17/05/2015 | 11.00 |
| 2 | 17/05/2015 | 21.00 |
| 2 | 17/05/2015 | 22.00 |
| 3 | 17/05/2015 | 31.00 |
| 3 | 17/05/2015 | 32.00 |
| 3 | 17/05/2015 | 33.00 |
| 1 | 24/05/2015 | 11.00 |
| 2 | 24/05/2015 | 21.00 |
| 3 | 24/05/2015 | 31.00 |
| 2 | 31/05/2015 | 21.00 |
+-----------------+------------+--------------------+
In this example I'm looking to have the following returned if x = 1:
+-----------------+------------+--------------------+
| Customer Number | Date | Transaction Amount |
+-----------------+------------+--------------------+
| 1 | 17/05/2015 | 11.00 |
| 2 | 17/05/2015 | 43.00 |
| 3 | 17/05/2015 | 96.00 |
| 1 | 24/05/2015 | 11.00 |
| 2 | 24/05/2015 | 21.00 |
| 3 | 24/05/2015 | 31.00 |
| 2 | 31/05/2015 | 21.00 |
+-----------------+------------+--------------------+
But this returned if x = 3:
+-----------------+------------+--------------------+
| Customer Number | Date | Transaction Amount |
+-----------------+------------+--------------------+
| 2 | 17/05/2015 | 43.00 |
| 2 | 24/05/2015 | 21.00 |
| 2 | 31/05/2015 | 21.00 |
+-----------------+------------+--------------------+
Thanks
Try this:
select customernumber, date, sum(transactionvalue) as transaction_amt
from transactions
where date >= '2015-01-01' and date < '2015-07-01'
and datename(weekday, date) = 'Sunday'
and customernumber in (
select customernumber
from transactions
where date >= '2015-01-01' and date < '2015-07-01'
and datename(weekday, date) = 'Sunday'
group by customernumber
having count(distinct date) >= 3
)
group by customernumber, date
SQL Fiddle