Hive- Error : missing EOF at 'WHERE' - hive

I'm trying to learn Hive, especially functions like unix_timestamp and from_unixtime.
I have three tables
emp (employee table)
+---+----------------+
| id| name|
+---+----------------+
| 1| James Gordon|
| 2| Harvey Bullock|
| 3| Kristen Kringle|
+---+----------------+
txn (transaction table)
+------+----------+---------+
|acc_id|trans_date|trans_amt|
+------+----------+---------+
| 101| 20180105| 951|
| 102| 20180205| 800|
| 103| 20180131| 100|
| 101| 20180112| 50|
| 102| 20180126| 800|
| 103| 20180203| 500|
+------+----------+---------+
acc (account table)
+---+------+--------+
| id|acc_id|cred_lim|
+---+------+--------+
| 1| 101| 1000|
| 2| 102| 1500|
| 3| 103| 800|
+---+------+--------+
I want to find out the people whose trans_amt exceeded their cred_lim in the month of Jan 2018.
The query I'm trying to use is
WITH tabl as
(
SELECT e.id, e.name, a.acc_id, t.trans_amt, a.cred_lim, from_unixtime(unix_timestamp(t.trans_date, 'yyyyMMdd'), 'MMM yyyy') month
FROM emp e JOIN acc a on e.id = a.id JOIN txn t on a.acc_id = t.acc_id
)
SELECT acc_id, sum(trans_amt) total_amt
FROM tabl
GROUP BY tabl.acc_id, tabl.month
WHERE tabl.month = 'Jan 2018' AND tabl.total_amt > cred_lim;
But when I run it, I get an error saying
FAILED: ParseException line 9:2 missing EOF at 'WHERE' near 'month'
This error persists even when I change the where clause to
WHERE tabl.total_amt > cred_lim;
This makes me think the error comes from the GROUP BY clause but I can't seem to figure this out.
Could someone help me with this?

Your query has several problems.
WHERE clause should be used before GROUP BY
There is an extra ')' after GROUP BY columns
tabl.total_amt > cred_lim - This line cannot be used in where
clause because the alias total_amt cannot be used before it is
nested. Instead, use a HAVING clause.
I've made these changes in this query and should work for you.
WITH tabl
AS (
SELECT e.id
,e.name
,a.acc_id
,t.trans_amt
,a.cred_lim
,from_unixtime(unix_timestamp(t.trans_date, 'yyyyMMdd'), 'MMM yyyy') month
FROM emp e
INNER JOIN acc a ON e.id = a.id
INNER JOIN txn t ON a.acc_id = t.acc_id
)
SELECT acc_id
,sum(trans_amt) total_amt
FROM tabl
WHERE month = 'Jan 2018'
GROUP BY acc_id
,month
HAVING SUM(trans_amt) > MAX(cred_lim);

Related

Customer life cycle status analysis based on monthly activity

Hi my company wants to better tracks how many users are active on our platform. We are using Microsoft SQL Server 2019 as the Database, connected to the Azure Data Studio.
Below are two tables DDLs from our DB:
CALENDAR TABLE
COLUMN
DATA TYPE
DETAILS
CALENDAR_DATE
DATE NOT NULL
Base date (YYYY-MM-DD)
CALENDAR_YEAR
INTEGER NOT NULL
2010, 2011 etc
CALENDAR_MONTH_NUMBER
INTEGER NOT NULL
1-12
CALENDAR_MONTH_NAME
VARCHAR(100)
January, February etc
CALENDAR_DAY_OF_MONTH
INTEGER NOT NULL
1-31
CALENDAR_DAY_OF_WEEK
INTEGER NOT NULL
1-7
CALENDAR_DAY_NAME
INTEGER NOT NULL
Monday, Tuesday etc
CALENDAR_YEAR_MONTH
INTEGER NOT NULL,
201011, 201012, 201101 etc
REVENUE ANALYSIS
Column
Data Type
Details
ACTIVITY_DATE
DATE NOT NULL
Date Wager was made
MEMBER_ID
INTEGER NOT NULL
Unique Player identifier
GAME_ID
SMALLINT NOT NULL
Unique Game identifier
WAGER_AMOUNT
REAL NOT NULL
Total amount wagered on the game
NUMBER_OF_WAGERS
INTEGER NOT NULL
Number of wagers on the game
WIN_AMOUNT
REAL NOT NULL
Total amount won on the game
ACTIVITY_YEAR_MONTH
INTEGER NOT NULL
YYYYMM
BANK_TYPE_ID
SMALL INT DEFAULT 0 NOT NULL,
0=Real money, 1=Bonus money
Screenshot for both tables below:
CALENDAR TABLE
REVENUE ANALYSIS TABLE
Long story short "active" means that the member has made a minimum of one real money wager in the month.
Every month a member has a certain lifecycle type. This status will change on a monthly basis on their previous and current months activity. The statuses are the following:
NEW
First time they placed a real money wager
RETAINED
Active in the prior calendar month and the current calendar month
UNRETAINED
Active in the prior calendar month but not active in the current calendar month
REACTIVATED
Not active in the prior calendar month, but active in the current calendar month
LAPSED
Not active in the prior calendar month or the current calendar month
We would like initially to get to a view with the columns below:
MEMBER_ID |
CALENDAR_YEAR_MONT |
MEMBER_LIFECYCLE_STATUS |
LAPSED_MONTHS
Also the view should display one row per member per month, starting from the month in which they first placed a real money wager. This view should give their lifecycle status for that month, and if the member has lapsed, it should show a rolling count of the number of months since they were last active.
So far I have come up with the following CTE to give me a basis for the view. However I am not sure about the UNRETAINED and REACTIVATED columns. Any ideas anyone?
with all_activities as (
select a.member_id, activity_date, calendar_month_number as month_activity, calendar_year as year_activity,
datepart(month,CURRENT_TIMESTAMP) as current_month, datepart(year,CURRENT_TIMESTAMP) as current_year,
datepart(month,CONVERT(DATE, DATEADD(DAY,-DAY(GETDATE()),GETDATE()))) as previous_month, datepart(year,CONVERT(DATE, DATEADD(DAY,-DAY(GETDATE()),GETDATE()))) as year_last_month,
a.NUMBER_OF_WAGERS, (case when datepart(month,CURRENT_TIMESTAMP) = calendar_month_number and datepart(year,CURRENT_TIMESTAMP) = calendar_year then 'active' else 'inactive' end) as status,
case when (case when datepart(month,CURRENT_TIMESTAMP) = calendar_month_number and datepart(year,CURRENT_TIMESTAMP) = calendar_year then 'active' else 'inactive' end) = 'active' and number_of_wagers = 1 then 'New'
when (LAG((case when datepart(month,CURRENT_TIMESTAMP) = calendar_month_number and datepart(year,CURRENT_TIMESTAMP) = calendar_year then 'active' else 'inactive' end) ,1,0) OVER(PARTITION BY member_id ORDER BY calendar_month_number desc) = 'active' and calendar_month_number = datepart(month,CONVERT(DATE, DATEADD(DAY,-DAY(GETDATE()),GETDATE())))) then 'Retained'
when (calendar_month_number = datepart(month,CURRENT_TIMESTAMP) and year_activity = datepart(year,CURRENT_TIMESTAMP) and calendar_month_number = datepart(month,CONVERT(DATE, DATEADD(DAY,-DAY(GETDATE()),GETDATE())))) then 'Unretained'
from [dbo].[REVENUE_ANALYSIS] a
join CALENDAR b on a.ACTIVITY_DATE= b.CALENDAR_DATE
)
select * from all_activities
This is about customer lifecycle status analysis, which requires a couple of things:
customer acquisition date (it'll be nice to have this stored because some customers may go back to years or tens of years). For this question, we assume revenue_analysis has everthing we need and to calculate user acquisition month
lapsed vs churned: a churned customer is usually defined no activity for a period of time. For this question, we don't have the definition, thus, a user will be reported as lapsed forever.
For life cycle status calculation, we're going to gather the following (member_id, calendar_month, acquisition_month, activity_month, prior_activity_month), so that we can calculate the final result.
with cte_new_user_monthly as (
select member_id,
min(activity_year_month) as acquisition_month
from revenue_analysis
group by 1),
cte_user_monthly as (
select u.member_id,
u.acquisition_month,
m.yyyymm as calendar_month
from cte_new_user_monthly u,
calendar_month m
where u.acquisition_month <= m.yyyymm),
cte_user_activity_monthly as (
select f.member_id,
f.activity_year_month as activity_month
from revenue_analysis f
group by 1,2),
cte_user_lifecycle as (
select u.member_id,
u.calendar_month,
u.acquisition_month,
m.activity_month
from cte_user_monthly u
left
join cte_user_activity_monthly m
on u.member_id = m.member_id
and u.calendar_month = m.activity_month),
cte_user_status as (
select member_id,
calendar_month,
acquisition_month,
activity_month,
lag(activity_month,1) over (partition by member_id order by calendar_month) as prior_activity_month
from cte_user_lifecycle),
user_status_monthly as (
select member_id,
calendar_month,
activity_month,
case
when calendar_month = acquisition_month then 'NEW'
when prior_activity_month is not null and activity_month is not null then 'RETAINED'
when prior_activity_month is not null and activity_month is null then 'UNRETAINED'
when prior_activity_month is null and activity_month is not null then 'REACTIVATED'
when prior_activity_month is null and activity_month is null then 'LAPSED'
else null
end as user_status
from cte_user_status)
select member_id,
calendar_month,
activity_month,
user_status,
row_number() over (partition by member_id, user_status order by calendar_month) as months
from user_status_monthly
order by 1,2;
Result (include activity_month for easy understanding):
member_id|calendar_month|activity_month|user_status|months|
---------+--------------+--------------+-----------+------+
1001| 201701| 201701|NEW | 1|
1001| 201702| |UNRETAINED | 1|
1001| 201703| |LAPSED | 1|
1001| 201704| |LAPSED | 2|
1001| 201705| 201705|REACTIVATED| 1|
1001| 201706| 201706|RETAINED | 1|
1001| 201707| |UNRETAINED | 2|
1001| 201708| |LAPSED | 3|
1001| 201709| 201709|REACTIVATED| 2|
1001| 201710| |UNRETAINED | 3|
1001| 201711| |LAPSED | 4|
1001| 201712| 201712|REACTIVATED| 3|
1002| 201703| 201703|NEW | 1|
1002| 201704| |UNRETAINED | 1|
1002| 201705| |LAPSED | 1|
1002| 201706| |LAPSED | 2|
1002| 201707| |LAPSED | 3|
1002| 201708| |LAPSED | 4|
1002| 201709| |LAPSED | 5|
1002| 201710| |LAPSED | 6|
1002| 201711| |LAPSED | 7|
1002| 201712| |LAPSED | 8|
EDIT:
Codes tested in MySQL because I didn't notice 'mysql' tag was removed.
calendar_month in the code can be derived from the calendar dimension.

Running total between two dates SQL

I have a problem with building an efficient query in order to get a running total of sales between two dates.
Now I have the query :
select SalesId,
sum(Sales) as number_of_sales,
Sales_DATE as SalesDate,
ADD_MONTHS(Sales_DATE , -12) as SalesDatePrevYear
from DWH.L_SALES
group by SalesId, Sales_DATE
With the result:
| SalesId| number_of_sales| SalesDate|SalesDatePrevYear|
|:---- |:------:| :-----:|-----:|
| 1000| 1| 20200101|20190101|
| 1001| 1| 20220101|20210101|
| 1002| 1| 20220201|20210201|
| 1003| 1| 20220301|20210301|
The preferred result is the following:
| SalesId| number_of_sales| running total of sales | SalesDate|SalesDatePrevYear|
|:---- |:------:| :-----:| :-----:|-----:|
| 1000| 1| 1 | 20200101|20190101|
| 1001| 1| 1 | 20220101|20210101|
| 1002| 1| 2| 20220201|20210201|
| 1003| 1| 3|20220301|20210301|
As you can see, I want the total of Sales between the two dates, but because I also need the lower level (SalesId), it always stays at 1.
How can i get this efficiently?
You have successfully gotten the result which gives you the start and end dates that you care about, so you just need to take this result and then join it to the original data with an inequality join, and then sum the results. I suggest looking into the style of using CTE's (Common Table Expressions) which is helpful for learning and debugging.
For example,
WITH CTE_BASE_RESULT AS
(
your query goes here
)
SELECT CTE_BASE_RESULT.SalesId, CTE_BASE_RESULT.SalesDate, SUM(Sales) AS Total_Sales_Prior_Year
FROM CTE_BASE_RESULT
INNER JOIN DWH.L_Sales
ON CTE_BASE_RESULT.SalesId = L_Sales.SalesId
AND CTE_BASE_RESULT.SalesDate >= L_Sales.SalesDATE
AND CTE_BASE_RESULT.SalesDatePrevYear > L_Sales.SalesDATE
GROUP BY CTE_BASE_RESULT.SalesId, CTE_BASE_RESULT.SalesDate
I also recommend a website like SQL Generator that can help write complex operations, for example this is called Timeseries Aggregate.
This syntax works for snowflake, I didnt see what system you're on.
Alternatively,
WITH BASIC_OFFSET_1YEAR AS (
SELECT
A.Sales_Id,
A.SalesDate,
SUM(B.Sales) as SUM_SALES_PAST1YEAR
FROM
L_Sales A
INNER JOIN L_Sales B ON A.Sales_Id = B.Sales_Id
WHERE
B.SalesDate >= DATEADD(YEAR, -1, A.SalesDate)
AND B.SalesDate <= A.SalesDate
GROUP BY
A.Sales_Id,
A.SalesDate
)
SELECT
src.*, BASIC_OFFSET_1YEAR.SUM_SALES_PAST1YEAR
FROM
L_Sales src
LEFT OUTER JOIN BASIC_OFFSET_1YEAR
ON BASIC_OFFSET_1YEAR.SalesDate = src.SalesDate
AND BASIC_OFFSET_1YEAR.Sales_Id = src.Sales_Id

How to return all records with the latest datetime value [Postgreql]

How can I return only the records with the latest upload_date(s) from the data below?
My data is as follows:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 00:00:00.000|Monday | 467082| -58961| 1|
2022-05-02 15:58:54.094|Monday | 421427| -45655| 0|
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 16:54:04.136|Tuesday | 496021| 74594| 1|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
My desired results should be:
upload_date |day_name |rows_added|row_count_delta|days_since_last_update|
-----------------------+---------+----------+---------------+----------------------+
2022-05-01 00:00:00.000|Sunday | 526043| | |
2022-05-02 18:19:22.894|Monday | 421427| 0| 0|
2022-05-03 18:17:27.502|Tuesday | 496021| 0| 0|
2022-05-04 18:19:26.392|Wednesday| 487154| -8867| 1|
2022-05-05 18:18:15.277|Thursday | 489713| 2559| 1|
2022-05-06 16:15:39.518|Friday | 489713| 0| 1|
2022-05-07 16:18:00.916|Saturday | 482955| -6758| 1|
NOTE only the latest upload_date for 2022-05-02 and 2022-05-03 should be in the result set.
You can use a window function to PARTITION by day (casting the timestamp to a date) and sort the results by most recent first by ordering by upload_date descending. Using ROW_NUMBER() it will assign a 1 to the most recent record per date. Then just filter on that row number. Note that I am assuming the datatype for upload_date is TIMESTAMP in this case.
SELECT
*
FROM (
SELECT
your_table.*,
ROW_NUMBER() OVER (PARTITION BY CAST(upload_date AS DATE)
ORDER BY upload_date DESC) rownum
FROM your_table
)
WHERE rownum = 1
demo
WITH cte AS (
SELECT
max(upload_date) OVER (PARTITION BY upload_date::date),
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM test101 ORDER BY 1
)
SELECT
upload_date,
day_name,
rows_added,
row_count_delta,
days_since_last_update
FROM
cte
WHERE
max = upload_date;
This is more verbose but I find it easier to read and build:
SELECT *
FROM mytable t1
JOIN (
SELECT CAST(upload_date AS DATE) day_date, MAX(upload_date) max_date
FROM mytable
GROUP BY day_date) t2
ON t1.upload_date = t2.max_date AND
CAST(upload_date AS DATE) = t2.day_date;
I don't know about perfomance right away, but I suspect the window function is worse because you will need to order by, which is usually a slow operation unless your table already have an index for doing so.
Use DISTINCT ON:
SELECT DISTINCT ON (date_trunc('day', upload_date))
to_char(upload_date, 'Day') AS weekday, * -- added weekday optional
FROM tbl
ORDER BY date_trunc('day', upload_date), upload_date DESC;
db<>fiddle here
For few rows per day (like your sample data suggests) it's the simplest and fastest solution possible. See:
Select first row in each GROUP BY group?
I dropped the redundant column day_name from the table. That's just a redundant representation of the timestamp. Storing it only adds cost and noise and opportunities for inconsistent data. If you need the weekday displayed, use to_char(upload_date, 'Day') AS weekday like demonstrated above.
The query works for any number of days, not restricted to 7 weekdays.

Selecting the most recent row before a certain timestamp

I have a table like this called tt
ID|Name|Date|Value|
------------------------------------
0| S1| 2017-03-05 00:00:00| 1.5|
1| S1| 2017-04-05 00:00:00| 1.2|
2| S2| 2017-04-06 00:00:00| 1.2|
3| S3| 2017-04-07 00:00:00| 1.1|
4| S3| 2017-05-07 00:00:00| 1.2|
I need to select the row with the highest time for each Name that is < theTime
theTime being just a variable with the timestamp. In the example you could hardcode a date string, e.g. < DATE '2017-05-01' I will inject the value of the variable later programmatically with another language
I'm having a difficult time figuring out how to do this... does anyone know?
Also, I would like to know how to select what I described above but limited to a specific name, e.g. name='S3'
It would be nice if hsqldb really supported row_number():
select t.*
from (select tt.*,
row_number() over (partition by name order by date desc) as seqnum
from tt
where . . .
) t
where seqnum = 1;
Lacking that, use a group by and join:
select tt.*
from tt join
(select name, max(date) as maxd
from tt
where date < THETIME
group by name
) ttn
on tt.name = ttn.name and tt.date = ttn.maxd;
Note: this will return duplicates if the maximum date has duplicates for a given name.
The where has the limitation on your timestamp.

Aggregate by aggregate (ARRAY_AGG)?

Let's say I have a simple table agg_test with 3 columns - id, column_1 and column_2. Dataset, for example:
id|column_1|column_2
--------------------
1| 1| 1
2| 1| 2
3| 1| 3
4| 1| 4
5| 2| 1
6| 3| 2
7| 4| 3
8| 4| 4
9| 5| 3
10| 5| 4
A query like this (with self join):
SELECT
a1.column_1,
a2.column_1,
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2)
FROM agg_test a1
JOIN agg_test a2 ON a1.column_2 = a2.column_2 AND a1.column_1 <> a2.column_1
WHERE a1.column_1 = 1
GROUP BY a1.column_1, a2.column_1
Will produce a result like this:
column_1|column_1|array_agg
---------------------------
1| 2| {1}
1| 3| {2}
1| 4| {3,4}
1| 5| {3,4}
We can see that for values 4 and 5 from the joined table we have the same result in the last column. So, is it possible to somehow group the results by it, e.g:
column_1|column_1|array_agg
---------------------------
1| {2}| {1}
1| {3}| {2}
1| {4,5}| {3,4}
Thanks for any answers. If anything isn't clear or can be presented in a better way - tell me in the comments and I'll try to make this question as readable as I can.
I'm not sure if you can aggregate by an array. If you can here is one approach:
select col1, array_agg(col2), ar
from (SELECT a1.column_1 as col1, a2.column_1 as col2,
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2) as ar
FROM agg_test a1 JOIN
agg_test a2
ON a1.column_2 = a2.column_2 AND a1.column_1 <> a2.column_1
WHERE a1.column_1 = 1
GROUP BY a1.column_1, a2.column_1
) t
group by col1, ar
The alternative is to use array_dims to convert the array values into a string.
You could also try something like this:
SELECT DISTINCT
a1.column_1,
ARRAY_AGG(a2.column_1) OVER (
PARTITION BY
a1.column_1,
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2)
) AS "a2.column_1 agg",
ARRAY_AGG(DISTINCT a1.column_2 ORDER BY a1.column_2)
FROM agg_test a1
JOIN agg_test a2 ON a1.column_2 = a2.column_2 AND a1.column_1 a2.column_1
WHERE a1.column_1 = 1
GROUP BY a1.column_1, a2.column_1
;
(Highlighted are the parts that are different from the query you've posted in your question.)
The above uses a window ARRAY_AGG to combine the values of a2.column_1 alongside the other other ARRAY_AGG, using the latter's result as one of the partitioning criteria. Without the DISTINCT, it would produce two {4,5} rows for your example. So, DISTINCT is needed to eliminate the duplicates.
Here's a SQL Fiddle demo: http://sqlfiddle.com/#!1/df5c3/4
Note, though, that the window ARRAY_AGG cannot have an ORDER BY like it's "normal" counterpart. That means the order of a2.column_1 values in the list would be indeterminate, although in the linked demo it does happen to match the one in your expected output.