HIVEQL select all accounts for a given customer who has AT LEAST ONE account that satisfies a certain criteria - hive

So i have a large dataset that contains credit card accounts. A customer can have multiple credit card accounts. So the accounts is unique, the customer is most certainly not unique (customer '1234' can have 5 accounts). I want to select a customer's entire account list if any of the accounts satisfy a particular requirement. The requirement is looking at its last cycle date (when the account last cycled). so let's look at this dataset...
account|customer|last_cycle_dt
4839|1|20190114
9522|1|20190103
1195|1|20181227
5461|2|20190112
1178|2|20190108
2229|2|20181218
8723|3|20181227
5692|3|20181227
0392|4|20190113
1847|5|20190113
0389|5|20190112
3281|5|20190101
2008|5|20181222
3948|5|20181216
So i have this data sorted in a particular way that it's easier to see. In fact maybe the data needs to be sorted this way to do the extract (most efficiently) but I'm not sure.
So the criteria in our extract will select all customers accounts who has at least 1 account who's last_cyc_dt field is GREATER THAN 20180112
So...
We would select ALL of customers 1 accounts
We would select NONE of customers 2 accounts
We would select NONE of customers 3 accounts
We would select ALL of customers 4 accounts
We would select ALL of customers 5 accounts
Because there exists at least 1 account for that customer who's last cycle date is greater than 20180112
What's the best approach to achieve this in HIVE ?

Using max as a window function, get the latest last_cycl_dt for each customer and check if it is greater than the required date.
select account,customer,last_cycl_dt
from (select t.*,max(last_cycle_dt) over(partition by customer) as latest_last_cycl_dt
from tbl t
) t
where latest_last_cycl_dt > '20180112'

Related

code to Look at rows and return the value for the highest number for a user

I have a table(Below is a sample of what it contains) that shows the userId plus various milestones and admission stages. What I need to do this to look at the highest number in milestone_stage_number for that user and returns the value of milestone, admission stage and milestone_latest_stage. so in the example below the query should only return one line for userid 1 with milestone_stoge_number =4 (which is the max number for that person) and return accepted for the admission stage, milestione_lates_stage = emailed and milestone= emailed. In my actual table I have over 12000 users but I need the query to just return one row per user with the information for the maximum stage Number of that user. I hope this is clear what I need to achieve so if I have use 2 five times only returns the row for the highest numve=ber in Milestone_stage_number and hence after running the query I get one row for user 1 and one row for user 2.
my table is called applicants
Person_id Milestone admission_stage milestone_latest_stage milestone_stage_number
1 Under Review Accepted Accepted 2
1 emailed accepted emailed 4
1 offered accepted accepted 3
1 submitted reviewed offered 1
Could use a qualify and a window function
SELECT * FROM applicants
QUALIFY MAX(milestone_stage_number) OVER (PARTITION BY Person_id) = milestone_stage_number

COUNT with multiple LEFT joins [duplicate]

This question already has answers here:
Two SQL LEFT JOINS produce incorrect result
(3 answers)
Closed 12 months ago.
I am having some troubles with a count function. The problem is given by a left join that I am not sure I am doing correctly.
Variables are:
Customer_name (buyer)
Product_code (what the customer buys)
Store (where the customer buys)
The datasets are:
Customer_df (list of customers and product codes of their purchases)
Store1_df (list of product codes per week, for Store 1)
Store2_df (list of product codes per day, for Store 2)
Final output desired:
I would like to have a table with:
col1: Customer_name;
col2: Count of items purchased in store 1;
col3: Count of items purchased in store 2;
Filters: date range
My query looks like this:
SELECT
DISTINCT
C_customer_name,
C.product_code,
COUNT(S1.product_code) AS s1_sales,
COUNT(S2.product_code) AS s2_sales,
FROM customer_df C
LEFT JOIN store1_df S1 USING(product_code)
LEFT JOIN store2_df S2 USING(product_code)
GROUP BY
customer_name, product_code
HAVING
S1_sales > 0
OR S2_sales > 0
The output I expect is something like this:
Customer_name
Product_code
Store1_weekly_sales
Store2_weekly_sales
Luigi
120012
4
8
James
100022
6
10
But instead, I get:
Customer_name
Product_code
Store1_weekly_sales
Store2_weekly_sales
Luigi
120012
290
60
James
100022
290
60
It works when instead of COUNT(product_code) I do COUNT(DSITINCT product_code) but I would like to avoid that because I would like to be able to aggregate on different timespans (e.g. if I do count distinct and take into account more than 1 week of data I will not get the right numbers)
My hypothesis are:
I am joining the tables in the wrong way
There is a problem when joining two datasets with different time aggregations
What am I doing wrong?
The reason as Philipxy indicated is common. You are getting a Cartesian result from your data thus bloating your numbers. To simplify, lets consider just a single customer purchasing one item from two stores. The first store has 3 purchases, the second store has 5 purchases. Your total count is 3 * 5. This is because for each entry in the first is also joined by the same customer id in the second. So 1st purchase is joined to second store 1-5, then second purchase joined to second store 1-5 and you can see the bloat. So, by having each store pre-query the aggregates per customer will have AT MOST, one record per customer per store (and per product as per your desired outcome).
select
c.customer_name,
AllCustProducts.Product_Code,
coalesce( PQStore1.SalesEntries, 0 ) Store1SalesEntries,
coalesce( PQStore2.SalesEntries, 0 ) Store2SalesEntries
from
customer_df c
-- now, we need all possible UNIQUE instances of
-- a given customer and product to prevent duplicates
-- for subsequent queries of sales per customer and store
JOIN
( select distinct customerid, product_code
from store1_df
union
select distinct customerid, product_code
from store2_df ) AllCustProducts
on c.customerid = AllCustProducts.customerid
-- NOW, we can join to a pre-query of sales at store 1
-- by customer id and product code. You may also want to
-- get sum( SalesDollars ) if available, just add respectively
-- to each sub-query below.
LEFT JOIN
( select
s1.customerid,
s1.product_code,
count(*) as SalesEntries
from
store1_df s1
group by
s1.customerid,
s1.product_code ) PQStore1
on AllCustProducts.customerid = PQStore1.customerid
AND AllCustProducts.product_code = PQStore1.product_code
-- now, same pre-aggregation to store 2
LEFT JOIN
( select
s2.customerid,
s2.product_code,
count(*) as SalesEntries
from
store2_df s2
group by
s2.customerid,
s2.product_code ) PQStore2
on AllCustProducts.customerid = PQStore2.customerid
AND AllCustProducts.product_code = PQStore2.product_code
No need for a group by or having since all entries in their respective pre-aggregates will result in a maximum of 1 record per unique combination. Now, as for your needs to filter by date ranges. I would just add a WHERE clause within each of the AllCustProducts, PQStore1, and PQStore2.

Update using loop taking too long

I'm trying to update a table inside a loop, and it is taking too long. Need help in how I can make this efficient?
A little background on the problem and the approach being used -
I have the following table,
Gift_Earned_Used:
customer earned_id earned_day earned_type used_id used_day used_type
6832 1234 '01-JAN-19' Free Pizza null null null
6832 1771 '03-JAN-19' Free Pizza null null null
6506 1901 '07-JAN-19' Free Coffee null null null
The table currently has 33 million rows with nulls for used_id, used_day and used_type. The table contains all the customers that have earned a gift of any type (free pizza, free coffee, free bread) along with the respective transaction id (earned_id) and transaction day (earned_day).
The other table,
Gift_Used:
customer used_id used_day used_type ear_pos_earned_day
6832 1339 '31-DEC-18' Free Pizza '02-DEC-18'
6832 1821 '03-JAN-19' Free Pizza '04-DEC-18'
6506 2454 '07-JAN-19' Free Coffee '08-JAN-19'
currently has 19 millions rows.
The problem is that when a customer use a gift, there is no way to tie that particular used gift to a gift earned. The earned_id and used_id are merely just transaction ids. And in an effort to do that, we are assuming first-in first-out approach.
That in this case assumes that the first used gift will tie to the first earned gift matching on customer and gift type. Also, there is a need to ensure that the used_day is not less than the earned_day (you simply cannot use a gift if you haven't already earned it). More specifically, the earned_day has to be between the ear_pos_earned_day and used_day.
To achieve that, I am looping over the Gift_Used table to update the nulls in the Gift_Earned_Used table where there is a match, such that my table Gift_Earned_Used after the update would look like:
customer earned_id earned_day earned_type used_id used_day used_type
6832 1234 '01-JAN-19' Free Pizza 1821 '03-JAN-19' Free Pizza
6832 1771 '03-JAN-19' Free Pizza null null null
6506 1901 '07-JAN-19' Free Coffee 2454 '07-JAN-19' Free Coffee
I took into consideration several use cases, and I am able to achieve what I want to through my code.
DECLARE
var_earned_id NUMBER;
--looping through all the customers in the gift_used table
--and ordering it by used_day, used_id such that if there
--are two used gifts of the same type, the one with the lesser
--transaction id gets assigned first
BEGIN
FOR v_used IN
(
SELECT /*+PARALLEL(8)*/
Customer
,Used_Type
,Used_Id
,Used_Day
,ear_pos_earned_day
FROM
gift_used
ORDER BY
Customer,Used_Day,Used_Id
)
LOOP
BEGIN
--this is the part where i am getting the earned_id that matches
--the criteria. If more than one earned_id matches the criteria
--, the top one is picked (one with lesser transaction id)
SELECT Earned_Id INTO Var_Earned_Id FROM
(
SELECT Earned_Id FROM gift_earned_used
WHERE 1=1
AND Customer = v_used.Customer
AND Earned_Type = v_used.Used_Type
AND Used_Id IS NULL
AND Earned_Day BETWEEN v_used.ear_pos_earned_day AND v_used.used_day ORDER BY Earned_Day,Earned_Id
)
WHERE ROWNUM=1
;
--for the earned_id picked above that matched the criteria
--the values in the used_id and used_day are updated from loop
UPDATE /*+PARALLEL(8)*/ gift_earned_used u
SET u.used_id = v_used.Used_Id
,u.used_day = v_used.used_day
WHERE 1=1
AND u.earned_id = Var_Earned_Id
;
EXCEPTION
WHEN NO_DATA_FOUND THEN
Var_Earned_Id := 0;
END;
END LOOP;
COMMIT;
END;
I am able to achieve the desired output as shown above. I tried several ways of doing it but could logically achieve this using loop construct only.
I tried it on small data sets, and it seems to work fine. But when I am doing it for the entire data set -- 33 million rows in gift_earned_used to be updated from gift_used (19 million rows) where there is a match - it just never stops. Takes too long.
I really need suggestions on how I can improve this, make it more efficient.
This addresses the original version of the question.
You can write a query to get the used_id for each earned by interleaving the rows and using window functions.
The idea is to assign a grouping using a cumulative count of earned/redeemed for each customer/type and then use that to assign the used_id. This is tricky, because the cumulative count is a sum that ignores the current row for the redemption (it needs to be associated with the most recent earned value).
with eu as (
select earned_id, customer, earned_date as date, earned_type as type, null as used_id, 1 as earned
from gift_earned_used geu
union all
select null, customer, used_date as date, used_type as type, used_id, -1 as earned
from gift_used geu
),
eu2 as (
select eu.*,
(sum(earned) over (partition by customer, type
order by date
) -
greatest(earned, 0) -- ignore current row for redemptions
) earned_grouping
from eu
)
select eu2.*
from (select eu2.*,
lead(used_id ignore nulls) over (partition by customer, type, earned_grouping order by date) as new_used_id
from eu2
) eu2
where used_id is null; -- only select the earned rows
When you have verified that this works, you have two approaches:
Use merge to update the original table.
Join in the additional columns you want and replace the original table.
I would use the second method, because updating essentially every row in the table can be quite expensive.

SQL - find prior string value

I have a DB which 'tracks' the customer shopping journey. What I want to do is recall the previous value if their final destination or 'shop' is a particular value.
For example say the shops are named like this:
Shop 1
Shop 2
Shop 3
Shop 4
If my select query returns Shop 4 (for any customer) then I want the extra column to show the previous shop they last shopped at. There is no natural order to my data so I can't literally state that Shop 4 = Shop 3 it just needs to return whatever shop they last shopped at if the last one is Shop 4 (there previous shop could be any 'shop').
This is what I have so far but it's probably way off the mark. I have a date column in my table but don't know how to use it in this way.
Select ...
case
when TableShop.ShopName LIKE 'Shop4' then
cast(TableShop.ShopName -1 AS nvarchar(50))
end
From ...
Presumably, you have some column that specifies the ordering of the visits -- say a visitDatetime column.
Then, you can use the ANSI standard LAG() function:
select s.*,
(case when s.shopName = 'Shop4'
then lag(s.shopName) over (partition by customerId order by visitDateTime)
end) as prev_ShopName
from tableshop s;

counting and numbering in a select statement in Access SQL

Could you please help me figuring out how to accomplish the following.
I have a table containing the number of products available between one date and another as per below:
TABLE MyProducts
DateProduct ProductId Quantity Price
26/02/2016 7 2 100
27/02/2016 7 3 100
28/02/2016 7 4 100
I have created a form where users need to select a date range and the number of products they are looking for (in my example, the number of products is going to be 1).
In this example, let's say that a user makes the following selection:
SELECT SUM(MyProducts.Price) As TotalPrice
FROM MyProducts WHERE MyProducts.DateProduct
Between #2/26/2016# And #2/29/2016#-1 AND MyProducts.Quantity>=1
Now the user can see the total amount that 1 product costs: 300
For this date range, however, I want to allow users to select from a combobox also the number of products that they can still buy: if you give a look at the Quantity for this date rate, a user can only buy a maximum of 2 products because 2 is the lowest quantity available is in common for all the dates listed in the query.
First question: how can I feed the combobox with a "1 to 2" list (in this case) considering that 2 is lowest quantity available in common for all the dates queried by this user?
Second question: how can I manage the products that a user has purchased.
Let's say that a user has purchased 1 product within this date range and a second user has purchased for the very same date range the same quantity too (which is 1) for a total of 2 products purchased already in this date range. How can I see that for this date rate and giving this case the number of products actually available are:
DateProduct ProductId Quantity Price
26/02/2016 7 0 100
27/02/2016 7 1 100
28/02/2016 7 2 100
Thank you in advance and please let me know should you need further information.
You could create a table with an integer field counting from 1 to whatever max qty you could expect. Then create a query that will only return rows from your new table up to the min() qty in the MyProducts table. Use that query as the control source of your combobox.
EDIT: You will actually need two queries. The first should be:
SELECT Min(MyProducts.Quantity) AS MinQty FROM MyProducts;
which I called "qryMinimumProductQty". I create the table called "Numbering" with a single integer field called "Sequence". The second query:
SELECT Numbering.Sequence FROM Numbering, qryMinimumProductQty WHERE Numbering.Sequence<=qryMinimumProductQty.MinQty;
AFAIK there is no Access function/feature that will fill in a series of numbers in a combobox control source. You have to build the control source yourself. (Anyone with more VBA experience might have a solution to solve this, but I do not.)
It makes me ache thinking of an entire table with a single integer column only being used for a combobox though. A simpler approach to the combobox would just to show the qty available in a control on your form, give an unbound text box for the user to enter their order qty, and add a validation rule to stop the order and notify them if they have chosen a number greater than the qty on hand. (Just a thought)
As for your second question, I don't really understand what you're looking for either. It sounds like there may be another table of purchases? It should be a simple query to relate MyProducts to Purchases and take the difference between your MyProducts!qty and the Purchases!qty. If you don't have a table to store Purchases, it might be warranted based on my cursory understanding of your system.