code to Look at rows and return the value for the highest number for a user - google-bigquery

I have a table(Below is a sample of what it contains) that shows the userId plus various milestones and admission stages. What I need to do this to look at the highest number in milestone_stage_number for that user and returns the value of milestone, admission stage and milestone_latest_stage. so in the example below the query should only return one line for userid 1 with milestone_stoge_number =4 (which is the max number for that person) and return accepted for the admission stage, milestione_lates_stage = emailed and milestone= emailed. In my actual table I have over 12000 users but I need the query to just return one row per user with the information for the maximum stage Number of that user. I hope this is clear what I need to achieve so if I have use 2 five times only returns the row for the highest numve=ber in Milestone_stage_number and hence after running the query I get one row for user 1 and one row for user 2.
my table is called applicants
Person_id Milestone admission_stage milestone_latest_stage milestone_stage_number
1 Under Review Accepted Accepted 2
1 emailed accepted emailed 4
1 offered accepted accepted 3
1 submitted reviewed offered 1

Could use a qualify and a window function
SELECT * FROM applicants
QUALIFY MAX(milestone_stage_number) OVER (PARTITION BY Person_id) = milestone_stage_number

Related

HIVEQL select all accounts for a given customer who has AT LEAST ONE account that satisfies a certain criteria

So i have a large dataset that contains credit card accounts. A customer can have multiple credit card accounts. So the accounts is unique, the customer is most certainly not unique (customer '1234' can have 5 accounts). I want to select a customer's entire account list if any of the accounts satisfy a particular requirement. The requirement is looking at its last cycle date (when the account last cycled). so let's look at this dataset...
account|customer|last_cycle_dt
4839|1|20190114
9522|1|20190103
1195|1|20181227
5461|2|20190112
1178|2|20190108
2229|2|20181218
8723|3|20181227
5692|3|20181227
0392|4|20190113
1847|5|20190113
0389|5|20190112
3281|5|20190101
2008|5|20181222
3948|5|20181216
So i have this data sorted in a particular way that it's easier to see. In fact maybe the data needs to be sorted this way to do the extract (most efficiently) but I'm not sure.
So the criteria in our extract will select all customers accounts who has at least 1 account who's last_cyc_dt field is GREATER THAN 20180112
So...
We would select ALL of customers 1 accounts
We would select NONE of customers 2 accounts
We would select NONE of customers 3 accounts
We would select ALL of customers 4 accounts
We would select ALL of customers 5 accounts
Because there exists at least 1 account for that customer who's last cycle date is greater than 20180112
What's the best approach to achieve this in HIVE ?
Using max as a window function, get the latest last_cycl_dt for each customer and check if it is greater than the required date.
select account,customer,last_cycl_dt
from (select t.*,max(last_cycle_dt) over(partition by customer) as latest_last_cycl_dt
from tbl t
) t
where latest_last_cycl_dt > '20180112'

SQL - find prior string value

I have a DB which 'tracks' the customer shopping journey. What I want to do is recall the previous value if their final destination or 'shop' is a particular value.
For example say the shops are named like this:
Shop 1
Shop 2
Shop 3
Shop 4
If my select query returns Shop 4 (for any customer) then I want the extra column to show the previous shop they last shopped at. There is no natural order to my data so I can't literally state that Shop 4 = Shop 3 it just needs to return whatever shop they last shopped at if the last one is Shop 4 (there previous shop could be any 'shop').
This is what I have so far but it's probably way off the mark. I have a date column in my table but don't know how to use it in this way.
Select ...
case
when TableShop.ShopName LIKE 'Shop4' then
cast(TableShop.ShopName -1 AS nvarchar(50))
end
From ...
Presumably, you have some column that specifies the ordering of the visits -- say a visitDatetime column.
Then, you can use the ANSI standard LAG() function:
select s.*,
(case when s.shopName = 'Shop4'
then lag(s.shopName) over (partition by customerId order by visitDateTime)
end) as prev_ShopName
from tableshop s;

In an SSRS report builder expression, I am trying to get the sum of a conditional count

I want the sum of a count IF the count is >=3. This gives me a sum of all the counts, regardless if they are <> 3:
=Sum(Iif(CountDistinct(Fields!ENCOUNTER.Value)>=3,1,0))
This produces th same result, the total number of distinct encounters:
=Sum(Iif(CountDistinct(Fields!ENCOUNTER.Value)>=3,CountDistinct(Fields!ENCOUNTER.Value),Nothing))
I want the total number of distinct encounters if there are 3 or more per person. I am grouping on person first, then encounter id.
Ex:
Person Enc
John 1
Bob 4
Sue 2
Ann 3
Total Enc>=3: 2
Based on your requirement, if there are not details rows under ENCOUNTER, you should directly compare the Fields!ENCOUNTER.Value instead of using countdistinct()
Sum(IIF(Fields!ENCOUNTER.value>=3,1,0))
If you have multiple detail rows under ENCOUNTER group level, your requirement can't be achieved because we can't use an aggregation function within an aggregation function. Which means we can't get the distinct ENCOUNTER IDs first, then calculate the total.
I found a workaround. I created another query that selects only those people with 3 or more encounters and added it to the report as a subreport

Row blocks in Hive (how to group rows by certain criteria and count these groups)

Here is a sample of the data I have:
Date_key UserID
20140401 a
20140402 a
20140406 a
20140407 a
20140408 a
20140409 a
20140404 b
20140408 b
20140409 b
20140414 b
20140415 b
... ...
Each row has a Date, User ID couple which indicates that that user was active on that day. A user can appear on multiple dates and a date will have multiple users -- just like in the example.
I want to get the number of consecutive day groups (i.e. blocks of activity). For example, this value for 'User a' would be 2 because they were active on 20140401 and 20140402 which is the first group of consecutive days. After 20140402, they waited for a while before becoming active again (i.e. they were not active the following day). On 20140406, their second block of activity started and continued without any break up until 20140409. For 'User b', this value would be 3 because they have been active during three consecutive day periods: 1)20140404 2) 20140408, 20140409 3) 20140414, 20140415
I use Hive. I am not sure if this is possible in Hive, but if the data needs to be carried over to a RDBMS to perform this task, I can do that too. Your recommendations are greatly appreciated. Thank you!
Cheers
When you use the distribute by clause, ie: .......distribute by user_id sort by user_id,date_key desc...... all the records for a particular user would go to a particular reducer, where the records are then sorted by date_key descending. Here why don't we write a UDF to iterate through the records, and when ever there is break in the continuity it would increment the counter for continuity by 1 and return the result along with the user_id.

MDX query to count number of rows that match a certain condition (newest row for each question, client group)

I have the following fact table:
response_history_id client_id question_id answer
1 1 2 24
2 1 2 27
3 1 3 12
4 1 2 43
5 2 2 39
It holds history of client answers to some questions. The largest response_history_id for each client_id,question_id combination is the latest answer for that question and client.
What I want to do is to count the number of clients whose latest answer falls within a specific range
I have some dimensions:
question associated with question_id
client associated with client_id
response_history_id associated with response_history_id
range associated with answer. 0-20 low, 20-40 = medium, >40 is high
and some measures:
max_history_id as max(response_history_id)
clients_count as disticnt count(client_id)
Now, I want to group only the latest answers by range:
select
[ranges].members on 0,
{[Measures].[clients_count]} on 1
from (select [question].[All].[2] on 1 from [Cube])
What I get is:
Measures All low medium high
clients_count 2 0 2 1
But what I wanted (and I can't get) is the calculation based on the latest answer:
Measures All low medium high
clients_count 2 0 1 1
I understand why my query doesn't give me the desired result, it's more for demonstration purpose. But I have tried a lot of more complex MDX queries and still couldn't get any good result.
Also, I can't generate a static view from my fact table because later on I would like to limit the search by another column in fact table which is timestamp, my queries must eventually be able to get _the number of clients whose latest answer to a question before a given timestamp falls within a specific range.
Can anyone help me with this please?
I can define other dimensions and measures and I am using iccube.