Find highest (max) date query, and then find highest value from results of previous query - sql

Here is a table called packages:
id packages_sent date sent_order
1 | 10 | 2017-02-11 | 1
2 | 25 | 2017-03-15 | 1
3 | 5 | 2017-04-08 | 1
4 | 20 | 2017-05-21 | 1
5 | 25 | 2017-05-21 | 2
6 | 5 | 2017-06-19 | 1
This table shows the number of packages sent on a given date; if there were multiple packages sent on the same date (as is the case with rows 4 and 5), then the sent_order keeps track of the order in which they were sent.
I am trying to make a query that will return sum(packages_sent) given the following conditions: first, return the row with the max(date) (given some date provided), and second, if there are multiple rows with the same max(date), return the row with the max(send_order) (the highest send_order value).
Here is the query I have so far:
SELECT sum(packages_sent)
FROM packages
WHERE date IN
(SELECT max(date)
FROM packages
WHERE date <= '2017-05-29');
This query correctly finds the max date, which is 2017-05-21, but then for the sum it returns 45 because it is adding rows 4 and 5 together.
I want the query to return the max(date), and if there are multiple rows with the same max(date), then return the row with the max(sent_order). Using the example above with the date 2017-05-29, it should only return 25.

I don't see where a sum() comes into play. You seem to only want the last row:
select p.*
from packages p
order by date desc, sendorder desc
fetch first 1 row only;

If you data is truly ordered ascending as you show it then it's easier to use the surrogate key ID field.
SELECT packages_sent
FROM packages
WHERE ID =
(SELECT max(ID)
FROM packages
WHERE date <= '2017-05-29');
Since the ID is always increasing with date and sent order finding the max of it also finds the max of the other two in one step.

Related

Running sum of unique users in redshift

I have a table with as follows with user visits by day -
| date | user_id |
|:-------- |:-------- |
| 01/31/23 | a |
| 01/31/23 | a |
| 01/31/23 | b |
| 01/30/23 | c |
| 01/30/23 | a |
| 01/29/23 | c |
| 01/28/23 | d |
| 01/28/23 | e |
| 01/01/23 | a |
| 12/31/22 | c |
I am looking to get a running total of unique user_id for the last 30 days . Here is the expected output -
| date | distinct_users|
|:-------- |:-------- |
| 01/31/23 | 5 |
| 01/30/23 | 4 |
.
.
.
Here is the query I tried -
SELECT date
, SUM(COUNT(DISTINCT user_id)) over (order by date rows between 30 preceding and current row) AS unique_users
FROM mytable
GROUP BY date
ORDER BY date DESC
The problem I am running into is that this query not counting the unique user_id - for instance the result I am getting for 01/31/23 is 9 instead of 5 as it is counting user_id 'a' every time it occurs.
Thank you, appreciate your help!
Not the most performant approach, but you could use a correlated subquery to find the distinct count of users over a window of the past 30 days:
SELECT
date,
(SELECT COUNT(DISTINCT t2.user_id)
FROM mytable t2
WHERE t2.date BETWEEN t1.date - INTERVAL '30 day' AND t1.date) AS distinct_users
FROM mytable t1
ORDER BY date;
There are a few things going on here. First window functions run after group by and aggregation. So COUNT(DISTINCT user_id) gives the count of user_ids for each date then the window function runs. Also, window function set up like this work over the past 30 rows, not 30 days so you will need to fill in missing dates to use them.
As to how to do this - I can only think of the "expand to the data so each date and id has a row" method. This will require a CTE to generate the last 2 years of dates plus 30 days so that the look-back window works for the first dates. Then window over the past 30 days for each user_id and date to see which rows have an example of this user_id within the past 30 days, setting the value to NULL if no uses of the user_id are present within the window. Then Count the user_ids counts (non NULL) grouping by just date to get the number of unique user_ids for that date.
This means expanding the data significantly but I see no other way to get truly unique user_ids over the past 30 days. I can help code this up if you need but will look something like:
WITH RECURSIVE CTE to generate the needed dates,
CTE to cross join these dates with a distinct set of all the user_ids in user for the past 2 years,
CTE to join the date/user_id data set with the table of real data for past 2 years and 30 days and window back counting non-NULL user_ids, partition by date and user_id, order by date, and setting any zero counts to NULL with a DECODE() or CASE statement,
SELECT, grouping by just date count the user_ids by date;

sql query to get unique id for a row in oracle based on its continuity

I have a problem that needs to be solved using sql in oracle.
I have a dataset like given below:
value | date
-------------
1 | 01/01/2017
2 | 02/01/2017
3 | 03/01/2017
3 | 04/01/2017
2 | 05/01/2017
2 | 06/01/2017
4 | 07/01/2017
5 | 08/01/2017
I need to show the result in the below format:
value | date | Group
1 | 01/01/2017 | 1
2 | 02/01/2017 | 2
3 | 03/01/2017 | 3
3 | 04/01/2017 | 3
2 | 05/01/2017 | 4
2 | 06/01/2017 | 4
4 | 07/01/2017 | 5
5 | 08/01/2017 | 6
The logic is whenever value changes over date, it gets assigned a new group/id, but if its the same as the previous one , then its part of the same group.
Here is one method using lag() and cumulative sum:
select t.*,
sum(case when value = prev_value then 0 else 1 end) over (order by date) as grp
from (select t.*,
lag(value) over (order by date) as prev_value
from t
) t;
The logic here is to simply count the number of times that the value changes from one month to the next.
This assumes that date is actually stored as a date and not a string. If it is a string, then the ordering will not be correct. Either convert to a date or use a column that specifies the correct ordering.
Here is a solution using the MATCH_RECOGNIZE clause, introduced in Oracle 12.*
select value, dt, mn as grp
from inputs
match_recognize (
order by dt
measures match_number() as mn
all rows per match
pattern ( a b* )
define b as value = prev(value)
)
order by dt -- if needed
;
Here is how this works: Other than SELECT, FROM and ORDER BY, the query has only one clause, MATCH_RECOGNIZE. What this clause does is: it takes the rows from inputs and it orders them by dt. Then it searches for patterns: one row, marked as a, with no constraints, followed by zero or more rows b, where b is defined by the condition that the value is the same as for the prev[ious] row. What the clause calculates or measures is the match_number() - first "match" of the pattern, second match etc. We use this match number as the group number (grp) in the outer query - that's all we needed!
*Notes: The existence of solutions like this shows why it is important for posters to state their Oracle version. (Run the statement select * from v$version to find out.) Also: date and group are reserved words in Oracle and shouldn't be used as column names. Not even for posting made-up sample data. (There are workarounds but they aren't needed in this case.) Also, whenever using dates like 03/01/2017 in a post, please indicate whether that is March 1 or January 3, there's no way for "us" to tell. (It wasn't important in this case, but it is in the vast majority of cases.)

Finding the maximum value between a given interval

Let's say I have a table like so, where the amount is some arbitrary amount of something(like fruit or something but we don't care about the type)
row | amount
_______________
1 | 54
2 | 2
3 | 102
4 | 102
5 | 1
And I want to select the rows that have the maximum value within a given interval. For instance if I was only wanting to select from rows 2-5 what would be returned would be
row | amount
_______________
3 | 102
4 | 102
Because they both contain the max value within the interval, which is 102. Or if I chose to only look at rows 1-2 it would return
row | amount
_______________
1 | 54
Because the maximum value in the interval 1-2 only exists in row 1
I tried to use a variety of:
amount= (select MAX(amount) FROM arbitraryTable)
But that will only ever return
row | amount
_______________
3 | 102
4 | 102
Because 102 is the absolute max of the table. Can you find the maximum value between a given interval?
I would use rank() or max() as a window function:
select t.row, t.amount
from (select t.*, max(amount) over () as maxamount
from t
where row between 2 and 4
) t
where amount = maxamount;
You can use a subquery to get the max value and use it in WHERE clause:
SELECT
row,
amount
FROM
arbitraryTable
WHERE
row BETWEEN 2 AND 5 AND
amount = (
SELECT
MAX(amount)
FROM
arbitraryTable
WHERE
row BETWEEN 2 AND 5
);
Just remember to use the same conditions in the main and sub query: row BETWEEN 2 AND 5.

SQL: the most effective way to get row number of one element

I have a table of persons:
id | Name | Age
1 | Alex | 18
2 | Peter| 30
3 | Zack | 25
4 | Bim | 30
5 | Ken | 20
And I have the following interval of rows: WHERE ID>1 AND ID<5. I know that in this interval there is a person whose id=3. What is the most efficient (the fastest) way to get its row number in this interval (in my example rownumber=2)? I mean I don't need any other data. I need only one thing - to know row position of person with id=3 in interval WHERE ID>1 AND ID<5.
If it's possible I would like to get not vendor specific solution but a general sql solution. If it's not possible then I need solution for postgresql and h2.
The row number would be the number of rows between the first row in the interval and the row you're looking for. For interval ID>1 AND ID<5 and target row ID=3, this is:
select count(*)
from YourTable
where id between 2 and 3
For interval ID>314 AND ID<1592 and target row ID=1000, you'd use:
where id between 315 and 1000
To be sure that there is an element with ID=3, use:
select count(*)
from YourTable
where id between 2 and
(
select id
from YourTable
where id = 3
)
This will return 0 if the row doesn't exist.

Find last (first) instance in table but exclude most recent (oldest) date

I have a table that reflects a monthly census of a certain population. Each month on an unpredictable day early in that month, the population is polled. Any member who existed at that point is included in that month's poll, any member who didn't is not.
My task is to look through an arbitrary date range and determine which members were added or lost during that time period. Consider the sample table:
ID | Date
2 | 1/3/2010
3 | 1/3/2010
1 | 2/5/2010
2 | 2/5/2010
3 | 2/5/2010
1 | 3/3/2010
3 | 3/3/2010
In this case, member with ID "1" was added between Jan and Feb, and member with ID 2 was lost between Feb and Mar.
The problem I am having is that if I just poll to try and find the most recent entry, I will capture all the members that were dropped, but also all the members that exist on the last date. For example, I could run this query:
SELECT
ID,
Max(Date)
FROM
tableName
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
GROUP BY
ID
This would return:
ID | Date
1 | 3/3/2010
2 | 2/5/2010
3 | 3/3/2010
What I actually want, however, is just:
ID | Date
2 | 2/5/2010
Of course I can manually filter out the last date, but since the start and end date are parameters I want to generalize that. One way would be to run sequential queries. In the first query I'd find the last date, and then use that to filter in the second query. It would really help, however, if I could wrap this logic into a single query.
I'm also having a related problem when I try to find when a member was first added to the population. In that case I'm using a different type of query:
SELECT
ID,
Date
FROM
tableName i
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
AND
NOT EXISTS(
SELECT
ID,
Date
FROM
tableName ii
WHERE
ii.ID=i.ID
AND
ii.Date < i.Date
AND
Date BETWEEN '1/1/2010' AND '3/27/2010'
)
This returns:
ID | Date
1 | 2/5/2010
2 | 1/1/2010
3 | 1/1/2010
But what I want is:
ID | Date
1 | 2/5/2010
I would like to know:
1. Which approach (the MAX() or the subquery with NOT EXISTS) is more efficient and
2. How to fix the queries so that they only return the rows I want, excluding the first (last) date.
Thanks!
You could do something like this:
SELECT
ID,
Max(Date)
FROM
tableName
WHERE
Date BETWEEN '1/1/2010' AND '3/27/2010'
GROUP BY
ID
having max(date) < '3/1/2010'
This filters out anyone polled in March.