query from/to, newest within a specific timerange - sql

I have the following table:
PersNumber | Property | From | To
XXX | 34 | 20180101 | 20180630
XXX | 38 | 20180701 | 20190330
XXX | 39 | 20180401 | 20201231
I have a period time frame, i.e. from 2018-01-01 to 2019-12-31
I need to query the last row (actually only the 2 first columns). The criteria is actually : from / to within the timerange, and the newest if more than one. Meaning :
row : out because not in the period scope
row : a part is in the period scope but not the newest
row : a part is in the period scope, and this is the newest
I don't know whether the problem is understandable, if not do not hesitate to tell it to me

You seem to want:
select t.*
from t
where date_from >= '2018-01-01' and date_to <= '2019-12-31'
order by date_from
limit 1;

Related

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.
Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

Retrieving 52 weeks after the result of a subquery

From a table that contains sales, I retrieved the last week of that table. That gives me the last week where there are sales being made. 'Date' is always the first day of the month but it doesn't matter, the real important data is week and partial_week.
The result is simple :
+------------+---------+--------------+
| Date | Week | Partial_week |
+------------+---------+--------------+
| 2020-02-01 | 2020-09 | 2020M02W09 |
+------------+---------+--------------+
Let's call it t1
I have a table with the first day of each month, every week and partial week from 2015 to 2025
(when a week is on two months, it's split in two partial weeks that have the same number but different month). It looks like this :
+------------+---------+--------------+
| Date | Week | Partial_week |
+------------+---------+--------------+
| 2020-02-01 | 2020-05 | 2020M02W05 |
| 2020-02-01 | 2020-06 | 2020M02W06 |
| 2020-02-01 | 2020-07 | 2020M02W07 |
| 2020-02-01 | 2020-08 | 2020M02W08 |
| 2020-02-01 | 2020-09 | 2020M02W09 |
| 2020-03-01 | 2020-09 | 2020M03W09 |
+------------+---------+--------------+
Let's call it t2
I now need to retrieve everything in t2 that is between 1 et 52 weeks after my week retrieved in t1. (this should get me every weeks and partial weeks until 2021-09 or so).
I tought about having a
'select top 52 distinct week from t2'
joining on t1 and having a where clause 'where t1.week < t2.week'
then joining everything on t2 again to get every partial week too,
but that doesn't work because on every week t1.week is equal to null (I wish t1.week could just be a variable since it only has one row...)
Any ideas would be appreciated.
Your logic seems to be close. Put the initial query in a Scalar Subquery to handle it like a variable:
select *
from t2
where t2.week >=
( select week from t1 -- i.e. your existing query to return the latest week
)
qualify
dense_rank()
over (order by week) <= 52
You can also switch to a join:
select *
from t2
join
( select week from t1 -- i.e. your existing query to return the latest week
) as t1
on t2.week >= t1.week
qualify
dense_rank() -- next 52 week & partial weeks
over (order by t2.week) <= 52
Explain of the Scalar Subquery might be better.

SQL Query X Days back excluding date ranges (Confusing!)

Ok, I have a tough SQL query, and I'm not sure how to go about writing it.
I am summing the number of "bananas collected" by an employee within the last X days, but what I could really use help on is determining X.
The "last X days" value is defined to be the last 100 days that the employee was NOT out due to Purple Fever, starting from some ChosenDate (we'll say today, 6/24/14). That is to say, if the person was sick with Purple Fever for 3 days, then I want to look back over the last 103 days from ChosenDate rather than the last 100 days. Any other reason the employee may have been out does not affect our calculation.
Table PersonOutIncident
+----------------------+----------+-------------+
| PersonOutIncidentID | PersonID | ReasonOut |
+----------------------+----------+-------------+
| 1 | Sarah | PurpleFever |
| 2 | Sarah | PaperCut |
| 3 | Jon | PurpleFever |
| 4 | Sarah | PurpleFever |
+----------------------+----------+-------------+
Table PersonOutDetail
+-------------------+----------------------+-----------+-----------+
| PersonOutDetailID | PersonOutIncidentID | BeginDate | EndDate |
+-------------------+----------------------+-----------+-----------+
| 1 | 1 | 1/1/2014 | 1/3/2014 |
| 2 | 1 | 1/7/2014 | 1/13/2014 |
| 3 | 2 | 2/1/2014 | 2/3/2014 |
| 4 | 3 | 1/15/2014 | 1/20/2014 |
| 5 | 4 | 5/1/2014 | 5/15/2014 |
+-------------------+----------------------+-----------+-----------+
The tables are established. Many PersonOutDetail records can be associated with one PersonOutIncident record and there may be multiple PersonOutIncident records for a single employee (That is to say, there could be two or three PersonOutIncident records with an identical ReasonOut column, because they represent a particular incident or event and the not-necessarily-continuous days lost due to that particular incident)
The nature of this requirement complicates things, even conceptually to me.
The best I can think of is to check for a BeginDate/EndDate pair within the 100 day base period, then determine the number of days from BeginDate to EndDate and add that to the base 100 days. But then I would have to check again that this new range doesn't overlap or contain additional BeginDate/EndDate pairs and add, if so, add those days as well. I can tell already that this isn't the method I want to use, but I can't wrap my mind quite around how exactly what I need to start/structure this query. Does anyone have an idea that might steer me in the correct direction? I realize this might not be clear and I apologize if I'm just confusing things.
One way to do this is to work with a table or WITH CLAUSE that contains a list of days. Let's say days is a table with one column that contains the last 200 days. (This means the query will break if the employee had more than 100 sick days in the last 200 days).
Now you can get a list of all working days of an employee like this (replace ? with the employee id):
WITH t1 AS
(
SELECT day,
ROW_NUMBER() OVER (ORDER BY day DESC) AS 'RowNumber'
FROM days d
WHERE NOT EXISTS (SELECT * FROM PersonOutDetail pd
INNER JOIN PersonOutIncidentID po ON po.PersonOutIncidentID = pd.PersonOutIncidentID
WHERE d.day BETWEEN pd.BeginDate AND pd.EndDate
AND po.ReasonOut = 'PurpleFever'
AND po.PersonID = ?)
)
SELECT * FROM t1
WHERE RowNumber <= 100;
Alternatively, you can obtain the '100th day' by replacing RowNumber <= 100 with RowNumber = 100.

Select Earliest Date and Time from List of Distinct User Sessions

I have a table of user access sessions which records website visitor activity:
accessid, userid, date, time, url
I'm trying to retrieve all distinct sessions for userid 1234, as well as the earliest date and time for each of those distinct sessions.
SELECT
DISTINCT accessid,
date,
time
FROM
accesslog
WHERE userid = '1234'
GROUP BY accessid
This gives me the date and time of a random row within each distinct accessid. I've read a number of posts recommending the use of min() and max(), so I tried:
SELECT DISTINCT accessid, MIN(DATE) AS date, MIN(TIME) AS time FROM accesslog WHERE userid = '1234' GROUP BY accessid ORDER BY date DESC, time DESC
... and even...
SELECT DISTINCT accessid, MIN(CONCAT(DATE, ' ', TIME)) AS datetime FROM accesslog WHERE userid = '1234' GROUP BY accessid ORDER BY date DESC, time DESC
... but I never get the correct result of the earliest date and time.
What is the trick to ordering this kind of query?
EDIT -
Something weird is happening....
The code posted below by Bill Karwin correctly retrieves the earliest date and time for sessions that started in 2009-09. But, for sessions that began on some day in 2009-08, the time and date for the first hit occurring in the current month is what is returned. In other words, the query does not appear to be spanning months!
Example data set:
accessid | userid | date | time
1 | 1234 | 2009-08-15 | 01:01:01
1 | 1234 | 2009-09-01 | 12:01:01
1 | 1234 | 2009-09-15 | 13:01:01
2 | 1234 | 2009-09-01 | 14:01:01
2 | 1234 | 2009-09-15 | 15:01:01
At least on my actual data table, the query posted below finds the follow earliest date and time for each of the two accessid's:
accessid | userid | date | time
1 | 1234 | 2009-09-01 | 12:01:01
2 | 1234 | 2009-09-01 | 14:01:01
... and I would guess that the only reason the result for accessid 2 appears correct is because it has no hits in a previous month.
Am I going crazy?
EDIT 2 -
The answer is yes, I am going crazy. The query works on the above sample data when placed in a table of duplicate structure.
Here is the (truncated) original data. I included the very first hit, another hit in the same month, the first hit of the next month, and then the last hit of the month. The original data set has many more hits in between these points, for a total of 462 rows.
accessid | date | time
cbb82c08d3103e721a1cf0c3f765a842 | 2009-08-18 | 04:01:42
cbb82c08d3103e721a1cf0c3f765a842 | 2009-08-23 | 23:18:52
cbb82c08d3103e721a1cf0c3f765a842 | 2009-09-17 | 05:12:16
cbb82c08d3103e721a1cf0c3f765a842 | 2009-09-18 | 06:29:59
... the query returns the 2009-09-17 value as the earliest value when the original table is queried. But, when I copy the ........ oh, balls.
It's because the hits from 2009-08% have an empty userid field.
This is a variation of the "greatest-n-per-group" problem that comes up on StackOverflow several times per week.
SELECT
a1.accessid,
a1.date,
a1.time
FROM
accesslog a1
LEFT OUTER JOIN
accesslog a2
ON (a1.accessid = a2.accessid AND a1.userid = a2.userid
AND (a1.date > a2.date OR a1.date = a2.date AND a1.time > a2.time))
WHERE a1.userid = '1234'
AND a2.accessid IS NULL;
The way this works is that we try to find a row (a2) that has the same accessid and userid, and an earlier date or time than the row a1. When we can't find an earlier row, then a1 must be the earliest row.
Re your comment, I just tried it with the sample data you provided. Here's what I get:
+----------+------------+----------+
| accessid | date | time |
+----------+------------+----------+
| 1 | 2009-08-15 | 01:01:01 |
| 2 | 2009-09-01 | 14:01:01 |
+----------+------------+----------+
I'm using MySQL 5.0.75 on Mac OS X.
Try this
SELECT
accessid,
date,
time
FROM
accesslog
WHERE userid = '1234'
GROUP BY accessid
HAVING MIN(date)
It will return all unique accesses with minimum time for each for userid = '1234'.

Finding correlated values from second table without resorting to PL/SQL

I have the following two tables in my database:
a) A table containing values acquired at a certain date (you may think of these as, say, temperature readings):
sensor_id | acquired | value
----------+---------------------+--------
1 | 2009-04-01 10:00:00 | 20
1 | 2009-04-01 10:01:00 | 21
1 | 2009-04 01 10:02:00 | 20
1 | 2009-04 01 10:09:00 | 20
1 | 2009-04 01 10:11:00 | 25
1 | 2009-04 01 10:15:00 | 30
...
The interval between the readings may differ, but the combination of (sensor_id, acquired) is unique.
b) A second table containing time periods and a description (you may think of these as, say, periods when someone turned on the radiator):
sensor_id | start_date | end_date | description
----------+---------------------+---------------------+------------------
1 | 2009-04-01 10:00:00 | 2009-04-01 10:02:00 | some description
1 | 2009-04-01 10:10:00 | 2009-04-01 10:14:00 | something else
Again, the length of the period may differ, but there will never be overlapping time periods for any given sensor.
I want to get a result that looks like this for any sensor and any date range:
sensor id | start date | v1 | end date | v2 | description
----------+---------------------+----+---------------------+----+------------------
1 | 2009-04-01 10:00:00 | 20 | 2009-04-01 10:02:00 | 20 | some description
1 | 2009-04-01 10:10:00 | 25 | 2009-04-01 10:14:00 | 30 | some description
Or in text from: given a sensor_id and a date range of range_start and range_end,
find me all time periods which have overlap with the date range (that is, start_date < range_end and end_date > range_start) and for each of these rows, find the corresponding values from the value table for the time period's start_date and end_date (find the first row with acquired > start_date and acquired > end_date).
If it wasn't for the start_value and end_value columns, this would be a textbook trivial example of how to join two tables.
Can I somehow get the output I need in one SQL statement without resorting to writing a PL/SQL function to find these values?
Unless I have overlooked something blatantly obvious, this can't be done with simple subselects.
Database is Oracle 11g, so any Oracle-specific features are acceptable.
Edit: yes, looping is possible, but I want to know if this can be done with a single SQL select.
You can give this a try. Note the caveats at the end though.
SELECT
RNG.sensor_id,
RNG.start_date,
RDG1.value AS v1,
RNG.end_date,
RDG2.value AS v2,
RNG.description
FROM
Ranges RNG
INNER JOIN Readings RDG1 ON
RDG1.sensor_id = RNG.sensor_id AND
RDG1.acquired => RNG.start_date
LEFT OUTER JOIN Readings RDG1_NE ON
RDG1_NE.sensor_id = RDG1.sensor_id AND
RDG1_NE.acquired >= RNG.start_date AND
RDG1_NE.acquired < RDG1.acquired
INNER JOIN Readings RDG2 ON
RDG2.sensor_id = RNG.sensor_id AND
RDG2.acquired => RNG.end_date
LEFT OUTER JOIN Readings RDG1_NE ON
RDG2_NE.sensor_id = RDG2.sensor_id AND
RDG2_NE.acquired >= RNG.end_date AND
RDG2_NE.acquired < RDG2.acquired
WHERE
RDG1_NE.sensor_id IS NULL AND
RDG2_NE.sensor_id IS NULL
This uses the first reading after the start date of the range and the first reading after the end date (personally, I'd think using the last date before the start and end would make more sense or the closest value, but I don't know your application). If there is no such reading then you won't get anything at all. You can change the INNER JOINs to OUTER and put additional logic in to handle those situations based on your own business rules.
It seems pretty straight forward.
Find the sensor values for each range. Find a row - I will call acquired of this row just X - where X > start_date and not exists any other row with acquired > start_date and acquired < X. Do the same for end date.
Select only the ranges that meet the query - start_date before and end_date after the dates supplied by the query.
In SQL this would be something like that.
SELECT R1.*, SV1.aquired, SV2.aquired
FROM ranges R1
INNER JOIN sensor_values SV1 ON SV1.sensor_id = R1.sensor_id
INNER JOIN sensor_values SV2 ON SV2.sensor_id = R1.sensor_id
WHERE SV1.aquired > R1.start_date
AND NOT EXISTS (
SELECT *
FROM sensor_values SV3
WHERE SV3.aquired > R1.start_date
AND SV3.aquired < SV1.aquired)
AND SV2.aquired > R1.end_date
AND NOT EXISTS (
SELECT *
FROM sensor_values SV4
WHERE SV4.aquired > R1.end_date
AND SV4.aquired < SV2.aquired)
AND R1.start_date < #range_start
AND R1.end_date > #range_end