SQL Server: I have multiple records per day and I want to return only the first of the day - sql

I have some records track inquires by DATETIME. There is an glitch in the system and sometimes a record will enter multiple times on the same day. I have a query with a bunch of correlated subqueries attached to these but the numbers are off because when there were those glitches in the system then these leads show up multiple times. I need the first entry of the day, I tried fooling around with MIN but I couldn't quite get it to work.
I currently have this, I am not sure if I am on the right track though.
SELECT SL.UserID, MIN(SL.Added) OVER (PARTITION BY SL.UserID)
FROM SourceLog AS SL

Here's one approach using row_number():
select *
from (
select *,
row_number() over (partition by userid, cast(added as date) order by added) rn
from sourcelog
) t
where rn = 1

You could use group by along with min to accomplish this.
Depending on how your data is structured if you are assigning a unique sequential number to each record created you could just return the lowest number created per day. Otherwise you would need to return the ID of the record with the earliest DATETIME value per day.
--Assumes sequential IDs
select
min(Id)
from
[YourTable]
group by
--the conversion is used to stip the time value out of the date/time
convert(date, [YourDateTime]

Related

Filtering for Max Date in SQL when Multiple Records have the Same Date

I have a dataset that includes the following fields: building_id, lease_id, date_signed
I am trying to get the most recent lease signed for each building_id. The issue I'm running into is that several of the leases have the same date signed in the building and I only want to include one of them. I have been using SnowSQL and was looking into using the Rank window function, but that gives the same rank value to records that have the same date in the building.
Is there a way to pull only one value for the most recent lease date for each building even if there are multiple leases with that same date? I will also need to know which lease it is associated with. Thanks!
Use row_number() instead of rank(). Something like:
select t.*
from t
qualify row_number() over (partition by building_id order by date_signed desc) = 1;

How to select 1 row per id?

I'm working with a table that has multiple rows for each order id (e.g. variations in spelling for addresses and different last_updated dates), that in theory shouldn't be there (not my doing). I want to select just 1 row for each id and so far I figured I can do that using partitioning like so:
SELECT dp.order_id,
MAX(cr.updated_at) OVER(PARTITION BY dp.order_id) AS updated_at
but I have seen other queries which only use MAX and list every other column like so
SELECT dp.order_id,
MAX(dp.ship_address) as address,
MAX(cr.updated_at) as updated_at
etc...
this solution looks more neat but I can't get it to work (still returns multiple rows per single order_id). What am I doing wrong?
If you want one row per order_id, then window functions are not sufficient. They don't filter the data. You seem to want the most recent row. A typical method uses row_number():
select t.*
from (select t.*,
row_number() over (partition by order_id order by created_at desc) as seqnum
from t
) t
where seqnum = 1;
You can also use aggregation:
select order_id, max(ship_address), max(created_at)
from t
group by order_id;
However, the ship_address may not be from the most recent row and that is usually not desirable. You can tweak this using keep syntax:
select order_id,
max(ship_address) keep (dense_rank first order by created_at desc),
max(created_at)
from t
group by order_id;
However, this gets cumbersome for a lot of columns.
The 2nd "solution" doesn't care about values in other columns - it selects their MAX values. It means that you'd get ORDER_ID and - possibly - "mixed" values for other columns, i.e. those ADDRESS and UPDATED_AT might belong to different rows.
If that's OK with you, then go for it. Otherwise, you'll have to select one MAX row (using e.g. row_number analytic function), and fetch data that is related only to it (i.e. doesn't "mix" values from different rows).
Also, saying that you
can't get it to work (still returns multiple rows per single order_id)
is kind of difficult to believe. The way you put it, it can't be true.

Efficiently find last date in a table - Teradata SQL

Say I have a rather large table in a Teradata database, "Sales" that has a daily record for every sale and I want to write a SQL statement that limits this to the latest date only. This will not always be the previous day, for example, if it was a Monday the latest date would be the previous Friday.
I know I can get the results by the following:
SELECT s.*
FROM Sales s
JOIN (
SELECT MAX(SalesDate) as SalesDate
FROM Sales
) sd
ON s.SalesDate=sd.SalesDt
I am not knowledgable on how it would process the subquery and since Sales is a large table would there be a more efficient way to do this given there is not another table I could use?
Another (more flexible) way to get the top n utilizes OLAP-functions:
SELECT *
FROM Sales s
QUALIFY
RANK() OVER (ORDER BY SalesDate DESC) = 1
This will return all rows with the max date. If you want only one of them switch to ROW_NUMBER.
That is probably fine, if you have an index on salesdate.
If there is only one row, then I would recommend:
select top 1 s.*
from sales s
order by salesdate desc;
In particular, this should make use of an index on salesdate.
If there is more than one row, use top 1 with ties.

SQL to find best row in group based on multiple columns?

Let's say I have an Oracle table with measurements in different categories:
CREATE TABLE measurements (
category CHAR(8),
value NUMBER,
error NUMBER,
created DATE
)
Now I want to find the "best" row in each category, where "best" is defined like this:
It has the lowest errror.
If there are multiple measurements with the same error, the one that was created most recently is the considered to be the best.
This is a variation of the greatest N per group problem, but including two columns instead of one. How can I express this in SQL?
Use ROW_NUMBER:
WITH cte AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY category ORDER BY error, created DESC) rn
FROM measurements m
)
SELECT category, value, error, created
FROM cte
WHERE rn = 1;
For a brief explanation, the PARTITION BY clause instructs the DB to generate a separate row number for each group of records in the same category. The ORDER BY clause places those records with the smallest error first. Should two or more records in the same category be tied with the lowest error, then the next sorting level would place the record with the most recent creation date first.

Select latest and earliest times within a time group and a pivot statement

I have attandance data that contains a username, time, and status (IN or OUT). I want to show attendance data that contains a name, and the check in/out times. I expect a person to check in and out no more than twice a day. The data looks like this:
As you can see, my problem is that one person can have multiple data entries in different seconds for the same login attempt. This is because I get data from a fingerprint attendace scanner, and the machine in some cases makes multiple entries, sometimes just within 5-10 seconds. I want to select the data to be like this:
How can I identify the proper time for the login attempt, and then select the data with a pivot?
First, you need to normalize your data by removing the duplicate entries. In your situation, that's a challenge because the duplicated data isn't easily identified as a duplicate. You can make some assumptions though. Below, I assume that no one will make multiple login attempts in a two minute window. You can do this by first using a Common Table Expression (CTE, using the WITH clause).
Within the CTE, you can use the LAG function. Essentially what this code is saying is "for each partition of user and entry type, if the previous value was within 2 minutes of this value, then put a number, otherwise put null." I chose null as the flag that will keep the value because LAG of the first entry is going to be null. So, your CTE will just return a table of entry events (ID) that were distinct attempts.
Now, you prepare another CTE that a PIVOT will pull from that has everything from your table, but only for the entry IDs you cared about. The PIVOT is going to look over the MIN/MAX of your IN/OUT times.
WITH UNIQUE_LOGINS AS (
SELECT ID FROM LOGIN_TABLE
WHERE CASE WHEN LAG(TIME, 1, 0) OVER (PARTITION BY USERNAME, STATUS ORDER BY TIME)
+ (2/60/24) < TIME THEN NULL ELSE 1 END IS NULL ), -- Times within 2 minutes
TEMP_FOR_PIVOT AS (
SELECT USERNAME, TIME, STATUS FROM LOGIN_TABLE WHERE ID IN (SELECT ID FROM UNIQUE_LOGINS)
)
SELECT * FROM TEMP_FOR_PIVOT
PIVOT (
MIN(TIME), MAX(TIME) FOR STATUS IN ('IN', 'OUT')
)
From there, if you need to rearrange or rename your columns, then you can just put that last SELECT into yet another CTE and then select your values from it. There is some more about PIVOT here: Rotate/pivot table with aggregation in Oracle