SQL Query to return one record for a column with duplicated values based on the date of another column

SQL Query to return one record for a column with duplicated values based on the date of another column - sql

I am sure this question may have been answered before, but it's hard to phrase, and I've spent a couple of hours on Google and still not found a solution.
I have a table (view) that has a record of device serial numbers that are rented out. The row is created by a sales order when we ship out a device (Status = Shipped) and then when the device is returned the same record is updated (Status = Returned). The TransDate is updated when shipped and then when returned.
Now a device is back in our warehouse, we will rent it out again, this time with a new order (because it will 99% be a different customer). So we get a new row in the table, with a new order number but, of course, with the same serial number.
Here is some example data (but there are many additional fields in the real table)
SerNo TransDate OrdNo Status (record id for describing the issue)
1111 20170105 1234 Returned 1
2222 20161220 1235 Shipped 2
3333 20170105 1235 Returned 3
4444 20170105 1236 Returned 4
1111 20170115 1311 Returned 5
4444 20170110 1312 Shipped 6
6666 20170110 1313 Shipped 7
1111 20170125 1401 Shipped 8
My challenge is that I need a Select query that will return just one record for every serial number that is in the table... and where there is more than one record for the same serial number, I need the one with the latest date.
In other words the result would include record id's:
2, 6, 7, 8 (devices out at customers) AND 3 (because this has been returned but not re-shipped).
(Records 1, 4, 5 have been returned, but then rented out again, so they are now just historical records and do not represent current status).
I know GROUP BY will not work because I have to aggregate the other fields, and I need all the other fields (there are many more) from the record with the latest date for a given serial number.
We are running this on SQL Server 2012.
Thank you in advance!

You can use the row_number function to get the first row for each serno ordered by descending transdate.
select serno,transdate,ordno,status
from (
select t.*, row_number() over(partition by serno order by transdate desc) as rnum
from data t
) x
where rnum=1

Related

Query to select appropriate row and calculate elapsed time

I need some help in coming up with a query that will return the answer to the question “How long has a Help Desk Ticket been owned by the currently assigned group?” Following is a subset of the data model with some sample data:
Help Desk Cases
Case ID (PK) Assigned Person Assigned Group
123456 Robert Hardware
Help Desk Case Assignment History
Case ID (PK) Seq # (PK) Assigned Group Assigned Person Elapsed Time Row Added Date/Time
123456 1 Hardware 10
123456 2 Software 2
123456 3 Hardware Sam 1
123456 4 Software Sophie 6
123456 5 Hardware 8
123456 6 Hardware Sam 3
123456 7 Hardware Robert
The Elapsed Time column for the most recent row (Seq #7) is not updated until a subsequent row (Seq #8) is written, so I don’t think I can use an aggregate function. For the sample data above, I need to get the Row Added column from Seq # 5 and subtract it from the current date to get the total amount of time the case has been most recently assigned to the Hardware group (we ignore previous assignments such as Seq # 1 and Seq # 3).
The Query output for the example above should be:
Case ID Assigned Group Assigned Person Time Owned
123456 Hardware Robert Current Date - Seq #5 Row Added Date/Time

With Oracle 12c and higher...
select case_id,
last_assigned_group as assigned_group,
last_assigned_person as assigned_person,
nvl(last_row_added, systimestamp) - first_row_added as time_owned
from help_desk_case_assignment_history
match_recognize (
partition by case_id
order by seq#
measures
first(row_added) as first_row_added,
last(row_added) as last_row_added,
last(assigned_group) as last_assigned_group,
last(assigned_person) as last_assigned_person
one row per match
after match skip past last row
pattern (
assignment_run* case_end
)
define
assignment_run as (assigned_group = next(assigned_group)),
case_end as (elapsed_time is null or next(assigned_group) is null)
)
;
In human words: Per each helpdesk case ID find the last uninterrupted "run" of assignments within the same group. For the last "run" of assignments identify its starting time, ending time, and ending person. And display the found values.
With Oracle 11g and lower...
with xyz as (
select X.*,
case when lnnvl(assigned_group = lag(assigned_group) over (partition by case_id order by seq#)) then seq# end as assignment_run_start
from help_desk_case_assignment_history X
),
xyz2 as (
select X.*,
last_value(assignment_run_start) ignore nulls over (partition by case_id order by seq#) as assignment_run_id
from xyz X
),
xyz3 as (
select case_id, assigned_group, assignment_run_id,
max(assigned_person) keep (dense_rank last order by seq#) as last_assigned_person,
nvl(max(row_added) keep (dense_rank last order by seq#), systimestamp)
- min(row_added) keep (dense_rank first order by seq#)
as time_owned,
row_number() over (partition by case_id order by assignment_run_id desc) as last_group_ind
from xyz2 X
group by case_id, assigned_group, assignment_run_id
)
select case_id, assigned_group, last_assigned_person as assigned_person, time_owned
from xyz3
where last_group_ind = 1
;
Perhaps ugly, but pretty straightforward and working.
In human words:
Identify the boundaries (starts) of assignment runs as increasing numeric IDs.
Extend the found assignment run starts to the whole assignment runs.
Calculate the assignments' run times and last assigned persons.
Restrict the previous calculation to the last (by their ID) assignment run only.

Count Distinct values in one column based on other columns

I have a table that looks like the following:
app_id supplier_reached creation_date platform
10001 1 9/11/2018 iOS
10001 2 9/18/2018 iOS
10002 1 5/16/2018 android
10003 1 5/6/2018 android
10004 1 10/1/2018 android
10004 1 2/3/2018 android
10004 2 2/2/2018 web
10005 4 1/5/2018 web
10005 2 5/1/2018 android
10006 3 10/1/2018 iOS
10005 4 1/1/2018 iOS
The objective is to find the unique number of app_id submitted per month.
If I just do a count(distinct app_id) I will get the following results:
Group by month count(app number)
Jan 1
Feb 1
may 3
september 1
october 2
However, an application is considered unique based on a combination of other fields as well. For example, for the month of January, the app_id is the same however a combination of app_id, supplier_reached and platform show different values and hence the app_id should be counted twice.
Following the same pattern, the desired result should be:
Group by month Desired answer
Jan 2
Feb 2
may 3
september 2
october 2
Lastly, there can be many other columns in the table which may or may not contribute to the uniqueness of an application.
Is there a way to do this type of count in SQL?
I am using Redshift.

As pointed out above, in Redshift count(distinct ...) does not work with multiple fields.
You can first group by the columns that you want to be unique and then count the records like this:
select month,count(1) as app_number
from (
select month,app_id,supplier_reached,platform
from your_table
group by 1,2,3,4
)
group by 1

I don't think Postgres or Redshift supports COUNT(DISTINCT) with multiple arguments. One workaround is to use concatenation:
count(distinct app_id || ':' || supplier_reached || ':' || platform)

Your objective's mean is wrong.
You don't want
to find the unique number of app_id submitted per month
you want
to find the unique number of app_id + supplier_reached + platform submitted per month.
And so, you need to use a) combination of columns like count(distinct col1||col2||col3) or b)
select t1.month, count(t1.*)
(select distinct
app_id,
supplier_reached,
platform,
month
from sometable) t1
group by month

Actually, you can count distinct ROW values conveniently in Postgres:
SELECT month, count(DISTINCT (app_id, supplier_reached, platform)) AS dist_apps
FROM tbl
GROUP BY 1;
The ROW keyword would be just noise here:
count(DISTINCT ROW(app_id, supplier_reached, platform))
I would discourage concatenating columns for the purpose. This is comparatively expensive, error prone (think of distinct data types and locale-dependent text representation) and introduces corner-case errors if the used separator can be contained in column values.
Alas, not supported by Redshift:
...
Value expressions
Subscripted expressions
Array constructors
Row constructors
...

Time between date. (More advanced than just Datediff)

I have a table that contains Guest_ID and Trip_Date. I have been tasked with trying to find out for each Guest_ID how many times they have had over 365 days between trips. I know that for the time between the dates I can use datediff formula but I am unsure of how to get the dates plugged in properly. I think if I can get help with this part I can do the rest.
For each time this happened I need to report back Guest_ID, Prior_Last_Trip, New_Trip, days between. This data goes back for over a decade so it is possible for a Guest to have multiple periods of over a year between visits.
I was thinking of just loading a table with that data that can be queried later. That way once I figure out how to make this work the first time I can setup a stored procedure or trigger to check for new occurrences of this and populate the table.
I was not sure were to begin on this code. I was thinking recursion might be the answer but I do not know recursion just that it exist.
This table is quite large. Around 1.5 million unique Guest_ID's with over 30 million trips.
I am using SQL Server 2012. If there is anything else I can add to help this let me know. I will edit and update this as I have ideas on how to make this work myself.
Edit 1: Sample Data and Desired Results
Guest_ID Trip_Date
1 1/1/2013
1 2/5/2013
1 12/5/2013
1 1/1/2015
1 6/5/2015
1 8/1/2017
1 10/2/2017
1 1/6/2018
1 6/7/2018
1 7/1/2018
1 7/5/2018
2 1/1/2018
2 2/6/2018
2 4/2/2018
2 7/3/2018
3 1/1/2014
3 6/5/2014
3 9/4/2014
Guest_ID Prior_Last_Trip New_Trip DaysBetween
1 12/5/2013 1/1/2015 392
1 6/5/2015 8/1/2017 788
So you can see that Guest 1 had 2 different times where they did not have a trip for over a year and that those two instances are recorded in the results. Guest 2 never had a gap of over a year and therefor has no records in the results. Guest 3 has not had a trip in over a year but without have the return trip currently does not qualify for the result set. Should Guest 3 ever make another trip they would then be added to the result set.
Edit 2: Working Query
Thanks to #Code4ml I got this working. Here is the complete query.
Select
Guest_ID, CurrentTrip, DaysBetween, Lasttrip
From (
Select
Guest_ID
,Lag(Trip_Date,1) Over(Partition by Guest_ID Order by Trip_Date) as LastTrip
,Trip_Date as CurrentTrip
,DATEDIFF(d,Lag(Trip_Date,1) Over(Partition by Guest_ID Order by Trip_Date),Trip_Date) as DaysBetween
From UCS
) as A
Where DaysBetween > 365

You may try SQL LAG function to access previous trip date like below.
SELECT guest_id, trip_date,
LAG (trip_date,1) OVER (PARTITION BY guest_id ORDER BY trip_date desc) AS prev_trip_date
FROM tripsTable
Now you can use this as a subquery to calculate number of days between trips and filter the data as required.

SQL Find latest record only if COMPLETE field is 0

I have a table with multiple records submitted by a user. In each record is a field called COMPLETE to indicate if a record is fully completed or not.
I need a way to get the latest records of the user where COMPLETE is 0, LOCATION, DATE are the same and no additional record exist where COMPLETE is 1. In each record there are additional fields such as Type, AMOUNT, Total, etc. These can be different, even though the USER, LOCATION, and DATE are the same.
There is a SUB_DATE field and ID field that denote the day the submission was made and auto incremented ID number. Here is the table:
ID NAME LOCATION DATE COMPLETE SUB_DATE TYPE1 AMOUNT1 TYPE2 AMOUNT2 TOTAL
1 user1 loc1 2017-09-15 1 2017-09-10 Food 12.25 Hotel 65.54 77.79
2 user1 loc1 2017-09-15 0 2017-09-11 Food 12.25 NULL 0 12.25
3 user1 loc2 2017-08-13 0 2017-09-05 Flight 140 Food 5 145.00
4 user1 loc2 2017-08-13 0 2017-09-10 Flight 140 NULL 0 140
5 user1 loc3 2017-07-14 0 2017-07-15 Taxi 25 NULL 0 25
6 user1 loc3 2017-08-25 1 2017-08-26 Food 45 NULL 0 45
The results I would like is to retrieve are ID 4, because the SUB_DATE is later that ID 3. Which it has the same Name, Location, and Date information and there is no COMPLETE with a 1 value.
I would also like to retrieve ID 5, since it is the latest record for the User, Location, Date, and Complete is 0.
I would also appreciate it if you could explain your answer to help me understand what is happening in the solution.

Not sure if I fully understood but try this
SELECT *
FROM (
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your_table t
) a
WHERE CompleteForNameLocationAndDate = 0 AND
SUB_DATE = LastSubDate
So what we have done here:
First, if you run just the inner query in Management Studio, you will see what that does:
The first max function will partition the data in the table by each unique Name,Location,Date set.
In the case of your data, ID 1 & 2 are the first partition, 3&4 are the second partition, 5 is the 3rd partition and 6 is the 4th partition.
So for each of these partitions it will get the max value in the complete column. Therefore any partition with a 1 as it's max value has been completed.
Note also, the convert function. This is because COMPLETE is of datatype BIT (1 or 0) and the max function does not work with that datatype. We therefore convert to INT. If your COMPLETE column is type INT, you can take the convert out.
The second max function partitions by unique Name, Location and Date again but we are getting the max_sub date this time which give us the date of the latest record for the Name,Location,Date
So we take that query and add it to a derived table which for simplicity we call a. We need to do this because SQL Server doesn't allowed windowed functions in the WHERE clause of queries. A windowed function is one that makes use of the OVER keyword as we have done. In an ideal world, SQL would let us do
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your)table t
WHERE MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) = 0 AND
SUB_DATE = MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE)
But it doesn't allow it so we have to use the derived table.
So then we basically SELECT everything from our derived table Where
CompleteForNameLocationAndDate = 0
Which are Name,Location, Date partitions which do not have a record marked as complete.
Then we filter further asking for only the latest record for each partition
SUB_DATE = LastSubDate
Hope that makes sense, not sure what level of detail you need?
As a side, I would look at restructuring your tables (unless of course you have simplified to better explain this problem) as follows:
(Assuming the table in your examples is called Booking)
tblBooking
BookingID
PersonID
LocationID
Date
Complete
SubDate
tblPerson
PersonID
PersonName
tblLocation
LocationID
LocationName
tblType
TypeID
TypeName
tblBookingType
BookingTypeID
BookingID
TypeID
Amount
This way if you ever want to add Type3 or Type4 to your booking information, you don't need to alter your table layout

select and delete query based on older entries

I have an Excel sheet that is pushing data to an Access database using ADO. It is essentially putting invoices into a database. Sometimes I will revise my invoice and therefore the database will end up with the same invoice twice. I need to make a select and delete query that will find duplicates based on the invoice number, and delete the older version of the invoice (older record), for a simple example:
id invoice# total item datestamp
1 1234 456.29$ shoes 06/06/2016 03:51
2 1234 78.58$ boots 06/06/2016 03:51
3 1234 22.74$ scarf 06/06/2016 03:51
4 1234 539.34$ shoes 06/07/2016 12:44
4 1234 66.24$ pants 06/07/2016 12:44
As you can see row 4 and 5 are my new invoice for this customer. I want every previous order of the same invoice # to be deleted. Please note: they are not actually duplicates, only the invoice number is duplicated. The query needs to see dupliactes based on invoice number and criteria sees dates older than the most recent date.
At that point it is way beyond me. I would appreciate the help.

Consider using a correlated aggregate subquery in WHERE clause:
DELETE *
FROM InvoiceTable
WHERE NOT datestamp IN
(SELECT Max(datestamp)
FROM InvoiceTable sub
WHERE sub.InvoiceNumber = InvoiceTable.InvoiceNumber)

As I said, try being conservative and not deleting. Instead, select rows that are based on the maximum date stamp for a given invoice number:
SELECT
invoices.id, invoices.invoice, invoices.total, invoices.item, invoices.datestamp
FROM
invoices
INNER JOIN
(SELECT
id, MAX(datestamp) AS maxdate
FROM
invoices
GROUP BY
id) lastinv
ON invoices.id = lastinv.id AND
invoices.datestamp = lastinv.maxdate
This is untested code, but should, pretty much do what you want. All you have to do is mangle it into Microsoft Access, as this is T-SQL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas