Combine duplicate rows to output a single row - sql

I have a table for a ticket system that has the following cols;
order<string> startDate<datetime> endDate<datetime>
The majority of rows are not duplicated based on a column 'order' however in the scenario where an 'order' crosses from one day to the next it is split into 2 orders, the 1st has an endDate of 5pm (end of day) and the 2nd has a startDate of 8am (start of day). Their corresponding start and end dates are as needed.
Some orders can be >2 days long and so will be split into >2 rows.
example
order startDate endDate
1 2016-03-29 11:00:53.000 2016-03-29 17:00:53.000
1 2016-03-30 08:00:53.000 2016-03-30 12:48:53.000
2 2016-03-30 10:17:53.000 2016-03-30 13:08:53.000
would transform to
1 2016-03-29 11:00:53.000 2016-03-30 12:48:53.000
2 2016-03-30 10:17:53.000 2016-03-30 13:08:53.000
I need to combine all rows to give me a table of unique 'order' ids with start and ends. i.e. a row with the lowest start date of its duplicates and highest enddate of its duplicates.
I plan to do this by creating a new table and populating it and can choose 1 of duplicate rows based on a certain value but in not sure how to create a new row based on values from multiple rows.

SELECT order, MIN(startDate), MAX(endDate)
FROM your_table_name
GROUP BY order
There may be no need to create a new table for this — GROUP BY queries are extremely common in production usage, and there's no inherent harm in simply running that query to get the results you need when you need them.

Related

Start date and end date assigning based on date ranges and value change

I have two tables temp_N and temp_C . Table script and data is given below . I am using teradata
Table Script and data
First image is temp_N and second one is temp_C
Now I will try to explain my requirement. Key column for this two tables are 'nbr'. This two table contains all the changes for a particular period of time.( this is sample data and this two tables will get daily loaded based on the updates). Now I need to merge this two tables into one table with date range assigned correctly. The expected result is given below. To explain the logic behind the expected result, first row in the expected result, fstrtdate is the least date which from the two tables which is 2022-01-31 and for the same row if we notice the end date is given as 2022-07-10 as there is a change in the cpnrate on 2022-07-11. second row is start with 2022-07-11 giving the changed cpnrate, now when comes to third row there is a change in ntr on 2022-08-31 and the data is update accordingly. Please note all this are date fields, there wont be any timestamp, please ignore the timestamp in screenshots
Now I would like to know how to achieve this in sql or is it possible to achieve ?
You can combine all the changes into a single table and order by effective start date (fstrtdate). Then you can compute effective end date as day prior to next change, and where one of the data values is NULL use LAG IGNORE NULLS to "bring down" the desired previous not-NULL value:
SELECT nbr, fstrtdate,
PRIOR(LEAD(fstrtdate) OVER (PARTITION BY nbr ORDER BY fstrtdate)) as fenddate,
COALESCE(ntr,LAG(ntr) IGNORE NULLS OVER (PARTITION BY nbr ORDER BY fstrtdate)) as ntr,
COALESCE(cpnrate,LAG(cpnrate) IGNORE NULLS OVER (PARTITION BY nbr ORDER BY fstrtdate)) as cpnrate
FROM (
SELECT nbr, fstrtdate, max(ntr) as ntr, max(cpnrate) as cpnrate
FROM (
SELECT nbr, fstrtdate, ntr, NULL (DECIMAL(9,2)) as cpnrate
from temp_n
UNION ALL
SELECT nbr, fstrtdate, NULL (DECIMAL(9,2)) as ntr, cpnrate
from temp_c
) AS COMBINED
GROUP BY 1, 2
) AS UNIQUESTART
ORDER BY fstrtdate;
The innermost SELECTs make the structure the same for data from both tables with NULLs for the values that come from the other table, so we can do a UNION to form one COMBINED derived table with rows for both types of change events. Note that you should explicitly assign datatype for those added NULL columns to match the datatype for the corresponding column in the other table; I somewhat arbitrarily chose DECIMAL(9,2) above since I didn't know the real data types. They can't be INT as in the example, though, since that would truncate the decimal part. There's no reason to carry along the original fenddate; a new non-overlapping fenddate will be computed in the outer query.
The intermediate GROUP BY is only to combine what would otherwise be two rows in the special case where both ntr and cpnrate changed on the same day for the same nbr. That case is not present in the example data - the dates are already unique - but it might be necessary to do this when processing the full table. The syntax requires an aggregate function, but there should be at most two rows for a (nbr, fstrtdate) group; and when there are two rows, in each of the other columns one row has NULL and the other row does not. In that case either MIN or MAX will return the non-NULL value.
In the outer query, the COALESCEs will return the value for that column from the current row in the UNIQUED derived table if it's not NULL, otherwise LAG is used to obtain the value from a previous row.
The first two rows in the result won't match the screenshot above but they do accurately reflect the data provided - specifically, the example does not identify a cpnrate for any date prior to 2022-05-11.
nbr
fstrtdate
fenddate
ntr
cpnrate
233
2022-01-31
2022-05-10
311,000.00
NuLL
233
2022-05-11
2022-07-10
311,000.00
3.31
...
-
-
-
-

Get latest cumulative sales amount for various evaluation dates in SAS

I have a list of evaluation dates stored in a table, datelist. It's technically two columns, start_date and end_date, for each evaluation period. The end_date will definitely need to be used, but the start_date may not. I only care about periods that are completed, so, for example, the period from 2016-01-01 to 2016-07-01 is in progress but not complete. So, it's not in the table.
start_date end_date
2012-01-01 2012-07-01
2012-07-01 2013-01-01
2013-01-01 2013-07-01
2013-07-01 2014-01-01
2014-01-01 2014-07-01
2014-07-01 2015-01-01
2015-01-01 2015-07-01
2015-07-01 2016-01-01
I have a separate table that lists cumulative sales by customer, sales_table with three columns, customer_ID, cumul_sales, transaction_date. For example, let's say customer 4793 bought $100 worth of stuff on 2/14/2014 and $200 worth of stuff on 3/30/2014 and $75 on 7/27/2014, the table will have the following rows:
customer_ID cumul_sales transaction_date
4793 100 2014-02-14
4793 300 2014-03-30
4793 375 2014-07-27
Now, for each evaluation date and for each customer, I want to know what's the cumulative sales as of the evaluation date for that customer? If a customer hadn't purchased anything by an evaluation date, then I wouldn't want a row for that customer at all corresponding to said evaluation date. This would be stored in a new table, called sales_by_eval, with columns customer_ID, cumul_sales, eval_date. For the example customer above, I'd have the following rows:
customer_ID cumul_sales eval_date
4793 300 2014-07-01
4793 375 2015-01-01
4793 375 2015-07-01
4793 375 2016-01-01
I can do this, but I'm looking to do it in an efficient way so I don't have to read through the data once for each evaluation date. If there are a lot of rows in the sales_table and 40 evaluation dates, that would be a large waste to read through the data 40 times, once for each evaluation date. Would it be possible with only one read through the data, for example?
The basic idea of the current process is a macro loop that loops once per evaluation period. Each loop has a data step that creates a new table (one table per loop) to check each transaction to see if it has occurred before or on the end_date of that corresponding evaluation period. That is, each table has all the transactions that occur before or on that evaluation date but none of the ones that occur after. Then, a later data step uses "last." to get only the last transaction for each customer before that evaluation date. And, finally, all the various tables created are put back together in another data step where they are all listed in the SET statement.
This is in SAS, so anything SAS can do, including SQL and macros, is fine with me.
In SAS, when you use group by statement, you can still use not grouping variables in select statement, like this:
proc sql;
create table sales_by_eval as
select s.customer_ID, s.cumul_sales, d.end_date as eval_date
from datelist d
join sales_table s
on d.end_date > s.transaction_date
group by s.customer_ID, d.end_date
having max(s.transaction_date) = s.transaction_date
;
quit;
This mean that for each combination of selected variablem SAS will return rekord with measures summarized within defined group. To limit the result to the last state of transaction value, use having condition, where you select only those records that have transaction_date equal to max(transaction_date) within s.customer_ID, d.end_date group.

Possible to calculate iterated count of timestamps relative to one another?

This question is a bit complicated but to make it as simple as possible:
I have a list of timestamps (it is in the millions but let's say for simplicity sake it is much smaller):
order_times
-----------
2014-10-11 15:00:00
2014-10-11 15:02:00
2014-10-11 15:03:31
2014-10-11 15:07:00
2014-10-11 16:00:00
2014-10-11 16:04:00
I am trying to build a query (in PostgeSQL) that would allow me to determine the number of times a an order_time occurs within 10 minutes of 2 order_times prior to it (and no more).
In the sample data above:
first time stamp is considered 0 as there were no orders before it
second timestamp is considered 0 as it was within 10 minutes of it
prior but there was only 1 request before it
third timestamp is considered 1 because there were at least 2 orders within 10 minutes of it
(and so on)...
I hope this is clear!
You don't need to look at the first previous, just the one 2 prior to each. If that is within 10 minutes, then the one after it will be also.
Best way is to get the data that is important to you into a single row, so you can do set operations on it. For that, use the windowing function ROW_NUMBER() and a self join. This is the MS SQL way of doing what you want.
WITH T1 AS (
SELECT ID, Order_Time, ROW_NUMBER() OVER( ORDER BY Order_Time) AS RowNumber FROM myTest)
SELECT T1.ID,T1.Order_Time, T2.ID AS CompareID,T2.Order_Time AS CompareTime
FROM T1 LEFT OUTER JOIN T1 AS T2 ON T1.RowNumber-2 = T2.RowNumber
WHERE DATEDIFF(n,t2.Order_Time,t1.Order_Time)<=10
First we create a query that has the row numbers, then use it as an inline table to do a self join to build a row that contains each order, and the one that happened 2 orders prior to it. Then just do a simple date comparison to select out the rows you want.

Join to Calendar Table - 5 Business Days

So this is somewhat of a common question on here but I haven't found an answer that really suits my specific needs. I have 2 tables. One has a list of ProjectClosedDates. The other table is a calendar table that goes through like 2025 which has columns for if the row date is a weekend day and also another column for is the date a holiday.
My end goal is to find out based on the ProjectClosedDate, what date is 5 business days post that date. My idea was that I was going to use the Calendar table and join it to itself so I could then insert a column into the calendar table that was 5 Business days away from the row-date. Then I was going to join the Project table to that table based on ProjectClosedDate = RowDate.
If I was just going to check the actual business-date table for one record, I could use this:
SELECT actual_date from
(
SELECT actual_date, ROW_NUMBER() OVER(ORDER BY actual_date) AS Row
FROM DateTable
WHERE is_holiday= 0 and actual_date > '2013-12-01'
ORDER BY actual_date
) X
WHERE row = 65
from here:
sql working days holidays
However, this is just one date and I need a column of dates based off of each row. Any thoughts of what the best way to do this would be? I'm using SQL-Server Management Studio.
Completely untested and not thought through:
If the concept of "business days" is common and important in your system, you could add a column "Business Day Sequence" to your table. The column would be a simple unique sequence, incremented by one for every business day and null for every day not counting as a business day.
The data would look something like this:
Date BDAY_SEQ
========== ========
2014-03-03 1
2014-03-04 2
2014-03-05 3
2014-03-06 4
2014-03-07 5
2014-03-08
2014-03-09
2014-03-10 6
Now it's a simple task to find the N:th business day from any date.
You simply do a self join with the calendar table, adding the offset in the join condition.
select a.actual_date
,b.actual_date as nth_bussines_day
from DateTable a
join DateTable b on(
b.bday_seq = a.bday_seq + 5
);

making sure "expiration_date - X" falls on a valid "date_of_price" (if not, use the next valid date_of_price)

I have two tables. The first table has two columns: ID and date_of_price. The date_of_price field skips weekend days and bank holidays when markets are closed.
table: trading_dates
ID date_of_price
1 8/7/2008
2 8/8/2008
3 8/11/2008
4 8/12/2008
The second table also has two columns: ID and expiration_date. The expiration_date field is the one day in each month when options expire.
table: expiration_dates
ID expiration_date
1 9/20/2008
2 10/18/2008
3 11/22/2008
I would like to do a query that subtracts a certain number of days from the expiration dates, with the caveat that the resulting date must be a valid date_of_price. If not, then the result should be the next valid date_of_price.
For instance, say we are subtracting 41 days from the expiration_date. 41 days prior to 9/20/2008 is 8/10/2008. Since 8/10/2008 is not a valid date_of_price, we have to skip 8/10/2008. The query should return 8/11/2008, which is the next valid date_of_price.
Any advice would be appreciated! :-)
This will subtract 41 days from the date with ID = 1 in expirations_dates and find the nearest date_of_price.
Query
Modify the ID and the 41 for different dates.
SELECT date_of_price
FROM trading_dates
WHERE date_of_price >= (
SELECT DATE_SUB(expiration_date, INTERVAL 41 DAY)
FROM expiration_dates
WHERE ID=1
)
ORDER BY date_of_price ASC
LIMIT 1;
Performance
To get the best performance from this query, your trading_dates table should have a clustered index on date_of_price (this will make the ORDER BY a no-op) and of course a primary key index on expirations_dates.ID (to lookup the expiration date quickly).
Don't put in the clustered index blindly though. If you update or insert values more often than you look up expirations like this, then don't put it in since all your inserts/updates will have an added overhead to keep the clustered index.