SQL statement to match dates that are the closest? - sql

I have the following table, let's call it Names:
Name Id Date
Dirk 1 27-01-2015
Jan 2 31-01-2015
Thomas 3 21-02-2015
Next I have the another table called Consumption:
Id Date Consumption
1 26-01-2015 30
1 01-01-2015 20
2 01-01-2015 10
2 05-05-2015 20
Now the problem is, that I think that doing this using SQL is the fastest, since the table contains about 1.5 million rows.
So the problem is as follows, I would like to match each Id from the Names table with the Consumption table provided that the difference between the dates are the lowest, so we have: Dirk consumes on 27-01-2015 about 30. In case there are two dates that have the same "difference", I would like to calculate the average consumption on those two dates.
While I know how to join, I do not know how to code the difference part.
Thanks.
DBMS is Microsoft SQL Server 2012.
I believe that my question differs from the one mentioned in the comments, because it is much more complicated since it involves comparison of dates between two tables rather than having one date and comparing it with the rest of the dates in the table.

This is how you could it in SQL Server:
SELECT Id, Name, AVG(Consumption)
FROM (
SELECT n.Id, Name, Consumption,
RANK() OVER (PARTITION BY n.Id
ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date]))) AS rnk
FROM Names AS n
INNER JOIN Consumption AS c ON n.Id = c.Id ) t
WHERE t.rnk = 1
GROUP BY Id, Name
Using RANK with PARTITION BY n.Id and ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date])) you can locate all matching records per Id: all records with the smallest difference in days are going to have rnk = 1.
Then, using AVG in the outer query, you are calculating the average value of Consumption between all matching records.
SQL Fiddle Demo

Related

Impala get the difference between 2 dates excluding weekends

I'm trying to get the day difference between 2 dates in Impala but I need to exclude weekends.
I know it should be something like this but I'm not sure how the weekend piece would go...
DATEDIFF(resolution_date,created_date)
Thanks!
One approach at such task is to enumerate each and every day in the range, and then filter out the week ends before counting.
Some databases have specific features to generate date series, while in others offer recursive common-table-expression. Impala does not support recursive queries, so we need to look at alternative solutions.
If you have a table wit at least as many rows as the maximum number of days in a range, you can use row_number() to offset the starting date, and then conditional aggregation to count working days.
Assuming that your table is called mytable, with column id as primary key, and that the big table is called bigtable, you would do:
select
t.id,
sum(
case when dayofweek(dateadd(t.created_date, n.rn)) between 2 and 6
then 1 else 0 end
) no_days
from mytable t
inner join (select row_number() over(order by 1) - 1 rn from bigtable) n
on t.resolution_date > dateadd(t.created_date, n.rn)
group by id

How many customers upgraded from Product A to Product B?

I have a "daily changes" table that records when a customer "upgrades" or "downgrades" their membership level. In the table, let's say field 1 is customer ID, field 2 is membership type and field 3 is the date of change. Customers 123 and ABC each have two rows in the table. Values in field 1 (ID) are the same, but values in field 2 (TYPE) and 3 (DATE) are different. I'd like to write a SQL query to tell me how many customers "upgraded" from membership type 1 to membership type 2 how many customers "downgraded" from membership type 2 to membership type 1 in any given time frame.
The table also shows other types of changes. To identify the records with changes in the membership type field, I've created the following code:
SELECT *
FROM member_detail_daily_changes_new
WHERE customer IN (
SELECT customer
FROM member_detail_daily_changes_new
GROUP BY customer
HAVING COUNT(distinct member_type_cd) > 1)
I'd like to see an end report which tells me:
For Fiscal 2018,
X,XXX customers moved from Member Type 1 to Member Type 2 and
X,XXX customers moved from Member Type 2 to Member type 1
Sounds like a good time to use a LEAD() analytical function to look ahead for a given customer's member_Type; compare it to current record and then evaluate if thats an upgrade/downgrade then sum results.
DEMO
CTE AS (SELECT case when lead(Member_Type_Code) over (partition by Customer order by date asc) > member_Type_Code then 1 else 0 end as Upgrade
, case when lead(Member_Type_Code) over (partition by Customer order by date asc) < member_Type_Code then 1 else 0 end as DownGrade
FROM member_detail_daily_changes_new
WHERE Date between '20190101' and '20190201')
SELECT sum(Upgrade) upgrades, sum(downgrade) downgrades
FROM CTE
Giving us: using my sample data
+----+----------+------------+
| | upgrades | downgrades |
+----+----------+------------+
| 1 | 3 | 2 |
+----+----------+------------+
I'm not sure if SQL express on rex tester just doesn't support the sum() on the analytic itself which is why I had to add the CTE or if that's a rule in non-SQL express versions too.
Some other notes:
I let the system implicitly cast the dates in the where clause
I assume the member_Type_Code itself tells me if it's an upgrade or downgrade which long term probably isn't right. Say we add membership type 3 and it goes between 1 and 2... now what... So maybe we need a decimal number outside of the Member_Type_Code so we can handle future memberships and if it's an upgrade/downgrade or a lateral...
I assumed all upgrades/downgrades are counted and a user can be counted multiple times if membership changed that often in time period desired.
I assume an upgrade/downgrade can't occur on the same date/time. Otherwise the sorting for lead may not work right. (but if it's a timestamp field we shouldn't have an issue)
So how does this work?
We use a Common table expression (CTE) to generate the desired evaluations of downgrade/upgrade per customer. This could be done in a derived table as well in-line but I find CTE's easier to read; and then we sum it up.
Lead(Member_Type_Code) over (partition by customer order by date asc) does the following
It organizes the data by customer and then sorts it by date in ascending order.
So we end up getting all the same customers records in subsequent rows ordered by date. Lead(field) then starts on record 1 and Looks ahead to record 2 for the same customer and returns the Member_Type_Code of record 2 on record 1. We then can compare those type codes and determine if an upgrade or downgrade occurred. We then are able to sum the results of the comparison and provide the desired totals.
And now we have a long winded explanation for a very small query :P
You want to use lag() for this, but you need to be careful about the date filtering. So, I think you want:
SELECT prev_membership_type, membership_type,
COUNT(*) as num_changes,
COUNT(DISTINCT member) as num_members
FROM (SELECT mddc.*,
LAG(mddc.membership_type) OVER (PARTITION BY mddc.customer_id ORDER BY mddc.date) as prev_membership_type
FROM member_detail_daily_changes_new mddc
) mddc
WHERE prev_membership_type <> membership_type AND
date >= '2018-01-01' AND
date < '2019-01-01'
GROUP BY membership_type, prev_membership_type;
Notes:
The filtering on date needs to occur after the calculation of lag().
This takes into account that members may have a certain type in 2017 and then change to a new type in 2018.
The date filtering is compatible with indexes.
Two values are calculated. One is the overall number of changes. The other counts each member only once for each type of change.
With conditional aggregation after self joining the table:
select
2018 fiscal,
sum(case when m.member_type_cd > t.member_type_cd then 1 else 0 end) upgrades,
sum(case when m.member_type_cd < t.member_type_cd then 1 else 0 end) downgrades
from member_detail_daily_changes_new m inner join member_detail_daily_changes_new t
on
t.customer = m.customer
and
t.changedate = (
select max(changedate) from member_detail_daily_changes_new
where customer = m.customer and changedate < m.changedate
)
where year(m.changedate) = 2018
This will work even if there are more than 2 types of membership level.

Understanding a Correlated Subquery

I want to create a query that returns the most recent date for a date field and the highest value of a integer field for each "assessment" record. What I think is required is a correlated subquery and using the MAX function.
example data would be as follows
the date field could have duplicate dates for each assessment but each duplicate date group would have a different the integer in the integer field.
eg
1256 2/6/14 0
1256 2/6/14 1
1256 1/6/14 0
4534 3/6/14 0
4534 3/6/14 1
4534 3/6/14 2
select assessment, Max(correctnum) maxofcorrectnum, dateeffect
from lraassm outerassm
where dateeffect =
(select MAX(dateeffect) maxofdateeffect
from pthdbo.lraassm innerassm
innerassm.assessment = outerassm.assessment
group by innerassm.assessment)
group by assessment, dateeffect
so my theory is that the inner query executes and gives the outer query the criteria for the dateeffect field in the outer query and then the outer query would return the maximum of the correctnum field for this dateeffect and also return its corresponding assessment and the dateeffect.
Could someone please confirm this is correct. How does the subquery handle the rows? what other ways are there to solve this problem? thanks
Your query is doing the right thing, but granted, the correlated subquery is a little difficult to understand. What the subquery does is, it filters the records based on assessment from the outer query and then returns the maximum dateeffect for that assessment. In fact, you don't need the group by clause on the correlated query.
These types of queries are where common when working with data in ERP systems, when you're only interested in "latest" records, etc. This is also known as a "top segment" type of query (which the query optimizer is sometimes able to figure out by itself). I've found, that on SQL Server 2005 or newer, it is a lot easier to use the ROW_NUMBER() function. The following query should return the same as yours, namely one record from lraassm for each assessment, that has the highest value of dateeffect and correctnum.
select * from (
select
assessment, dateeffect, correctnum,
ROW_NUMBER() OVER (
PARTITION BY assessment,
ORDER BY dateeffect DESC, correctnum DESC
) AS segment
from lraassm) AS innerQuery
where segment = 1
This is the query I worked out using my tables. But it will get you on the right track and you should be able to substitute your fields/tables in.
Select * from Decode
where updated_time = (Select MAX(updated_time)from DECODE)
That Query gives you every record that has the most recent updated_time. The next query will return the greatest entry_id value as well as the most recent updated_time from those Records
Select MAX(entry_id), updated_time from Decode
where updated_time = (Select MAX(updated_time)from DECODE)
group by updated_time
The result is 2 columns 1 record, 1st column is the Maximum value of entry id, the second is the most recent updated_time. Is that what you wanted to return?

Obtain maximum row_number inside a cross apply

I am having trouble in calculating the maximum of a row_number in my sql case.
I will explain it directly on the SQL Fiddle example, as I think it will be faster to understand: SQL Fiddle
Columns 'OrderNumber', 'HourMinute' and 'Code' are just to represent my table and hence, should not be relevant for coding purposes
Column 'DateOnly' contains the dates
Column 'Phone' contains the phones of my customers
Column 'Purchases' contains the number of times customers have bought in the last 12 months. Note that this value is provided for each date, so the 12 months time period is relative to the date we're evaluating.
Finally, the column I am trying to produce is the 'PREVIOUSPURCHASES' which counts the number of times the figure provided in the column 'Purchases' has appeared in the previous 12 months (for each phone).
You can see on the SQL Fiddle example what I have achieved so far. The column 'PREVIOUSPURCHASES' is producing what I want, however, it is also producing lower values (e.g. only the maximum one is the one I need).
For instance, you can see that rows 4 and 5 are duplicated, one with a 'PREVIOUSPURCHASES' of 1 and the other with 2. I don't want to have the 4th row, in this case.
I have though about replacing the row_number by something like max(row_number) but I haven't been able to produce it (already looked at similar posts at stackoverflow...).
This should be implemented in SQL Server 2012.
Thanks in advance.
I'm not sure what kind of result set you want to see but is there anything wrong with what's returned with this?
SELECT c.OrderNumber, c.DateOnly, c.HourMinute, c.Code, c.Phone, c.Purchases, MAX(o.PreviousPurchases)
FROM cte c CROSS APPLY (
SELECT t2.DateOnly, t2.Phone,t2.ordernumber, t2.Purchases, ROW_NUMBER() OVER(PARTITION BY c.DateOnly ORDER BY t2.DateOnly) AS PreviousPurchases
FROM CurrentCustomers_v2 t2
WHERE c.Phone = t2.Phone AND t2.purchases<=c.purchases AND DATEDIFF(DAY, t2.DateOnly, c.DateOnly) BETWEEN 0 AND 365
) o
WHERE c.OrderNumber = o.OrderNumber
GROUP BY c.OrderNumber, c.DateOnly, c.HourMinute, c.Code, c.Phone, c.Purchases
ORDER BY c.DateOnly

Query return rows whose sum of column value match given sum

I have tables with:
id desc total
1 baskets 25
2 baskets 15
3 baskets 75
4 noodles 10
I would like to ask the query with output which the sum of total is 40.
The output would be like:
id desc total
1 baskets 25
2 baskets 15
I believe this will get you a list of the results you're looking for, but not with your example dataset because nothing in your example dataset can provide a total sum of 40.
SELECT id, desc, total
FROM mytable
WHERE desc IN (
SELECT desc
FROM mytable
GROUP BY desc
HAVING SUM(total) = 40
)
Select Desc,SUM(Total) as SumTotal
from Table
group by desc
having SUM(Total) > = 40
Not quite sure what you want, but this may get you started
SELECT `desc`, SUM(Total) Total
FROM TableName
GROUP BY `desc`
HAVING SUM(Total) = 40
From reading your question, it sounds like you want a query that returns any subset of of sums that represent a certain target value and have the same description.
There is no simple way to do this. This migrates into algorithmic territory.
Assuming I am correct in what you are after, group bys and aggregate functions will not solve your problem. SQL cannot indicate that a query should be performed on subsets of data until it exhaust all possible permutations and finds the Sums that match your requirements.
You will have to intermix an algorithm into your sql ... i.e a stored procedure.
Or simply get all the data from the database that fits the desc then perform your algorithm on it in code.
I recall there was a CS algorithmic class I took where this was a known Problem:
I believe you could just adapt working versions of this algorithm to solve your problem
http://en.wikipedia.org/wiki/Subset_sum_problem
select desc
from (select desc, sum(total) as ct group by desc)