How to pull the last N number of dates in SAS SQL - sql

I am dealing with a large dataset (30 million rows) and I need to pull the most recent three dates (which may have an indeterminate number of rows attached to them) so like 03MAR2016 might have 2 rows 27FEB2016 might have ten and 25FEB2016 might have 3. How do I say "Select everything that falls within the last X number of values in this set regardless of how many rows there are"?

As you can not sort in an in-line view/subquery you will have to split your SQL statement in two parts:
Sort the date DESCENDING and get the distinct values
Join back to the original data and limit to first 3
But as stated before, SQL is not good at this kind of operation.
DATA input_data ;
INPUT date value ;
CARDS ;
20160101 1
20160101 2
20160101 3
20160102 1
20160103 1
20160104 1
20160105 1
20160105 2
20160105 3
;
proc sql _method;
create table DATE_ID as
select distinct DATE
from input_data
order by DATE DESC;
create table output_data as
select data.*
from (select *
from DATE_ID
where monotonic() <= 3
) id
inner join input_data data
on id.DATE = data.DATE
;
quit;

You need to break this down into two tasks.
Determine which dates are the last three dates
Pull all rows from those dates
Both are possible in SQL, though the first is much easier using other methods (SAS's SQL isn't very good at getting the "first X things").
I would suggest using something like PROC FREQ or PROC TABULATE to generate the list of dates (just a PROC FREQ on the date variable), really any proc you're comfortable with - even PROC SORT would work (though that's probably less efficient). Then once you have that table, limit it to the three highest observations, and then you can use it in a SQL step to join to the main table and filter to those three dates - or you can use other options, like creating a custom format or hash tables or whatever works for you. 30 million rows isn't so many that a SQL join should be a problem, though, I'd think.

Related

Sum Column for Running Total where Overlapping Date

I have a table with about 3 million rows of Customer Sales by Date.
For each CustomerID row I need to get the sum of the Spend_Value
WHERE Order_Date BETWEEN Order_Date_m365 AND Order_Date
Order_Date_m365 = OrderDate minus 365 days.
I just tried a self join but of course, this gave the wrong results due to rows overlapping dates.
If there is a way with Window Functions this would be ideal but I tried and can't do the between dates in the function, unless I missed a way.
Tonly way I can think now is to loop so process all rank 1 rows into a table, then rank 2 in a table, etc., but this will be really inefficient on 3 million rows.
Any ideas on how this is usually handled in SQL?
SELECT CustomerID,Order_Date_m365,Order_Date,Spend_Value
FROM dbo.CustomerSales
Window functions likely won't help you here, so you are going to need to reference the table again. I would suggest that you use an APPLY with a subquery to do this. Provided you have relevant indexes then this will likely be the more efficient approach:
SELECT CS.CustomerID,
CS.Order_Date_m365,
CS.Order_Date,
CS.Spend_Value,
o.output
FROM dbo.CustomerSales CS
CROSS APPLY (SELECT SUM(Spend_Value) AS output
FROM dbo.CustomerSales ca
WHERE ca.CustomerID = CS.CustomerID
AND ca.Order_Date >= CS.Order_Date_m365 --Or should this is >?
AND ca.Order_Date <= CS.Order_Date) o;

SQL statement to match dates that are the closest?

I have the following table, let's call it Names:
Name Id Date
Dirk 1 27-01-2015
Jan 2 31-01-2015
Thomas 3 21-02-2015
Next I have the another table called Consumption:
Id Date Consumption
1 26-01-2015 30
1 01-01-2015 20
2 01-01-2015 10
2 05-05-2015 20
Now the problem is, that I think that doing this using SQL is the fastest, since the table contains about 1.5 million rows.
So the problem is as follows, I would like to match each Id from the Names table with the Consumption table provided that the difference between the dates are the lowest, so we have: Dirk consumes on 27-01-2015 about 30. In case there are two dates that have the same "difference", I would like to calculate the average consumption on those two dates.
While I know how to join, I do not know how to code the difference part.
Thanks.
DBMS is Microsoft SQL Server 2012.
I believe that my question differs from the one mentioned in the comments, because it is much more complicated since it involves comparison of dates between two tables rather than having one date and comparing it with the rest of the dates in the table.
This is how you could it in SQL Server:
SELECT Id, Name, AVG(Consumption)
FROM (
SELECT n.Id, Name, Consumption,
RANK() OVER (PARTITION BY n.Id
ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date]))) AS rnk
FROM Names AS n
INNER JOIN Consumption AS c ON n.Id = c.Id ) t
WHERE t.rnk = 1
GROUP BY Id, Name
Using RANK with PARTITION BY n.Id and ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date])) you can locate all matching records per Id: all records with the smallest difference in days are going to have rnk = 1.
Then, using AVG in the outer query, you are calculating the average value of Consumption between all matching records.
SQL Fiddle Demo

Oracle SQL last n records

i have read tons of articles regarding last n records in Oracle SQL by using rownum functionality, but on my case it does not give me the correct rows.
I have 3 columns in my table: 1) message (varchar), mes_date (date) and mes_time (varchar2).
Inside lets say there is 3 records:
Hello world | 20-OCT-14 | 23:50
World Hello | 21-OCT-14 | 02:32
Hello Hello | 20-OCT-14 | 23:52
I want to get the last 2 records ordered by its date and time (first row the oldest, and second the newest date/time)
i am using this query:
SELECT *
FROM (SELECT message
FROM messages
ORDER
BY MES_DATE, MES_TIME DESC
)
WHERE ROWNUM <= 2 ORDER BY ROWNUM DESC;
Instead of getting row #3 as first and as second row #2 i get row #1 and then row #3
What should i do to get the older dates/times on top follow by the newest?
Maybe that helps:
SELECT *
FROM (SELECT message,
mes_date,
mes_time,
ROW_NUMBER() OVER (ORDER BY TO_DATE(TO_CHAR(mes_date, 'YYYY-MM-DD') || mes_time, 'YYYY-MM-DD HH24:MI') DESC) rank
FROM messages
)
WHERE rank <= 2
ORDER
BY rank
I am really sorry to disappoint - but in Oracle there's no such thing as "the last two records".
The table structure does not allocate data at the end, and does not keep a visible property of time (the only time being held is for the sole purpose of "flashback queries" - supplying results as of point in time, such as the time the query started...).
The last inserted record is not something you can query using the database.
What can you do? You can create a trigger that orders the inserted records using a sequence, and select based on it (so SELECT * from (SELECT * FROM table ORDER BY seq DESC) where rownum < 3) - that will assure order only if the sequence CACHE value is 1.
Notice that if the column that contains the message date does not have many events in a second, you can use that column, as the other solution suggested - e.g. if you have more than 2 events that arrive in a second, the query above will give you random two records, and not the actual last two.
AGAIN - Oracle will not be queryable for the last two rows inserted since its data structure do not managed orders of inserts, and the ordering you see when running "SELECT *" is independent of the actual inserts in some specific cases.
If you have any questions regarding any part of this answer - post it down here, and I'll focus on explaining it in more depth.
select * from table
minus
select * from table
where rownum<=(select count(*) from table)-n

How to have GROUP BY and COUNT include zero sums?

I have SQL like this (where $ytoday is 5 days ago):
$sql = 'SELECT Count(*), created_at FROM People WHERE created_at >= "'. $ytoday .'" AND GROUP BY DATE(created_at)';
I want this to return a value for every day, so it would return 5 results in this case (5 days ago until today).
But say Count(*) is 0 for yesterday, instead of returning a zero it doesn't return any data at all for that date.
How can I change that SQLite query so it also returns data that has a count of 0?
Without convoluted (in my opinion) queries, your output data-set won't include dates that don't exist in your input data-set. This means that you need a data-set with the 5 days to join on to.
The simple version would be to create a table with the 5 dates, and join on that. I typically create and keep (effectively caching) a calendar table with every date I could ever need. (Such as from 1900-01-01 to 2099-12-31.)
SELECT
calendar.calendar_date,
Count(People.created_at)
FROM
Calendar
LEFT JOIN
People
ON Calendar.calendar_date = People.created_at
WHERE
Calendar.calendar_date >= '2012-05-01'
GROUP BY
Calendar.calendar_date
You'll need to left join against a list of dates. You can either create a table with the dates you need in it, or you can take the dynamic approach I outlined here:
generate days from date range

How do I count data from 2 different tables by date

I have 2 tables with no relations, both tables have different number of columns, but there are a few columns that are the same but hold different data. I was able to create a function or view of only the data I wanted, but when I try to count the data by filtering the date, I always get the wrong count in return. Let me explain by showing the 2 functions and what I try to do:
Function 1
ID - number from 1 to 8
data sent - YES or NO
Date - date value
Function 2
ID - number from 1 to 8
data sent - yes or no
date - date value
Upon running both separately, I get all the rows from the tables and everything looks good.
Then I try to add the following to each function:
select
count([data sent]), ID
from function1
Where (date between #date1 and #date2)
group by ID
The above statement works great and gives me the right result for each function.
Now I thought what if I want to add those 2 functions into one and get the count from both functions on 1 page.
So I created the following function:
Function 3
select
count(Function1.[data sent]) as Expr1,
Function1.id,
count(Function2.[data sent]) as Expr2,
Function1.date
from
Function1
LEFT OUTER JOIN
Function2 on Function1.id = Function2.id
Where
(Function1.date between #date1 and #date2)
group by
Function1.id
Upon running the above, I get the following table:
ID Expr1 Expr2
On both Expr1 and Expr2, I get results which I am not sure where they come from. I guess something is being multiplied by 100000 since one table holds almost 15000 rows and the other around 5000 rows.
What I would like to know first is if it possible at all to be able to filter by date and count records from both table at the same time. If anyone need more information please let me know and I will be glad to share and explain more.
Thank you
The LEFT OUTER JOIN is taking each row of the left table, finding ALL of the rows in the right table with the same id field, and creating that many rows in the result table. Since id isn't what we usually think of as an identity field (it looks more like a "deviceId" or something), you'll get lots of matches for each one. Repeat 15000 times and you get your combinatorial explosion.
Tip: To debug things like this, you can create sample tables with a tiny subset of the real data, say 10 rows from each, and run your query on them. You'll see the issue immediately.
It's possible to filter by date. It's hard to recommend an actual solution without better understanding your phrase "I want to add those 2 functions into one and get the count from both functions on 1 page".
Why can't you create a temporary table for each function then join them together?
Maybe subqueries can help you to achieve what you want:
SELECT
ID = COALESCE(f1.ID, f2.ID),
Date = COALESCE(f1.Date, f2.Date),
f1.Expr1,
f2.Expr2
FROM (
SELECT
ID,
Date,
Expr1 = COUNT([data sent])
FROM Function1
WHERE Date BETWEEN #date1 AND #date2
GROUP BY
ID,
Date
) f1
FULL JOIN (
SELECT
ID,
Date,
Expr2 = COUNT([data sent])
FROM Function2
WHERE Date BETWEEN #date1 AND #date2
GROUP BY
ID,
Date
) f2
ON f1.ID = f2.ID AND f1.Date = f2.Date
This query also uses full (outer) join instead of left join, in case the right side of the join contains rows that have no match in the left side (and you want those rows).