counting date and time for historical reporting - sql

I am currently working on a query that will be used in junction with share-point to run reports. I have a query that I know will work with Oracle, but the company I am working for is running SQL Server 2005.
What the report will do is give the person the ability to select any date and time, and give the count for that specific operation. The problem is that there are large gaps in the time stamps (because it takes a little while for the product to get to the next operation). The date type is varchar, so i used substrings to parse out the year, month, day, and time. I have sample data available.
The people looking at the reports want the ability to say at this time and day how many units went through this operation.
I know this is is confusing, let me know if you need any clarification.
Here is the oracle syntax
SELECT T3.PAYMENT_DATE, T3."Hr", T3."Min",
(SELECT COUNT(*)
FROM INVOICE_ARCHIVE T4
WHERE TO_NUMBER(TO_CHAR(T4.PAYMENT_DATE, 'MM')) <= T3."Hr"
AND TO_NUMBER(TO_CHAR(T4.PAYMENT_DATE, 'DD')) <= T3."Min") AS "NUM"
FROM(SELECT T1.PAYMENT_DATE, T2."Hr", T2."Min"
FROM (SELECT (FLOOR((LEVEL + 359)/60)) AS "Hr",
MOD((LEVEL + 359), 60) AS "Min"
FROM dual CONNECT BY LEVEL <= 961) T2, INVOICE_ARCHIVE T1
ORDER BY T1.PAYMENT_DATE, T2."Hr", T2."Min") T3

The answer to your question is the datepart() function in SQL Server. This will allow you to extract minutes and hours from dates.
The harder part is the "connect by level" portion. How is this being used? You might need to use recursive CTEs to handle this.
With the little hint from spencer, the following may suffice for your query:
SELECT T3.PAYMENT_DATE, T3."Hr", T3."Min",
(SELECT COUNT(*)
FROM INVOICE_ARCHIVE T4
WHERE datepart(month, T4.PAYMENT_DATE) <= T3."Hr" AND
datepart(day, T4.PAYMENT_DATE, 'DD') <= T3."Min"
) AS "NUM"
FROM (SELECT T1.PAYMENT_DATE, T2."Hr", T2."Min"
FROM (SELECT top 961 (FLOOR((LEVEL + 359)/60)) AS "Hr",
MOD((LEVEL + 359), 60) AS "Min"
FROM (select top 961 row_number() over (order by (select NULL)) as level
from invoice_archive
) t
) T2 cross join
INVOICE_ARCHIVE T1
) T3
ORDER BY T3.PAYMENT_DATE, T3."Hr", T3."Min"
I made the following changes:
Changed the date arithmetic to use datepart() instead of to_char() .
Replaced the method for getting a list of numbers, by using row_number() instead of connect by level
Made the cross join explicit
Moved the order by to the outer query, since neither SQL Server nor Oracle guarantee the results of an order by in a subquery (and SQL Server does not allow it unless you have a "TOP" query)

Related

Two queries returning different results when they should be equivalent?

Our dataset is fundamentally joining a set of dates (weeks from the current week into the past) to a set of sections based on whether those sections started on or before and ended on or after that week. While originally this query gave us the results we expected, this week it began providing us incorrect results. After a bunch of tinkering, we discovered that if we changed the query to a LEFT JOIN and then filtered the query using a WHERE clause, it gave us correct results again.
What's the difference? Why does one work and the other doesn't? (Bonus points: why did the original query work for weeks before suddenly experiencing this error?) Performing the same inner join on Redshift delivers correct results, so it seems to be a Snowflake nuance that we don't understand.
Original query:
WITH week_list AS
(
SELECT DATEADD(week, -4, DATE_TRUNC(week, CURRENT_DATE())) AS week_value
UNION ALL
SELECT DATEADD(week, 1, week_value)
FROM week_list
WHERE DATEADD(week, 1, week_value) < CURRENT_DATE()
),
active_sections_per_week AS
(
SELECT
wl.week_value, s.id section_id
FROM week_list wl
JOIN schema.sections s ON wl.week_value >= DATE_TRUNC(week, s.starts_at)
AND wl.week_value <= DATE_TRUNC(week, s.ends_at)
)
SELECT
aspw.week_value,
COUNT(DISTINCT aspw.section_id) count_sections
FROM
active_sections_per_week aspw
GROUP BY 1
ORDER BY 1 DESC
Results: One row, dated 2019-12-30 (4 weeks ago). No data for the past three weeks.
Note: If you adjust the DATEADD in the first CTE, whatever is the first date returned will always seem to join successfully. This behavior started only within the last week--previously, this query provided the expected number of rows (in other words, the number of weeks specified in that first DATEADD).
"Fixed" query:
WITH week_list AS
(
SELECT DATEADD(week, -4, DATE_TRUNC(week, CURRENT_DATE())) AS week_value
UNION ALL
SELECT DATEADD(week, 1, week_value)
FROM week_list
WHERE DATEADD(week, 1, week_value) < CURRENT_DATE()
),
active_sections_per_week AS
(
SELECT wl.week_value, s.id section_id
FROM week_list wl
LEFT JOIN schema.sections s ON wl.week_value >= DATE_TRUNC(week, s.starts_at)
AND wl.week_value <= DATE_TRUNC(week, s.ends_at)
WHERE s.id IS NOT NULL
)
SELECT aspw.week_value, COUNT(DISTINCT aspw.section_id) count_sections
FROM active_sections_per_week aspw
GROUP BY 1
ORDER BY 1 DESC
Results: returns four rows, weeks dated 2019-12-30 to 2020-01-20, with appropriate section counts.
This is a recursive CTE on "week_list". Redshift does not support recursive CTEs.
Snowflake does support recursive CTEs, which would explain the difference in behavior.
It's hard to test this without the underlying data. If you're getting correct results in Redshift, then chances are you do not need or want a recursive CTE. You can modify it so that "week_list" does not reference itself.
As for why it worked before, it's likely the table state and recursive CTE worked only under special cases. When CURRENT_DATE() advanced, it took it out of that special case. Also, the inner join and left outer join where s.id IS NOT NULL would be equivalent if not in a recursive CTE.
You can read more about recursive CTEs here:
https://docs.snowflake.net/manuals/user-guide/queries-cte.html#recursive-ctes-and-hierarchical-data
the recursive CTE can be avoided if the -4 weeks is a constant with this code:
WITH week_list AS (
SELECT DATEADD(week, column1, DATE_TRUNC(week, CURRENT_DATE()))
FROM VALUES (-4),(-3),(-2),(-1),(0)
)
with the JOIN snowflake will move the filters higher in the execution stack, and you might have found a bug. Where-as with the LEFT JOIN (even though it has a equivalent WHERE clause it most likely avoiding the aggressive broken optimization.
There was a software release last night for us, but we are on an Enterprise account so you might have been upgrade 2 days prior. This release had a number of bugs that impacted us, we had it rolled back (for us)
Thank you for all of the feedback! The good news is you all helped me get to a solution that I think I am satisfied with. I have also followed up with Snowflake so they can investigate this behavior and see if it was user error on my part due to not understanding how recursive CTEs process, or whether it is possibly a bug introduced in a recent release.
Here's what I found: while recursion works for the use case I was applying it to (generating a list of dates based on CURRENT_DATE), it is not strictly necessary. Since we want a list of dates, I could just as easily generate a table and use the row numbers to perform the DATEADD adjustments.
It looks like this:
SELECT DATEADD(week, '-' || ROW_NUMBER() OVER (ORDER BY NULL),
DATEADD(week, 1, DATE_TRUNC(week, CURRENT_DATE()))) AS week_value
FROM table (generator(rowcount => 200))
One of the big benefits to this approach is I am no longer limited by the MAX_RECURSIONS setting in Snowflake (which is set to 100 by default). Since I am using this data to create graphs of activity over time, having 200 values gives me more than three years of history rather than just shy of 2 years of history. I also don't have to contact my Snowflake rep if I want to expand it.
Changing the week_list CTE to this non-recursive approach seems to fix whatever issue was causing the INNER JOIN to perform incorrectly. We still don't understand why the recursive CTE seemed to work for several weeks and then suddenly started misbehaving, but if Snowflake can shed light on that via our support ticket, I will double back here to provide an update. Thank you all for your help and guidance!

SQL WITH AS statements in Ecto Subquery

I have an SQL query that using the PostgreSQL WITH AS to act as an XOR or "Not" Left Join. The goal is to return what is in unique between the two queries.
In this instance, I want to know what users have transactions within a certain time period AND do not have transactions in another time period. The SQL Query does this by using WITH to select all the transactions for a certain date range in new_transactions, then select all transactions for another date range in older_transactions. From those, we will select from new_transactions what is NOT in older_transactions.
My Query in SQL is :
/* New Customers */
WITH new_transactions AS (
select * from transactions
where merchant_id = 1 and inserted_at > date '2017-11-01'
), older_transactions AS (
select * from transactions
where merchant_id = 1 and inserted_at < date '2017-11-01'
)
SELECT * from new_transactions
WHERE user_id NOT IN (select user_id from older_transactions);
I'm trying to replicate this in Ecto via Subquery. I know I can't do a subquery in the where: statement, which leaves me with a left_join. How do I replicate that in Elixir/Ecto?
What I've replicated in Elixir/Ecto throws an (Protocol.UndefinedError) protocol Ecto.Queryable not implemented for [%Transaction....
Elixir/Ecto Code:
def new_merchant_transactions_query(merchant_id, date) do
from t in MyRewards.Transaction,
where: t.merchant_id == ^merchant_id and fragment("?::date", t.inserted_at) >= ^date
end
def older_merchant_transactions_query(merchant_id, date) do
from t in MyRewards.Transaction,
where: t.merchant_id == ^merchant_id and fragment("?::date", t.inserted_at) <= ^date
end
def new_customers(merchant_id, date) do
from t in subquery(new_merchant_transactions_query(merchant_id, date)),
left_join: ot in subquery(older_merchant_transactions_query(merchant_id, date)),
on: t.user_id == ot.user_id,
where: t.user_id != ot.user_id,
select: t.id
end
Update:
I tried changing it to where: is_nil(ot.user_id), but get the same error.
This maybe should be a comment instead of an answer, but it's too long and needs too much formatting so I went ahead and posted this as an answer. With that out of the way, here we go.
What I would do is re-write the query to avoid the Common Table Expression (or CTE; this is what a WITH AS is really called) and the IN() expression, and instead I'd do an actual JOIN, like this:
SELECT n.*
FROM transactions n
LEFT JOIN transactions o ON o.user_id = n.user_id and o.merchant_id = 1 and o.inserted_at < date '2017-11-01'
WHERE n.merchant_id = 1 and n.inserted_at > date '2017-11-01'
AND o.inserted_at IS NULL
You might also choose to do a NOT EXISTS(), which on Sql Server at least will often produce a better execution plan.
This is probably a better way to handle the query anyway, but once you do that you may also find this solves your problem by making it much easier to translate to ecto.

Can't access CTE via inner join SQL Server

I know I'm missing something obvious but it's not so obvious to me!
I've got a table valued function that produces a nice interval range of dates given a start, end, interval (thanks to another SO answer!).
I've another TVF that produces the latest part transaction given a date.
However, I was after being able to produce the last parts transaction in a series of dates lying between the start and end dates given. So, given March to May and an interval of say, 2 days, I'd get a sort of time series between the two.
However, I've hit a wall now with CTE's and was trying to avoid going into procedural/cursor style looping to do this.
This is the code:
WITH datesTbl(DateValue)
AS (SELECT DateValue
FROM [dbo].[DateRange]('2016-03-18', '2016-04-27', 1))
SELECT *
FROM datesTbl dr
INNER JOIN dbo.MoveDateDiff(dr.Datevalue, DATEADD(day, 1, dr.DateValue), 14792) pm
ON DATEDIFF(Day, dr.dateValue, pm.MovementDate) <= 1;
I know I've other conceptual errors in the underlying TVF's however here I'm wanting to find a way past the fact I can't seem to access the CTE in the first part of the Inner Join statement (there is no syntax error after the ON declaration!).
Any guidance would be gratefully received!
When you use a TVF, you want APPLY, not JOIN:
WITH datesTbl(DateValue) as (
SELECT DateValue
FROM [dbo].[DateRange]('2016-03-18', '2016-04-27', 1)
)
SELECT *
FROM datesTbl dr CROSS APPLY
dbo.MoveDateDiff(dr.Datevalue, DATEADD(day, 1, dr.DateValue), 14792) pm
WHERE DATEDIFF(Day, dr.dateValue, pm.MovementDate) <= 1;

20 Day moving average with joins alone

There are questions like this all over the place so let me specify where I specifically need help.
I have seen moving averages in SQL with Oracle Analytic functions, MSSQL apply, or a variety of other methods. I have also seen this done with self joins (one join for each day of the average, such as here How do you create a Moving Average Method in SQL? ).
I am curious as to if there is a way (only using self joins) to do this in SQL (preferably oracle, but since my question is geared towards joins alone this should be possible for any RDBMS). The way would have to be scalable (for a 20 or 100 day moving average, in contrast to the link I researched above, which required a join for each day in the moving average).
My thoughts are
select customer, a.tradedate, a.shares, avg(b.shares)
from trades a, trades b
where b.tradedate between a.tradedate-20 and a.tradedate
group by customer, a.tradedate
But when I tried it in the past it hadn't worked. To be more specific, I am trying a smaller but similar exmaple (5 day avg instead of 20 day) with this fiddle demo and cant find out where I am going wrong. http://sqlfiddle.com/#!6/ed008/41
select a.ticker, a.dt_date, a.volume, avg(b.volume)
from yourtable a, yourtable b
where b.dt_date between a.dt_date-5 and a.dt_date
and a.ticker=b.ticker
group by a.ticker, a.dt_date, a.volume
I don't see anything wrong with your second query, I think the only reason it's not what you're expecting is because the volume field is an integer data type so when you calculate the average the resulting output will also be an integer data type. For an average you have to cast it, because the result won't necessarily be an integer (whole number):
select a.ticker, a.dt_date, a.volume, avg(cast(b.volume as float))
from yourtable a
join yourtable b
on a.ticker = b.ticker
where b.dt_date between a.dt_date - 5 and a.dt_date
group by a.ticker, a.dt_date, a.volume
Fiddle:
http://sqlfiddle.com/#!6/ed008/48/0 (thanks to #DaleM for DDL)
I don't know why you would ever do this vs. an analytic function though, especially since you mention wanting to do this in Oracle (which has analytic functions). It would be different if your preferred database were MySQL or a database without analytic functions.
Just to add to the answer, this is how you would achieve the same result in Oracle using analytic functions. Notice how the PARTITION BY acts as the join you're using on ticker. That splits up the results so that the same date shared across multiple tickers don't interfere.
select ticker,
dt_date,
volume,
avg(cast(volume as decimal)) over( partition by ticker
order by dt_date
rows between 5 preceding
and current row ) as mov_avg
from yourtable
order by ticker, dt_date, volume
Fiddle:
http://sqlfiddle.com/#!4/0d06b/4/0
Analytic functions will likely run much faster.
http://sqlfiddle.com/#!6/ed008/45 would appear to be what you need.
select a.ticker,
a.dt_date,
a.volume,
(select avg(cast(b.volume as float))
from yourtable b
where b.dt_date between a.dt_date-5 and a.dt_date
and a.ticker=b.ticker)
from yourtable a
order by a.ticker, a.dt_date
not a join but a subquery

SQL Average Inter-arrival Time, Time Between Dates

I have a table with sequential timestamps:
2011-03-17 10:31:19
2011-03-17 10:45:49
2011-03-17 10:47:49
...
I need to find the average time difference between each of these(there could be dozens) in seconds or whatever is easiest, I can work with it from there. So for example the above inter-arrival time for only the first two times would be 870 (14m 30s). For all three times it would be: (870 + 120)/2 = 445 (7m 25s).
A note, I am using postgreSQL 8.1.22 .
EDIT: The table I mention above is from a different query that is literally just a one-column list of timestamps
Not sure I understood your question completely, but this might be what you are looking for:
SELECT avg(difference)
FROM (
SELECT timestamp_col - lag(timestamp_col) over (order by timestamp_col) as difference
FROM your_table
) t
The inner query calculates the distance between each row and the preceding row. The result is an interval for each row in the table.
The outer query simply does an average over all differences.
i think u want to find avg(timestamptz).
my solution is avg(current - min value). but since result is interval, so add it to min value again.
SELECT avg(target_col - (select min(target_col) from your_table))
+ (select min(target_col) from your_table)
FROM your_table
If you cannot upgrade to a version of PG that supports window functions, you
may compute your table's sequential steps "the slow way."
Assuming your table is "tbl" and your timestamp column is "ts":
SELECT AVG(t1 - t0)
FROM (
-- All this silliness would be moot if we could use
-- `` lead(ts) over (order by ts) ''
SELECT tbl.ts AS t0,
next.ts AS t1
FROM tbl
CROSS JOIN
tbl next
WHERE next.ts = (
SELECT MIN(ts)
FROM tbl subquery
WHERE subquery.ts > tbl.ts
)
) derived;
But don't do that. Its performance will be terrible. Please do what
a_horse_with_no_name suggests, and use window functions.