SQL Query for count of records matching day in a date range? - sql

I have a table with records that look like this:
CREATE TABLE sample (
ix int unsigned auto_increment primary key,
start_active datetime,
last_active datetime
);
I need to know how many records were active on each of the last 30 days. The days should also be sorted incrementing so they are returned oldest to newest.
I'm using MySQL and the query will be run from PHP but I don't really need the PHP code, just the query.
Here's my start:
SELECT COUNT(1) cnt, DATE(?each of last 30 days?) adate
FROM sample
WHERE adate BETWEEN start_active AND last_active
GROUP BY adate;

Do an outer join.
No table? Make a table. I always keep a dummy table around just for this.
create table artificial_range(
id int not null primary key auto_increment,
name varchar( 20 ) null ) ;
-- or whatever your database requires for an auto increment column
insert into artificial_range( name ) values ( null )
-- create one row.
insert into artificial_range( name ) select name from artificial_range;
-- you now have two rows
insert into artificial_range( name ) select name from artificial_range;
-- you now have four rows
insert into artificial_range( name ) select name from artificial_range;
-- you now have eight rows
--etc.
insert into artificial_range( name ) select name from artificial_range;
-- you now have 1024 rows, with ids 1-1024
Now make it convenient to use, and limit it to 30 days, with a view:
Edit: JR Lawhorne notes:
You need to change "date_add" to "date_sub" to get the previous 30 days in the created view.
Thanks JR!
create view each_of_the_last_30_days as
select date_sub( now(), interval (id - 1) day ) as adate
from artificial_range where id < 32;
Now use this in your query (I haven't actually tested your query, I'm just assuming it works correctly):
Edit: I should be joining the other way:
SELECT COUNT(*) cnt, b.adate
FROM each_of_the_last_30_days b
left outer join sample a
on ( b.adate BETWEEN a.start_active AND a.last_active)
GROUP BY b.adate;

SQL is great at matching sets of values that are stored in the database, but it isn't so great at matching sets of values that aren't in the database. So one easy workaround is to create a temp table containing the values you need:
CREATE TEMPORARY TABLE days_ago (d SMALLINT);
INSERT INTO days_ago (d) VALUES
(0), (1), (2), ... (29), (30);
Now you can compare a date that is d days ago to the span between start_active and last_active of each row. Count how many matching rows in the group per value of d and you've got your count.
SELECT CURRENT_DATE - d DAYS, COUNT(*) cnt,
FROM days_ago
LEFT JOIN sample ON (CURRENT_DATE - d DAYS BETWEEN start_active AND last_active)
GROUP BY d
ORDER BY d DESC; -- oldest to newest
Another note: you can't use column aliases defined in your select-list in expressions until you get to the GROUP BY clause. Actually, in standard SQL you can't use them until the ORDER BY clause, but MySQL supports using aliases in GROUP BY and HAVING clauses as well.

Turn the date into a unix timestamp, which is seconds, in your query and then just look for the difference to be <= the number of seconds in a month.
You can find more information here:
http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_unix-timestamp
If you need help with the query please let me know, but MySQL has nice functions for dealing with datetime.
[Edit] Since I was confused as to the real question, I need to finish the lawn but before I forget I want to write this down.
To get a count of the number by day you will want your where clause to be as I described above, to limit to the past 30 days, but you will need to group by day, and so select by converting each start to a day of the month and then do a count of those.
This assumes that each use will be limited to one day, if the start and end dates can span several days then it will be trickier.

Related

For loop with output arrays

In snowflake :
I have two tables available:
"SEG_HISTO": This is a segmentation run once a month.
columns: Client ID /date (1st of each month) /segment.
"TCK": a table that contains the tickets with the columns: Ticket ID / Customer ID / Date / Amount.
For each customer ID in the "SEG_HISTO" table, I searched for all the customer's tickets over a rolling year and associated the sum of the amount spent:
SELECT SEG_OMNI.*, TCK_12M.TOTAL_AMOUNT_HT
FROM "SHARE"."DATAMARTS_DATASCIENCE"."SEG_OMNI" SEG_OMNI
LEFT OUTER JOIN
(
SELECT DISTINCT PR_ID_BU,
SUM(TOTAL_AMOUNT_HT) AS "TOTAL_AMOUNT_HT",
COUNT(*) "NB_ACHAT"
FROM
(
SELECT * FROM "SHARE"."RAW_BDC"."TCK"
WHERE TO_DATE(DT_SALE) >= DATEADD(YEAR, -1, '2022-07-01') -- <<<===== date add manually
)
GROUP BY PR_ID_BU
) TCK_12M
ON SEG_OMNI."pr_id_bu" = TCK_12M.PR_ID_BU
Now I need to create a for loop that iterates this for each date in the SEG_OMNI table (SELECT DISTINCT TO_DATE(DT_MAJ) DT FROM "SHARE"."DATAMARTS_DATASCIENCE"."SEG_HISTO") and stack the output in a view.
And it is at this level where I block
Thank you for your help in advance
As Dave said in the comments, it would be better if you could figure out how to run all this in one query, instead of running the same query multiple times.
But as you are asking how to output the results of multiple queries out of one stored procedure I'm going to give you the pattern for that here. I'm also assuming you want this in a SQL script (we could use Python/Java/JS instead):
declare
your_var string;
all_dates cursor for (
select dates
from your_table
);
begin
-- create a table to store results
create or replace temp table discovery_results(x string, y string, z int);
for record in all_dates do
-- for each date run the query an insert results into the table created
insert into discovery_results
select x, y, z
from the_query
where (:dates_cursor_data)
;
end for;
return 'run [select * from discovery_results] to find the results';
end;
select *
from discovery_results

SQL Query to show attendance between two dates

I've got a .Net application with an attendance table which has fields for a Start and End date. I'm struggling to show a graph of attendance for a given period. I can easily find how many rows are applicable on any given day using between but I can't get my head around pivoting results so that I can graph a count of rows per day. I could run a SQL query for every day individually and then graph the results but is there any way of doing this with T-SQL that I could then use to graph with?
Edit:-
Apologies as this is the first time I've asked a question here, but as huMpty duMpty has stated the question probably needs more clarification. I've got both a startdate and enddate column in the sql db and I need to count per day if the range between these dates falls between the range of the selection criteria. e.g if I've got a start date of 2013-01-01 and end date 2013-01-10 and I report on a period of 2013-01-09 to 2013-01-11 then i'm looking at getting a result for 1 for 2013-01-09 and 1 for 2013-01-10... Hope this make more sense and thanks for your assistance
I think you have a table with start and end dates; for a given date range, you would like to know the given number of records that fall on each date.
I believe this problem may be solved with a numbers table. I created a numbers table on the fly in a stored procedure, but I recommend creating a permanent numbers table in your production code. Here's the SQL Fiddle.
Create Table Attendance (
id int primary key identity(1,1) not null
,start_date date not null
,end_date date not null
);
Go
Insert Attendance(start_date, end_date)
Values ('1/1/2013', '1/10/2013')
,('1/10/2013', '1/15/2013')
,('2/20/2013', '3/1/2013');
Go
-- Create numbers table. See: Method 3 of http://stackoverflow.com/a/1407488/772086
With Numbers(Number) As
(
Select 1 As Number
Union All
Select Number + 1 From Numbers Where Number < 10000
)
Select
AttendanceDate = Convert(date, DateAdd(day,Numbers.number, '1/1/2000'))
,AttendanceCount = Count(*)
From dbo.Attendance
Join Numbers
On Numbers.Number
Between DateDiff(day, '1/1/2000', Attendance.start_date)
And DateDiff(day, '1/1/2000', Attendance.end_date)
-- Reporting range between 1/9 and 1/11
Where DateAdd(day,Numbers.number, '1/1/2000') Between '1/9/2013'
And '1/11/2013'
Group By Convert(date, DateAdd(day,Numbers.number, '1/1/2000'))
Option(MaxRecursion 10000);
All dates are in US format (m/d/yy) - you may want to switch those to the internationalized standard (yyyy-mm-dd) in your production code.
You said you wanted a count by day in a date range. That can be done with a COUNT with a GROUP BY clause.
I don't know your schema, but a solution might look like this:
declare #MyTable table
(
ID int identity(1,1) primary key clustered,
MyDate smalldatetime
)
insert into #MyTable (MyDate)
values
('2012-12-31'), -- before the date range, so not included in results
('2013-01-10'),
('2013-01-10'), -- appears twice
('2013-01-11'), -- appears once
('2013-01-12') -- after the date range, so not included in results
select * from #MyTable
select
MyDate,
count(*)
from #MyTable
where MyDate between '2013-01-09' and '2013-01-11'
group by MyDate

finding consecutive date pairs in SQL

I have a question here that looks a little like some of the ones that I found in search, but with solutions for slightly different problems and, importantly, ones that don't work in SQL 2000.
I have a very large table with a lot of redundant data that I am trying to reduce down to just the useful entries. It's a history table, and the way it works, if two entries are essentially duplicates and consecutive when sorted by date, the latter can be deleted. The data from the earlier entry will be used when historical data is requested from a date between the effective date of that entry and the next non-duplicate entry.
The data looks something like this:
id user_id effective_date important_value useless_value
1 1 1/3/2007 3 0
2 1 1/4/2007 3 1
3 1 1/6/2007 NULL 1
4 1 2/1/2007 3 0
5 2 1/5/2007 12 1
6 3 1/1/1899 7 0
With this sample set, we would consider two consecutive rows duplicates if the user_id and the important_value are the same. From this sample set, we would only delete row with id=2, preserving the information from 1-3-2007, showing that the important_value changed on 1-6-2007, and then showing the relevant change again on 2-1-2007.
My current approach is awkward and time-consuming, and I know there must be a better way. I wrote a script that uses a cursor to iterate through the user_id values (since that breaks the huge table up into manageable pieces), and creates a temp table of just the rows for that user. Then to get consecutive entries, it takes the temp table, joins it to itself on the condition that there are no other entries in the temp table with a date between the two dates. In the pseudocode below, UDF_SameOrNull is a function that returns 1 if the two values passed in are the same or if they are both NULL.
WHILE (##fetch_status <> -1)
BEGIN
SELECT * FROM History INTO #history WHERE user_id = #UserId
--return entries to delete
SELECT h2.id
INTO #delete_history_ids
FROM #history h1
JOIN #history h2 ON
h1.effective_date < h2.effective_date
AND dbo.UDF_SameOrNull(h1.important_value, h2.important_value)=1
WHERE NOT EXISTS (SELECT 1 FROM #history hx WHERE hx.effective_date > h1.effective_date and hx.effective_date < h2.effective_date)
DELETE h1
FROM History h1
JOIN #delete_history_ids dh ON
h1.id = dh.id
FETCH NEXT FROM UserCursor INTO #UserId
END
It also loops over the same set of duplicates until there are none, since taking out rows creates new consecutive pairs that are potentially dupes. I left that out for simplicity.
Unfortunately, I must use SQL Server 2000 for this task and I am pretty sure that it does not support ROW_NUMBER() for a more elegant way to find consecutive entries.
Thanks for reading. I apologize for any unnecessary backstory or errors in the pseudocode.
OK, I think I figured this one out, excellent question!
First, I made the assumption that the effective_date column will not be duplicated for a user_id. I think it can be modified to work if that is not the case - so let me know if we need to account for that.
The process basically takes the table of values and self-joins on equal user_id and important_value and prior effective_date. Then, we do 1 more self-join on user_id that effectively checks to see if the 2 joined records above are sequential by verifying that there is no effective_date record that occurs between those 2 records.
It's just a select statement for now - it should select all records that are to be deleted. So if you verify that it is returning the correct data, simply change the select * to delete tcheck.
Let me know if you have questions.
select
*
from
History tcheck
inner join History tprev
on tprev.[user_id] = tcheck.[user_id]
and tprev.important_value = tcheck.important_value
and tprev.effective_date < tcheck.effective_date
left join History checkbtwn
on tcheck.[user_id] = checkbtwn.[user_id]
and checkbtwn.effective_date < tcheck.effective_date
and checkbtwn.effective_date > tprev.effective_date
where
checkbtwn.[user_id] is null
OK guys, I did some thinking last night and I think I found the answer. I hope this helps someone else who has to match consecutive pairs in data and for some reason is also stuck in SQL Server 2000.
I was inspired by the other results that say to use ROW_NUMBER(), and I used a very similar approach, but with an identity column.
--create table with identity column
CREATE TABLE #history (
id int,
user_id int,
effective_date datetime,
important_value int,
useless_value int,
idx int IDENTITY(1,1)
)
--insert rows ordered by effective_date and now indexed in order
INSERT INTO #history
SELECT * FROM History
WHERE user_id = #user_id
ORDER BY effective_date
--get pairs where consecutive values match
SELECT *
FROM #history h1
JOIN #history h2 ON
h1.idx+1 = h2.idx
WHERE h1.important_value = h2.important_value
With this approach, I still have to iterate over the results until it returns nothing, but I can't think of any way around that and this approach is miles ahead of my last one.

How can I optimize this SQL query to get rid of the filesort and temp table?

Here's the query:
SELECT
count(id) AS count
FROM `numbers`
GROUP BY
MONTH(created_at),
YEAR(created_at)
ORDER BY
YEAR(created_at),
MONTH(created_at)
That query throws a 'Using temporary' and 'Using filesort' when doing EXPLAIN.
Ultimately what I'm doing is looking at a table of user-submitted tracking numbers and counting the number of submitted rows a grouping the counts by month/year.
ie. In November 2008 there were 11,312 submitted rows.
UPDATE, here's the DESCRIBE for the numbers table.
id int(11) NO PRI NULL auto_increment
tracking varchar(255) YES NULL
service varchar(255) YES NULL
notes text YES NULL
user_id int(11) YES NULL
active tinyint(1) YES 1
deleted tinyint(1) YES 0
feed text YES NULL
status varchar(255) YES NULL
created_at datetime YES NULL
updated_at datetime YES NULL
scheduled_delivery date YES NULL
carrier_service varchar(255) YES NULL
Give this a shot:
SELECT COUNT(x.id)
FROM (SELECT t.id,
MONTH(t.created_at) 'created_month',
YEAR(t.created_at) 'created_year'
FROM NUMBERS t) x
GROUP BY x.created_month, x.created_year
ORDER BY x.created_month, x.created_year
It's not a good habit to use functions in the WHERE, GROUP BY and ORDER BY clauses because indexes can't be used.
...query throws a 'Using temporary' and 'Using filesort' when doing EXPLAIN.
From what I found, that's to be expected when using DISTINCT/GROUP BY.
Make sure you have a covering index over YEAR and MONTH (that is, both fields within the same index) so that the ORDER BY component of your query can use an index. This should remove the need for a filesort, although a temporary table may still be needed to handle the grouping.
SELECT
count(`id`) AS count, MONTH(`created_at`) as month, YEAR(`created_at`) as year
FROM `numbers`
GROUP BY month, year
ORDER BY created_at
This will be the best you can get, as far as I can tell. I created a table with an id and a datetime column and filled it with 10000 rows. The query above uses a sub select, but it really doesn't do you any different and has the overhead of a sub select. The resulting time for mine was 0.015s and his was 0.016s.
Make sure that you have an index on created_at, this will help your initial query out. It is pretty rare to not end up with a file sort when the group by comes about, but it may be possible in other situations. MySql's docs have an article about this if you feel so inclined. I do not see how those methods can be applied here, with the information you have provided.
Whenever MySQL has to do work in memory, and that work exceeds the available amount (innodb_buffer_pool_size), it starts having to use the disk to store temporary work. You could increase the variable I mentioned, but setting it too high could cause performance problems in other areas.
If you're running a dedicated server, set it to ~50-75%.
The best method would be creating a helper column that would contain numberic values of YEAR and MONTH concatenated together:
YEAR(created_at) * 100 + MONTH(created_at)
Grouping on this column would use INDEX FOR GROUP BY.
However, you can create two helper tables, the first one containing reasonable number of years (say, from 1900 to 2100), the second one containing months (from 0 to 11), and use these tables to generate the sets:
SELECT (
SELECT COUNT(*)
FROM numbers
WHERE created_at >= '1900-01-01' + INTERVAL y YEAR + INTERVAL m MONTH
AND created_at < '1900-01-01' + INTERVAL y YEAR + INTERVAL m + 1 MONTH
)
FROM year_table
CROSS JOIN
month_table
WHERE y BETWEEN 2008 AND 2010
I'm sorry, but I have to disagree with the other answers.
I think what you need is to add an index to your table, preferably a covering index.
If you add an index on the columns you are searching on (created_at) and also on the columns you want to get a result from (id) then it will be dramatically faster then before.
The reason why you are using a temp table is because you use a group by.
To speed up the group by, you can change the MySQL server settings to increase the size of the tmp table and the max heap table size so that the temp table will be in memory.

MySQL LEFT JOIN SELECT not selecting all the left side records?

I'm getting odd results from a MySQL SELECT query involving a LEFT JOIN, and I can't understand whether my understanding of LEFT JOIN is wrong or whether I'm seeing a genuinely odd behavior.
I have a two tables with a many-to-one relationship: For every record in table 1 there are 0 or more records in table 2. I want to select all the records in table 1 with a column that counts the number of related records in table 2. As I understand it, LEFT JOIN should always return all records on the LEFT side of the statement.
Here's a test database that exhibits the problem:
CREATE DATABASE Test;
USE Test;
CREATE TABLE Dates (
dateID INT UNSIGNED NOT NULL AUTO_INCREMENT,
date DATE NOT NULL,
UNIQUE KEY (dateID)
) TYPE=MyISAM;
CREATE TABLE Slots (
slotID INT UNSIGNED NOT NULL AUTO_INCREMENT,
dateID INT UNSIGNED NOT NULL,
UNIQUE KEY (slotID)
) TYPE=MyISAM;
INSERT INTO Dates (date) VALUES ('2008-10-12'),('2008-10-13'),('2008-10-14');
INSERT INTO Slots (dateID) VALUES (3);
The Dates table has three records, and the Slots 1 - and that record points to the third record in Dates.
If I do the following query..
SELECT d.date, count(s.slotID) FROM Dates AS d LEFT JOIN Slots AS s ON s.dateID=d.dateID GROUP BY s.dateID;
..I expect to see a table with 3 rows in - two with a count of 0, and one with a count of 1. But what I actually see is this:
+------------+-----------------+
| date | count(s.slotID) |
+------------+-----------------+
| 2008-10-12 | 0 |
| 2008-10-14 | 1 |
+------------+-----------------+
The first record with a zero count appears, but the later record with a zero count is ignored.
Am I doing something wrong, or do I just not understand what LEFT JOIN is supposed to do?
You need to GROUP BY d.dateID. In two of your cases, s.DateID is NULL (LEFT JOIN) and these are combined together.
I think you will also find that this is invalid (ANSI) SQL, because d.date is not part of a GROUP BY or the result of an aggregate operation, and should not be able to be SELECTed.
I think you mean to group by d.dateId.
Try removing the GROUP BY s.dateID
The dateid for 10-12 and 10-13 are groupd together by you. Since they are 2 null values the count is evaluated to 0
I don't know if this is valid in MySQL but you could probably void this mistake in the future by using the following syntax instead
SELECT date, count(slotID) as slotCount
FROM Dates LEFT OUTER JOIN Slots USING (dateID)
GROUP BY (date)
By using the USING clause you don't get two dateID's to keep track of.
replace GROUP BY s.dateID with d.dateID.