Recursive SQL query to speed up non-indexed query - sql

This question is largely driven by curiosity, as I do have a working query (it just takes a little longer than I would like).
I have a table with 4 million rows. The only index on this table is an auto-increment BigInt ID. The query is looking for distinct values in one of the columns, but only going back 1 day. Unfortunately, the ReportDate column that is evaluated is not of the DateTime type, or even a BigInt, but is char(8) in the format of YYYYMMDD. So the query is a bit slow.
SELECT Category
FROM Reports
where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112)
GROUP BY Category
Note that the date converstion in the above statement is simply converting it to a YYYYMMDD format for comparison.
I was wondering if there was a way to optimize this query based on the fact that I know that the only data I am interested in is at the "bottom" of the table. I was thinking of some sort of recursive SELECT function which gradually grew a temporary table that could be used for the final query.
For example, in psuedo-sql:
N = 128
TemporaryTable = SELECT TOP {N} *
FROM Reports
ORDER BY ID DESC
/* Once we hit a date < Today, we can stop */
if(TemporaryTable does not contain ReportDate < Today)
N = N**2
Repeat Select
/* We now have a smallish table to do our query */
SELECT Category
FROM TemproaryTable
where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112)
GROUP BY Category
Does that make sense? Is something like that possible?
This is on MS SQL Server 2008.

I might suggest you do not need to convert the Date that is stored as char data in YYYYMMDD format; That format is inherently sortable all by itself. I would instead convert your date to output in that format.
Also, the way you have the conversion written, it is converting the current DateTime for every individual row, so even storing that value for the whole query could speed things up... but I think just converting the date you are searching for to that format of char would help.
I would also suggest getting the index(es) you need created, of course... but that's not the question you asked :P

Why not just create the index you need?
create index idx_Reports_ReportDate
on Reports(ReportDate, Category)

No, that doesn't make sense. The only way to optimize this query is to have a covering index for it:
CREATE INDEX ndxReportDateCategory ON Reports (ReportDate, Category);
Update
Considering your comment that you cannot modify the schema, then you should modify the schema. If you still can't, then the answer still applies: the solution is to have an index.
And finally, to answer more directly your question, if you have a strong correlation between ID and ReportData: the ID you seek is the biggest one that has a ReportDate smaller than the date you're after:
SELECT MAX(Id)
FROM Reports
WHERE ReportDate < 'YYYYMMDD';
This will do a reverse scan on the ID index and stop at the first ID that is previous to your desired date (ie. will not scan the entire table). You can then filter your reports base don this found max Id.

I think you will find the discussion on SARGability, on Rob Farley's Blog to be very interesting reading in relation to your post topic.
http://blogs.lobsterpot.com.au/2010/01/22/sargable-functions-in-sql-server/
An interesting alternative approach that does not require you to modify the existing column data type would be to leverage computed columns.
alter table REPORTS
add castAsDate as CAST(ReportDate as date)
create index rf_so2 on REPORTS(castAsDate) include (ReportDate)

One of the query patterns I occasionally use to get into a log table with similiar indexing to yours is to limit by subquery:
DECLARE #ReportDate varchar(8)
SET #ReportDate = Convert(varchar(8), GetDate(), 112)
SELECT *
FROM
(
SELECT top 20000 *
FROM Reports
ORDER BY ID desc
) sub
WHERE sub.ReportDate = #ReportDate
20k/4M = 0.5% of the table is read.
Here's a loop solution. Note: might want to make ID primary key and Reportdate indexed in the temp table.
DECLARE #ReportDate varchar(8)
SET #ReportDate = Convert(varchar(8), GetDate(), 112)
DECLARE #CurrentDate varchar(8), MinKey bigint
SELECT top 2000 * INTO #MyTable
FROM Reports ORDER BY ID desc
SELECT #CurrentDate = MIN(ReportDate), #MinKey = MIN(ID)
FROM #MyTable
WHILE #ReportDate <= #CurrentDate
BEGIN
SELECT top 2000 * INTO #MyTable
FROM Reports WHERE ID < #MinKey ORDER BY ID desc
SELECT #CurrentDate = MIN(ReportDate), #MinKey = MIN(ID)
FROM #MyTable
END
SELECT * FROM #MyTable
WHERE ReportDate = #ReportDate
DROP TABLE #MyTable

Related

SQL Server - Looking for a way to shorten code

I'm basically very new to SQL Server, so please bare with me. Here is my problem:
I have a table with (let's say) 10 columns and 80k rows. I have 1 column called Date in the format of YYYY-MM-DD type varchar(50) (can't convert it to date or datetime type I tried, the initial source of data is not good).
**Example :
Table [dbo].[TestDates]
Code
SellDate
XS4158
2019-11-26
DE7845
2020-02-06
What I need to do is to turn the YYYY-MM-DD format to DD/MM/YYYY format. After a lot of tries (I tried the functions (DATE_FORMAT, CONVERT, TO_DATE etc) and this is solution :
1- I added a primary key for join purpose later (ID)
2- I split my date column in 3 columns in a whole new table
3- I merged the 3 columns in the order I need with the delimiter of my choice (/) in the same new table
4- I copied the good column to my initial table using the primary key ID I created before
alter table [dbo].[TestDates]
add ID int not null IDENTITY primary key;
SELECT ID,
FORMAT(DATEPART(month, [SellDate]),'00') AS Month,
FORMAT(DATEPART(day, [SellDate]),'00') AS Day,
FORMAT(DATEPART(year, [SellDate]),'0000') AS Year
INTO [dbo].[TestDates_SPLIT]
FROM [dbo].[TestDates]
GO
ALTER TABLE [dbo].[TestDates_SPLIT]
ADD SellDate_OK varchar(50)
UPDATE [dbo].[TestDates_SPLIT]
SET SellDate_OK = [Day] + '/' + [Month] + '/' + [Year]
ALTER TABLE [dbo].[TestDates_SPLIT]
DROP COLUMN Month, Day, Year
ALTER TABLE [dbo].[TestDates]
ADD SellDate_GOOD varchar(50)
UPDATE [dbo].[TestDates]
SET [TestDates].SellDate_GOOD = [TestDates_SPLIT].SellDate_OK
FROM [dbo].[TestDates]
INNER JOIN [dbo].[TestDates_SPLIT]
ON [TestDates].ID = [TestDates_SPLIT].ID
This code works but i find too heavy and long, considering I have 6 more dates columns to work on. Is there a way to make it shorter or more efficient? Maybe with SET SellDate = SELECT (some query of sorts that doesn't require to create and delete table)
Thank you for your help
I tried the usual SQL functions but since my column is a varchar type, the converting was impossible
You should not be storing dates as text. But, that being said, we can try doing a rountrip conversion from text YYYY-MM-DD to date to text DD/MM/YYYY:
WITH cte AS (
SELECT '2022-11-08' AS dt
)
SELECT dt, -- 2022-11-08
CONVERT(varchar(10), CONVERT(datetime, dt, 121), 103) -- 08/11/2022
FROM cte;
Demo

Declare Big Query Variable with Scheduled Query and Destination Table

I use a scheduled query in Big Query which appends data from the previous day to a Big Query table. The data from the previous day is not always available when my query runs, so, to make sure that I have all the data, I need to calculate the last date available in my Big Query table.
My first attempt was to write the following query :
SELECT *
FROM sourceTable
WHERE date >= (SELECT Max(date) from destinationTable)
When I run this query, only date >= max(date) is correctly exported. However, the query processes the entire sourceTable, and not only J - max(date). Therefore, the cost is higher than expected.
I also tried to declare a variable using "DECLARE" & "SET" (https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting). This solution works fine and only J - max(date) is processed. However, BQ interprets a query with "DECLARE" as a script, so the results can't be exported automatically to a BQ table using scheduled queries.
DECLARE maxDate date;
SET maxDate = (SELECT Max(date) from destinationTable);
SELECT *
FROM sourceTable
WHERE date >= maxDate
Is there another way of doing what I would like? Or a way to declare a variable using "DECLARE" & "SET" in a scheduled query with a destination table?
Thanks!
Scripting query, when being scheduled, doesn't support setting a destination table for now. You need to use DDL/DML to make change to existing table.
DECLARE maxDate date;
SET maxDate = (SELECT Max(date) from destinationTable);
CREATE OR REPLACE destinationTable AS
SELECT *
FROM sourceTable
WHERE date >= maxDate
Is destinationTable partitioned? If not, can you recreate it as a partitioned table? If it is a partitioned table, and partitioned on the destinationTable.date column, you could do something like:
SELECT *
FROM sourceTable
WHERE date >= (SELECT MAX(_PARTITIONTIME) from destinationTable)
Since _PARTITIONTIME is a pseudo-column, there is no cost in running the subquery.
We still can't use scheduled scripting in BigQuery along with destination table.
But if your source table is date sharded then there is a workaround that can be used to achieve the desired solution (only required data scan based on an initial value from another table).
SELECT
*
FROM
sourceTable*
WHERE
_TABLE_SUFFIX >= (
SELECT
IFNULL(MAX(date),
'<default_date>')
FROM
destinationTable)
This will scan only shards that are greater than or equal to the maximum date of the destination table.
P.S. - source table is date sharded.

How to assign variable to insert into a Table in SQL Server?

I have two tables.
NewTransaction_tb
OldTransaction_tb
I want to move the records from old to new table including date. But the OldTransaction_tb it doesn't have the Date column.
This is what I am trying.
For example
DECLARE #VarDate Datetime = CONVERT(datetime,GETDATE(),102)
INSERT INTO HQMatajer.dbo.NewTransaction_tb
SELECT
Name, class, Qualification, #VarDate //this #VarDate is not in OldTransaction_tb
FROM
HQMatajer.dbo.OldTransaction_tb
What is the solution for this scenario? Thanks,
You don't have to declare a variable for doing this, you can directly convert the string from the old table and insert to the new one.
INSERT into HQMatajer.dbo.NewTransaction_tb
SELECT Name,class,Qualification,CONVERT(datetime,GETDATE(),102)
FROM HQMatajer.dbo.OldTransaction_tb
You can use directly Date in select column
SELECT
Name,class,Qualification,CONVERT(datetime,GETDATE(),102)
See, the answer to this can be what you wrote or what others suggested.
The question is what you want in your result set.
For example, if you are processing a data set all at once, say entire of OldTransaction table, and you want that all the rows being transferred to NewTransaction should have the same DateTime, then it is preferred to do it by first declaring a variable and then calling it.
This is better than using the function in a SELECT clause because the function is then called once for every row. So if you have a billion rows in OldTransaction table then the function will be called a billion times and you will have a small speed impact.
But if your require all rows to have the exact date time of insertion, in case your insert takes a prolonged time over an hour or so, then there is no choice but to use the function within the SELECT statement.
SELECT
Name, Class, Qualification, CONVERT(Datetime, GETDATE(), 102)
FROM HQMatajer.dbo.OldTransaction_tb
Check this:
if table does not yet exist then use :
Select *
Into NewTransaction_tb
From
(select
* ,CONVERT(datetime, GETDATE(), 102) as Date
from
OldTransaction_tb) a
OR
if table already exists, then use :
insert into NewTransaction_tb
select
*, CONVERT(datetime, GETDATE(), 102) as Date
from
OldTransaction_tb

How automatically add 1 year date to an existing date in SQL Server

I have a task to automatically bill all registered patients in PatientsInfo table an Annual Bill of N2,500 base on the DateCreated column.
Certainly I will use a stored procedure to insert these records into the PatientDebit table and create a SQL Job to perform this procedure.
How will I select * patients in PatientsInfo table where DateCreated is now 1 yr old for me to insert into another table PatientDebit.
I have my algorithm like this:
select date of registration for patients from PatientsInfo table
Add 1 year to their DateCreated
Is date added today? if yes,
Insert record into PatientDebit table with the bill of N2,500
If no, do nothing.
Please how do I write the script?
Use DATEADD, i.e.:
SELECT DATEADD(year, 1, '2006-08-30')
Ref.: http://msdn.microsoft.com/en-us/library/ms186819.aspx
Assuming the columns of the 2 tables are the same:
INSERT INTO PatientDebit
SELECT * from PatientsInfo WHERE DateCreated<DATEADD(year, -1, GETDATE())
Make sure you have an index on DateCreated if PatientsInfo has a lot of records as it could potentially be slow otherwise
there should be .add or addyear() function in sql. You add like .add(year, day, month). Read upon sql datetime add year, month and seconds. It is pretty straightforward. It is just like c#.
Dateadd(info). now time is. getdate().

More efficent way to select rows by date as a nvarchar

Hi all wonder if someone could advise a more efficient way to select rows from a table that has roughly 60 millions records in it. Each row has a date stored as a nvarchar, for example '20110527030000.106'. I want to select all rows that are 3 months or older based on this date field, so for example i'm only interested in the first part of the date field; '20110527'. I have the following code to do that, however its a bit slow and wondering if there was a better way?
DECLARE #tempDate varchar(12)
SET #tempDate = convert(varchar(12),DATEADD(m,-3,GETDATE()),112)
SELECT *
FROM [TABLE A]
WHERE SUBSTRING([DATE_FIELD],0,8) < #tempDate
Your query not only it can't use any index on [DATE_FIELD] and does a full scan but it also applies the SUBSTRING() function to all values of the (date_field column of the) table.
Don't apply any function on the column so the index of [DATE_FIELD] can be used and the function is only applied once, at the calculation of #tempDate :
SELECT *
FROM [TABLE A]
WHERE [DATE_FIELD] < #tempDate
The < comparison works for varchar values. The following will evaluate to True:
'20110526030000.106' < '20110527'
Is there any reason that the datetime is not stored as datetime type?
If you can modify the table you could add a datetime column and then run an update to populate it with the correct data.
If you can't modify the table then you could create a new table with a datetime column, extract the keys from the table you want to query into it and enforce a foriegn key contraint across the tables. Then you can popluate the datetime column as before and then join the tables when querying.
If you can't modify anything then I guess yiou could try benchmarking your solution against a solution where you cast the varchar date into a datetime on the fly (with a user defined function for example). This may actually run faster.
Hope this helps you some..
If you can modify the database you could add a new field isolder3months and set it to 1 for each new entry.
With triggers you can update that once a day for every entry with isolder3months = 1. This way you check / update only 1/n th of your entries.
This solution is only practical if 3 months is fix and if this query is used often.
Then your query would look like
SELECT *
FROM [TABLE A]
WHERE [isolder3months] = 1