Find the longest sequence of a value in a table - sql

This is an SQL Question, I think it is difficult one - I'm not sure it is possible to achieve in a simple SQL sentence or a stored procedure:
I want to find the number of the longest sequence of the same (known) number in a column in a table:
example:
TABLE:
DATE SALEDITEMS
1/1/09 4
1/2/09 3
1/3/09 3
1/4/09 4
1/5/09 3
calling the sp/sentence for 4 will give 1 calling the sp/sentecne for 3 will give 2
as there was 2 times in a row number 3.
I'm running SQL server 2008.

UPDATE: I generated a million rows of random data, and abandoned the recursive CTE solution, as its query plan didn't make good use of indexes in the optimizer.
But the non-recursive solution I originaly posted turned out to work great, as long as there was an additional non-clustered index on (SALEDITEMS, [DATE]). This makes sense, since the query needs to filter in both directions (both by date and by SALEDITEMS). With this additional index, queries on a million rows return in under 2 seconds on my (not very beefy) desktop mathine. Without this index, the query was dog-slow.
BTW, this is a great example of how SQL Server's cost-based query optimization totally breaks down in some cases. The recursive CTE solution has a cost (on my PC) of 42 and takes at least several minutes to finish. The non-recursive solution has a cost of 15,446 (!!!) and completes in 1.5 seconds. Moral of the story: when comparing SQL Server query plans, don't assume that cost necessarily correlates to query performance!
Anyway, here's the solution I'd recommend (the same non-recursive CTE I posted earlier) :
DECLARE #SALEDITEMS INT = 3;
WITH SalesNoMatch ([DATE], SALEDITEMS, NoMatchDate)
AS
(
SELECT [DATE], SALEDITEMS,
(SELECT MIN([DATE]) FROM Sales s2 WHERE s2.SALEDITEMS <> #SALEDITEMS
AND s2.[DATE] > s1.[DATE]) as NoMatchDate
FROM Sales s1
)
, SalesMatchCount ([DATE], ConsecutiveCount) AS
(
SELECT [DATE], 1+(SELECT COUNT(1) FROM Sales s2 WHERE s2.[DATE] > s1.[DATE] AND s2.[DATE] < NoMatchDate)
FROM SalesNoMatch s1
WHERE s1.SALEDITEMS = #SALEDITEMS
)
SELECT MAX(ConsecutiveCount)
FROM SalesMatchCount;
Here's the DDL I used to test this, including indexes you'll need:
CREATE TABLE [Sales](
[DATE] date NOT NULL,
[SALEDITEMS] int NOT NULL
);
CREATE UNIQUE CLUSTERED INDEX IX_Sales ON Sales ([DATE]);
CREATE UNIQUE NONCLUSTERED INDEX IX_Sales2 ON Sales (SALEDITEMS, [DATE]);
And here's how I created my test data-- 1,000,001 rows with ascending dates with SALEDITEMS randomly set between 1 and 10.
INSERT INTO Sales ([DATE], SALEDITEMS)
VALUES ('1/1/09', 5)
DECLARE #i int = 0;
WHILE (#i < 1000000)
BEGIN
INSERT INTO Sales ([DATE], SALEDITEMS)
SELECT DATEADD (d, 1, (SELECT MAX ([DATE]) FROM Sales)), ABS(CHECKSUM(NEWID())) % 10 + 1
SET #i = #i + 1;
END
Here's the recursive-CTE solution that I abandoned:
DECLARE #SALEDITEMS INT = 3;
-- recursive CTE solution (remember to set MAXRECURSION!)
WITH SalesRowNum ([DATE], SALEDITEMS, RowNum)
AS
(
SELECT [DATE], SALEDITEMS, ROW_NUMBER() OVER (ORDER BY s1.[DATE]) as RowNum
FROM Sales s1
)
, SalesCTE (RowNum, [DATE], ConsecutiveCount)
AS
(
SELECT s1.RowNum, s1.[DATE], 1 AS ConsecutiveCount
FROM SalesRowNum s1
WHERE SALEDITEMS = #SALEDITEMS
UNION ALL
SELECT s1.RowNum, s1.[DATE], ConsecutiveCount + 1 AS ConsecutiveCount
FROM SalesRowNum s1
INNER JOIN SalesCTE s2 ON s1.RowNum = s2.RowNum + 1
WHERE SALEDITEMS = #SALEDITEMS
)
SELECT MAX(ConsecutiveCount)
FROM SalesCTE;

Untested, because you did not provide DDL and sample data:
DECLARE #SALEDITEMS INT;
SET #SALEDITEMS=3;
SELECT MAX(cnt) FROM(
SELECT COUNT(*) FROM YourTable JOIN (
SELECT y1.[Date] AS d1, y2.[Date] AS d2
FROM YourTable AS y1 JOIN YourTable AS y2
ON y1.SALEDITEMS=#SALEDITEMS AND y2.SALEDITEMS=#SALEDITEMS
AND NOT EXISTS(SELECT 1 FROM YourTable AS y
WHERE y.SALEDITEMS<>#SALEDITEMS
AND y1.[Date] < y.[Date] AND y.[Date] < y2.[Date])
) AS t
WHERE [Date] BETWEEN t.d1 AND t.d2
) AS t;

Related

How can I improve the native query for a table with 7 millions rows?

I have the below view(table) in my database(SQL SERVER).
I want to retrieve 2 things from this table.
The object which has the latest booking date for each Product number.
It will return the objects = {0001, 2, 2019-06-06 10:39:58} and {0003, 2, 2019-06-07 12:39:58}.
If all the step number has no booking date for a Product number, it wil return the object with Step number = 1. It will return the object = {0002, 1, NULL}.
The view has 7.000.000 rows. I must do it by using native query.
The first query that retrieves the product with the latest booking date:
SELECT DISTINCT *
FROM TABLE t
WHERE t.BOOKING_DATE = (SELECT max(tbl.BOOKING_DATE) FROM TABLE tbl WHERE t.PRODUCT_NUMBER = tbl.PRODUCT_NUMBER)
The second query that retrieves the product with booking date NULL and Step number = 1;
SELECT DISTINCT *
FROM TABLE t
WHERE (SELECT max(tbl.BOOKING_DATE) FROM TABLE tbl WHERE t.PRODUCT_NUMBER = tbl.PRODUCT_NUMBER) IS NULL AND t.STEP_NUMBER = 1
I tried using a single query, but it takes too long.
For now I use 2 query for getting this information but for the future I need to improve this. Do you have an alternative? I also can not use stored procedure, function inside SQL SERVER. I must do it with native query from Java.
Try this,
Declare #p table(pumber int,step int,bookdate datetime)
insert into #p values
(1,1,'2019-01-01'),(1,2,'2019-01-02'),(1,3,'2019-01-03')
,(2,1,null),(2,2,null),(2,3,null)
,(3,1,null),(3,2,null),(3,3,'2019-01-03')
;With CTE as
(
select pumber,max(bookdate)bookdate
from #p p1
where bookdate is not null
group by pumber
)
select p.* from #p p
where exists(select 1 from CTE c
where p.pumber=c.pumber and p.bookdate=c.bookdate)
union all
select p1.* from #p p1
where p1.bookdate is null and step=1
and not exists(select 1 from CTE c
where p1.pumber=c.pumber)
If performance is main concern then 1 or 2 query do not matter,finally performance matter.
Create NonClustered index ix_Product on Product (ProductNumber,BookingDate,Stepnumber)
Go
If more than 90% of data are where BookingDate is not null or where BookingDate is null then you can create Filtered Index on it.
Create NonClustered index ix_Product on Product (ProductNumber,BookingDate,Stepnumber)
where BookingDate is not null
Go
Try row_number() with a proper ordering. Null values are treated as the lowest possible values by sql-server ORDER BY.
SELECT TOP(1) WITH TIES *
FROM myTable t
ORDER BY row_number() over(partition by PRODUCT_NUMBER order by BOOKING_DATE DESC, STEP_NUMBER);
Pay attention to sql-server adviced indexes to get good performance.
Possibly the most efficient method is a correlated subquery:
select t.*
from t
where t.step_number = (select top (1) t2.step_number
from t t2
where t2.product_number = t.product_number and
order by t2.booking_date desc, t2.step_number
);
In particular, this can take advantage of an index on (product_number, booking_date desc, step_number).

Update record for the last week

I'm building a report that needs to show how many users were upgraded from account status 1 to account status 2 each hour for the last week (and delete hours where the upgrades = 0). My table has an updated date, however it isn't certain that the account status is the item being updated (it could be contact information etc).
The basic table config that I'm working with is below. There are other columns but they aren't needed for my query.
account_id, account_status, updated_date.
My initial idea was to first filter and look at the data for the current week, then find if they were at account_status = 1 and later account_status = 2.
What's the best way to tackle this?
This is the kind of thing that you would use a SELF JOIN for. It's tough to say exactly how to do this without getting any kind of example data, but hopefully you can build off of this at least. There are a lot of tutorials on how to write a successful self join, so I'd refer to those if you're having difficulties.
select a.account_id
from tableName a, tableName b
where a.account_id= b.account_id
and
(a.DateModified > 'YYYY-MM-DD' and a.account_status = 1)
and
(b.DateModified < 'YYYY-MM-DD' and b.account_status= 2)
Maybe you could try to rank all the updates older than an update, with a status of 2 for an account by the timestamp descending. Check if such an entry with status 1 and rank 1 exists, to know that the respective younger update did change the status from 1 to 2.
SELECT *
FROM elbat t1
WHERE t1.account_status = 2
AND EXISTS (SELECT *
FROM (SELECT rank() OVER (ORDER BY t2.updated_date DESC) r,
t2.account_status
FROM elbat t2
WHERE t2.account_id = t1.account_id
AND t2.updated_date <= t1.updated_date) x
WHERE x.account_status = 1
AND x.r = 1);
Then, to get the hours you, could create a table variable and fill it with the hours worth a week (unless you already have a suitable calender/time table). Then INNER JOIN that table (variable) to the result from above. Since it's an INNER JOIN hours where no status update exists won't be in the result.
DECLARE #current_time datetime = getdate();
DECLARE #current_hour datetime = dateadd(hour,
datepart(hour,
#current_time),
convert(datetime,
convert(date,
#current_time)));
DECLARE #hours
TABLE (hour datetime);
DECLARE #interval_size integer = 7 * 24;
WHILE #interval_size > 0
BEGIN
INSERT INTO #hours
(hour)
VALUES (dateadd(hour,
-1 * #interval_size,
#current_hour));
SET #interval_size = #interval_size - 1;
END;
SELECT *
FROM #hours h
INNER JOIN (SELECT *
FROM elbat t1
WHERE t1.account_status = 2
AND EXISTS (SELECT *
FROM (SELECT rank() OVER (ORDER BY t2.updated_date DESC) r,
t2.account_status
FROM elbat t2
WHERE t2.account_id = t1.account_id
AND t2.updated_date <= t1.updated_date) x
WHERE x.account_status = 1
AND x.r = 1)) y
ON convert(date,
y.updated_date) = h.convert(date,
h.hour)
AND datepart(hour,
y.updated_date) = datepart(hour,
h.hour);
If you use this often and/or performance is important, you might consider to introduce persistent, computed and indexed columns for the convert(...) and datepart(...) expressions and use them in the query instead. Indexing the calender/time table and the columns used in the subqueries is also worth a consideration.
(Disclaimer: Since you didn't provide DDL of the table nor any sample data this is totally untested.)

How to calculate RowTotal of CTE that run in less time

I have store procedure which have following structure
WITH ItemsContact (
IsCostVariantItem
,ItemID
,AttributeSetID
,ItemTypeID
,HidePrice
,HideInRSSFeed
,HideToAnonymous
,IsOutOfStock
,AddedOn
,BaseImage
,AlternateText
,SKU
,[Name]
,DownloadableID
,[Description]
,ShortDescription
,[Weight]
,Quantity
,Price
,ListPrice
,IsFeatured
,IsSpecial
,ViewCount
,SoldItem
,TotalDiscount
,RatedValue
,RowNumber
)
AS (
SELECT ------,
ROW_NUMBER() OVER (
ORDER BY i.[ItemID] DESC
) AS RowNumber
FROM -------
)
,rowTotal (RowTotal)
AS (
SELECT MAX(RowNumber)
FROM ItemsContact
)
SELECT CONVERT(INT, r.RowTotal) AS RowTotal
,c.*
FROM ItemsContact c
,rowTotal r
WHERE RowNumber >= #offset
AND RowNumber <= (#offset + #limit - 1)
ORDER BY ItemID
when i execute this i have found from execution plan
SQL Server Execution Times:
CPU time = 344 ms, elapsed time = 362 ms.
Now i remove the second cte ie rowTotal
WITH ItemsContact (
IsCostVariantItem
,ItemID
,AttributeSetID
,ItemTypeID
,HidePrice
,HideInRSSFeed
,HideToAnonymous
,IsOutOfStock
,AddedOn
,BaseImage
,AlternateText
,SKU
,[Name]
,DownloadableID
,[Description]
,ShortDescription
,[Weight]
,Quantity
,Price
,ListPrice
,IsFeatured
,IsSpecial
,ViewCount
,SoldItem
,TotalDiscount
,RatedValue
,RowNumber
)
AS (
SELECT ------,
ROW_NUMBER() OVER (
ORDER BY i.[ItemID] DESC
) AS RowNumber
FROM -------
)
SELECT c.*
FROM ItemsContact c
,rowTotal r
WHERE RowNumber >= #offset
AND RowNumber <= (#offset + #limit - 1)
ORDER BY ItemID
And it show execution plan as
SQL Server Execution Times:
CPU time = 63 ms, elapsed time = 61 ms.
My first code to calculate rowtotal work fine but it takes more time.My question is why MAX(RowNumber) take so longer time and how can i optimized this code.Thanks in advance for any help.
Since MAX(RowNumber) will always be equal to the total number of rows, try just having:
SELECT ------,
ROW_NUMBER() OVER (
ORDER BY i.[ItemID] DESC
) AS RowNumber,
COUNT(*) OVER () as RowTotal
FROM -------
As your first CTE.
The way you have coded it, your SQL has syntax errors. Another thing is that if you removed rowTotal from your second query, it simply wouldn't work because it still has a reference to it. So I don't know where these second execution times are from.
However, if I use code blocks as templates and remove errors, the execution plan for this query should be quite simple: you should have a (clustered) index scan on your ------- table and sort operator, along with some other operators (sequence projection for a ranking ROW_NUMBER function, some join operator like nested loop etc). Clustered index scan and sort should be most processor intensive operations.
SQL server should here calculate row numbers for each row, find a maximum of it and constraint results between the two row numbers calculated from input variables. Obvously there is a paging functionality built using this query and there is a lot about paging in SQL Server on SO, so look for it and you can find a lot of related information.
If there is a known layer built on this query, you should change it. It uses additional unnecesarry column for max(row_number(ID)) that is constant through all rows (38k?) and logically has just a scalar value in it. Instead you should return count(*) as #Damien_The_Unbeliever suggested in its solution, but separate it from the resultset. This way you would simplify the query and have instead something like this:
SELECT
N,
*
FROM
YourTable
CROSS apply(
SELECT N = ROW_NUMBER() OVER (ORDER BY ItemID DESC)
) x
WHERE N BETWEEN #offset AND #offset + #limit - 1
ORDER BY ItemID
It should be easy to get the result count in the next query. AND if you have a really big table, you can count approximate number of rows using this method.
P.S. If you already haven't checked your execution plan for index problems, do it.

Multiple Running Totals with Group By

I am struggling to find a good way to run running totals with a group by in it, or the equivalent. The below cursor based running total works on a complete table, but I would like to expand this to add a "Client" dimension. So I would get running totals as the below creates but for each company (ie Company A, Company B, Company C, etc.) in one table
CREATE TABLE test (tag int, Checks float, AVG_COST float, Check_total float, Check_amount float, Amount_total float, RunningTotal_Check float,
RunningTotal_Amount float)
DECLARE #tag int,
#Checks float,
#AVG_COST float,
#check_total float,
#Check_amount float,
#amount_total float,
#RunningTotal_Check float ,
#RunningTotal_Check_PCT float,
#RunningTotal_Amount float
SET #RunningTotal_Check = 0
SET #RunningTotal_Check_PCT = 0
SET #RunningTotal_Amount = 0
DECLARE aa_cursor CURSOR fast_forward
FOR
SELECT tag, Checks, AVG_COST, check_total, check_amount, amount_total
FROM test_3
OPEN aa_cursor
FETCH NEXT FROM aa_cursor INTO #tag, #Checks, #AVG_COST, #check_total, #Check_amount, #amount_total
WHILE ##FETCH_STATUS = 0
BEGIN
SET #RunningTotal_CHeck = #RunningTotal_CHeck + #checks
set #RunningTotal_Amount = #RunningTotal_Amount + #Check_amount
INSERT test VALUES (#tag, #Checks, #AVG_COST, #check_total, #Check_amount, #amount_total, #RunningTotal_check, #RunningTotal_Amount )
FETCH NEXT FROM aa_cursor INTO #tag, #Checks, #AVG_COST, #check_total, #Check_amount, #amount_total
END
CLOSE aa_cursor
DEALLOCATE aa_cursor
SELECT *, RunningTotal_Check/Check_total as CHECK_RUN_PCT, round((RunningTotal_Check/Check_total *100),0) as CHECK_PCT_BIN, RunningTotal_Amount/Amount_total as Amount_RUN_PCT, round((RunningTotal_Amount/Amount_total * 100),0) as Amount_PCT_BIN
into test_4
FROM test ORDER BY tag
create clustered index IX_TESTsdsdds3 on test_4(tag)
DROP TABLE test
----------------------------------
I can the the running total for any 1 company but I would like to do it for multiple to produce something like the results below.
CLIENT COUNT Running Total
Company A 1 6.7%
Company A 2 20.0%
Company A 3 40.0%
Company A 4 66.7%
Company A 5 100.0%
Company B 1 3.6%
Company B 2 10.7%
Company B 3 21.4%
Company B 4 35.7%
Company B 5 53.6%
Company B 6 75.0%
Company B 7 100.0%
Company C 1 3.6%
Company C 2 10.7%
Company C 3 21.4%
Company C 4 35.7%
Company C 5 53.6%
Company C 6 75.0%
Company C 7 100.0%
This is finally simple to do in SQL Server 2012, where SUM and COUNT support OVER clauses that contain ORDER BY. Using Cris's #Checks table definition:
SELECT
CompanyID,
count(*) over (
partition by CompanyID
order by Cleared, ID
) as cnt,
str(100.0*sum(Amount) over (
partition by CompanyID
order by Cleared, ID
)/
sum(Amount) over (
partition by CompanyID
),5,1)+'%' as RunningTotalForThisCompany
FROM #Checks;
SQL Fiddle here.
I originally started posting the SQL Server 2012 equivalent (since you didn't mention what version you were using). Steve has done a great job of showing the simplicity of this calculation in the newest version of SQL Server, so I'll focus on a few methods that work on earlier versions of SQL Server (back to 2005).
I'm going to take some liberties with your schema, since I can't figure out what all these #test and #test_3 and #test_4 temporary tables are supposed to represent. How about:
USE tempdb;
GO
CREATE TABLE dbo.Checks
(
Client VARCHAR(32),
CheckDate DATETIME,
Amount DECIMAL(12,2)
);
INSERT dbo.Checks(Client, CheckDate, Amount)
SELECT 'Company A', '20120101', 50
UNION ALL SELECT 'Company A', '20120102', 75
UNION ALL SELECT 'Company A', '20120103', 120
UNION ALL SELECT 'Company A', '20120104', 40
UNION ALL SELECT 'Company B', '20120101', 75
UNION ALL SELECT 'Company B', '20120105', 200
UNION ALL SELECT 'Company B', '20120107', 90;
Expected output in this case:
Client Count Running Total
--------- ----- -------------
Company A 1 17.54
Company A 2 43.86
Company A 3 85.96
Company A 4 100.00
Company B 1 20.55
Company B 2 75.34
Company B 3 100.00
One way:
;WITH gt(Client, Totals) AS
(
SELECT Client, SUM(Amount)
FROM dbo.Checks AS c
GROUP BY Client
), n (Client, Amount, rn) AS
(
SELECT c.Client, c.Amount,
ROW_NUMBER() OVER (PARTITION BY c.Client ORDER BY c.CheckDate)
FROM dbo.Checks AS c
)
SELECT n.Client, [Count] = n.rn,
[Running Total] = CONVERT(DECIMAL(5,2), 100.0*(
SELECT SUM(Amount) FROM n AS n2
WHERE Client = n.Client AND rn <= n.rn)/gt.Totals
)
FROM n INNER JOIN gt ON n.Client = gt.Client
ORDER BY n.Client, n.rn;
A slightly faster alternative - more reads but shorter duration and simpler plan:
;WITH x(Client, CheckDate, rn, rt, gt) AS
(
SELECT Client, CheckDate, rn = ROW_NUMBER() OVER
(PARTITION BY Client ORDER BY CheckDate),
(SELECT SUM(Amount) FROM dbo.Checks WHERE Client = c.Client
AND CheckDate <= c.CheckDate),
(SELECT SUM(Amount) FROM dbo.Checks WHERE Client = c.Client)
FROM dbo.Checks AS c
)
SELECT Client, [Count] = rn,
[Running Total] = CONVERT(DECIMAL(5,2), rt * 100.0/gt)
FROM x
ORDER BY Client, [Count];
While I've offered set-based alternatives here, in my experience I have observed that a cursor is often the fastest supported way to perform running totals. There are other methods such as the quirky update which perform about marginally faster but the result is not guaranteed. The set-based approach where you perform a self-join becomes more and more expensive as the source row counts go up - so what seems to perform okay in testing with a small table, as the table gets larger, the performance goes down.
I have a blog post almost fully prepared that goes through a slightly simpler performance comparison of various running totals approaches. It is simpler because it is not grouped and it only shows the totals, not the running total percentage. I hope to publish this post soon and will try to remember to update this space.
There is also another alternative to consider that doesn't require reading previous rows multiple times. It's a concept Hugo Kornelis describes as "set-based iteration." I don't recall where I first learned this technique, but it makes a lot of sense in some scenarios.
DECLARE #c TABLE
(
Client VARCHAR(32),
CheckDate DATETIME,
Amount DECIMAL(12,2),
rn INT,
rt DECIMAL(15,2)
);
INSERT #c SELECT Client, CheckDate, Amount,
ROW_NUMBER() OVER (PARTITION BY Client
ORDER BY CheckDate), 0
FROM dbo.Checks;
DECLARE #i INT, #m INT;
SELECT #i = 2, #m = MAX(rn) FROM #c;
UPDATE #c SET rt = Amount WHERE rn = 1;
WHILE #i <= #m
BEGIN
UPDATE c SET c.rt = c2.rt + c.Amount
FROM #c AS c
INNER JOIN #c AS c2
ON c.rn = c2.rn + 1
AND c.Client = c2.Client
WHERE c.rn = #i;
SET #i = #i + 1;
END
SELECT Client, [Count] = rn, [Running Total] = CONVERT(
DECIMAL(5,2), rt*100.0 / (SELECT TOP 1 rt FROM #c
WHERE Client = c.Client ORDER BY rn DESC)) FROM #c AS c;
While this does perform a loop, and everyone tells you that loops and cursors are bad, one gain with this method is that once the previous row's running total has been calculated, we only have to look at the previous row instead of summing all prior rows. The other gain is that in most cursor-based solutions you have to go through each client and then each check. In this case, you go through all clients' 1st checks once, then all clients' 2nd checks once. So instead of (client count * avg check count) iterations, we only do (max check count) iterations. This solution doesn't make much sense for the simple running totals example, but for the grouped running totals example it should be tested against the set-based solutions above. Not a chance it will beat Steve's approach, though, if you are on SQL Server 2012.
UPDATE
I've blogged about various running totals approaches here:
http://www.sqlperformance.com/2012/07/t-sql-queries/running-totals
I didn't exactly understand the schema you were pulling from, but here is a quick query using a temp table that shows how to do a running total in a set based operation.
CREATE TABLE #Checks
(
ID int IDENTITY(1,1) PRIMARY KEY
,CompanyID int NOT NULL
,Amount float NOT NULL
,Cleared datetime NOT NULL
)
INSERT INTO #Checks
VALUES
(1,5,'4/1/12')
,(1,5,'4/2/12')
,(1,7,'4/5/12')
,(2,10,'4/3/12')
SELECT Info.ID, Info.CompanyID, Info.Amount, RunningTotal.Total, Info.Cleared
FROM
(
SELECT main.ID, SUM(other.Amount) as Total
FROM
#Checks main
JOIN
#Checks other
ON
main.CompanyID = other.CompanyID
AND
main.Cleared >= other.Cleared
GROUP BY
main.ID) RunningTotal
JOIN
#Checks Info
ON
RunningTotal.ID = Info.ID
DROP TABLE #Checks

Getting multiple records on year wise

I have a Patient information table with ~50 million records. I need to check some samples for each year which may be in any order. Here are the sample date available in database "20090722", "20080817", ... "19980301". Also i have a primary-key column called "PID". My requirement is to get 2 or 3 samples for each year with a query.
I tried to get 2 samples for each year using sub-queries, i am not succeeded.
Any one in this forum have idea on this kind of requirement, If so please help me.
Guys i tried the following query in sql server and it worked find. But i need the query in MYSQL. Please help me out.
select pid,studydate
FROM (SELECT ROW_NUMBER() OVER ( PARTITION BY studydate ORDER BY pid DESC ) AS
'RowNumber', pid,studydate
FROM patient
) pt
WHERE RowNumber <= 2
If I understand you correctly you could do something like this:
select year(datecolumn) as Year,
(select id from PatiendRecords pr2 where pr2.id>=min(pr.id)+rand()*max(pr.id) LIMIT 1),
(select id from PatiendRecords pr2 where pr2.id>=min(pr.id)+rand()*max(pr.id) LIMIT 1),
(select id from PatiendRecords pr2 where pr2.id>=min(pr.id)+rand()*max(pr.id) LIMIT 1)
from PatiendRecords pr
group by year(datecolumn);
EDIT
delimiter //
CREATE PROCEDURE RandomRecordsPerYear(n INT)
BEGIN
CREATE TEMPORARY TABLE lookup
(id INT) ENGINE = MEMORY;
SET #x = 0;
REPEAT SET #x = #x + 1;
INSERT INTO lookup (id)
SELECT (SELECT id FROM PatientRecords pr WHERE pr2.id>=min(pr.id)+rand()*max(pr.id) LIMIT 1) AS Id FROM PatientRecords pr GROUP BY year(created_at);
UNTIL #x >= n END REPEAT;
SELECT * FROM PatientRecords s JOIN lookup l ON l.id=pr.id;
DROP TABLE lookup;
END
//
call RandomRecordsPerYear(3)//
PS. I find it pretty cool that you have 50 million patient records in a MySQL database. DS.
SELECT md.*
FROM (
SELECT #r := #r + 1 AS y
FROM (
#r := 0
) vars
CROSS JOIN
mytable
LIMIT 200
) years
JOIN mytable md
ON md.datecol >= CAST('1900-01-01' AS DATETIME) + INTERVAL y YEARS
AND md.datecol < CAST('1900-01-01' AS DATETIME) + INTERVAL y + 1 YEARS
AND md.id <=
COALESCE(
(
SELECT id
FROM mytable mi
WHERE mi.datecol >= CAST('1900-01-01' AS DATETIME) + INTERVAL y YEARS
AND mi.datecol < CAST('1900-01-01' AS DATETIME) + INTERVAL y + 1 YEARS
ORDER BY
id
LIMIT 2
), 0xFFFFFFFF)