I have a table containing ~5,000,000 rows of scada data, described by the following:
create table data (o int, m money).
Where:
- o is PK with clustered index on it. o's fill factor is close to 100%. o represents the date of meter reading, can be thought of as X axis.
- m is a decimal value laying within 1..500 region and is the actual meter reading can be thought of as Y axis.
I need to find out about certain patterns i.e. when, how often and for how long they had been occurring.
Example. Looking for all occurrences of m changing by a region from 500 to 510 within 5 units (well from 1 to 5) of o I run the following query:
select d0.o as tFrom, d1.o as tTo, d1.m - d0.m as dValue
from data d0
inner join data d1
on (d1.o = d0.o + 1 or d1.o = d0.o + 2 or d1.o = d0.o + 3 or d1.o = d0.o + 4)
and (d1.m - d0.m) between 500 and 510
the query takes 23 seconds to execute.
Previous version took 30 minutes (90 times slower), I' managed to optimize it using a naive approach by replacing : on (d1.o - d0.o) between 1 and 4 with on (d0.o = d1.o - 1 or d0.o = d1.o - 2 or d0.o = d1.o - 3 or d0.o = d1.o - 4)
It's clear to me why it's faster - on one hand indexed column scan should fork fast enough on another one I can afford it as dates are discrete (and I always give 5 minutes grace time to any o region, so for 120 minutes it's 115..120 region). I can't use the same approach with m values as they are integral though.
Things I've tried so far:
Soft sharding by applying where o between #oRegionStart and #oRegionEnd at the bottoom of my script. and running it within a loop, fetching results into a temp table. Execution time - 25 seconds.
Hard sharding by splitting data into a number of physical tables. The result is 2 minutes nevermind the maintenance hassle.
Using some precooked data structures, like:
create table data_change_matrix (o int, dM5Min money, dM5Max money, dM10Min money, dM10Max money ... dM{N}Min money, dM{N}Max money)
where N is max dept for which I run the analysis. Having such table I could easily write a query:
select * from data_change_matrix where dMin5Min between 500 and 510
The result is - it went nowhere due to the tremendous size requirements (5M X ~ 250) and maintenance related costs, I need to support that matrix actuality close to real time.
SQL CLR - don't even ask me what went wrong it just didn't work out.
Right now I'm out of inspiration and looking for help.
All in all - is it possible to get a close to instant response time running such type of queries on large volumes of data?
All's run on MS Sql Server 2012. Didn't try it on MS Sql Server 2014 but happy to do it if it'll make sense.
Update - execution plan: http://pastebin.com/PkSSGHvH.
Update 2 - While I really love LAG function suggested by usr I wonder if there's a LAG**S** function allowing for
select o, MIN(LAG**S**(o, 4)) over(...) - or what's its shortest implementation in TSL?
I tried something very similar using SQL CLR and got it working but the performance was awful.
I assume you meant to write "on (d1.o = ..." and not "on (d.o = ...". Anyway, I got pretty drastic improvements just by simplifying the statement (making it easy for the query optimizer to pick a better plan I guess):
select d0.o as tFrom, d1.o as tTo, d1.m - d0.m as dValue
from data d0
inner join data d1
on d1.o between d0.o + 1 and d0.o + 4
and (d1.m - d0.m) between 500 and 510
Good luck with your query!
You say you've already tried CLR but don't give any code.
It was fastest in my test for my sample data.
CREATE TABLE data
(
o INT PRIMARY KEY,
m MONEY
);
INSERT INTO data
SELECT TOP 5000000 ROW_NUMBER() OVER (ORDER BY ##SPID),
1 + ABS(CAST(CRYPT_GEN_RANDOM(4) AS INT) %500)
FROM master..spt_values v1,
master..spt_values v2
None of the versions actually return any results (it is impossible for m to be a decimal value laying within 1..500 and simultaneously for two m values to have a difference > 500) but disregarding this typical timings I got for the code submitted so far are.
+-----------------+--------------------+
| | Duration (seconds) |
+-----------------+--------------------+
| Lag/Lead | 39.656 |
| Original code | 40.478 |
| Between version | 21.037 |
| CLR | 13.728 |
+-----------------+--------------------+
The CLR code I used was based on that here
To call it use
EXEC [dbo].[WindowTest]
#WindowSize = 5,
#LowerBound = 500,
#UpperBound = 510
Full code listing
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class StoredProcedures
{
public struct DataRow
{
public int o;
public decimal m;
}
[Microsoft.SqlServer.Server.SqlProcedure]
public static void WindowTest(SqlInt32 WindowSize, SqlInt32 LowerBound, SqlInt32 UpperBound)
{
int windowSize = (int)WindowSize;
int lowerBound = (int)LowerBound;
int upperBound = (int)UpperBound;
DataRow[] window = new DataRow[windowSize];
using (SqlConnection conn = new SqlConnection("context connection=true;"))
{
SqlCommand comm = new SqlCommand();
comm.Connection = conn;
comm.CommandText = #"
SELECT o,m
FROM data
ORDER BY o";
SqlMetaData[] columns = new SqlMetaData[3];
columns[0] = new SqlMetaData("tFrom", SqlDbType.Int);
columns[1] = new SqlMetaData("tTo", SqlDbType.Int);
columns[2] = new SqlMetaData("dValue", SqlDbType.Money);
SqlDataRecord record = new SqlDataRecord(columns);
SqlContext.Pipe.SendResultsStart(record);
conn.Open();
SqlDataReader reader = comm.ExecuteReader();
int counter = 0;
while (reader.Read())
{
DataRow thisRow = new DataRow() { o = (int)reader[0], m = (decimal)reader[1] };
int i = 0;
while (i < windowSize && i < counter)
{
DataRow previousRow = window[i];
var diff = thisRow.m - previousRow.m;
if (((thisRow.o - previousRow.o) <= WindowSize-1) && (diff >= lowerBound) && (diff <= upperBound))
{
record.SetInt32(0, previousRow.o);
record.SetInt32(1, thisRow.o);
record.SetDecimal(2, diff);
SqlContext.Pipe.SendResultsRow(record);
}
i++;
}
window[counter % windowSize] = thisRow;
counter++;
}
SqlContext.Pipe.SendResultsEnd();
}
}
}
This looks like a great case for windowed aggregate functions or LAG. Here a version using LAG:
select *
from (
select o
, lag(m, 4) over (order by o) as m4
, lag(m, 3) over (order by o) as m3
, lag(m, 2) over (order by o) as m2
, lag(m, 1) over (order by o) as m1
, m as m0
from data
) x
where 0=1
or (m1 - m0) between 500 and 510
or (m2 - m0) between 500 and 510
or (m3 - m0) between 500 and 510
or (m4 - m0) between 500 and 510
Using a windowed aggregate function you should be able to remove the manual expansion of those LAG calls.
SQL Server implements these things using a special execution plan operator called Window Spool. That makes it quite efficient.
Related
I'm writing a function that should add each line item quantity multiplied by its unit cost and then iterating through the entire pickticket (PT). I don't get an error when altering the function in SQL Server or when running it, but it gives me a 0 as the output each time.
Here is an example:
[PT 1]
[Line 1 - QTY: 10 Unit Cost: $5.00] total should be = $50.00
[Line 2 - QTY: 5 Unit Cost: $2.50] total should be = $12.50
The function should output - $62.50
Not really sure what I'm missing here, but would appreciate the help.
Alter Function fn_CalculateAllocatedPTPrice
(#psPickTicket TPickTicketNo)
-------------------------------
Returns TInteger
As
Begin
Declare
#iReturn TInteger,
#iTotalLineNumbers TInteger,
#iIndex TInteger,
#fTotalCost TFloat;
set #iIndex = 1;
set #iTotalLineNumbers = (ISNULL((select top 1 PickLineNo
from tblPickTicketDtl
where PickTicketNo = #psPickTicket
order by PickLineNo desc), 0)) /* This returns the highest line number */
while(#iIndex <= #iTotalLineNumbers)
BEGIN
/* This should be adding up the total cost of each line item on the PT */
set #fTotalCost = #fTotalCost + (ISNULL((select SUM(P.RetailUnitPrice*P.UnitsOrdered)
from tblPickTicketDtl P
left outer join tblCase C on (P.PickTicketNo = C.PickTicketNo)
where P.PickTicketNo = #psPickTcket
and P.PickLineNo = #iIndex
and C.CaseStatus in ('A','G','K','E','L','S')), 0))
set #iIndex = #iIndex + 1;
END
set #iReturn = #fTotalCost;
_Return:
Return(#iReturn);
End /* fn_CalculateAllocatedPTPrice */
It seems simple aggregation should suffice
A few points to note:
WHILE loops and cursors are very rarely needed in SQL. You should stick to set-based solutions, and if you find yourself writing a loop you shuold question your code from its beginnings.
Scalar functions are slow and inefficient. Use an inline Table function, which you can correlate with your main query either with an APPLY or a subquery
Your left join becomes an inner join because of the where predicate
User defined types are not normally a good idea (when they are just aliasing system types)
CREATE OR ALTER FUNCTION fn_CalculateAllocatedPTPrice
(#psPickTicket TPickTicketNo)
RETURNS TABLE AS RETURN
SELECT fTotalCost = ISNULL((
SELECT SUM(P.RetailUnitPrice * P.UnitsOrdered)
from tblPickTicketDtl P
join tblCase C on (P.PickTicketNo = C.PickTicketNo)
where P.PickTicketNo = #psPickTcket
and C.CaseStatus in ('A','G','K','E','L','S')
), 0);
GO
! - I'm not looking for paid software which will do this job (as too expensive)
We have an issue with cash management to match the values.
I have two SQL Tables, let's call it SHOP_CASH and BANK_CASH
1) The matching should be happens based on ShopName-CashAmount-Date.
2) Here I faced two issues
The cash should be round up to nearest £50, ideally, 12 400 and 12 499 should round up to 12 450, OR this just IDEAL is a match based on cash difference which less than 50, dry to match different value if the difference is less than 50, match them, but here is the question how to match the value up.. this is just the stupid ideas))??? Hmmm...stuck.
Dates, the shop can cash up a few days later, so need to join based on cash-up date (for example 2018-10-26) with bank date RANGE 2018-10-26 to (+7 days) 2018-11-02
Currently, I do not understand the possible way (logical) of matching in this circumstance. Any logical path of calculation/joining will be extremely appreciated
TRY:
Let's say I can join two tables by SHOPNAME - Cool
Then I will try to join by date, which potentially will be:
SELECT * FROM SHOP_CASH AS SC
LEFT JOIN BANK_CASH AS BC
ON SC.SHOP_NAME_SC = BC.SHOP_NAME_BC
AND SC.DATE_SC = (ANY DATE FROM SC.DATE_SC TO SC.DATE_SC (+7 DAYS) = TO DATE_BC - not sure how)
AND FLOOR(SC.CASH_SC / 50) * 50 = FLOOR(BC_CASH_BC / 50) * 50
P.S. For this project will be using the Google Big Query.
This is my (temporary solution)
WITH MAIN AS(SELECT
CMS.Store_name AS STORE_NAME,
CMS.Date AS SHOP_DATE,
CMB.ENTRY_DATE AS BANK_DATE,
SUM(CMS.Cash) AS STORE_CASH,
SUM(CMB.AMOUNT) AS BANK_CASH
FROM `store_data` CMS
LEFT JOIN `bank_data` AS CMB
ON CMS.store_name = CMB.STRAIGHT_LOOKUP
AND FLOOR(CMS.Cash / 50) * 50 = FLOOR(CMB.AMOUNT / 50) * 50
AND CAST(FORMAT_DATE("%F",CMB.ENTRY_DATE) AS STRING) > CAST(FORMAT_DATE("%F",CMS.Date) AS STRING)
AND CAST(FORMAT_DATE("%F",CMB.ENTRY_DATE) AS STRING) <= CAST(FORMAT_DATE("%F",DATE_ADD(CMS.Date, INTERVAL 4 day)) AS STRING)
GROUP BY STORE_NAME,SHOP_DATE,BANK_DATE)
SELECT
MAIN2.*
FROM (
SELECT
ARRAY_AGG(MAIN ORDER BY MAIN.SHOP_DATE ASC LIMIT 1)[OFFSET(0)] AS MAIN2
FROM
MAIN AS MAIN
GROUP BY MAIN.SHOP_DATE, MAIN.STORE_CASH)
this is quite interesting case.
You haven't provided any sample data so I'm not able to test it, but this may work. Some modification may be required since not sure about date format. Let me know if there is an issue.
SELECT * FROM SHOP_CASH AS SC
LEFT JOIN BANK_CASH AS BC
ON SC.SHOP_NAME_SC = BC.SHOP_NAME_BC
AND SC.DATE_SC BETWEEN BC.DATE_BC AND DATE_ADD(BC.DATE_BC, DAY 7)
AND trunc(SC.CASH_SC, -2) + 50 = trunc(BC.CASH_BC,2) + 50
I've got a very interesting issue currently with a very simple query:
SELECT TOP 2 ID, StoreID, UserID, Lat, Long
FROM EntityLocation
WHERE UserID = 'NS4089'
This query (in Enterprise Manager) runs in < 1 ms timed using:
SELECT GETDATE();
SELECT TOP 2 ID, StoreID, UserID, Lat, Long
FROM EntityLocation
WHERE UserID = 'NS4089';
SELECT GETDATE();
The data returned is:
3196 NULL NS4089 -33.720801 151.019014 (so not big at all)
The table schema is:
ID (PK, int not null)
StoreID (int, null)
UserID (nvarchar(50), null)
Lat (nvarchar(50), not null)
Long (nvarchar(50), not null)
Using ADO.Net this query takes 213ms:
using (var conn = new SqlConnection("connection_string")) {
var start = DateTime.Now;
var command = new SqlCommand("SELECT TOP 2 ID, StoreID, UserID, Lat, Long FROM EntityLocation WHERE UserID='NS4089'", conn);
conn.Open();
var reader = command.ExecuteReader();
while (reader.Read())
Console.WriteLine("\t{0}\t{1}\t{2}", reader[0], reader[1], reader[2]);
Console.WriteLine($"\nQuery Took: {DateTime.Now.Subtract(start).TotalSeconds}s\n");
}
Using a custom query in Entity Framework, the query takes 1642ms:
var start = DateTime.Now;
var e = db.EntityLocation.SqlQuery("SELECT TOP 2 * FROM EntityLocation WHERE UserID='NS4089'").Single();
Console.WriteLine($"\nQuery Took: {DateTime.Now.Subtract(start).TotalSeconds}s\n");
Using the generated (DB first) classes the query takes: 2746ms
var start = DateTime.Now;
db.EntityLocation.Single(r => r.UserID == "NS4089");
Console.WriteLine($"\nQuery Took: {DateTime.Now.Subtract(start).TotalSeconds}s\n");
My question is how is it possible that:
ADO.Net takes > 200ms to run the query and transfer such a small amount
of data. Latency could explain this the DB is in the cloud, so lets assume 200ms is ok. I could test further but this is not the big issue
Entity Framework is so slow..... Logging out the queries run by EF it says that it executes in 50ms. But my timing returns > 2 seconds
I have tried disabling lazy loading, change detection, etc, nothing improves this performance.
Any hints/pointers to this one, its driving me nuts.
I have a query that returns the probability that a token has a certain classification.
token class probPaired
---------- ---------- ----------
potato A 0.5
potato B 0.5
potato C 1.0
potato D 0.5
time A 0.5
time B 1.0
time C 0.5
I need to aggregate the probabilities of each class by multiplying them together.
-- Imaginary MUL operator
select class, MUL(probPaired) from myTable group by class;
class probability
---------- ----------
A 0.25
B 0.5
C 0.5
D 0.5
How can I do this in SQLite? SQLite doesn't have features like LOG/EXP or variables - solutions mentioned in other questions.
In general, if SQLite can't do it you can write a custom function instead. The details depend on what programming language you're using, here it is in Perl using DBD::SQLite. Note that functions created in this way are not stored procedures, they exist for that connection and must be recreated each time you connect.
For an aggregate function, you have to create a class which handles the aggregation. MUL is pretty simple, just an object to store the product.
{
package My::SQLite::MUL;
sub new {
my $class = shift;
my $mul = 1;
return bless \$mul, $class;
}
sub step {
my $self = shift;
my $num = shift;
$$self *= $num;
return;
}
sub finalize {
my $self = shift;
return $$self;
}
}
Then you'd install that as the aggregate function MUL which takes a single argument and uses that class.
my $dbh = ...doesn't matter how the connection is made...
$dbh->sqlite_create_aggregate("MUL", 1, "My::SQLite::MUL");
And now you can use MUL in queries.
my $rows = $dbh->selectall_arrayref(
"select class, MUL(probPaired) from myTable group by class"
);
Again, the details will differ with your particular language, but the basic idea will be the same.
This is significantly faster than fetching each row and taking the aggregate product.
You can calculate row numbers and then use a recursive cte for multiplication. Then get the max rnum (calculated row_number) value for each class which contains the final result of multiplication.
--Calculating row numbers
with rownums as (select t1.*,
(select count(*) from t t2 where t2.token<=t1.token and t1.class=t2.class) as rnum
from t t1)
--Getting the max rnum for each class
,max_rownums as (select class,max(rnum) as max_rnum from rownums group by class)
--Recursive cte starts here
,cte(class,rnum,probPaired,running_mul) as
(select class,rnum,probPaired,probPaired as running_mul from rownums where rnum=1
union all
select t.class,t.rnum,t.probPaired,c.running_mul*t.probPaired
from cte c
join rownums t on t.class=c.class and t.rnum=c.rnum+1)
--Final value selection
select c.class,c.running_mul
from cte c
join max_rownums m on m.max_rnum=c.rnum and m.class=c.class
SQL Fiddle
Calculating geometrically link returns
How do you multiply record2 * record1?
The desire is to return a value for actual rate and annulized rate
Given table unterval:
EndDate PctReturn
-------------------------------
1. 05/31/06 -0.2271835
2. 06/30/06 -0.1095986
3. 07/31/06 0.6984908
4. 08/31/06 1.4865360
5. 09/30/06 0.8938896
The desired output should look like this:
EndDate PctReturn Percentage UnitReturn
05/31/06 -0.2271835 -0.002272 0.997728
06/30/06 -0.1095986 -0.001096 0.996634669
07/31/06 0.6984908 0.006985 1.00359607
08/31/06 1.4865360 0.014865 1.018514887
09/30/06 0.8938896 0.008939 1.027619286
Percentage = PctReturn/100
UnitReturn (1 + S1) x (1 + S2) x ... (1 + Sn) - 1
Aggregating values desired:
Actual Rate 2.761928596
Annulized 6.757253223
Mathematics on aggregating value:
Actual Rate 1.027619 1.027619-1 = 0.027619 * 100 = 2.761928596
Annulized Rate 6.757253 (ActualRate^(12/number of intervals)-1)*100
Number of intervals in Example = 5
there are only 5 records or intervals
I did try utilizing the sum in the select statement but this did not allow for multiplying record2 by record1 to link returns. I thought utilizing the while function would allow for stepping record by record to multiply up the values of unitreturn. My starter level in SQL has me looking for help.
You have two option for getting a product in SQL Server.
1. Simulate using logs and exponents:
SQL Fiddle
create table returns
(
returnDate date,
returnValue float
)
insert into returns values('05/31/06', -0.002271835)
insert into returns values('06/30/06', -0.001095986)
insert into returns values('07/31/06', 0.006984908)
insert into returns values('08/31/06', 0.014865360)
insert into returns values('09/30/06', 0.008938896)
select totalReturn = power
(
cast(10.0 as float)
, sum(log10(returnValue + 1.0))
) - 1
from returns;
with tr as
(
select totalReturn = power
(
cast(10.0 as float)
, sum(log10(returnValue + 1.0))
) - 1
, months = cast(count(1) as float)
from returns
)
select annualized = power(totalReturn + 1, (1.0 / (months / 12.0))) - 1
from tr;
This leverages logs and exponents to simulate a product calculation. More info: User defined functions.
The one issue here is that it will fail for return < -100%. If you don't expect these it's fine, otherwise you'll need to set any values < 100% to -100%.
You can then use this actual return to get an annualized return as required.
2. Define a custom aggregate with CLR:
See Books Online.
You can create a CLR custom function and then link this an aggregate for use in your queries. This is more work and you'll have to enable CLRs on your server, but once it's done once you can use it as much as required.