Quickly update and do statistical normalization of data SQL

Quickly update and do statistical normalization of data SQL - sql

I have a SQL-Server table which needs to be updated via a SQLCLR function. An update to a single row will need to trigger table-wide update. I was wondering how to properly perform the update and normalize the table. The main problem I see is that it's linked to a website, so there could be multiple updates coming in at any time. The website will only see a few hundred visitors and after a short period of time will be closed (collecting data for research).
To get an idea of the SQLClr call:
DECLARE #ID INT = 501;
SELECT dbo.fn_ComputeJaccard(
RectangleTwo.MinX, RectangleTwo.MinY, RectangleTwo.MaxX, RectangleTwo.MaxY,
RectangleOne.MinX, RectangleOne.MinY, RectangleOne.MaxX, RectangleOne.MaxY) as RunningTotal
FROM PreProcessed RectangleOne
INNER JOIN PreProcessed RectangleTwo
ON RectangleTwo.ID <> RectangleOne.ID
WHERE dbo.fn_ComputeJaccard(
RectangleTwo.MinX, RectangleTwo.MinY, RectangleTwo.MaxX, RectangleTwo.MaxY,
RectangleOne.MinX, RectangleOne.MinY, RectangleOne.MaxX, RectangleOne.MaxY) > .97
AND RectangleTwo.ID = #ID
I would need to select this data into a temp table, normalize that table and then perform an update to the original table with the values (newValue*.5 + oldValue*.9) then renormalize the whole table. I imagine this would take a while to process, so I'm looking for the most efficient way of doing that, plus a solution to the multiple updates flying in issue.
Any advice you could give me would be great!
Thanks

Related

SQL Server 2 Table Sync Recursion

I'm using triggers to keep two identical tables in a single database in sync. One is for an internal proprietary system, the other is used to expose a subset of data to the outside world. I'm not able to use the same table for both.
I need updates, inserts, deletes in either of the tables to be applied to the other.
So far I'm using triggers on both tables instead of a scheduled stored procedure because I wanted immediate updates. The problem is, an update to table A fires the trigger to update table B, which fires a trigger to update table A, which fires the trigger to table B.... and so on.
What's the best way to stop the recursion?
One way is to check the data first to see if it is different, something like this:
SELECT #cempno = inserted.cempno FROM inserted
SELECT #count = COUNT(*)
FROM jcempy j INNER JOIN zhhjcempy z ON j.cempno = z.cempno AND j.cempno = #cempno
WHERE (j.ccostcode <> z.ccostcode)
OR (j.cimearnreg <> z.cimearnreg)
OR (j.cimearnot <> z.cimearnot)
OR (j.cimearndt <> z.cimearndt)
OR (j.cimearnl1 <> z.cimearnl1)
IF #count = 1
BEGIN
-- Update the record
END
Another way could be to use a third table that holds status flags to show which table first initiated the update, but I've a feeling managing that will be a problem once I have 100 users hammering away at the system.
Any ideas, or comments on what sort of performance penalties the data check will incur?
Thanks!

If you can't avoid triggers look at ##NESTLEVEL within your trigger.
It should be always the same number. If number is bigger, just do nothing.

SQL Query on table with 30mill records

I have been having problems building a table in my local SQL Server. Orginally it was causing the tempdb table to become full and throw an exception. This has a lot of joins and outer applies, and so to find specifically where the problem lay I did a select on the first table in the sql query to determine how long it took, that was fast so I then added the next table that was the first join in the query and reran, I continued to do this until I found the table that stalled.
I found the problem (or at least the first problem) was with the shipper_container table. This table is huge and pulling it alone gets a System.OutOfMemoryException just showing a select on the results of that table alone (it has only 5 columns). It cuts out at 16 million records but has 30 million rows. It is 1.2GB in size. This doesn't seem so big for me that SQL Management studio couldn't handle it.
Using a WHERE statement to collect values between 1 January - 10 January 2015 still resulted in a search that took over 5 minutes and was still executing when I cancelled. I have also added indexes on each of the select parameters and this did not increase performance either.
Here is the SQL Query. You can see I have commented out the other parameters that have yet to be added in other joins and outer applies.
DECLARE #startDate DATETIME
DECLARE #endDate DATETIME
DECLARE #Shipper_Key INT = NULL
DECLARE #Part_Key INT = NULL
SET #startDate = '2015-01-01'
SET #endDate = '2015-01-10'
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
INSERT Shipped_Container
(
Ship_Date,
Invoice_Quantity,
Shipper_No,
Serial_No,
Truck_Key,
Shipper_Key
)
SELECT
S.Ship_Date,
SC.Quantity,
S.Shipper_No,
SC.Serial_No,
S.Truck_Key,
S.Shipper_Key
FROM Shipper AS S
JOIN Shipper_Line AS SL
--ON SL.PCN = S.PCN
ON SL.Shipper_Key = S.Shipper_Key
JOIN Shipper_Container AS SC
--ON SC.PCN = SL.PCN
ON SC.Shipper_Line_Key = SL.Shipper_Line_Key
WHERE S.Ship_Date >= #startDate AND S.Ship_Date <= #endDate
AND S.Shipper_Key = ISNULL(#Shipper_Key, S.Shipper_Key)
AND SL.Part_Key = ISNULL(#Part_Key, SL.Part_Key)
The server instance is run on the local network - could this be an issue? I really have minimal experience at this and would really appreciate help and as detailed and clear as possible. Often in SQL forums people jump right into technical details I don't follow so well.

Don't do a Select ... From yourtable in SS Management Studio when it return
hundrend of thousand or millions of row. 1GB of data gets a lot bigger when the system has to draw and show it on screen in the Management Studio data sheet
The server instance is run on the local network
When you do a Select ... From yourtable in SSMS, the server must send all the data to your laptop/desktop. This is quite a lot of uneeded presure on the network.
It should not be an issue when you insert because everything stays on the server. However, staying on the server does not mean it will be fast if your data model is not good enough.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
You may get dirty data is you use that... It may be better to remove it unless you know why it is there and why you need it.
I have also added indexes on each of the select parameters and this did not increase performance either
If you mean indexes on :
S.Ship_Date,
SC.Quantity,
S.Shipper_No,
SC.Serial_No,
S.Truck_Key,
S.Shipper_Key
What are their definitions ?
If they are individual indexes on 1 column, you can drop indexes on SC.Quantity, S.Shipper_No, SC.Serial_No and S.Truck_Key. They are not used.
Ship_Date and Shipper_key may be usefull. It all depends on your model and existing primary keys. (which you need to describe, see below)
It will help to give a more accurate answer if you could tell us:
the relation between your 3 tables (which field link A to B and in which direction)
the primary key on your 3 tables
a complete list of all your indexes(and columns) on your 3 tables
If none of your indexes are usefull or if they are missing, it will most likely read the whole 3 tables and try to match them. Because it is pretty big, it does not have enough memory to process it and it uses tempdb to store intermediary data.
For now I will suppose that shipper_key + PCN is the primary key on each tables.
I think you can try that:
You can create an index on S.Ship_Date
Create Index Shipper_Line_Ship_Date(Ship_Date) -- subject to updates according to your Primary Key
The query optimizer may not use the indexes (if they exists) with such a where clause:
AND S.Shipper_Key = ISNULL(#Shipper_Key, S.Shipper_Key)
AND SL.Part_Key = ISNULL(#Part_Key, SL.Part_Key)
you can use:
AND (S.Shipper_Key = #Shipper_Key or #Shipper_Key is null)
AND (SL.Part_Key = #Part_Key or #Part_Keyis null)
It would help to have indexes on Shipper_Key and PCN
Finally
As I already said above, we need to know more about your data model (create table...), primary keys and indexes (create indexes). You can create a modele here http://sqlfiddle.com/ with all 3 create tables and their indexes. Then go to link and add the link here.
In SSMS, you can right click on a table and go to Script Table as / Create To / New Query Window and add it here or in http://sqlfiddle.com/. Only keep the CREATE TABLE ... part down to the first GO.
You can then do the same thing on all you indexes.
You should also add a copy of you query plan.
In SSMS, go to Query menu / Display Estimated Execution Plan and right click to save it as xml (xml is better). It is only an estimation and it won't execute the whole query. It should be pretty fast.

How to convert a loop in SQL to Set-based logic

I have spent a good portion of today and yesterday attempting to decide whether to utilize a loop or cursor in SQL or to figure out how to use set based logic to solve the problem. I am not new to set logic, but this problem seems to be particularly complex.
The Problem
The idea is that if I have a list of all transactions (10's, 100's of millions) and a date they occurred, I can start combining some of that data into a daily totals table so that it is more rapidly view able by reporting and analytic systems. The pseudocode for this is as such:
foreach( row in transactions_table )
if( row in totals_table already exists )
update totals_table, add my totals to the totals row
else
insert into totals_table with my row as the base values
delete ( or archive ) row
As you can tell, the block of the loop is relatively trivial to implement, and as is the cursor/looping iteration. However, the execution time is quite slow and unwieldy and my question is: is there a non-iterative way to perform such a task, or is this one of the rare exceptions where I just have to "suck it up" and use a cursor?
There have been a few discussions on the topic, some of which seem to be similar, but not usable due to the if/else statement and the operations on another table, for instance:
How to merge rows of SQL data on column-based logic? This question doesn't seem to be applicable because it simply returns a view of all sums, and doesn't actually make logical decisions about additions or updates to another table
SQL Looping seems to have a few ideas about selection with a couple of cases statements which seems possible, but there are two operations that I need done dependent upon the status of another table, so this solution does not seem to fit.
SQL Call Stored Procedure for each Row without using a cursor This solution seems to be the closest to what I need to do, in that it can handle arbitrary numbers of operations on each row, but there doesn't seem to be a consensus among that group.
Any advice how to tackle this frustrating problem?
Notes
I am using SQL Server 2008
The schema setup is as follows:
Totals: (id int pk, totals_date date, store_id int fk, machine_id int fk, total_in, total_out)
Transactions: (transaction_id int pk, transaction_date datetime, store_id int fk, machine_id int fk, transaction_type (IN or OUT), transaction_amount decimal)
The totals should be computed by store, by machine, and by date, and should total all of the IN transactions into total_in and the OUT transactions into total_out. The goal is to get a pseudo data cube going.

You would do this in two set-based statements:
BEGIN TRANSACTION;
DECLARE #keys TABLE(some_key INT);
UPDATE tot
SET totals += tx.amount
OUTPUT inserted.some_key -- key values updated
INTO #keys
FROM dbo.totals_table AS tot WITH (UPDLOCK, HOLDLOCK)
INNER JOIN
(
SELECT t.some_key, amount = SUM(amount)
FROM dbo.transactions_table AS t WITH (HOLDLOCK)
INNER JOIN dbo.totals_table AS tot
ON t.some_key = tot.some_key
GROUP BY t.some_key
) AS tx
ON tot.some_key = tx.some_key;
INSERT dbo.totals_table(some_key, amount)
OUTPUT inserted.some_key INTO #keys
SELECT some_key, SUM(amount)
FROM dbo.transactions_table AS tx
WHERE NOT EXISTS
(
SELECT 1 FROM dbo.totals_table
WHERE some_key = tx.some_key
)
GROUP BY some_key;
DELETE dbo.transactions_table
WHERE some_key IN (SELECT some_key FROM #keys);
COMMIT TRANSACTION;
(Error handling, applicable isolation level, rollback conditions etc. omitted for brevity.)
You do the update first so you don't insert new rows and then update them, performing work twice and possibly double counting. You could use output in both cases to a temp table, perhaps, to then archive/delete rows from the tx table.
I'd caution you to not get too excited about MERGE until they've resolved some of these bugs and you have read enough about it to be sure you're not lulled into any false confidence about how much "better" it is for concurrency and atomicity without additional hints. The race conditions you can work around; the bugs you can't.
Another alternative, from Nikola's comment
CREATE VIEW dbo.TotalsView
WITH SCHEMABINDING
AS
SELECT some_key_column(s), SUM(amount), COUNT_BIG(*)
FROM dbo.Transaction_Table
GROUP BY some_key_column(s);
GO
CREATE UNIQUE CLUSTERED INDEX some_key ON dbo.TotalsView(some_key_column(s));
GO
Now if you want to write queries that grab the totals, you can reference the view directly or - depending on query and edition - the view may automatically be matched even if you reference the base table.
Note: if you are not on Enterprise Edition, you may have to use the NOEXPAND hint to take advantage of the pre-aggregated values materialized by the view.

I do not think you need the loop.
You can just
Update all rows/sums that match your filters/ groups
Archive/ delete previous.
Insert all rows that do not match your filter/ groups
Archive/ delete previous.
SQL is supposed to use mass data not rows one by one.

Performing SQL updates in single statements vs batches

I'm working with large databases and need advice on how to optimize my selects/updates. Here's an ex:
create table Book (
BookID int,
Description nvarchar(max)
)
-- 8 million rows
create table #BookUpdates (
BookID int,
Description nvarchar(max)
)
-- 2 million rows
Let's assume that there's 8 million Books and I have to update the genre for 2 million of them.
Problem: the time to run these updates is very long. It will occasionally cause blocking for the users who are also trying to run statements off the database. I've come up with a solution but want to know if there's a better one out there. I have to prepare one-off random updates like this alot (for whatever reason)
-- normal update
update b set b.Description = bu.Description
from Book b
join #BookUpdates bu
on bu.BookID = b.BookID
-- batch update
while (#BookID < #MaxBookID)
begin
update b set b.Description = bu.Description
from Book b
join #BookUpdates bu
on bu.BookID = b.BookID
where bu.BookID >= #BookID
and bu.BookID < #BookID + 5000
set #BookID = #BookID + 5000
end
The second update works a lot faster. I like this solution because I can print status updates to myself on how long it has left and it doesn't cause performance issues on our customers.
Question: am I missing something important here? Indexes on the temp tables?
I updated the EXAMPLE tables so I don't get more normalization comments. Only 1 description per book :)

You can prevent blocking on the query side by using NOLOCK or READUNCOMITTED hints on the SQL queries.
The real issue with performance is probably the accumulation of changes in the log. Your method of batching the changes in groups of 5,000 is quite reasonable. Because you are setting up the updates in a batch table, you might as well calculate the batch number in the table and then do the looping based on that.

I would try your own suggestion first and index the temp table before you run the update:
CREATE INDEX IDX_BookID ON #BookUpdates(BookID)
Try it with the index and without the index and see what the impact on the runtime is. If you want to avoid impacting your users for this test, run it outside working hours (if you can) or copy Book to another temp table first and test against that.
Regardless, given the volume, I expect you will still cause blocking for other processes. If you are unable to schedule your updates at a time when no other processes are running against this table (which would be the ideal solution), your existing batch update appears to be a perfectly valid solution. Indexing the temp table will likely help with that too so you may be able to increase the batch size without causing blocking.

Updating 100k records in one update query

Is it possible, or recommended at all, to run one update query, that will update nearly 100k records at once?
If so, how can I do that? I am trying to pass an array to my stored proc, but it seems not to work, this is my SP:
CREATE PROCEDURE [dbo].[UpdateAllClients]
#ClientIDs varchar(max)
AS
BEGIN
DECLARE #vSQL varchar(max)
SET #vSQL = 'UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN (' + #ClientIDs + ')';
EXEC(#vSQL);
END
I have not idea whats not working, but its just not updating the relevant queries.
Anyone?

The UPDATE is reading your #ClientIDs (as a Comma Separated Value) as a whole. To illustrate it more, you are doing like this.
assume the #ClientIDs = 1,2,3,4,5
your UPDATE command is interpreting it like this
UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN ('1,2,3,4,5')';
and not
UPDATE Clients SET LastUpdate=GETDATE() WHERE ID IN (1,2,3,4,5)';
One suggestion to your question is by using subquery on your UPDATE, example
UPDATE Clients
SET LastUpdate = GETDATE()
WHERE ID IN
(
SELECT ID
FROM tableName
-- where condtion
)
Hope this makes sense.

A few notes to be aware of.
Big updates like this can lock up the target table. If > 5000 rows are affected by the operation, the individual row locks will be promoted to a table lock, which would block other processes. Worth bearing in mind if this could cause an issue in your scenario. See: Lock Escalation
With a large number of rows to update like this, an approach I'd consider is (basic):
bulk insert the 100K Ids into a staging table (e.g. from .NET, use SqlBulkCopy)
update the target table, using a join onto the above staging table
drop the staging table
This gives some more room for controlling the process, but breaking the workload up into chunks and doing it x rows at a time.

There is a limit for the number of items you pass to 'IN' if you are giving an array.
So, if you just want to update the whole table, skip the IN condition.
If not specify an SQL inside IN. That should do the job

The database will very likely reject that SQL statement because it is too long.
When you need to update so many records at once, then maybe your database schema isn't appropriate. Maybe the LastUpdate datum should not be stored separately for each client but only once globally or only once for a constant group of clients?
But it's hard to recommend a good course of action without seeing the whole picture.

What version of sql server are you using? If it is 2005+ I would recommend using TVPs (table valued parameters - http://msdn.microsoft.com/en-us/library/bb510489.aspx). The transfer of data will be faster (as opposed to building a huge string) and your query would look nicer:
update c
set lastupdate=getdate()
from clients c
join #mytvp t on c.Id = t.Id

Each SQL statement on its own is a transaction statement . This means sql server is going to grab locks for all these million of rows .It can really degrade the performance of a table .So you really don’t tend to update a table which has million of rows in it which hurts the performance.So the workaround is to set rowcount before DML operation
set rowcount=100
UPDATE Clients SET LastUpdate=GETDATE()
WHERE ID IN ('1,2,3,4,5')';
set rowcount=0
or from SQL server 2008 you can parametrize Top keyword
Declare #value int
set #value=100000
again:
UPDATE top (#value) Clients SET LastUpdate=GETDATE()
WHERE ID IN ('1,2,3,4,5')';
if ##rowcount!=0 goto again
See how fast the above query takes and then adjust and change the value of the variable .You need to break the tasks for smaller units as suggested in the above answers

Method 1:
Split the #clientids with delimiters ','
put in array and iterate over that array
update clients table for each id.
OR
Method 2:
Instead of taking #clientids as a varchar2, follow below steps
create object type table for ids and use join.
For faster processing u can also create index on clientid as well.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas