This is a best practice/other approach question about using a ADO Enumerator ForEach loop.
My data is financial accounts, coming from a source system into a data warehouse.
The current structure of the data is a list of financial transactions eg.
+-----------------------+----------+-----------+------------+------+
| AccountGUID | Increase | Decrease | Date | Tags |
+-----------------------+----------+-----------+------------+------+
| 00000-0000-0000-00000 | 0 | 100.00 | 01-01-2018 | Val1 |
| 00000-0000-0000-00000 | 200.00 | 0 | 03-01-2018 | Val3 |
| 00000-0000-0000-00000 | 400.00 | 0 | 06-01-2018 | Val1 |
| 00000-0000-0000-00000 | 0 | 170.00 | 08-01-2018 | Val1 |
| 00000-0000-0000-00002 | 200.00 | 0 | 04-01-2018 | Val1 |
| 00000-0000-0000-00002 | 0 | 100.00 | 09-01-2018 | Val1 |
+-----------------------+----------+-----------+------------+------+
My SSIS Package, current has two forEach Loops
All Time Balances
End Of Month Balances
All Time Balances
Passes AccountGUID into the loop and selects all transactions for that account. It then orders them by date with the first transaction being first and assigns it a sequence number.
Once the sequence number is assigned, it begins to count the current balances based on the increase and decrease cols, along with the tag col to work out which balance its dealing with.
It finishes this off by assigning the latest record with a Current flag.
All Time Balances - Work Flow
->Get All Account ID's in Staging table
|-> Write all Account GUID's to object variable
|--> ADO Enumerator ForEach - Loop Account GUID List - Write GUID to variable
|---> (Data Flow) Select all transactions for Account GUID
|----> (Data Flow) Order all transactions by date and assign Sequence number
|-----> (Data Flow) Run each row through a script component transformation to calculate running totals for each record
|------> (Data Flow) Insert balance data into staging table
End Of Month Balances
The second package, End of Month does something very similar with the exception of a second loop. The select will find the earliest transnational record and the latest transnational record. Using those two dates it will figure out all the months between those two and loop for each of those months.
Inside the date loop, it does pretty much the same thing, works out the balances based on tags and stamps the end of month record for each account.
The Issue/Question
All of this currently works fine, but the performance is horrible.
In one database with approx 8000 Accounts and 500,000 transactions. This process takes upwards of a day to run. This being one of our smaller clients, I tremble at the idea of running it for our heavy databases.
Is there a better approach to doing this, using SQL cursors or so other neat way I have not seen?
Ok, so I have managed to take my package execution from around 3 days to about 11 minutes all up.
I ran a profiler and standard windows stats while running the loops and found a few interesting things.
Firstly, there was almost no utilization of HDD, CPU, RAM or network during the execution of the packages. It told me what I kind of already knew, that it was not running as quickly as it could.
What I did notice, between each execution of the loop there was a 1 to 2ms delay before the next instance of the loop started executing.
Eventually I found that every time a new instance of the loop began, SSIS created a new connection to the SQL database, it appears that this is SSIS's default behavior. Whenever you create a Source or Destination, you are adding a connection delay to your project.
The Fix:
Now this was an odd fix, you need to go into your connection manager (The odd bit) it must be the onscreen window not in the right hand project manager window.
If you select your connect that is referenced in the loop, the properties window on the right side (In my layout anyway) you will see the option called "RetainSameConnection" which be default is set to false.
By setting this to true, I eliminated the 2ms delay.
Considerations:
In doing this I created a heap of other issues, which really just highlighted areas of my package that I had not thought out well.
Some things that appears to be impacted by this change were stored procedures that used temp tables, these seemed to break instantly. I assume that is because of how SQL handles temp tables, in closing the connection and reopening, you can be pretty certain that the temp table is gone. With the same connection setting, the chance of running into temp tables appears to be an issue again.
I removed all temp tables and replaced them with CTE statements, this appears to fix this issue.
The second major issue I found was with tasks that ran parallel and both used the same connection manager. From this I received an error that SQL is still trying to run the previous statement. This bombed out my package.
To get around this, I created a duplicate connection manager (All up I made three connection managers for the same database).
Once I had my connections set up, I went into each of my parallel Source and Destinations and assigned them their own connection manager. This appears to have resolved the last error I received.
Conclusion:
They may be more unforeseen issues in doing this, but for now my packages are lightening quick and this highlighted some faults in my design.
Related
I don't know how to phrase my question right. But to provide further details about the problem I am trying to solve, let me describe my application. Suppose I am trying to implement a queue reservation application, and I maintain the number of slots in a table roughly.
id | appointment | slots_available | slots_total
---------------------------------------------------
1 | apt 1 | 30 | 30
2 | apt 2 | 1 | 5
.. | .. | .. | ..
So, in a competitive scenario, assuming that everything works in the application side of things. A scenario can happen in the application where :
user 1 -> reserves apt 2 -> [validate if slot exists] -> update slot_available to 0 -> reserve (insert a record)
user 2 -> reserves ap2 2 -> validate if slot exists -> [update slot_available to 0] -> reserve (insert a record)
What if user 1 and 2 happens to find a slot available for apt2 at the same time in the user interface? (Of course I would validate first if there is one slot, but they would see the same value in the UI if not one of them has clicked yet). Then the two submits a reservation at the same time.
Now what if user 1 validates that there is a slot that is available, even though user 2 has already taken it though the update operation is not yet done? Then there will be two inserts.
At any case, how do I ensure that only one of them gets the reservation at database level? I'm sure this is a common scenario, but I have no idea yet on how to implement something like this. A suggestion to remodel would also be acceptable as long as it solves the scenario.
I have a requirement to calculate productivity in our issue tracking software(Jira). Idea is I want to capture the data as an issue progresses through diff development stages (like audit trail of the issue for specific field in it).
Now I want to build a data model that enables me to capture metrics like avg. amount of time it took for issues to move between 2 stages (for eg, In Progress > UAT). Avg. time for each developer etc.
Above audit trail view would give me data in this format
Audit ID | Issue ID| Developer | Issue-stage| Data_Update_dt
A001 | 101 | D01 | In Progress| 31-May-17 00:25:00
A002 | 101 | D01 | UAT | 31-May-17 06:25:00
I am trying to understand the design to calculate difference between A002 and A001, time reqd to move from In progress to UAT. What is the best way to do it.
Please advice.
I would create a dimensional model/star schema that has an 'accumulating snapshot fact' as its central fact, with one row per issue as it moves through the system.
Accumulating Snapshot Facts
In the fact table you'd include key dates/times for each stage. You'd also be able to add 'lag' measures to the fact, precalculated as the gap between one stage and the next, and/or time spent in a stage.
The fact would be surrounded by dimensions for dates, times and developers.
Then you'd be able to calculate averages on those lags and be able to analyse by developer.
Folks, as this question relates to IDENTITY columns and merge replication, if I may ask you to refrain from answering with "use GUIDs instead". I'm acutely aware of the benefits and limitations of both and have been using SQL replication with CE since SQL Server 2000. Very occasionally I get surprised. This is such a case.
This is a complex description of the problem so please bear with me.
Below is an extract from here https://msdn.microsoft.com/en-us/library/ms152543.aspx and is what I've always understood with regard to identity ranges and thresholds.
"Subscribers running SQL Server Compact or previous versions of SQL Server are assigned only the primary range; assignment of new ranges is controlled by the #threshold parameter. Additionally, a republishing Subscriber has only the range specified in the #identity_range parameter; it must use this range for local changes and for changes at Subscribers that synchronize with the republishing Subscriber. For example, you could specify 10000 for #pub_identity_range, 500000 for #identity_range and 80 percent for #threshold. After 8000 inserts at a Subscriber (80 percent of 10000), the Publisher is assigned a new range. When a new range is assigned, there will be a gap in the identity range values in the table. Specifying a higher threshold results in smaller gaps, but the system is less fault-tolerant: if the Merge Agent cannot run for some reason, a Subscriber could more easily run out of identities."
If was assume this is true for the moment we'll get to the start of my problem.
To help users using our application we have been using a variation of the following query to let clients know they might run out of identities if they keep going and to initiate a sync to get a new range.
SELECT
AUTOINC_MAX, AUTOINC_NEXT, AUTOINC_MAX-AUTOINC_NEXT
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_NAME = N'Asset'
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3081898 | 3080899 | 999
By evaluating AUTOINC_MAX - AUTOINC_NEXT (= 999) we can see when we're getting low on IDs
In code we are looking at AUTOINC_MAX - AUTOINC_MIN which gives the allocated range. Using the default threshold of 80% and the remaining range we can advise clients to sync if they look like running out.
However this is where what I've thought to be true fails in practice. Referring to the Microsoft details above this sentence stands out "When a new range is assigned, there will be a gap in the identity range values in the table."
I take this to mean the following
If our user has an identity range of 0-1000 IDs and uses IDs up to 801. On next sync the user will be allocated the next range of 1001-2000(we are assuming one subscriber for illustration). As a result of the sync the next ID used will be 1001 leaving a gap from 802-1000.
Firstly please let me know if my understanding is wrong.
Secondly though, this is not what we're seeing in practice.
In practice what we are seeing, based on the example above, post sync and subsequent inserts, is the balance of the IDs being used until we fully expend the original range. THEN upon expending the range AUTOINC-MIN, -MAX and -NEXT are all updated to the new range. No additional syncs have occurred.
Below is an example.
In the target table the last ID used is 3080899
To simulate usage the following query was used
INSERT INTO Asset (lInstID, lTypeID, sUsrName, lUsrID, dCreated, dAudit, sStatus)
SELECT
lInstID, lTypeID, sUsrName, lusrID, dCreated, dAudit, sStatus
FROM
Asset
WHERE
lAssetID = 3080899
After insert the next ID value used is 3080900 (as expected e.g. AUTOINC_NEXT = 3080899, + 1 = 3080900)
We repeat this insert until we reach 80% of the allocated identity range.
SELECT
AUTOINC_MAX, AUTOINC_NEXT, AUTOINC_MAX-AUTOINC_NEXT
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_NAME = N'Asset'
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3081898 | 3081699 | 199
We sync. We note 800 Subscriber changes. We query and this is what we see. No change from pre-sync.
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3081898 | 3081699 | 199
We continue to insert until there are zero IDs remaining
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3081898 | 3081898 | 0
One more insert and the results are this
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3082898 | 3081899 | 999
This is completely unexpected and contrary to when a new range is assigned, there will be a gap in the identity range values in the table. In fact the IDENTITY RANGE is contiguous. This is somewhat desirable as we don't waste IDs.
I cannot find in the .SDF where the next allocated range is stored.
I'm presuming it's next_range_start and next_range_end from the sysmergearticles server table but no documentation can be found that exposes these values in the .SDF.
If someone knows what's going here I'd greatly appreciate it.
As a point to note if you fully expend this "next range" without a sync the database returns an error as expected.
A sync post error shows 1200 new records uploaded by the subscriber (200 from the previous range plus the 1000 from the "next range")
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3083898 | 3082898 | 999
Kind regards
Andrew
For those interested I received this answer from a chap at Microsoft.
Hi Andrew,
Based on testing, I believe this should be a document bug. The CE
synchronization should use same method subscriber running SQL Server
2005 or a later version. That is,
subscriber will receive two identity ranges. The secondary range is
equal in size to the primary range; when the primary range is
exhausted, the secondary range is used, and the Merge Agent assigns a
new range to the Subscriber. The new range becomes the secondary
range, and the process continues as the Subscriber uses identity
values.
So, let us say, you set subscriber identity range to 10. Then the
publisher starts from 1 to 11 for primary range and 11 to 21 for the
secondary range. And subscriber starts from 21 to 31 for the primary
range and 31 to 41 for the secondary range. If there is no
synchronization before you consume all subscriber range, you will hit
the error, it means you could insert double number record of
subscriber range you set to subscriber database before you hit the
error message.
For publisher end, if you consume all range but not in a single batch,
insert trigger will help you re-allocate a new range. For example,
before synchronization, if publisher end reaches 21 and next insert
will start from 42 since 22 to 41 belongs to subscriber.
The mechanism is the same for multiple subscribers, if you are still
in primary range for subscriber, new range will not be allocated. If
you are already in the secondary range, new range will be allocated
and makes your secondary range to new primary and the new range to the
new secondary range.
You could skip re-publisher scenario since CE subscriber could not act
as publisher.
Best regards,
Peter https://social.msdn.microsoft.com/profile/sql%20team%20-%20msft/?ws=usercard-mini
This confirms precisely what we are seeing in our testing. So to be clear SQL CE 3.1+ does receive a primary and secondary range on first sync. After expending the first range the secondary becomes the primary and next sync a new secondary range is applied.
Just need the queries to identify what the secondary range is in SQL CE...
I have a table that looks like the following:
game_stats table:
id | game_id | player_id | stats | (many other cols...)
----------------------
1 | 'game_abc' | 8 | 'R R A B S' | ...
2 | 'game_abc' | 9 | 'S B A S' | ...
A user uploads data for a given game in bulk, submitting both players' data at once. For example:
"game": {
id: 'game_abc',
player_stats: {
8: {
stats: 'R R A B S'
},
9: {
stats: 'S B A S'
}
}
}
Submitting this to my server should result in the first table.
Instead of updating the existing rows when the same data is submitted again (with revisions, for example) what I do in my controller is first delete all existing rows in the game_stats table that have the given game_id:
class GameStatController
def update
GameStat.where("game_id = ?", game_id).destroy_all
params[:game][:player_stats].each do |stats|
game_stat.save
end
end
end
This works fine with a single threaded or single process server. The problem is that I'm running Unicorn, which is a multi-process server. If two requests come in at the same time, I get a race condition:
Request 1: GameStat.where(...).destroy_all
Request 2: GameStat.where(...).destroy_all
Request 1: Save new game_stats
Request 2: Save new game_stats
Result: Multiple game_stat rows with the same data.
I believe somehow locking the rows or table is the way to go to prevent multiple updates at the same time - but I can't figure out how to do it. Combining with a transaction seems the right thing to do, but I don't really understand why.
EDIT
To clarify why I can't figure out how to use locking: I can't lock a single row at a time, since the row is simply deleted and not modified.
AR doesn't support table-level locking by default. You'll have to either execute db specific SQL or use a gem like Monogamy
Wrapping up the save statements in a transaction will speed things up if nothing else.
Another alternative is to implement the lock with Redis. Gems like redis-lock are also available. This will probably be less risky as it doesn't touch the DB, and you can set Redis keys to expire.
I am having trouble coming up with a good way to store a dataset that continually changes.
I want to track and periodically report on the contents of specific websites. For example, for a certain website I want to keep track of all the PDF documents that are available. Then I want to report periodically (say, quarterly) on the number of documents, PDF version number and various other statistics. In addition, I want to track the change of these metric over time. E.g. I want to graph the increase in PDF documents offered on the website over time.
My input is basically a long list of URLs that point to all the PDF documents on the website. These inputs arrive intermittently, but they may not coincide with the dates I want to run the reports on. For example, in Q4 2010 I may get two lists of URLs, several weeks apart. In Q1 2011 I may get just one.
I am having trouble figuring out how to efficiently store this input data in a database of some sorts so that I can easily generate the correct reports.
On the one hand, I could simply insert the complete list into a table each time I recieve a new list, along with a date of import. But I fear that the table will grow quite big in a short time, and most of it will be duplicate URLs.
But, on the other hand I fear that it may get quite complicated to maintain a list of unique URLs or documents. Especially when documents are added, removed and then re-added over time. I fear I might get into the complexities of creating a temporal database. And I shudder to think what happens when the document itself is updated but the URL stays the same (in that case the metadata might change, such as the PDF version, file size, etcetera).
Can anyone recommend me a good way to store this data so I can generate reports from it? I would especially like to have the ability to retroactively generate reports. E.g, when I want to track a new website in Q1 2011, I would like to be able to generate a report from both the Q4 2010 data as well, even though the Q1 2011 data has already been imported.
Thanks in advance!
Why not just a single table, called something like URL_HISTORY:
URL VARCHAR (PK)
START_DATE DATE (PK)
END_DATE DATE
VERSION VARCHAR
Have END_DATE as either NULL or a suitable dummy date (eg. 31-Dec-9999) where the version has not been superceded; set END_DATE to be the last valid date where the version has been superceded, and create a new record for the new version - eg.
+------------------+-------------+--------------+---------+
|URL | START_DATE | END_DATE | VERSION |
|..\Harry.pdf | 01-OCT-2009 | 31-DEC-9999 | 1.1.0 |
|..\SarahJane.pdf | 01-OCT-2009 | 31-DEC-2009 | 1.1.0 |
|..\SarahJane.pdf | 01-JAN-2010 | 31-DEC-9999 | 1.1.1 |
+------------------+-------------+--------------+---------+
What about using a document database and instead of saving each url you save a document that has a collection of urls. At this point whenever you execute whatever process that iterates over all the urls you get all of the documents that existing a time frame or whatever qualifications you have on that and then run all of the urls across each of the documents.
This could also be emulated in sql server by just serializing your object to json or xml and storing the output in a fitting column.