SQL Server CE what is really happening with #threshold and identity ranges? - sql

Folks, as this question relates to IDENTITY columns and merge replication, if I may ask you to refrain from answering with "use GUIDs instead". I'm acutely aware of the benefits and limitations of both and have been using SQL replication with CE since SQL Server 2000. Very occasionally I get surprised. This is such a case.
This is a complex description of the problem so please bear with me.
Below is an extract from here https://msdn.microsoft.com/en-us/library/ms152543.aspx and is what I've always understood with regard to identity ranges and thresholds.
"Subscribers running SQL Server Compact or previous versions of SQL Server are assigned only the primary range; assignment of new ranges is controlled by the #threshold parameter. Additionally, a republishing Subscriber has only the range specified in the #identity_range parameter; it must use this range for local changes and for changes at Subscribers that synchronize with the republishing Subscriber. For example, you could specify 10000 for #pub_identity_range, 500000 for #identity_range and 80 percent for #threshold. After 8000 inserts at a Subscriber (80 percent of 10000), the Publisher is assigned a new range. When a new range is assigned, there will be a gap in the identity range values in the table. Specifying a higher threshold results in smaller gaps, but the system is less fault-tolerant: if the Merge Agent cannot run for some reason, a Subscriber could more easily run out of identities."
If was assume this is true for the moment we'll get to the start of my problem.
To help users using our application we have been using a variation of the following query to let clients know they might run out of identities if they keep going and to initiate a sync to get a new range.
SELECT
AUTOINC_MAX, AUTOINC_NEXT, AUTOINC_MAX-AUTOINC_NEXT
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_NAME = N'Asset'
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3081898 | 3080899 | 999
By evaluating AUTOINC_MAX - AUTOINC_NEXT (= 999) we can see when we're getting low on IDs
In code we are looking at AUTOINC_MAX - AUTOINC_MIN which gives the allocated range. Using the default threshold of 80% and the remaining range we can advise clients to sync if they look like running out.
However this is where what I've thought to be true fails in practice. Referring to the Microsoft details above this sentence stands out "When a new range is assigned, there will be a gap in the identity range values in the table."
I take this to mean the following
If our user has an identity range of 0-1000 IDs and uses IDs up to 801. On next sync the user will be allocated the next range of 1001-2000(we are assuming one subscriber for illustration). As a result of the sync the next ID used will be 1001 leaving a gap from 802-1000.
Firstly please let me know if my understanding is wrong.
Secondly though, this is not what we're seeing in practice.
In practice what we are seeing, based on the example above, post sync and subsequent inserts, is the balance of the IDs being used until we fully expend the original range. THEN upon expending the range AUTOINC-MIN, -MAX and -NEXT are all updated to the new range. No additional syncs have occurred.
Below is an example.
In the target table the last ID used is 3080899
To simulate usage the following query was used
INSERT INTO Asset (lInstID, lTypeID, sUsrName, lUsrID, dCreated, dAudit, sStatus)
SELECT
lInstID, lTypeID, sUsrName, lusrID, dCreated, dAudit, sStatus
FROM
Asset
WHERE
lAssetID = 3080899
After insert the next ID value used is 3080900 (as expected e.g. AUTOINC_NEXT = 3080899, + 1 = 3080900)
We repeat this insert until we reach 80% of the allocated identity range.
SELECT
AUTOINC_MAX, AUTOINC_NEXT, AUTOINC_MAX-AUTOINC_NEXT
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
TABLE_NAME = N'Asset'
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3081898 | 3081699 | 199
We sync. We note 800 Subscriber changes. We query and this is what we see. No change from pre-sync.
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3081898 | 3081699 | 199
We continue to insert until there are zero IDs remaining
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3081898 | 3081898 | 0
One more insert and the results are this
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3082898 | 3081899 | 999
This is completely unexpected and contrary to when a new range is assigned, there will be a gap in the identity range values in the table. In fact the IDENTITY RANGE is contiguous. This is somewhat desirable as we don't waste IDs.
I cannot find in the .SDF where the next allocated range is stored.
I'm presuming it's next_range_start and next_range_end from the sysmergearticles server table but no documentation can be found that exposes these values in the .SDF.
If someone knows what's going here I'd greatly appreciate it.
As a point to note if you fully expend this "next range" without a sync the database returns an error as expected.
A sync post error shows 1200 new records uploaded by the subscriber (200 from the previous range plus the 1000 from the "next range")
AUTOINC_MAX | AUTOINC_NEXT | AUTOINC_MAX-AUTOINC_NEXT
------------+--------------+-------------------------
3083898 | 3082898 | 999
Kind regards
Andrew

For those interested I received this answer from a chap at Microsoft.
Hi Andrew,
Based on testing, I believe this should be a document bug. The CE
synchronization should use same method subscriber running SQL Server
2005 or a later version. That is,
subscriber will receive two identity ranges. The secondary range is
equal in size to the primary range; when the primary range is
exhausted, the secondary range is used, and the Merge Agent assigns a
new range to the Subscriber. The new range becomes the secondary
range, and the process continues as the Subscriber uses identity
values.
So, let us say, you set subscriber identity range to 10. Then the
publisher starts from 1 to 11 for primary range and 11 to 21 for the
secondary range. And subscriber starts from 21 to 31 for the primary
range and 31 to 41 for the secondary range. If there is no
synchronization before you consume all subscriber range, you will hit
the error, it means you could insert double number record of
subscriber range you set to subscriber database before you hit the
error message.
For publisher end, if you consume all range but not in a single batch,
insert trigger will help you re-allocate a new range. For example,
before synchronization, if publisher end reaches 21 and next insert
will start from 42 since 22 to 41 belongs to subscriber.
The mechanism is the same for multiple subscribers, if you are still
in primary range for subscriber, new range will not be allocated. If
you are already in the secondary range, new range will be allocated
and makes your secondary range to new primary and the new range to the
new secondary range.
You could skip re-publisher scenario since CE subscriber could not act
as publisher.
Best regards,
Peter https://social.msdn.microsoft.com/profile/sql%20team%20-%20msft/?ws=usercard-mini
This confirms precisely what we are seeing in our testing. So to be clear SQL CE 3.1+ does receive a primary and secondary range on first sync. After expending the first range the secondary becomes the primary and next sync a new secondary range is applied.
Just need the queries to identify what the secondary range is in SQL CE...

Related

Data structure for tableau: replace only some rows (dates) via filter for n number of cases/scenarios

I am using Tableau Desktop and Tableau Prep. In best case I can solve this issue in Prep to optimize the dashboard performance (and Prep is able to execute custom SQL for instance). Nevertheless: if there is an easy solution in Tableau Desktop this might work as well.
Overall goal:
Visualize all project-requests per pool over time. Include several scenarios per pool.
Data source "Requests":
Date
ProjectID
Pool
Request
March 25
6234
PoolA
1
April 24
92345
PoolB
0,5
April 23
123
PoolB
0,5
Data source "Scenarios":
Date
ProjectID
Pool
Scenario
March 26
6234
PoolA
rabbit
April 22
92345
PoolB
duck
Restrictions:
Key is ProjectID
len(Requests) >>> len(Scenarios)
"Scenario" will only replace some request-dates. To check the complete simulated overview, you need to include replaced rows but also some original rows per pool
Number of scenarios is dynamic and not fixed
Several scenarios might replace/affect the very same row from "Requests"
To avoid conflicts its just possible to select one scenario
Questions:
Whats the most elegant data structure to visualize this data and these scenarios in Tableau?
One solution: dont join but concatenate and use one filter to "activate" the scenario and another filter to "deactivate" the just replaced rows. Bad from a user perspective (two filters for just one scenario)
Brute force: duplicate every row per scenario (updated rows AND untouched rows). Should work in theory but will blow up my table even more

report scheduler system design using database as master

Problem
we have ~50k scheduled financial reports that we periodically deliver to clients via email
reports have their own delivery frequency (date&time format - as configured by clients)
weekly
daily
hourly
weekdays only
etc.
Current architecture
we have a table called report_metadata that holds report information
report_id
report_name
report_type
report_details
next_run_time
last_run_time
etc...
every week, all 6 instances of our scheduler service poll the report_metadata database, extract metadata for all reports that are to be delivered in the following week, and puts them in a timed-queue in-memory.
Only in the master/leader instance (which is one of the 6 instances):
data in the timed-queue is popped at the appropriate time
processed
a few API calls are made to get a fully-complete and current/up-to-date report
and the report is emailed to clients
the other 5 instances do nothing - they simply exist for redundancy
Proposed architecture
Numbers:
db can handle up to 1000 concurrent connections - which is good enough
total existing report number (~50k) is unlikely to get much larger in the near/distant future
Solution:
instead of polling the report_metadata db every week and storing data in a timed-queue in-memory, all 6 instances will poll the report_metadata db every 60 seconds (with a 10 s offset for each instance)
on average the scheduler will attempt to pick up work every 10 seconds
data for any single report whose next_run_time is in the past is extracted, the table row is locked, and the report is processed/delivered to clients by that specific instance
after the report is successfully processed, table row is unlocked and the next_run_time, last_run_time, etc for the report is updated
In general, the database serves as the master, individual instances of the process can work independently and the database ensures they do not overlap.
It would help if you could let me know if the proposed architecture is:
a good/correct solution
which table columns can/should be indexed
any other considerations
I have worked on a differt kind of sceduler for a program that reported analyses on a specific moment of the month/week and what I did was combining the reports to so called business cycle based time moments. these moments are on the "start of a new week", "start of the month", "start/end of a D/W/M/Q/Y'. So I standardised the moments of sending the reports and added the id's to a table that would carry the details of the report. - now you add thinks to the cycle of you remove it when needed, you could do this by adding a tag like(EOD(end of day)/EOM (End of month) SOW (Start of week) ect, ect, ect,).
So you could index the moments of when the clients want to receive the reports and build on that track. Hope that this comment can help you with your challenge.
It seems good to simply query that metadata table by all 6 instances to check which is the next report to process as you are suggesting.
It seems odd though to have a staggered approach with a check once every 60 seconds offset by 10 seconds for your servers. You have 6 servers now but that may change. Also I don't understand the "locking" you are suggesting, why now simply set a flag on the row such as [State] = "processing", then the next scheduler knows to skip that row and move on to the next available one. Once a run is processed, you can simply update a [Date_last_processed] column, or maybe something like [last_cycle_complete] = 'YES'.
Alternatively you could have one server-process to go through the table, and for each available row, sends it off to one of the instances, in a round-robin fashion (or keep track of who is busy and who isn't).

Auto-incrementing a Firebird field value when using UPDATE OR INSERT INTO

I have been a Delphi programmer for 25 years, but managed to avoid SQL until now. I was a dBase expert back in the day. I am using Firebird 3.0 SuperServer as a service on a Windows server 2012 box. I run a UDP listener service written in Delphi 2007 to receive status info from a software product we publish.
The FB database is fairly simple. I use the user's IP address as the primary key and record reports as they come in. I am currently getting about 150,000 reports a day and they are logged in a text file.
Rather than insert every report into a table, I would like to increment an integer value in a single record with a "running total" of reports received from each IP address. It would save a LOT of data.
The table has fields for IP address (Primary Key), LastSeen (timestamp), and Hits (integer). There are a few other fields but they aren't important.
I use UPDATE OR INSERT INTO when the report is received. If the IP address does not exist, a new row is inserted. If it does exist, then the record is updated.
I would like it to increment the "Hits" field by +1 every time I receive a report. In other words, if "Hits" already = 1, then I want to inc(Hits) on UPDATE to 2. And so on. Basically, the "Hits" field would be a running total of the number of times an IP address sends a report.
Adding 3 million rows a month just so I can get a COUNT for a specific IP address does not seem efficient at all!
Is there a way to do this?
The UPDATE OR INSERT statement is not suitable for this, as you need to specify the values to update or insert, so you will end up with the same value for both the insert and the update. You could address this by creating a before insert trigger that will always assign 1 to the field that holds the count (ignoring the value provided by the statement for the insert), but it is probably better to use MERGE, as it gives you more control about the resulting action.
For example:
merge into user_stats
using (
select '127.0.0.1' as ipaddress, timestamp '2021-05-30 17:38' as lastseen
from rdb$database
) as src
on user_stats.ipaddress = src.ipaddress
when matched then
update set
user_stats.hits = user_stats.hits + 1,
user_stats.lastseen = max_value(user_stats.lastseen , src.lastseen)
when not matched then
insert (ipaddress, hits, lastseen) values (src.ipaddress, 1, src.lastseen)
However, if you get a lot of updates for the same IP address, and those updates are processed concurrently, this can be rather error-prone due to update conflicts. You can address this by inserting individual hits, and then have a background process to summarize those records into a single record (e.g. daily).
Also keep in mind that having a single record removes the possibility to perform more analysis (e.g. distribution of hits, number of hits on day X or a time HH:mm, etc).

SSIS ForEach ADO Enumerator - Performance Issues

This is a best practice/other approach question about using a ADO Enumerator ForEach loop.
My data is financial accounts, coming from a source system into a data warehouse.
The current structure of the data is a list of financial transactions eg.
+-----------------------+----------+-----------+------------+------+
| AccountGUID | Increase | Decrease | Date | Tags |
+-----------------------+----------+-----------+------------+------+
| 00000-0000-0000-00000 | 0 | 100.00 | 01-01-2018 | Val1 |
| 00000-0000-0000-00000 | 200.00 | 0 | 03-01-2018 | Val3 |
| 00000-0000-0000-00000 | 400.00 | 0 | 06-01-2018 | Val1 |
| 00000-0000-0000-00000 | 0 | 170.00 | 08-01-2018 | Val1 |
| 00000-0000-0000-00002 | 200.00 | 0 | 04-01-2018 | Val1 |
| 00000-0000-0000-00002 | 0 | 100.00 | 09-01-2018 | Val1 |
+-----------------------+----------+-----------+------------+------+
My SSIS Package, current has two forEach Loops
All Time Balances
End Of Month Balances
All Time Balances
Passes AccountGUID into the loop and selects all transactions for that account. It then orders them by date with the first transaction being first and assigns it a sequence number.
Once the sequence number is assigned, it begins to count the current balances based on the increase and decrease cols, along with the tag col to work out which balance its dealing with.
It finishes this off by assigning the latest record with a Current flag.
All Time Balances - Work Flow
->Get All Account ID's in Staging table
|-> Write all Account GUID's to object variable
|--> ADO Enumerator ForEach - Loop Account GUID List - Write GUID to variable
|---> (Data Flow) Select all transactions for Account GUID
|----> (Data Flow) Order all transactions by date and assign Sequence number
|-----> (Data Flow) Run each row through a script component transformation to calculate running totals for each record
|------> (Data Flow) Insert balance data into staging table
End Of Month Balances
The second package, End of Month does something very similar with the exception of a second loop. The select will find the earliest transnational record and the latest transnational record. Using those two dates it will figure out all the months between those two and loop for each of those months.
Inside the date loop, it does pretty much the same thing, works out the balances based on tags and stamps the end of month record for each account.
The Issue/Question
All of this currently works fine, but the performance is horrible.
In one database with approx 8000 Accounts and 500,000 transactions. This process takes upwards of a day to run. This being one of our smaller clients, I tremble at the idea of running it for our heavy databases.
Is there a better approach to doing this, using SQL cursors or so other neat way I have not seen?
Ok, so I have managed to take my package execution from around 3 days to about 11 minutes all up.
I ran a profiler and standard windows stats while running the loops and found a few interesting things.
Firstly, there was almost no utilization of HDD, CPU, RAM or network during the execution of the packages. It told me what I kind of already knew, that it was not running as quickly as it could.
What I did notice, between each execution of the loop there was a 1 to 2ms delay before the next instance of the loop started executing.
Eventually I found that every time a new instance of the loop began, SSIS created a new connection to the SQL database, it appears that this is SSIS's default behavior. Whenever you create a Source or Destination, you are adding a connection delay to your project.
The Fix:
Now this was an odd fix, you need to go into your connection manager (The odd bit) it must be the onscreen window not in the right hand project manager window.
If you select your connect that is referenced in the loop, the properties window on the right side (In my layout anyway) you will see the option called "RetainSameConnection" which be default is set to false.
By setting this to true, I eliminated the 2ms delay.
Considerations:
In doing this I created a heap of other issues, which really just highlighted areas of my package that I had not thought out well.
Some things that appears to be impacted by this change were stored procedures that used temp tables, these seemed to break instantly. I assume that is because of how SQL handles temp tables, in closing the connection and reopening, you can be pretty certain that the temp table is gone. With the same connection setting, the chance of running into temp tables appears to be an issue again.
I removed all temp tables and replaced them with CTE statements, this appears to fix this issue.
The second major issue I found was with tasks that ran parallel and both used the same connection manager. From this I received an error that SQL is still trying to run the previous statement. This bombed out my package.
To get around this, I created a duplicate connection manager (All up I made three connection managers for the same database).
Once I had my connections set up, I went into each of my parallel Source and Destinations and assigned them their own connection manager. This appears to have resolved the last error I received.
Conclusion:
They may be more unforeseen issues in doing this, but for now my packages are lightening quick and this highlighted some faults in my design.

Store and Retrieve Records in Redis by Hour

I'm trying to store phone call logs in Redis in a manner that I can quickly look them up by the phone number and by the timestamp (which is actually the number of hours since the unix epoch).
So the data is something like this:
Phone # | Hour | Data
------------+--------+-------------------
15551231234 | 386615 | "call record 1..."
15551231234 | 386615 | "call record 2..."
And I need to be able to get all the data for the specified phone number, and the hour within a range (ex: the last 24 hours).
What is the best way in Redis to store and retrieve this form of data?
Redis is not very good solution for arbitrary range queries: you would probably better served by a relational database or something like MongoDB.
Now, because your range queries applies on numerical values, you can use a zset (sorted set) to represent your data.
# adding a new call
ZADD <phone> <hour> <call data>
# find calls for a given phone number in a range
ZRANGEBYSCORE <phone> <hour begin> <hour end> WITHSCORES
Each returned item will be a pair with time stamp and call data.
Please note you cannot find all phone calls within a given range with this data structure (only calls for a given phone number within a range).