Should the first step of an SSIS solution be to check connections? - sql

In SSIS 2012, how would I check whether a connection and credentials are working?
In a scenario where you must pull and aggregate data from different sources and load it into some target databases, should the whole project be preceeded by a special package that checks whether the connections are up and running?
If so, how would I check whether the connections to which the package needs access are up and ready to be connected to and that the credentials are right?
I understand that if I omit the step of checking the connections, the whole thing will just error out, but I feel like the best practice would be to explicitly check whether the connections and credentials are working.
Thank you very much for your kind attention.

Apparently your check would be as Siyual suggested a select 1.. But the bigger question here is do all the connections need to be active for you to get the target data or would you have some data to load even if some connections faulted - which obviously would depend on the nature of your requirement. Here is where the concept of staging tables, concept of Delta and using the completion workflow instead of the success work flow would help. For example and simplicity sake lets say I am working on pulling sales data from 3 different point of sale locations for my company (Brazil, US and India), I store the dates I last pulled from in a table and update them when the extraction is successful. In this case if one extraction fails I would still want to go to the next location and to try to pull the data.

Related

HA Database configuration that avoids split-brain issues?

I am looking for a (SQL/RDB) database setup that works something like this:
I will have 3+ databases in an active/active/active configuration
prior to doing any insert, the database will communicate with atleast a majority of the others, such that they all either insert at the same time or rollback (transaction)
this way I can write and read from any of the databases, and always get the same results (as long as the field wasn't updated very recently)
note: this is for a use case that will be very read-heavy and have few writes (and delay on the writes is an OK situation)
does anything like this exist? I see all sorts of solutions with database HA configurations, but most of them suggest writing to a primary node or having a passive backup
alternatively I could setup a custom application, and have each application talk to exactly 1 database, and achieve a similar result, but I was hoping something similar would already exist
So my questions is: does something like this exist? if not, are there any technical/architectural reasons why not?
P.S. - I will NOT be using a SAN where all databases can store/access the same data
edit: more clarifications as far as what I am looking for:
1. I have no database picked out yet, but I am more familiar with MySQL / SQL Server / Oracle, so I would have a minor inclination towards on of those
2. If a majority of the nodes are down (or a single node can't communicate with the collective), then I expect all writes from that node to fail, and accept that it may provide old/outdated information
failure / recover scenario expectations:
1. A node goes down: it will query and get updates from the other nodes when it comes back up
2. A node loses connection with the collective: it will provide potentially old data to read request, and refuse any writes
3. A node is in disagreement with the data stores in others: majority rule
4. 4. majority rule does not work: go with whomever has the latest data (although this really shouldn't happen)
5. The entries are insert/update/read only, i.e. there will be no deletes (except manually ofc), so I shouldn't need to worry about an update after a delete, however in that case I would choose to delete the record and ignore the update
6. Any other scenarios I missed?
update: I the closest I see to what I am looking for seems to be using a quorum + 2 DBs, however I would prefer if I could have 3 DBs instead, such that I can query any of them at any time (to further distribute the reads, and also to keep another copy of the data)
You need to define "very recently". In most environments with replication for inserts, all the databases will have the same data within a few seconds of an insert (and a few seconds seems pessimistic).
An alternative approach is a "read-from-one/write-to-all" approach. In this case, reads are spread through the system. Writes are then sent to all nodes by the application (or a common layer that the application uses).
Remember, though, that the issue with database replication is not how it works when it works. The issue is how it recovers when it fails and even how failures are identified. You need to decide what happens when nodes go down, how they recover lost transactions, how you decide that nodes are really synchronized. I would suggest that you peruse the documentation of the database that you are actually using and understand the replication mechanisms provided by that platform.

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

How to look at past DML on your table?

a friend asked me if there is a way to see the past dml statements and I wasn't really sure on how to go about answering that question. What he wants to see is the last set of insert statements. So that means it could be more than 1 record. At first I was just saying to check the latest identity, but then he asked what if more inserts were performed at the same time. Can you guys help me out? Is there a DMV I should use that I just don't know about? Thanks.
If you did not prepare for this question then there is no build in way to get to that information. However, you could use third party log reader tools to recover (all) the last statements that where executed against the database. This requires the database to be in Full recovery mode. You could potentially go back as far as you have log backups with this method.
If you want to prepare for that question being ask in the future, you have several options.
The most obvious one is Change Data Capture. You also could write a trigger yourself that records data changes.
You could also run a trace capturing SQL Batch Started.
Finally you could use a third party network sniffer/logger to capture all statements send to the server (this however requires that connection encryption is not used).

SyncFramework 2.1 updates & deletes do not seem to apply properly

I'm synchronizing SQL Server 2008 with ~6 SQL Server 2008 Express clients (everything R2 I believe), using the SyncOrchestrator or specifically using http://code.msdn.microsoft.com/windowsdesktop/Database-SyncSQL-Server-e97d1208 as a base with slight modifications. To my knowledge this means all connections are peers or nodes.
I have 2 scopes. One is download only and the other is upload only. The download only scope is ridden with identity columns primarily because I didn't know any better and still couldn't wrap my head around introducing Guids as the PK on the client side. It doesn't totally matter as all clients should have exact replicas of about 8 or so tables and these machines don't touch this data in any way, only read it.
The upload only scope uses Guids as fortunately I can control that portion of the database and there would be no way 10 clients all using the same identity seed could sync back to the server properly. Both scopes use the default provisioning with bulk inserts and the whole 9 yards so there shouldn't be anything I'm doing on the provisioning end to screw this up.
I initially set everything up not using PerformPostRestoreFixup AND the initial database would be manually synchronized with insert statements from the host. This seemed fine but no updates or deletes seemed to ever be applied. You can safely ignore this (only used for historical accuracy and to prove my ineptness) as I then used VS2010 Database Projects to rebuild the database down to schema only & synchronized. I then used the steps outlined here (http://social.microsoft.com/Forums/br/syncdevdiscussions/thread/9ac6d1a1-1565-4b82-a8d8-3d4a9ff5d07b) (sync, backup, restore, call performpostrestorefixup, sync on x clients) and on my dev box where I'm setting all this up I could see updates and deletes just fine. Its when I deploy this to the x clients that I'm not seeing a mirror of the database as I think I should.
The initial sync will complain and try to synchronize all records again. I believe this is expected. During ApplyChangeFailed event on the client I set everything other than DbConflictType.ErrorsOccurred to ApplyAction.RetryWithForceWrite. This may be a source of problems as I initially thought this should be done to force the change down to the client. I want the server to always win in this scenario but during trace I always see the phrase "Local wins" during the bulk insert/update calls. It's possible I'm seeing the error before the re-apply happens but it's awkward to look at.
The only problem I seem to be having is with the download only scope. The initial client database is about a week old now and if I use the performpostrestorefixup steps I don't see any of the updates that have applied between now and then as I think I should. It's as if SyncFx almost prefers a blank database on the client side to kick off the initial sync then all the updates seem to apply just fine with no ApplyChangesFailed events kicking off.
If anyone has seen this before or has a clue where to go I would greatly appreciate it. My brain has fried trying to determine what it is that's going on. My last ditch effort will be to deploy blank databases to all the clients and have them start the sync. I've had no issues with this on the dev side but I can only test one other client to know if that'll do anything different. Aside from that I don't know what to do other than to keep doing manual syncs which would defeat this purpose entirely. I thought PerformPostRestoreFixup would alleviate the issue entirely but I seem to be having the same problems with or without it or perhaps I'm not looking at what I need to be.
Thanks
I wanted to report and close the entry with my findings.
When I would deploy a previously configured client database, I'd often get ApplyChangeFailed events in the form of this log:
"[05:30:41 PM] - ApplyChange Failed: TableName: , Stage: ApplyingInserts, ConflictType: LocalInsertRemoteInsert, Action: RetryWithForceWrite"
This is what I thought would be expected as it tried to reinsert the data that is already there. What this should've been changed to was an update statement during RetryWithForceWrite but I found the data was not updating with what was being sent down.
Once I started each client with a completely blank database and provisioned locally, all of these errors went away. It's as if every client expects some unique id only it sets. I'm also using x64 builds versus x86 which may have some or no bearing on the results. I wish I could determine what exactly happened but it seems that when in doubt, and whenever possible, starting from absolute zero and letting sync fill in the data is your safest option.

question about frequency of updating access

i have a table in an access database
this access database is used on a regular basis, basically from 9-5
someone else has a copy of this exact table. sometimes records are added, sometimes deleted, and sometimes data within the records is updated.
i need to update the access database table with the offsite table every hour or so. what is the best algorithm of updating the data? there are about 5000 records.
would it severely lock up the table for a few seconds every hour?
i would like to publicly apologize for my rude comment to david fenton
My impression is that this question ties together pieces you've been exploring with your previous questions:
a file "listener" to detect the presence of a new file and do something with it when found
list files with some extension in a folder
DoCmd.TransferText to pull file data into your database
Insert, Update, Delete records in a table based on an imported set of records
Maybe it's time to give us a more detailed picture of what you're dealing with.
Tony asked if both sites are on the same WAN (Wide Area Network). You replied they are on Windows. Elsewhere you said you're using a network. Please tell us about the network.
I'm still unsure whether you need a one-way or two-way data exchange. You've talked about importing changes from the remote table into the local master table. Do you need to do the same type of operation at the remote site: import changes made to the table at the master site?
Tell us what needs to happen regarding the issue James raised. Can local and remote users ever edit the same record? If they can, how will you resolve the conflict? Similarly, what should happen if a remote user updates a record and a local user deletes their copy of that record?
Based on what you've told us so far, this sounds like a real challenge for Access, made more challenging by the rate of record changes (5,000 per hour). I like the outline Kevin suggested. However your challenge will be more complicated since you also need to account for record deletions at both sites.
It seems like you may have to create something which duplicates Access' Replication feature. Maybe you should look at the Jet Replication Wiki to see if you can modify your design to take advantage of Replication. I can't help you there, and unfortunately you appear to have frustrated David Fenton who is a leading authority on Jet Replication.
If a few seconds performance is critical, you'd rather move to a better database engine (like Sqlite, MySQL, MS SQL server). If you want a single file, then Sqlite is the best for you. All these use by-single-record locks, so you can read and write simultaneously.
If you stay with access, you will probably have to implement a timer to update only a few records at a time.
Before you do anything else you need to establish the "rules" as far as collisions go.
If a row in the local copy is updated and the same row in the remote copy is updated which one is the "correct" version? Ditto for deletions, inserts are even more of a pain as you can have the "same" set of values but perhaps a different key.
After you have worked out how to handle each of these cases you can then go on to thinking about the implementation.
As other posters have suggested the way to completely avoid these issues is switch to SQLServer or any other "proper" database which can be updated over the network by all users and where concurrency issues are handled by the DBMS when the updates are applied.
Other users have already suggested switching to a server based database i.e. SQL server etc. I would echo this and say it is the best way to go however if you are stuck with access and have no choice then I would suggest you add a field (with an index) along the lines of “Last Updated”. You could then export all records that have been modified within a particular time frame. Export this file as a CSV, ship it over to the remote site and import it into the “master” access database. With a bit of scripting you could automate this process.