What is the best approach to archiving operational data?

What is the best approach to archiving operational data? - sql

I have a sql server 2012 database which is the backend to an asp.net MVC application, storing customer and order information. This database is accessed under high load and high usage.
I know have a requirement to be able to generate ad hoc reports from the database accessing the same data as the MVC application works with. I am concerned what impact this would have on the database server and the database itself, around locking etc. As such their is a distinction between the data, for the app its operational, but for the reports its more data warehouse oriented.
Therefore I am looking at my options as to the best approach to avoid such.
I am considering creating another database on a different server and archive the data to it using a sql job at regular intervals during the day. Only concern around this is that it would require maintenance and also a dependency to ensure any necessary changes are made to the target database when the source database changes.
What other options opened to me in such a situation and what advice could be given regarding such? What is the best approach to such?

You don't have to think of your own solution to keep the databases in sync. SQL Server has build in ways to achieve this.
Database Mirroring
Replication
Always On Availability Groups
If you're using Enterprise Edition of SQL Server 2012 then I would look into Always On Availability Groups if not then (Transactional) Replication. Both of these solutions can keep a second read-only and near real-time copy of the database.

As Steve McConell suggests you should make no assumptions about performance. You should just measure it before making any decisions. It is not a wise choice to make design choices without knowing the actual performance overhead. So I would suggest to measure, or simulate the performance overhead before even consider using a complex architecture, because you would not know if it's worth the trouble.
Anyway, I think that your approach is right. I would create a windows service which periodically retrieves the data I need from my database and stores them in my warehouse (the new database). I don't think you would ever find a tool keeping consistency between the two schemas, unless you want one schema to be an exact copy of the other.
I don't know your exact needs and perhaps my suggestion is an overkill but I would encourage you to consider using an OLAP approach in the data warehouse where your reporting data will come from. I have to warn you that these systems are oriented in really big data and advanced reporting needs but perhaps you can take some ideas from them. Since you are familiar with the Microsoft ecosystem, I would suggest using Business Intelligence Studio. You could there build an OLAP cube using your normal database as data source and integrate advanced reporting.
Hope I helped.

Related

Proper way to move data to a data warehouse

I am in the middle of a small project aimed to eventually create a data warehouse. I am currently moving data from a flat file system and two SQL Server databases. The project started in C# to automate the processing of data from the flat file system. Along with this, the project executes stored procedures to bring data from other databases. They are accessing the data from other databases using linked servers.
I am wondering if this is incorrect as even though it does get the job done, there may better approach? The other way I have thought about this is to use the app to pull data from each DB then push it to the data warehouse, but I am not sure about performance. Is there another way? Any path that I can look into is appreciated.

'proper' is a pretty relative term. I have seen a series of stored procedures, SSIS (microsoft), and third party tools. THey each have some advantages
Stored procedures
Using a job to schedule a series of stored procedures that insert rows from one server to the next works. I find sql developers more likely to take this path...it's flexible in design and good SQL programmers can accomplish nearly anything in here. That said, it is exceedingly difficult to support / troubleshoot / maintain / alter (especially if the initial developer(s) are no longer with the company). There is usually very poor error handling here
SSIS and other tools such as pentaho or data stage or ...google search it, theres a few.
This gives a more graphical design interface, although I've seen SSIS packages that simply called a stored procedures in order that may as well just been a job. These tools are really what you make of them. They give very easy to see work flows and are substantially robust when it comes to error handling and troubleshooting ability (trust me, every ETL process is going to have a few bad days and you'll be very happy for any logging you have to identify what you want). I find configuring a servers resources (multiple processors for example) is significantly easier with these tools. They all come with quite the learning curve though.
I find SQL developers are very much inclined to use the stored procedure route while people from a DBA background are generally more inclined to use the tools. If you're investing the time into it, the SSIS or equivlent tool is a better way to go from the future of your company standpoint, though takes a bit more to implement.

In choosing what to use you need to consider the following factors:
How much data are we talking about moving and how quickly does it need to be moved. There is s huge difference between using a linked server to move45,000 records and using it to move 100,000,000 records. Consider alo the expected growth of the data set to be moved over time. A process taht is fine in the early stages may chocke and die once you get more records. Tools like SSIS are much faster once you know how to use them which brings us to point 2.
How much development time do you have and what tools does the developer and the person who will maintain the import over time know? SSIS for instance is a complex tool, it can take a long time to feel comfortable with it.
How much data cleaning and transforming do you need to do? What kind of error trapping and exception processing do you need, what kind of logging will you need? The more complex the process, the more likely you will need to bite the bullet and learn an ETL specific tool.

Even there is a few answers, and I agree with two of them, I have to give my subjective opinion about the wider picture.
I am in the middle of a small project aimed to eventually create a data warehouse.
Question name perfectly suits to your question description. It could be very helpful to future readers. So, your project should create data warehouse. However it's small, learn to develop projects with scalability. Always!
In that point of view, search and study about how data warehouse project should look like. And develop each step.
Custom software vs Stored Procedures (Linked DBs) vs ETL
Custom software (in this case your C# project) should be used in two cases:
Medium scale projects where budget ETL cannot do everything
You're working for Enterprise level IT company, so developing your solution is cheaper and more manageable
And perhaps you think for tiny straight-forward projects. But NO, because those projects can grow and very quick outgrow your solution (new tables, new sources, changing ERP or CRM, ect).
If you're using just SQL Server, if you no need for data cleansing, if you no need for data profiling, if you no need for external data, Stored Procedures are OK. But, a lot of 'ifs' is here. And again, you're loosing scalability (your managment what's to add some data from Google Spreadsheet they internly use, KPI targets for example).
ETL tools are one native step in data warehouse development. In begining, there could be few table copy operation, or some SQL's, one source, one target. As far as your project is growing, you can adding new transformations.
SSIS is perhaps best as you're using SQL Server, but there is some good, free tools.

How to go from a full SQL querying to something like a NoSQL?

In one of my process I have this SQL query that take 10-20% of the total execution time. This SQL query does a filter on my Database, and load a list of PricingGrid object.
So I want to improve these performance.
So far I guessed 2 solutions :
Use a NoSQL solution, AFAIK these are good solutions for improving reading process.
But the migration seems hard and needs a lot of work (like import the data from sql server to nosql in a regular basis)
I don't have any knowledge , I even don't know which one I should use (the first I'd use is Ravendb because I follow ayende and it's done by the .net community).
I might have some stuff to change in my model to make my object ok for a nosql database
Load all my PricingGrid object in memory (in a static IEnumerable)
This might be a problem when my server won't have enough memory to load everything
I might reinvent the wheel (indexes...) invented by the NoSQL providers
I think I'm not the first one wondering this, so what would be the best solution ? Is there any tools that could help me ?
.net 3.5, SQL Server 2005, windows server 2005

Migrating your data from SQL is only the first step.
Moving to a document store (like RavenDB or MongoDB) also means that you need to:
Denormalize your data
Perform schema validation in your code
Handle concurrency of complex operations in your code since you no longer have transactions (at least not the same way)
Perform rollbacks in the event of partial commits (changes)
Depending on your updates, reads and network model you might also need to handle conflicts
You provided very limited information but it sounds like your needs include a single database server and that your data fits well in the relational model.
In such a case I would vote against a NoSQL solution, it is more likely that you can speed up your queries with database optimizations and still retain all the added value of a RDBMS.
Non-relational databases are tools for a specific job (no matter how they sell them), if you need them it is usually because your data doesn't fit well in the relational model or if you have a need to distribute your data over multiple machines (size or availability). For instance, I use MongoDB for a write-intensive high throughput job management application. It is centralized and the data is very transient so the "cost" of having low durability is acceptable. This doesn't sound like the case for you.
If prefer to use a NoSQL solution perhaps you should try using Memcached+MySQL (InnoDB) this will allow you to get the speed benefits of an in-memory cache (in the form of a memcached daemon plugin) with the underlying protection and capabilities of an RDBMS (MySQL). It should also ease data migration and somewhat reduce the amount of changes required in your code.
I myself have never used it, I find that I either need NoSQL for the reasons I stated above or that I can optimize the RDBMS using stored procedures, indexes and table views in a way which is sufficient for my needs.

Asaf has provided great information in regards to the usage of NoSQL and when it is most appropriate. Given that your main concern was performance, I would tend to agree with his opinion - it would take you much more time and effort to adopt a completely new (and very different) data persistence platform than it would to trick out your SQL Server cluster. That said, my answer is mainly to address the "how" part of your question.
Addressing misunderstandings:
Denormalizing Data - You do not need to manually denormalize your existing data. This will be done for you when it is migrated over. More than anything you need to simply think about your data in a different fashion - root aggregates, entity and value types, etc.
Concurrency/Transactions - Transactions are possible in both Mongo and Raven, they are simply done in a different fashion. One of the inherent ways Raven does this is by using an ORM-like "unit of work" pattern with its RavenSession objects. Yes, your data validation needs to be done in code, but you already should be doing it there anyway. In my experience this is an over-hyped con.
How:
Install Raven or Mongo on a primary server, run it as a service.
Create or extend an existing application that uses the database you intend to port. This application needs all the model classes/libraries that your SQL database provides persistence for.
a. In your "data layer" you likely have a repository class somewhere. Extract an interface form this, and use it to build another repository class for your Raven/Mongo persistence. Both DB's have plenty good documentation for using their APIs to push/pull/update changes in the document graphs. It's pretty damn simple.
b. Load your SQL data into C# objects in memory. Pull back your top-level objects (just the entities) and load their inner collections and related data in memory. Your repository is probably already doing this (ex. when fetching an Order object, ensure not only its properties but associated collections like Items are loaded in memory.
c. Instantiate your Raven/Mongo repository and push the data to it. Primary entities become "top level documents" or "root aggregates" serialized in JSON, and their collections' data nested within. Save changes and close the repository. Note: You may break this step down into as many little pieces as your data deems necessary.
Once your data is migrated, play around with it and ensure you are satisfied. You may want to modify your application Models a little to adjust the way they are persisted to Raven/Mongo - for instance you may want to make both Orders and Items top-level documents and simply use reference values (much like relationships in RDBMS systems). Watch out here though, as doing so sort-of goes against the principal and performance behind NoSQL as now you have to tap the DB twice to get the Order and the Items.
If satisfied, shard/replicate your mongo/raven servers across your remaining available server boxes.
Obviously there are tons of little details I did not explain, but that is the general process, and much of it depends on the applications already consuming the database and may be tricky if more than one app/system talks to it.
Lastly, just to reiterate what Asaf said... learn as much as you can about NoSQL and its best use-cases. It is an amazing tool, but not golden solution for all data persistence. In your case try to really find the bottlenecks in your current solution and see if they are solvable. As one of my systems guys says, "technology for technology's sake is bullshit"

Handling linked systems

We have many systems that talk to each other and its become a bit of a mess. e.g system B gets data from system A and System A gets data from System C which also gets data from System B etc etc. The data is passed around using a variety of methods. Some of the data is copied across using sql periodically thus duplicating the data. Some of the data is pulled using views locally and remotely in real-time. We want to come up with a better solution. My plan is to create a central repository that the systems dump and get data from. Does this sound like a good idea? Whats the best practice for handling data between remote systems?
Thanks in advance.

You mean like a data warehouse? This is pretty standard as long as you don't want to update the data, and just want to use it for reporting/driving other applications.
You have a variety of options for getting the data in there including linked servers, SSIS packages and replication (if between oracle servers or ms sql servers)

You can read Microsoft recomendation: http://technet.microsoft.com/en-us/library/dd459147(SQL.100).aspx

As Martin Booth and Dalex say, if the data is used only for reporting, a datawarehouse is the obvious solution.
If you use the data in transactional systems, there are some other options.
If your system is primarily about data, I'd consider using ETL tools (http://en.wikipedia.org/wiki/Extract,_transform,_load) to manage the copying around of data.
If your system is not just about data, you should look at a service-oriented architecture; this is a brilliantly vague term, and can result in many billable consulting hours, so it's worth doing your homework. In general, the idea is to decouple the underlying implementation (views, replication, dump/restore etc.) from the conceptual "services". This might be too big a jump from where you are now - but the principles are useful when design your solution.

Replace relational DB (SQL Server) with rules-based/declarative implementation?

I have started working on a project in the financial services industry that is based (mainly) on SQL Server (2000), ColdFusion (8), and some Access/.NET applications. This project started as some simple Access forms/VBA and was slowly converted to web interfaces.
I could say that the database design and application coding was done by people that were learning on the job and didn't have the opportunity to learn about good design principles from the start. Many of the business rules are set in a myriad of cascading functions and stored procedures as well as in the web server templates. There is a huge amount of special case handling deep within complex 500-line SQL UDFs that use uncommented constants. It is very difficult to trace all of the interactions between the 10-20 UDFs that might be involved in a query. Some of the queries seem to take way too long to run (up to 15 minutes).
While the tables are fairly well indexed, there is a lack of FK relationships and almost no referential integrity. The DB is updated infrequently with daily batches of low volume (1,000 records in multiple tables.) It is primarily used to serve as a data repository - I suppose a data warehouse. We get very infrequent deadlocks or delays.
So, my question is: If I want to re-implement the whole project including the database and front-end would it make sense to look at non-relational implementations? The primary DB is only about 1GB (.mdf) so it could fit easily in memory. I would like to move from the SQL query structure to some declarative model that could be efficiently compiled and executed. If necessary, I could use the SQL DB just as a data store.

Why do you want to move from the relational approach? By moving from the relational approach you are only going to bury business logic deeper into the code by using any other approach. As you pointed out, the data model is fairly simple. You could first look at improving the data model itself. The reason they may not be any referential integrity constraints is because the initial designers might have assumed that this would lead to lower performance. They might be doing the checks using code that might itself be inefficient.
Your DB is small. adding referential integrity constraints will not affect the performance in any way. If required, you can rewrite some of the UDFs. Why dont you use a query analyzer to look at the performance metrics? That will give you a good starting point for analysis.

If I want to re-implement the whole project including the database and front-end would it make sense to look at non-relational implementations?
In general, most of the developers, even those who breathe map/reduce, and wear NoSQL T shirts, feel a LOT more comfortable with SQL.
If your application follows the classic MVC/MVP model, then most of the frameworks ( e.g. Spring, Rails, Grails, Django, Webmachine, etc.. ) actually come with first class support for a SQL back end. And some support for a NoSQL one.
In case you see no actual benefit that NoSQL can bring to your system ( here are the benefits I posted to another question ), why bother?
I would like to have a set of "english-language" rules that describe the transformations from the underlying raw data to a form that can be directly consumed by the application (web)
Seems that you are talking about a classic persistence layer with a service layer on top of it. Where "english-language" rules are just "english-language" methods in your service layer. Unless you need a more sophisticated rules engine, but most of the time it is not needed.

What point should someone decide to switch Database Systems

When developing whether its Web or Desktop at which point should a developer switch from SQLite, MySQL, MS SQL, etc

It depends on what you are doing. You might switch if:
You need more scalability or better performance - say from SQLite to SQL Server or Oracle.
You need access to more specific datatypes.
You need to support a customer that only runs a particular database.
You need better DBA tools.
Your application is using a different platform where your database no longer runs, or it's libraries do not run.
You have the ability/time/budget to actually make the change. Depending on the situation, the migration could be a bigger project than everything in the project up to that point. Migrations like these are great places to introduce inconsistencies, or to lose data, so a lot of care is required.
There are many more reasons for switching and it all depends on your requirements and the attributes of the databases.

You should switch databases at milestone 2.3433, 3ps prior to the left branch of dendrite 8,151,215.
You should switch databases when you have a reason to do so, would be my advice. If your existing database is performing to your expectations, supports the load that is being placed on it by your production systems, has the features you require in your applications and you aren't bored with it, why change? However, if you find your application isn't scaling, or you are designing an application that has high load or scalability requirements and your research tells you your current database platform is weak in that area, or, as was already mentioned, you need some spatial analysis or feature that a particular database has, well there you go.
Another consideration might be taking up the use of a database agnostic ORM tool that can allow you to experiment freely with different database platforms with a simple configuration setting. That was the trigger for us to consider trying out something new in the DB department. If our application can handle any DB the ORM can handle, why pay licensing fees on a commercial database when an open source DB works just as well for the levels of performance we require?
The bottom line, though, is that with databases or any other technology, I think there are no "business rules" that will tell you when it is time to switch - your scenario will tell you it is time to switch because something in your solution won't be quite right, and if you aren't at that point, no need to change.

BrianLy hit the nail on the head, but I'd also add that you may end up using different databases at different levels of development. It's not uncommon for developers to use SQLite on their workstation when they're coding against their personal development server, and then have the staging and/or production sites using a different database tool.
Of course, if you're using extensions or capabilities specific to a certain database tool (say, PostGIS in PostGreSQL), then obviously that wouldn't work.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas