We currently have a solution were we are having more and more the need to store unstructured data of various kinds. For example clients have the ability to define their own workflows where they define what kind of data should be captured (of various types...some simple some complex). This data then needs to be stored and are then displayed on a web application with a bit of functionality to modify the data.
Until now the workflows have been defined internally and therefore a MS SQL database was designed to cater for these specific workflows and their data. However now that clients have the ability to define workflows we need to relax the structure of our db. At first I thought that a key value table in ms sql might be a good idea but obviously I lose the typeness of the data being capture and then need to deserialize all the data in website (MVC.NET). I am also considering something like raven db but are not sure if this would be a good fit?
So my question is thus what would be the best way to store this unstructured data bearing in mind users must be able to search and edit/display this data as well?
How about combining 2 types of databases. Use a NO-SQL database for your unstructured data and the relational MS SQL database to save the references of your data for each workflow to retrieve them later on?
The data type will always be a problem and you always have to de-serialize it. Searching can be done by using the string representation of each value in your workflow and combining them in a searchable field in your MS SQL row.
Related
In short: I have a client who wish to be able to add domain tables, without adding SQL tables.
I am working with an application in wich data are organized and made available with a postgresql catalogue. What I mean by catalogue is that the database hold the path to the actual data file(s) as well as some metadata.
Adding a new table means that the (Java class of the) client application has to be updated. This is a costly process for the client, who want us to find a way to let him add new kind of data in the catalogue, without having to change the schema.
I don't have many more specificities about the db itself and it's configuration as I'm usualy mostly a client of the said db.
My idea: to solve this was to have a generic table with the most often used columns (like date, comment etc.) and a column containing a domain key. The domain key would be used by the client application to request the kind of generic data is needed (and would have no meaning whatsoever to the db provider). Adding metadata could be done with a companion file within the catalogue and further filtering would have to be done on the client side.
Question: as I am by no mean an SQL expert, I would like to know if it is an acceptable solution, and what limitation I could be facing ? I'm thinking of performance, data volume etc. Or maybe a different approach, is advisable ?
Regarding expected volume, for a single domain data type, it could be arround 30 new entry per day.
How do I describe the partition of client data when all data is stored in one place and separated via programming?
If a collection of data from various clients is stored in a variety of SQL tables and is separated via the code (E.g. members from different orgs defined by organisation table) at which layer is the data separation defined?
Sorry if this question is a bit poorly worded.
In terms of how to explain it, I'd need more information on how you're actually separating the data for consumption by different members, but we've done a similar thing using SQL views. In our case, it's pretty easy to explain because each role (i.e., a set of user permissions determined by their need-to-know) has a set of SQL views they have permissions to view and query but not modify. Then users can query the views as needed to make their own reports and datasets.
If you're looking for more technical jargon, this was one of the documents we came across when setting up our security.
It might be easiest to explain that each data element has a set of roles that have access to that data element. Your role within the multitude of client organizations determines which data elements you can work with in your reports. Then you would just want to use very strong language indicating how you have implemented safeguards ensuring that users cannot, in any way, access data that is not relevant to their need-to-know.
I am evaluating Amazon SimpleDB at this time. SimpleDB is very flexible in the sense that it does not have to have table (or domain) schemas. The schema evolves as the create / update commands flow in. All this is good but while I am using a modeling tool (evaluating MindScape LightSpeed) I require the schema upfront, in order for the tool to generate models based on the schema. I can handcraft domains in SimpleDB and that does help but for that I have to perform at least one create operation on the domain. I am looking for the ability to create domain schema only. Any clues?
There is no schema in SimpleDB.
This is the reason why the NoSQL people suggest to "unlearn" relational databases before shifting the paradigm to these non-relational data stores.
So, you cannot do what you describe. Without the data, there will be nothing.
While it's true that SimpleDB has no schema support, keeping some type information turns out to be crucial if you run queries on on numeric data or dates*. Most NoSQL products have both queries and types, or else no-queries and no-types, but SimpleDB has chosen queries and no-types.
As a result, integrating with any tool outside of your main application will require you to either:
store duplicate type information in different places
create your own simple schema system to store the type information
Option 2 seems much better and choosing it, despite what some suggest, does not mean that you "don't have your mind right."
S3 can be a good option for this data, you can keep it in a file with the same name as your domain and it will be accessible from anywhere with the same AWS credentials as your SimpleDB account.
Storing the data as a list of attributename=formatname is the extent of what I have needed to do. You can, in fact, store all this in an item in your domain. The only issue is that this special item could unintentionally come back from a domain query where you are expecting live data not type information.
I'm not familiar with MindScape LightSpeed, but this is a general strategy I have found beneficial when using SimpleDB, and if the product is able to load/store a file in S3 then all the better.
*Note: just to be clear, I'm not talking about reinventing the wheel or trying to use SimpleDB as a relational database. I'm talking about the fact that numeric data must be stored with both zero padding (to a length of your choosing) and an offset value (depending on if it is signed or unsigned) in order to work with SimpleDB's string-base query language. Once you decide on a format, or a set of formats to be used in your application, it would be folly to leave that information hidden in and scattered across your source files in the case where that information is needed by source code tools, query tools, reporting tools or any other code.
I have large and complex SQL Server 2005 DB used by multiple applications. I want to create a data-dictionary for maintaining not only my DB objects but also cross-reference them against applications that use a specific object.
For example, if a stored procedure is used by 15 diffrent applications I want to record that additional data too.
What are the key elements to be kept in mind so that I get a efficient and scalable Data Dictionary?
So, I recently helped to build a data dictionary for a very large product. We were dealing with documenting more than one-thousand tables using a change request process. I can send you a scrubbed version of the spreadsheet we used if you want. Basically, we captured the following:
Column Name
Data Type
Length
Scale (for decimals)
Whether the column is custom for the application(s) or a default column
Which application(s)/component(s) the column is used in
Release the column was introduced in
Business definition
We also captured information about who requested the addition, their contact information, etc. Our primary focus was on business definition, and clearly identifying why a column was being used or created.
We didn't have stored procedures in our solution, but bear in mind that these would be pretty easy to add to the system.
We used Access for our front-end, even though SQL Server was on the back end. It made it pretty easy for us to build out a rich user interface without much work, using the schema we had already built out.
Hope this helps you get started--feel free to ask if you have additional questions.
I've always been a fan of using the 'extended properties' within SQL Server for storing this kind of meta data. In this way the description of each object lives alongside the object and is accessible by anyone with access to the database itself. I'm sure there are also tools out there that can read these extended properties and turn them into a nicely formatted document.
As far as being "scalable", I don't know of any issues related to adding large amounts of data as extended properties; or I should say I've never had any issues with this.
You can set these extended properties using SQL Server Management Studio 'property' dialog for each table/proc/function/etc and can also use the 'sp_addextendedproperty'.
Without going into specifics...I have a large SQL Server 2005 database with umpteen stored-procedures.
I have multiple applications from WinForm apps to WebServices all of which use this DB.
My simple objective now is to create a meta-database...a prospective data-dictionary where I can maintain details of which specific app. file uses which SP.
For example, My application Alpha which has a file Beta.aspx...uses 3 SPs which are physically configured for usage in BetaDAL.cs
You might have inferred by now,it will make life easier for me later when there is a migration or deprecation....where I can just query this DB SP-wise to get all Apps/Files that use the DB or vice-versa.
I can establish this as a single de-normalized table..or structure it in a better way.
Does some schema already exist for this purpose?
SQL Server supports what are called extended properties, basically a key-value dictionary attached to every object in the catalog. You can add whatever custom information about the catalog (comments on tables, columns, stored procedures, ...) you wish to store as extended properties and query them along with the normal catalog views.
Here's one overview (written for SQL Server 2005, but roughly the same techniques should apply for 2000 or 2008).