for a personal project I am searching for the "most suitable" database engine to hit the following key issues
need to store large amounts of single different document files (PDF)
need to perform full-text search onto PDF (for this I plan to use OCR and save the processed data/metadata additionally to the database)
need to get pieces/chunks of the saved documents (for example from a specific year) and show a preview of lots of them within a nice web UI
as much performance as possible
Up to now I did work a lot with SQL (MySql) and have some theoretical knowledge about other systems (MemCached, Redis, PostgreSQ, MongoDb). But I`ve never used them in combination and never hit the point WHEN they should be used for WHAT exactly or how they can be combined.
I think especially for a project like this it`s very important to select the right engine from beginning not to hit performance issues later.
So especially to all experienced developers out there, what would be your favourite choiche for this kind of (I gues SQL may not be the only right solution) ?
Or at the end will it be better to store files within filesystem and keep only metadata in database ?
BTW my planned API backend for this is Laravel 7+, frontend will be Vue 2+.
Thank you very much !
The problem
Ok, sorry that my question is somewhat abstract and subjective, but will try to make it as specific as possible. So, the situation I am in is simple - I am remaking a very old MS Access application on a new website using ASP.NET MVC. As currently the MVC site is using SQL Server 2008 (for many well known reasons) I need to find a way to migrate the tables AND the data, because the information in the old database will be used in the new application.
Alright, so far so good, however there are a few problems. The old application is written in a different language, meaning that I want to translate table names, field names, and all other names that are there to English. Furthermore, I will be making some changes on the models themselves (change the type of some fields, add additional fields to some tables, remove old unnecessary ones and more). So technically I'll be 'having my way' with everything.
Researched solutions
With those things in mind I researched for the ways to migrate data from Access database to a SQL Server. Of course, there is a lot of information on the matter, in Stack Overflow alone there are more than a few questions and solutions. So why am I struggling to find the answer ? Well I found a few solutions that will be sufficient to some extend (actually will definitely solve my problems) but I am writing to ask if someone experienced has a better perspective on it than I do. Alright, the solutions and why I am still looking for advice: /I'll be listing just a couple of the most common and popular ones that I found, many of the others share the same capabilities and/or results /
Upsize Wizzard (Access) - this is a tool devised specifically for migrating tables and data from Access. It is my most favourite one for the moment as I find it kind of straightforward to work with and it provides good overall results. I was able to migrate the tables to SQL Server (along with the data of course) which more or less is what I am intending to do. It is fast, it seems like it allows you to migrate indexes, primary keys and even to my knowledge foreign keys (table relationships). The downsides of this tool, however, include that it ignores your queries (which I don't really need honestly) and it doesn't provide a way to change the model, names or types of the properties of the table you migrate - which is the thing I kind of prefer, because I will have to make more than a few changes, adding, renaming, deleting, etc. And then continue with the development process (of the application) which will lead to a few additional minor changes. And finally I would need to apply all changes (migration + all changes) on the production server, which overall is prone to mistakes as I will be doing it by hand (and there are more than a few tables).
SQL Server Migration Assistant (SSMA) - ok, this is a separate tool (not included in Access) with again the same idea - to migrate data from Access to ... possibly everywhere, haven't researched that. Overall it offers more functionality and customizing from the Upsize Wizard, but of course it does it in a more complicated way. I haven't put enough effort to make a migration with this tool yet, as it involves a lot of installations and additional work, but according to my research it provides almost all (if not all) of the functionality I require. The downside however comes with the naming. As I mentioned it allows you to apply changes on the tables, schema, fields, indexes, keys and probably everything, but the articles advice that I change the names in Access first, as it will be easier and the migration process will run more smoothly. I am not allowed to make changes on the original Access database, as it will remain functional until the publish of the 'renewed' project, and the data inside it is being used, so a mere copy of the file is a solution I am not particularly fond of, because I might loose new records. Also I cant predict the changes I would want to make in the development process (as I said I believe I would want/need to apply some additional changes later on when I find 'weaknesses' in my data design in the development process) so I find it to be a little half baked solution.
Conclusion
The options presented, the way I see them, are two:
Use the Upsize Wizard to migrate the access tables, then write a script that applies the changes I want to make. Then in the development process add any additional changes to the script. When ready to publish on the production server, reapply the migration with the wizard, run the changes script and pray everything is fine.
Get more involved with the SSMA tool and try producing an updated version of the tables with the migration process. (See how efficient the renaming is and decide whether to use copied file to rename and then find a way to migrate only new records or do it all in the SSMA). Then again write a script for the changes that occur in the development process and re-do and apply it all on the production server when ready and then pray everything is fine.
Option I have not yet seen, apply it and then pray everything is fine.
I have researched the matter for a couple of days now, and found a few more solutions that I do not believe are better by the mentioned. However I include the possibility of missing the 'big red X on the map', a practical and easy solution which seems like it was designed specifically for me (though I doubt that a little). Anyway, reducing all the madness that I have written so far to a few simple questions will look like:
Is anyone aware if my conclusions are correct? I am leaning towards option one as it is easier to accomplish.
Has anyone experienced/found a better way to do that, or just found some 'logic-leaps' in my writings as I am overthinking the entire thing a little and may be doing some obvious miscalculation.
Very sorry for asking a trivial question and one that includes decision making that may involve deeper understanding of my project and situation, yet I am working with rather sensitive data and would appreciate feedback, even if only to improve my confidence into the chosen approach.
There is one other tool/method you might want to consider that seems to cater to your specific needs more. This would be to use the data import/export tool that ships with sqlserver to do a complete copy of all data into a temporary location within sql server and then write custom queries to reorganize the names and other changes you want to make. Is a bit more work but you could use the end product as a seed method for your migrations ;) (if you are doing code first anyway)
I have a webpage where i have to allow the users to customize their header and footer.
i.e. I should store the Users header and footer HTML and should dynamically add it to the webpage. I have two ways of storing in database and storing in a files. Please suggest me which approach is better.
Solution with files get messier with time. With databases, it is easier to scale.
With databases, you can add bookkeeping fields (like last-modified, tags, or something else depending on your need). Backup is easier also perhaps.
With files, you have to worry about directory structure (having too many files in single directory is not good), permissions, etc.
If you are worried about efficiency, stop worrying :). MySQLqueries are pretty fast especially with the caching mechanisms/modules in apache.
Ther's not a better approach (there are pro and cons in general) but in this specific case I would store these snippet as a file because you have less complexity for sure (because you don't need to query a database and fetching result) and you don't relies on database connection for including header and footer
If you're using .Net you've something called Portals which does the same thing. There are things like master pages also that you may want to read. But all these are in .Net. Even if you're not doing this in .Net it'd be time consuming to handle all this stuff on your own as you need to take care of cross-site scripting and a few other issues.
Check the platform for features that you're working on to find out if this is possible by them. (Let me know the platform that you're using so I may help in that). Also, if the changes are just cosmetic you may store just css settings instead of complete html.
Finally it'd be better to use sql if the number of changes to store are more than 100 as the complexity will bug you down. But if you're fewer users and don't expect any scaling up then sure go for a file system.
:
Here are a couple of links for understanding portals and web parts in .Net:
http://msdn.microsoft.com/en-us/magazine/cc300767.aspx
http://msdn.microsoft.com/en-us/magazine/cc163965.aspx
What would be the best approach for versioning my whole database ?
Creating a file for each database object (table,view,procedsure..) or rather having one file for all DDL scripts and any new change will be put in a separate file ?
What about handling changes made in a Database manager tool ?
I'd like to have a generic solutions for any kind of RDBMS.
Are there any other options ?
I'm a huge VCS fan in general and a big Mercurial booster, but I really think you're going down the wrong path.
VCSs aren't just about iterative changes, the "what", they're also about answering the "who", "when", and "why". For a database those answers are a lot less interesting or hard to provide to the VCS. If you're doing nightly exports and commits the "who" will always be "cron" and the "why" will always be "midnight".
The other thing modern VCSs do really well is helping you merge changes from multiple branches. That's less applicable in the database world. Very seldom do you say "I want this table structure, but this data", and if you do the text/diff merge isn't going to help you much.
The thing that does do "what" and "when" very well is an incremental backup system, and that's probably the better fit.
At work we use Tivoli and at home I use rdiff-backup and duplicity, but there are plenty of great options.
I guess my general rule of thumb is "if it was typed by hand by a human then it does into source control, and if it was generated/exported then it goes in the incremental backups"
Certainly you can make this work, but I don't think it will buy you much over the more traditional backup solutions.
Have a look at this post
If you need generic solution - put everything in the scripts (simple text files) and put under Version Control system (can be used any of VCS).
Grouping similar database objects into scripts will be depend on your requirement.
So you may for example:
Store table/indexes/ in one or several script
Each procedure store in individual script or combine small procedures into one script.
However need to remember one important thing with this approach: don't forget change scripts if you changed table/view/procedure directly in databases and don't create/recreate/compile you db objects in database after changing scripts.
SQL Source Control currently supports SVN and TFS, but Mercurial requests are increasing rapidly and we're hoping to have a story for this very soon.
We use UserVoice to measure demand so please vote accordingly if you're interesting in this: http://redgate.uservoice.com/forums/39019-sql-source-control
Can you please point to alternative data storage tools and give good reasons to use them instead of good-old relational databases? In my opinion, most applications rarely use the full power of SQL--it would be interesting to see how to build an SQL-free application.
Plain text files in a filesystem
Very simple to create and edit
Easy for users to manipulate with simple tools (i.e. text editors, grep etc)
Efficient storage of binary documents
XML or JSON files on disk
As above, but with a bit more ability to validate the structure.
Spreadsheet / CSV file
Very easy model for business users to understand
Subversion (or similar disk based version control system)
Very good support for versioning of data
Berkeley DB (Basically, a disk based hashtable)
Very simple conceptually (just un-typed key/value)
Quite fast
No administration overhead
Supports transactions I believe
Amazon's Simple DB
Much like Berkeley DB I believe, but hosted
Google's App Engine Datastore
Hosted and highly scalable
Per document key-value storage (i.e. flexible data model)
CouchDB
Document focus
Simple storage of semi-structured / document based data
Native language collections (stored in memory or serialised on disk)
Very tight language integration
Custom (hand-written) storage engine
Potentially very high performance in required uses cases
I can't claim to know anything much about them, but you might also like to look into object database systems.
Matt Sheppard's answer is great (mod up), but I would take account these factors when thinking about a spindle:
Structure : does it obviously break into pieces, or are you making tradeoffs?
Usage : how will the data be analyzed/retrieved/grokked?
Lifetime : how long is the data useful?
Size : how much data is there?
One particular advantage of CSV files over RDBMSes is that they can be easy to condense and move around to practically any other machine. We do large data transfers, and everything's simple enough we just use one big CSV file, and easy to script using tools like rsync. To reduce repetition on big CSV files, you could use something like YAML. I'm not sure I'd store anything like JSON or XML, unless you had significant relationship requirements.
As far as not-mentioned alternatives, don't discount Hadoop, which is an open source implementation of MapReduce. This should work well if you have a TON of loosely structured data that needs to be analyzed, and you want to be in a scenario where you can just add 10 more machines to handle data processing.
For example, I started trying to analyze performance that was essentially all timing numbers of different functions logged across around 20 machines. After trying to stick everything in a RDBMS, I realized that I really don't need to query the data again once I've aggregated it. And, it's only useful in it's aggregated format to me. So, I keep the log files around, compressed, and then leave the aggregated data in a DB.
Note I'm more used to thinking with "big" sizes.
The filesystem's prety handy for storing binary data, which never works amazingly well in relational databases.
Try Prevayler:
http://www.prevayler.org/wiki/
Prevayler is alternative to RDBMS. In the site have more info.
If you don't need ACID, you probably don't need the overhead of an RDBMS. So, determine whether you need that first. Most of the non-RDBMS answers provided here do not provide ACID.
Custom (hand-written) storage engine / Potentially very high performance in required uses cases
http://www.hdfgroup.org/
If you have enormous data sets, instead of rolling your own, you might use HDF, the Hierarchical Data Format.
http://en.wikipedia.org/wiki/Hierarchical_Data_Format:
HDF supports several different data models, including multidimensional arrays, raster images, and tables.
It's also hierarchical like a file system, but the data is stored in one magic binary file.
HDF5 is a suite that makes possible the management of extremely large and complex data collections.
Think petabytes of NASA/JPL remote sensing data.
G'day,
One case that I can think of is when the data you are modelling cannot be easily represented in a relational database.
Once such example is the database used by mobile phone operators to monitor and control base stations for mobile telephone networks.
I almost all of these cases, an OO DB is used, either a commercial product or a self-rolled system that allows heirarchies of objects.
I've worked on a 3G monitoring application for a large company who will remain nameless, but whose logo is a red wine stain (-: , and they used such an OO DB to keep track of all the various attributes for individual cells within the network.
Interrogation of such DBs is done using proprietary techniques that are, usually, completely free from SQL.
HTH.
cheers,
Rob
Object databases are not relational databases. They can be really handy if you just want to stuff some objects in a database. They also support versioning and modify classes for objects that already exist in the database. db4o is the first one that comes to mind.
In some cases (financial market data and process control for example) you might need to use a real-time database rather than a RDBMS. See wiki link
There was a RAD tool called JADE written a few years ago that has a built-in OODBMS. Earlier incarnations of the DB engine also supported Digitalk Smalltalk. If you want to sample application building using a non-RDBMS paradigm this might be a start.
Other OODBMS products include Objectivity, GemStone (You will need to get VisualWorks Smalltalk to run the Smalltalk version but there is also a java version). There were also some open-source research projects in this space - EXODUS and its descendent SHORE come to mind.
Sadly, the concept seemed to die a death, probably due to the lack of a clearly visible standard and relatively poor ad-hoc query capability relative to SQL-based RDMBS systems.
An OODBMS is most suitable for applications with core data structures that are best represented as a graph of interconnected nodes. I used to say that the quintessential OODBMS application was a Multi-User Dungeon (MUD) where rooms would contain players' avatars and other objects.
You can go a long way just using files stored in the file system. RDBMSs are getting better at handling blobs, but this can be a natural way to handle image data and the like, particularly if the queries are simple (enumerating and selecting individual items.)
Other things that don't fit very well in a RDBMS are hierarchical data structures and I'm guessing geospatial data and 3D models aren't that easy to work with either.
Services like Amazon S3 provide simpler storage models (key->value) that don't support SQL. Scalability is the key there.
Excel files can be useful too, particularly if users need to be able to manipulate the data in a familiar environment and building a full application to do that isn't feasible.
There are a large number of ways to store data - even "relational databse" covers a range of alternatives from a simple library of code that manipulates a local file (or files) as if it were a relational database on a single user basis, through file based systems than can handle multiple-users to a generous selection of serious "server" based systems.
We use XML files a lot - you get well structured data, nice tools for querying same the ability to do edits if appropriate, something that's human readable and you don't then have to worry about the db engine working (or the workings of the db engine). This works well for stuff that's essentially read only (in our case more often than not generated from a db elsewhere) and also for single user systems where you can just load the data in and save it out as required - but you're creating opportunities for problems if you want multi-user editing - at least of a single file.
For us that's about it - we're either going to use something that will do SQL (MS offer a set of tools that run from a .DLL to do single user stuff all the way through to enterprise server and they all speak the same SQL (with limitations at the lower end)) or we're going to use XML as a format because (for us) the verbosity is seldom an issue.
We don't currently have to manipulate binary data in our apps so that question doesn't arise.
Murph
One might want to consider the use of an LDAP server in the place of a traditional SQL database if the application data is heavily key/value oriented and hierarchical in nature.
BTree files are often much faster than relational databases. SQLite contains within it a BTree library which is in the public domain (as in genuinely 'public domain', not using the term loosely).
Frankly though, if I wanted a multi-user system I would need a lot of persuading not to use a decent server relational database.
Full-text databases, which can be queried with proximity operators such as "within 10 words of," etc.
Relational databases are an ideal business tool for many purposes - easy enough to understand and design, fast enough, adequate even when they aren't designed and optimized by a genius who could "use the full power," etc.
But some business purposes require full-text indexing, which relational engines either don't provide or tack on as an afterthought. In particular, the legal and medical fields have large swaths of unstructured text to store and wade through.
Also:
* Embedded scenarios - Where usually it is required to use something smaller then a full fledged RDBMS. Db4o is an ODB that can be easily used in such case.
* Rapid or proof-of-concept development - where you wish to focus on the business and not worry about persistence layer
CAP theorem explains it succinctly. SQL mainly provides "Strong Consistency: all clients see the same view, even in presence of updates".
K.I.S.S: Keep It Small and Simple
I would offer RDBMS :)
If you do not wont to have troubles with set up/administration go for SQLite.
Built in RDBMS with full SQL support. It even allows you to store any type of data in any column.
Main advantage against for example log file: If you have huge one, how are you going to search in it? With SQL engine you just create index and speed up operation dramatically.
About full text search: SQLite has modules for full text search too..
Just enjoy nice standard interface to your data :)
One good reason not to use a relational database would be when you have a massive data set and want to do massively parallel and distributed processing on the data. The Google web index would be a perfect example of such a case.
Hadoop also has an implementation of the Google File System called the Hadoop Distributed File System.
I would strongly recommend Lua as an alternative to SQLite-kind of data storage.
Because:
The language was designed as a data description language to begin with
The syntax is human readable (XML is not)
One can compile Lua chunks to binary, for added performance
This is the "native language collection" option of the accepted answer. If you're using C/C++ as the application level, it is perfectly reasonable to throw in the Lua engine (100kB of binary) just for the sake of reading configs/data or writing them out.