SQL and Flat Files... In harmony? - sql

I was just thinking, how quick it would be to store the actual data of an application in a flat file.
Now, you can't just go storing everything in a flat file... sometimes sorts and searches are required, and to go through directories and files recursively could be a pain.
Now, imagine, you stored all your search-able data in a database, and had a pointer field, that pointed to a data file?
This would be very specific per app, however- so long as all my search-able data is stored in the database, why should I store the actual data in a database?
(Locking, Data integrity aside) it would be faster, I am sure... but how much, and is it worth doing it?

Well you often want to do things in queries beyond search on the data. For instance you might might not search on a field called cost_center, but you might have a case statment that processes things differently depending on the information in the field. Or you might need to concatenate information together. You might update one field based onthe information in another field. You might not search on a field today and need to search on it tomorrow.
A properly designed relational database can easily perform well with terrabytes of data.
And frankly you should never even consider "data integrity aside". If you don't have data integrity you don't have data.
As to whether what you want is a good idea, it depends on the type of data you are storing and the types of things you intend to do with it. There isn't enough information to say for sure.

Well "Locking, Data integrity aside" should mean a faster system. If you drop constraints you should improve performance.
But in practical terms, I don't think it's going to be faster. There's lot of development time behind RDBMSs and that's why they are quick. Sure, non-relational databases are performing better than them in highly parallel situations and scenarios which take advantage of their qualities, for instance. However, your idea does not offer an improvement such as exploiting parallelism... any performance advantage would come from dropping the qualities of RDBMSs...

As well as other answers...
Sharing of data: how are multiple clients going to access data on a share?
Backup/Restore: synching of text and "searchable"
Security/permissions on text data
Change anomalies

There is no need to implement a SQL database just to perform searches. Lots of applications store their data in XML, and you can search in many ways, e.g., using Lucene. How fast it is entirely depends on the quantity of data and how you structure it - just like a database.
It can perform very fast, but can complicate things when you want to run more than one app server.

BTrieve was essential what you describe. Back in the DOS days it was a very fast database.

Related

Sql Database structure for housing historical data and display changes

Good morning,
This is more of a concept question then anything.
I am looking to design a database and interface that will track changes to the entries (in this case people) and display those changes readily.
(user experience would look something like this)
for user A
Date Category Activity
8/8/14 change position position 1 -> position 2
8/9/14 change department department a -> department b
...
...
the visual experience seem like it would benefit from an E-A-V design, however i am designing the database to be easy to data mine and from my reading, i think that E-A-V is not the right way to go.
does it make sense to duplicate data just to display it?
if not, does anyone have a suggestion of how to query the history table and display? (currently using jquery and php to leverage the db...i suppose i could do something interesting from a coding perspective to get it done)
thank you for your help,
Travis
Creating an efficient operational database environment and a creating an 'easy-to-data mine' environment are two separate (and often opposing) goals.
Others might disagree with me but in my opinion it is best to create your database based on operational readiness (This means using your E-A-V design as mentioned above) and then worry about data transformation later. This may make it inconvenient later to transform the data to allow for easy mining but it will accomplish an incredibly important goal which is to eliminate the possibility for data error.
Once you have a good system in place where you can collect data appropriately, then you can create a warehouse or datamart environment to more conveniently extract that data.
This may sound like a lot of work but from a data integrity perspective, it is much safer than trying to create some system that is designed entirely for reporting. That's my personal opinion at least.
(sorry cannot comment yet)
You have to analyse the data you need to persist.
if you have only a couple of tables, with no relationship, you probably don't need the database.
In this case the database solution probably will be slower(connection/transmission/security overhead ...).
well if it's a few MBs of data, I would keep everything in one table.
You can easily load the whole data set in memory and do what you need to do.

Transferring data into a reporting database

We have a relational database with data. Currently the queries take to long for reporting, so we would like to use that data in a reporting database.
I want to analyze the queries that we use, run the queries at night and put the un normalized into tables that are read only. I do not believe that there are enough queries to necessitate a data mart.
My boss wants to write everything into xml packages and then whatever database we design, we will read in the xml files.
Like every person on earth, I think that my way is right. But, I have been wrong before. Can I get some advice and the pluses and minuses of each method?
This question is mostly opnion based, but reading from XML is pretty slow and very memory intensive, so i believe that creating a reporting database would be better, of course there are donwsides like that the unormalized database will ocuppy more space, but i think that with a good design it may work fairly well.

Database design - Sharing data between two databases?

I am thinking and exploring options on designing database for my new application. In general, I will have registered users and info about them. They will be able to do some things in app and that data will be in the sam DB as users data (so I can have FK's shared and stuff)
But, then I plan to have second database that will be in logic totally independent of the first database except it will share userID as FK.
I don't know should I even put that second logic in an extra DB or should I have everything in the same database. I plan to have subdomain in my app for second logic (it is like app in app) but what if I discover they should share more data? Will that cross querying drop my peformances? And is that a way to go actually, is there a real reason to separate databases ?
As soon as you have two databases you have potential complexity. You have not given any particular reason why you need two databases. So keep it simple until you have a reason.
An example of what folks do: have a "current" database, small, holding just the data needed right now. That might be where orders are taken and fulfilled. Once the data is no longer current, say some days or weeks after the order is filled move the data to a "historic" database. There marketing and mangement folks can look at overall trends in the history without affecting performance of the "current" database, whose performance might be critical to keeping your customers happy.
As an example of complexity: any time you have two databases you need to consider consistency between them, this is much harder to ensure than it might appear. Databases do offer Two-Phase Transactional capabilities, or you can devise batch processes but there are always subtleties that are hard to catch.
I would just keep all in one database. Unless you have dozens of tables there should be no real performance problems, imho. It will however facilitate your life greatly, only having to work with one database connection & not having to worry about merging information from two queries,
Also agree that unless volume of your data is going to be huge (judging by the question, doesn't seem like that is the case here), you can use single database to store your data without performance issues.
For "visual" separation of data structure, you can always create tables in two schemas of single database.

Which would be better? Storing/access data in a local text file, or in a database?

Basically, I'm still working on a puzzle-related website (micro-site really), and I'm making a tool that lets you input a word pattern (e.g. "r??n") and get all the matching words (in this case: rain, rein, ruin, etc.). Should I store the words in local text files (such as words5.txt, which would have a return-delimited list of 5-letter words), or in a database (such as the table Words5, which would again store 5-letter words)?
I'm looking at the problem in terms of data retrieval speeds and CPU server load. I could definitely try it both ways and record the times taken for several runs with both methods, but I'd rather hear it from people who might have had experience with this.
Which method is generally better overall?
The database will give you the best performance with the least amount of work. The built in index support and query analyzers will give you good performance for free while a textfile might give you excellent performance for a ton of work.
In the short term, I'd recommend creating a generic interface which would hide the difference between a database and a flat-file. Later on, you can benchmark which one will provide the best performance but I think the database will give you the best bang per hour of development.
For fast retrieval you certainly want some kind of index. If you don't want to write index code yourself, it's certainly easiest to use a database.
If you are using Java or .NET for your app, consider looking into db4o. It just stores any object as is with a single line of code and there are no setup costs for creating tables.
Storing data in a local text file (when you add new records to end of the file) always faster then storing in database. So, if you create high load application, you can save the data in a text file and copy data to a database later. However in most application you should use a database instead of text file, because database approach has many benefits.

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.