Distributed Database Solution? [closed] - sql

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Hey. I am going to be setting up a a database which could get really really huge.
I've been using standard mysql for most of my stuff but this particular problem will get up to the TBs and I will want to be able to do hundreds of queries a second.
So aside from designing my database schema such that its not going to chug, and fast harddrive speeds what is my biggest bottleneck and what sort of solution is recommended for this.
Does it make sense to spread the database over multiple computers on my intranet so it can scale with CPU/Ram etc and if so is there software for this or database solutions for this?
Thanks for any help!
I did a search for questions to related to this and couldn't find anything so sorry if it has already been asked.

Database scalability is a VERY complicated issue; there are a LOT of issues that come into the whole process.
First, consider the lowest-hanging fruit; do you have individual tables (or columns) that are going to be containing the bulk of your data? Columns which will contain BLOBs which are > 4MB each? Those can be extracted from the database and stored on a flat-file storage system, and merely referred to from the database; right there, that can take many unwieldy solutions down to a manageable level.
If not, do you have deeply different usage patterns for different subgroupings of tables? If so, there's an opportunity right there for segmenting your database into different functional databases which can be partitioned onto different servers. A good example of this is read-mostly data, such as on webservers, which gets generated rarely (think user-specific home page data), but read frequently; that type of data can get segregated into a database (or, again, flatfile with references) that's separate from the rest of the user data).
Consider the transactional requirements of your database; can you isolate your transaction boundaries cleanly, or will there be deeply mingled transactions going on all through your database? If you can isolate your transaction boundaries, there's another potential useful boundary.
This is just touching on some of the issues involved with this sort of thing. One thing worth considering is whether or not you really need to have a database that is actually going to be huge, or if you're just trying to use the database as a persistence layer. If you're using the database just as a persistence layer, you might reconsider whether you actually need the relational nature of a database at all, or if you can get away with a smaller relational overlay on top of a simpler persistence layer. (I say this because a large quantity of solutions seem like they could get away with a thin relational layer over a large persistence layer; it's worth considering.)

Ok, first I need to point you to here. I don't think MySQL is going to perform like you want. I have a bad feeling that when I say you need to look into an Oracle instalation, you're going to say, "We don't have the cash for that." But, when I say get the latest/greatest SQL-Server, you're going to say, "We don't have the hardware it'll take to implement that." I'm afraid that terabytes is just flat out going to crush your MySQL instalation.

A new breed of NewSQL databases are being built to solve exactly the problem of distributing resources over multiple servers. The Clustrix database (which was built from the ground up to be a MySQL replacement) is one example that provides near-linear scale -- as you run out of CPU/Memory, you can simple add nodes.

Database scalability is a tough problem and you should consider solutions that can address it for you. I believe that MySQL can be used as the foundation for a solution to your problem.
Horizontal scalability; the ability to scale a database horizontally (aka scale-out) is a good technique to address the problem of very large tables and databases.

Related

How to separate writes from reads to minimize effect of heavy read queries? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I have write-heavy tables in my database. There is a need to run read-only queries by someone else. I have no idea about complexity and volume of their queries but I do know when they start doing it, writes become superslow. So separating writes from reads seems the way to go.
Is the replication an answer? What else may I try?
As anything related to performance "It depends".
In general you are overlooking because general speaking the isolation level ill take care of that kind of problem for you. You can hit the books to see how it works. In general it's not wise to meddling with it IF you don't know exactly what you are doing.
IF You ends to handle issues about it you can:
1) Replicate (but you need to delve in details about it).
Advantge is simplicity, disvantages: waste of servers disk and cpu.
2) Create stag tables.
This is s simple solution and suitable when you get lot of heavy writes on heavy read tables. Example. You got a webservice where users sometimes uploads large csv files and those data are persited on stag tables. That simple no indexed tables acts a buffer (or queue) to the raw data. Later in a "window of opportunity" that data is inserted in the real tables. It takes a disvantage of the uploaded data is not readly to be queried. Advantages are it easy to handle bad formated data and let only sanitized data go on your DB. Also very easy to implement You can create a SQL Service to to it after or before dayly full backup for example.
3) Fine tune isolation level query by query: Advantages are if you really know what to do the system ill shine disvantages are: hard to do the right tweeks, prone to let your system down in a hell of deadlocks, ghost & dirty reads and lost data. Also demands a lot of time to implement and maintain in the right way (you must keep an eye on that tunned queries to be sure).
EDIT about the WITH(NOLOCK) comment: Serious guys? it's deprecated since SQL 2000! It's the silver bullet for the Lazy and don't work well. Consider the scenario where you make a dirty read, processed some data and persisted more data related to that dirty one. Now a rollback undo the dirty one you now got a orphan row or worse data integrity hell. Don't use it anymore unless you still working with SQL Server 7. Study isloation level to know how bad and useless NOLOCK become (in the last 15 years!)
For me the correct answer is replication, you can have a snapshot replication, have a different set of index in your insert database and another in your read. One focus on fast inserts and other in fast search.

NoSQL vs. SQL when scalability is irrelevant

Recently I have read a lot about different NoSQL databases and how they are being effectively deployed by some major websites out there. I'm starting a project in which I think the schema-free nature of a database such as MongoDB would be tremendously useful. Everything I have read though seems to indicate that the main advantage of a NoSQL database is scalability. Is choosing a NoSQL database for the schema-free design just as legitimate a design decision as that of scalability?
Yes, sometimes RDBMS are not the best solution, although there are ways to accomodate user defined fields (see XML Datatype, EAV design pattern, or just have spare generic columns) sometimes a schema free database is a good choice.
However, you need to nail down your requirements before choosing to go with a document database, as you will loose a lot of the power you may be used to with the relational model
eg...
If you would otherwise have multiple tables in your RDBMS database, you will need to research the features MongoDB affords you to accomodate these needs.
If you will need to query the data in specific ways, again you need to research what MongoDB offers you.
I wouldnt think of NoSQL as replacement for RDBMS, rather a slightly different tool that brings its own sets of advantages and disadvantages making it more suitable for some projects than others.
(Both databases may be used in some circumstances. Also if you decide to go down the route of possibly using MongoDB, once you have researched the websites out there and have more specific questions, you can visit Freenode IRC #mongodb channel)
There are a lot of other conditions that I've been hearing about with non-relational systems vs relational. I prefer this terminology over sql/no-sql as I personally think it describes the differences better, and several of the "no-sql" servers have sql add-ons, so anyway.... what sort of concurrency pattern or tranaction isolation is required in your system. One of the purported differences between rel and non-rel dbs is the "consistent-always", "consistent-mostly" or "consistent-eventually". Relation dbs by default usually fall into the "consistent-mostly" category and with some work, and a whole lot of locking and race conditions, ;) can be "consistent-always" so everyone is always looking at the most correct representation of a given piece of data. Most of what I've read/heard about non-rel dbs is that they are mainly "consistent-eventually". By this it means that there may be many instances of our data floating around, so user "A" may see that we have 92 widgets in inventory, whereas user "B" may see 79, and they may not get reconciled until someone actually goes to pull stuff from the warehouse. Another issue is mutability of data, how often does it need to be updated? The particular non-rel db's I've been exposed to have more overhead for updates, some of them having to regenerate the entire dataset to incorporate any updates.
Now mind, I think non-rel/nosql are great tools if they really match your use case. I've got several I'm looking into now for projects I've got. But you've got to look at all the trade offs when making the decision, otherwise it just turns into more resume driven development.
I don't think you should choose NoSQL datastore for its schema free design. Schema free design always existed in RDBMS via XML and some databases have good XML support. It is a lot easier to deal with a database than a NoSQL datastore. Scalability and big data should be the primary drivers to choose a NoSQL datastore otherwise the tradeoff of ACID and SQL is a lot to switch to NoSQL.
the most important things should be noticed to distinguish between No-SQL and SQL
which is :
NO-SQL useful when data base scales in a huge manner like social network
for example :
Stack Overflow: each question has multiple answers and not imaginary an answer without question, so No-SQL will ensure that each question include it's answers
as a result when needing getting answers of a question we can bring all answers without joining.Because join is the most expensive query in related database
thanks alot
what raised this issue that if you have a large server farm and need to manage the distribution of your data and load balancing which is more difficult and harder to implement using RDBMS and requires high IT skills to design, plan and deploy for your solution (and still performance is less).
but if you have only 3 or 4 servers with small project. I don't think you have an issue about it. NoSQL database is usually considered in large server farms not small number of servers

database for scientific calculation [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need to include a database in a C# project with a lot of data.
This database should be free even in a commercial use.
What database should I use?
[EDIT]
I would like to know what database are to be avoided.
When I say "a lot", it's for scientific calculation so it will be huge data.
I've used PostgreSQL database for my own "semi-scientific" proof of concept project. Stored 50GB+ data. My experience is positive. You should be careful about partition scheme and indexing. It is free and supported by large online group.
I think this question is still kind of vague. When you need to choose a database system, you may also need to consider some important factors, other than the input data size and freeware (budget).
1) You mentioned the purpose is scientific computation, do you need to support some complex and ad-hoc analytics/operations? For example, do you need to support time travel or multi-dimensional histogram generation? If so, you'd better choose a database specifically designed for scientific computing rather than a more general-purpose one, and SciDB/MonetDB/RasDaMan may be a good option.
2) You mentioned the data size, but what's the data type? Is it relational data (e.g., CSV), array-based data (HDF5/NetCDF), or spatial data? One size does not fit all, and there are different types of databases specifically designed for different types of input data: relational/array/spatial... databases. Note that before you use any database, you have to load your data in the database, and the data loading maybe very painful if there is a mismatch between your input data type and the database type.
3) Performance may be a very crucial factor in your case, and you may also need to consider scalability if distributed computing is in your plan. For example, since scientific data is usually read-only or append-only, do you really need to guarantee ACID property during all the query execution? You may consider sacrificing ACID for more performance improvement. If so, SciDB may be way much better than SQL Server.
Well, you can use actually almost anything, PostgreSQL, SQLite, and even Microsoft's SQL Server Compact or SQL Server.
It also depends on what is "a lot of data"/
http://www.fakenamegenerator.com/ might help you out. Depends on the type of data you need.

How to write a simple database engine [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am interested in learning how a database engine works (i.e. the internals of it). I know most of the basic data structures taught in CS (trees, hash tables, lists, etc.) as well as a pretty good understanding of compiler theory (and have implemented a very simple interpreter) but I don't understand how to go about writing a database engine. I have searched for tutorials on the subject and I couldn't find any, so I am hoping someone else can point me in the right direction. Basically, I would like information on the following:
How the data is stored internally (i.e. how tables are represented, etc.)
How the engine finds data that it needs (e.g. run a SELECT query)
How data is inserted in a way that is fast and efficient
And any other topics that may be relevant to this. It doesn't have to be an on-disk database - even an in-memory database is fine (if it is easier) because I just want to learn the principals behind it.
Many thanks for your help.
If you're good at reading code, studying SQLite will teach you a whole boatload about database design. It's small, so it's easier to wrap your head around. But it's also professionally written.
SQLite 2.5.0 for Code Reading
http://sqlite.org/
The answer to this question is a huge one. expect a PHD thesis to have it answered 100% ;)
but we can think of the problems one by one:
How to store the data internally:
you should have a data file containing your database objects and a caching mechanism to load the data in focus and some data around it into RAM
assume you have a table, with some data, we would create a data format to convert this table into a binary file, by agreeing on the definition of a column delimiter and a row delimiter and make sure such pattern of delimiter is never used in your data itself. i.e. if you have selected <*> for example to separate columns, you should validate the data you are placing in this table not to contain this pattern. you could also use a row header and a column header by specifying size of row and some internal indexing number to speed up your search, and at the start of each column to have the length of this column
like "Adam", 1, 11.1, "123 ABC Street POBox 456"
you can have it like
<&RowHeader, 1><&Col1,CHR, 4>Adam<&Col2, num,1,0>1<&Col3, Num,2,1>111<&Col4, CHR, 24>123 ABC Street POBox 456<&RowTrailer>
How to find items quickly
try using hashing and indexing to point at data stored and cached based on different criteria
taking same example above, you could sort the value of the first column and store it in a separate object pointing at row id of items sorted alphabetically, and so on
How to speed insert data
I know from Oracle is that they insert data in a temporary place both in RAM and on disk and do housekeeping on periodic basis, the database engine is busy all the time optimizing its structure but in the same time we do not want to lose data in case of power failure of something like that.
so try to keep data in this temporary place with no sorting, append your original storage, and later on when system is free resort your indexes and clear the temp area when done
good luck, great project.
There are books on the topic a good place to start would be Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom
SQLite was mentioned before, but I want to add some thing.
I personally learned a lot by studying SQlite. The interesting thing is, that I did not go to the source code (though I just had a short look). I learned much by reading the technical material and specially looking at the internal commands it generates. It has an own stack based interpreter inside and you can read the P-Code it generates internally just by using explain. Thus you can see how various constructs are translated to the low-level engine (that is surprisingly simple -- but that is also the secret of its stability and efficiency).
I would suggest focusing on www.sqlite.org
It's recent, small (source code 1MB), open source (so you can figure it out for yourself)...
Books have been written about how it is implemented:
http://www.sqlite.org/books.html
It runs on a variety of operating systems for both desktop computers and mobile phones so experimenting is easy and learning about it will be useful right now and in the future.
It even has a decent community here: https://stackoverflow.com/questions/tagged/sqlite
Okay, I have found a site which has some information on SQL and implementation - it is a bit hard to link to the page which lists all the tutorials, so I will link them one by one:
http://c2.com/cgi/wiki?CategoryPattern
http://c2.com/cgi/wiki?SliceResultVertically
http://c2.com/cgi/wiki?SqlMyopia
http://c2.com/cgi/wiki?SqlPattern
http://c2.com/cgi/wiki?StructuredQueryLanguage
http://c2.com/cgi/wiki?TemplateTables
http://c2.com/cgi/wiki?ThinkSqlAsConstraintSatisfaction
may be you can learn from HSQLDB. I think they offers small and simple database for learning. you can look at the codes since it is open source.
If MySQL interests you, I would also suggest this wiki page, which has got some information about how MySQL works. Also, you might want to take a look at Understanding MySQL Internals.
You might also consider looking at a non-SQL interface for your Database engine. Please take a look at Apache CouchDB. Its what you would call, a document oriented database system.
Good Luck!
I am not sure whether it would fit to your requirements but I had implemented a simple file oriented database with support for simple (SELECT, INSERT , UPDATE ) using perl.
What I did was I stored each table as a file on disk and entries with a well defined pattern and manipulated the data using in built linux tools like awk and sed. for improving efficiency, frequently accessed data were cached.

When shouldn't you use a relational database? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Apart from the google/bigtable scenario, when shouldn't you use a relational database? Why not, and what should you use? (did you learn 'the hard way'?)
In my experience, you shouldn't use a relational database when any one of these criteria are true:
your data is structured as a hierarchy or a graph (network) of arbitrary depth,
the typical access pattern emphasizes reading over writing, or
there’s no requirement for ad-hoc queries.
Deep hierarchies and graphs do not translate well to relational tables. Even with the assistance of proprietary extensions like Oracle's CONNECT BY, chasing down trees is a mighty pain using SQL.
Relational databases add a lot of overhead for simple read access. Transactional and referential integrity are powerful, but overkill for some applications. So for read-mostly applications, a file metaphor is good enough.
Finally, you simply don’t need a relational database with its full-blown query language if there are no unexpected queries anticipated. If there are no suits asking questions like "how many 5%-discounted blue widgets did we sell in on the east coast grouped by salesperson?", and there never will be, then you, sir, can live free of DB.
The relational database paradigm makes some assumptions about usage of data.
A relation consists of an unordered set of rows.
All rows in a relation have the same set of columns.
Each column has a fixed name and data type and semantic meaning on all rows.
Rows in a relation are identified by unique values in primary key column(s).
etc.
These assumptions support simplicity and structure, at the cost of some flexibility. Not all data management tasks fit into this kind of structure. Entities with complex attributes or variable attributes do not, for instance. If you need flexibility in areas where a relational database solution doesn't support it, you need to use a different kind of solution.
There are other solutions for managing data with different requirements. Semantic Web technology, for example, allows each entity to define its own attributes and to be self-describing, by treating metadata as attributes just like data. This is more flexible than the structure imposed by a relational database, but that flexibility comes with a cost of its own.
Overall, you should use the right tool for each job.
See also my other answer to "The Next-gen databases."
There are three main data models (C.J.Date, E.F.Codd) and I am adding a flat file to this:
flat file(s) (structure varies - from 'stupid' flat text to files conforming to grammars which coupled with clever tools do very clever things, think compilers and what they can do, narrow application in modelling new things)
hierarchical (trees, nested sets - examples: xml and other markup languages, registry, organizational charts, etc; anything can be modelled, but integrity rules are not easy to express and retrieval is hard to optimize automatically, some retrieval is fast and some is very slow )
network (networks, graphs - examples: navigational databases, hyperlinks, semantic web, again almost anything can be modelled but automatic optimizing of retrieval is a problem)
relational (first order predicate logic - example: relational databases, automatic optimization of retrieval)
Both hierarchical and network can be represented in relational and relational can be expressed in the other two.
The reason that relational is considered 'better' is the declarative nature and standardization on not only the data retrieval language but also on the data definition language, including the strong declarative data integrity, backed up with stable, scalable, multi-user management system.
Benefits come at a cost, which most projects find to be a good ratio for systems (multi application) that store long term data in a from that will be usable in foreseeable future.
If you are not building a system, but a single application, perhaps for a single user, and you are fairly certain that you will not want multiple applications using your data, nor multiple users, any time soon then you'll probably find faster approaches.
Also if you don't know what kind of data you want to store and how to model it then relational model strengths are wasted on it.
Or if you simply don't care about integrity of your data that much (which can be fine).
All data structures are optimized for a certain kind of use, only relational if properly modelled tries to represent the 'reality' in semantically unbiased way. People who had bad experience with relational databases usually don't realize that their experience would have been much worse with other types of data models. Horrible implementations are possible, and especially with relational databases, where it is relatively easy to build complex models, you could end up with quite a monster on your hands. Still I always feel better when I try to imagine the same monster in xml.
One example of how good relational model is, IMO, is ratio of complexity vs shortness of the questions that you will find that involve SQL.
I suggest you visit the High Scalability blog, which discusses this topic almost on a daily basis and has many articles about projects that chose distributed hashes, etc. over RDMBS.
The quick (but very incomplete answer) is that not all data translates well to tables in efficient ways. For example, if your data is essentially one big dictionary, there are probably much faster alternatives that plain old RDBMS. Having said that, it mostly a matter of performance, and if performance isn't a huge concern in a project, and stability, consistency and reliability, for example, are, then I don't see much point in delving into these technologies when RDBMS is a much more mature and well developed scheme, with support in all languages and platforms and a huge set of solutions to choose from.
Fifteen years ago I was working on a credit risk system (basically a big tree walking system). We were using Sybase on HPUX & solaris and performnce was killing us. We hired in consultants direct from Sybase who said it couldn't be done. Then we switched to an OO database (Object store in this case) and got a about a 100x performance increase (and the code was about 100x easier to write too)
But such situations are quite rare - a relational database is a good first choice.
When you schema varies a lot you will have a hard time with relational databases. This is where XML databases or key-value pair databases work best. or you could use IBM DB2 and have both relational data and XML data managed by a single database engine.
About 7-8 years ago I worked on a web site that grew in popularity beyond our initial expectations and it got us in trouble performance-wise. Since we were all relatively inexperienced in web based projects it posed a significant strain on us about what to do beyond usual database separation onto separate server, load balancing etc.
One day I've thought of something pretty simple. Since site was based on users, their profiles were stored in a database table the usual way someone would do it - user id, lots of info variables and stuff like that - which would show up as a users profile page which other users could look up. I've flushed all that data into a simple html file, already prepared as a users profile page and got a significant boost - basically a cache. I even made a system that when user edited their profile info, it would parse original html file, put it up for edit, and then flush out html back to the file system - got even more boost.
I made something simillar with messages users sent to each other. Basically wherever I could make a system bypass a database altogether, avoiding a INSERT or UPDATE, I got a significant boost. It may sound like a common sense, but it was an enlightening moment. It is not an avoidance of relational setup per se, but it is an avoidance of the database altogether - KISS.