Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need to include a database in a C# project with a lot of data.
This database should be free even in a commercial use.
What database should I use?
[EDIT]
I would like to know what database are to be avoided.
When I say "a lot", it's for scientific calculation so it will be huge data.
I've used PostgreSQL database for my own "semi-scientific" proof of concept project. Stored 50GB+ data. My experience is positive. You should be careful about partition scheme and indexing. It is free and supported by large online group.
I think this question is still kind of vague. When you need to choose a database system, you may also need to consider some important factors, other than the input data size and freeware (budget).
1) You mentioned the purpose is scientific computation, do you need to support some complex and ad-hoc analytics/operations? For example, do you need to support time travel or multi-dimensional histogram generation? If so, you'd better choose a database specifically designed for scientific computing rather than a more general-purpose one, and SciDB/MonetDB/RasDaMan may be a good option.
2) You mentioned the data size, but what's the data type? Is it relational data (e.g., CSV), array-based data (HDF5/NetCDF), or spatial data? One size does not fit all, and there are different types of databases specifically designed for different types of input data: relational/array/spatial... databases. Note that before you use any database, you have to load your data in the database, and the data loading maybe very painful if there is a mismatch between your input data type and the database type.
3) Performance may be a very crucial factor in your case, and you may also need to consider scalability if distributed computing is in your plan. For example, since scientific data is usually read-only or append-only, do you really need to guarantee ACID property during all the query execution? You may consider sacrificing ACID for more performance improvement. If so, SciDB may be way much better than SQL Server.
Well, you can use actually almost anything, PostgreSQL, SQLite, and even Microsoft's SQL Server Compact or SQL Server.
It also depends on what is "a lot of data"/
http://www.fakenamegenerator.com/ might help you out. Depends on the type of data you need.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Suppose I have a table in a RDBMS having 26 columns, say A - Z.
With relational databases I can writes queries which invlove conditions on multiple columns. For example,
Select A, B
from table
where C > 12
and D = 'john'
and E between 3 and 6
order by F;
However, if I have the same table in a NoSQL database, all they provide is lookups based on primary keys, or some predefined GSI(taking dynamodb as example).
Although, I can issue a scan against the table in NoSQL db, but that is a lot slower as compared to a table in RDBMS even if the columns involved are not indexed.
I wanted to understand what are the reasons why NoSQL databases scale very well, but fail to provide a query language like SQL. Can someone throw some lightt on it?
You should be more specific about which database(s) you're asking about. You mention DynamoDB, but it's not clear in your question whether this is one example, or are you asking only about DynamoDB?
There are over 220 products that call themselves NoSQL, and they have different characteristics.
Some have an SQL-like language, some don't.
Some support queries to search by secondary attributes, some don't.
It's more a question of why a specific product didn't implement a SQL-like language, not a limitation of "NoSQL" as a broad category of products.
Your question is like asking "why don't non-motorcycles have a clutch?" The answer is that non-motorcycles is a broad category of vehicles, some of which actually do have a clutch, whereas some others were designed not to need a clutch.
No-SQL databases are designed on the premise that the data contained within them is schemaless. Thus, there is no pre-defined structure for the data which a database engine can easily use to determine how to execute an ad-hoc query. However, some no-sql database engines (e.g. Couchbase) do indeed offer such a capability.
The issue with database management systems in general has rarely been about storage and retrieval efficiency, but rather query plan optimization. In general, computers are not very good about dealing with issues created by poor designs. Also in general, most developers are not good at properly structuring data such that it can be queried quickly and easily by an automatically-generated query plan. Thus, most systems which rely upon automatically generated query plans tend to suffer performance issues.
In my opinion, the reason why a no-sql technology might not want to provide automatic query plan generation is that it forces the developer to give actual thought to the process of retrieving the data out, such that an efficient and effective plan might be devised in the code. Indeed, I have found that I am usually better at writing queries than the computer is. Could I restructure the data in such a way that the computer can write a good query plan the first time? Yes, but that takes more time than doing it myself to begin with.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Hey. I am going to be setting up a a database which could get really really huge.
I've been using standard mysql for most of my stuff but this particular problem will get up to the TBs and I will want to be able to do hundreds of queries a second.
So aside from designing my database schema such that its not going to chug, and fast harddrive speeds what is my biggest bottleneck and what sort of solution is recommended for this.
Does it make sense to spread the database over multiple computers on my intranet so it can scale with CPU/Ram etc and if so is there software for this or database solutions for this?
Thanks for any help!
I did a search for questions to related to this and couldn't find anything so sorry if it has already been asked.
Database scalability is a VERY complicated issue; there are a LOT of issues that come into the whole process.
First, consider the lowest-hanging fruit; do you have individual tables (or columns) that are going to be containing the bulk of your data? Columns which will contain BLOBs which are > 4MB each? Those can be extracted from the database and stored on a flat-file storage system, and merely referred to from the database; right there, that can take many unwieldy solutions down to a manageable level.
If not, do you have deeply different usage patterns for different subgroupings of tables? If so, there's an opportunity right there for segmenting your database into different functional databases which can be partitioned onto different servers. A good example of this is read-mostly data, such as on webservers, which gets generated rarely (think user-specific home page data), but read frequently; that type of data can get segregated into a database (or, again, flatfile with references) that's separate from the rest of the user data).
Consider the transactional requirements of your database; can you isolate your transaction boundaries cleanly, or will there be deeply mingled transactions going on all through your database? If you can isolate your transaction boundaries, there's another potential useful boundary.
This is just touching on some of the issues involved with this sort of thing. One thing worth considering is whether or not you really need to have a database that is actually going to be huge, or if you're just trying to use the database as a persistence layer. If you're using the database just as a persistence layer, you might reconsider whether you actually need the relational nature of a database at all, or if you can get away with a smaller relational overlay on top of a simpler persistence layer. (I say this because a large quantity of solutions seem like they could get away with a thin relational layer over a large persistence layer; it's worth considering.)
Ok, first I need to point you to here. I don't think MySQL is going to perform like you want. I have a bad feeling that when I say you need to look into an Oracle instalation, you're going to say, "We don't have the cash for that." But, when I say get the latest/greatest SQL-Server, you're going to say, "We don't have the hardware it'll take to implement that." I'm afraid that terabytes is just flat out going to crush your MySQL instalation.
A new breed of NewSQL databases are being built to solve exactly the problem of distributing resources over multiple servers. The Clustrix database (which was built from the ground up to be a MySQL replacement) is one example that provides near-linear scale -- as you run out of CPU/Memory, you can simple add nodes.
Database scalability is a tough problem and you should consider solutions that can address it for you. I believe that MySQL can be used as the foundation for a solution to your problem.
Horizontal scalability; the ability to scale a database horizontally (aka scale-out) is a good technique to address the problem of very large tables and databases.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am interested in learning how a database engine works (i.e. the internals of it). I know most of the basic data structures taught in CS (trees, hash tables, lists, etc.) as well as a pretty good understanding of compiler theory (and have implemented a very simple interpreter) but I don't understand how to go about writing a database engine. I have searched for tutorials on the subject and I couldn't find any, so I am hoping someone else can point me in the right direction. Basically, I would like information on the following:
How the data is stored internally (i.e. how tables are represented, etc.)
How the engine finds data that it needs (e.g. run a SELECT query)
How data is inserted in a way that is fast and efficient
And any other topics that may be relevant to this. It doesn't have to be an on-disk database - even an in-memory database is fine (if it is easier) because I just want to learn the principals behind it.
Many thanks for your help.
If you're good at reading code, studying SQLite will teach you a whole boatload about database design. It's small, so it's easier to wrap your head around. But it's also professionally written.
SQLite 2.5.0 for Code Reading
http://sqlite.org/
The answer to this question is a huge one. expect a PHD thesis to have it answered 100% ;)
but we can think of the problems one by one:
How to store the data internally:
you should have a data file containing your database objects and a caching mechanism to load the data in focus and some data around it into RAM
assume you have a table, with some data, we would create a data format to convert this table into a binary file, by agreeing on the definition of a column delimiter and a row delimiter and make sure such pattern of delimiter is never used in your data itself. i.e. if you have selected <*> for example to separate columns, you should validate the data you are placing in this table not to contain this pattern. you could also use a row header and a column header by specifying size of row and some internal indexing number to speed up your search, and at the start of each column to have the length of this column
like "Adam", 1, 11.1, "123 ABC Street POBox 456"
you can have it like
<&RowHeader, 1><&Col1,CHR, 4>Adam<&Col2, num,1,0>1<&Col3, Num,2,1>111<&Col4, CHR, 24>123 ABC Street POBox 456<&RowTrailer>
How to find items quickly
try using hashing and indexing to point at data stored and cached based on different criteria
taking same example above, you could sort the value of the first column and store it in a separate object pointing at row id of items sorted alphabetically, and so on
How to speed insert data
I know from Oracle is that they insert data in a temporary place both in RAM and on disk and do housekeeping on periodic basis, the database engine is busy all the time optimizing its structure but in the same time we do not want to lose data in case of power failure of something like that.
so try to keep data in this temporary place with no sorting, append your original storage, and later on when system is free resort your indexes and clear the temp area when done
good luck, great project.
There are books on the topic a good place to start would be Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom
SQLite was mentioned before, but I want to add some thing.
I personally learned a lot by studying SQlite. The interesting thing is, that I did not go to the source code (though I just had a short look). I learned much by reading the technical material and specially looking at the internal commands it generates. It has an own stack based interpreter inside and you can read the P-Code it generates internally just by using explain. Thus you can see how various constructs are translated to the low-level engine (that is surprisingly simple -- but that is also the secret of its stability and efficiency).
I would suggest focusing on www.sqlite.org
It's recent, small (source code 1MB), open source (so you can figure it out for yourself)...
Books have been written about how it is implemented:
http://www.sqlite.org/books.html
It runs on a variety of operating systems for both desktop computers and mobile phones so experimenting is easy and learning about it will be useful right now and in the future.
It even has a decent community here: https://stackoverflow.com/questions/tagged/sqlite
Okay, I have found a site which has some information on SQL and implementation - it is a bit hard to link to the page which lists all the tutorials, so I will link them one by one:
http://c2.com/cgi/wiki?CategoryPattern
http://c2.com/cgi/wiki?SliceResultVertically
http://c2.com/cgi/wiki?SqlMyopia
http://c2.com/cgi/wiki?SqlPattern
http://c2.com/cgi/wiki?StructuredQueryLanguage
http://c2.com/cgi/wiki?TemplateTables
http://c2.com/cgi/wiki?ThinkSqlAsConstraintSatisfaction
may be you can learn from HSQLDB. I think they offers small and simple database for learning. you can look at the codes since it is open source.
If MySQL interests you, I would also suggest this wiki page, which has got some information about how MySQL works. Also, you might want to take a look at Understanding MySQL Internals.
You might also consider looking at a non-SQL interface for your Database engine. Please take a look at Apache CouchDB. Its what you would call, a document oriented database system.
Good Luck!
I am not sure whether it would fit to your requirements but I had implemented a simple file oriented database with support for simple (SELECT, INSERT , UPDATE ) using perl.
What I did was I stored each table as a file on disk and entries with a well defined pattern and manipulated the data using in built linux tools like awk and sed. for improving efficiency, frequently accessed data were cached.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Are User Defined Data Types in SQL Server something that a intermediate SQL user should know and use?
What are pros and cons of using UDTs?
Never use them is my advice. You are in a world of hurt if you ever have to change the definition. Perhaps this has improved since SQL Server 2000 and someone with more familiarity with the newer versions can tell you whether it is now safe to get in the water, but until I had confirmation of this and had checked it out myself with a test, I wouldn't put it on my production system.
Check out this question for details:
How to change the base type of a UDT in Sql Server 2005?
I do not use code-based UDTs because I don't think that the extra complexity warrants the advantages. I do use T-SQL UDTs because there's very little extra complexity so that the advantages are worth the effort. (Thanks go to Marc_s for pointing out that my original post was incomplete!)
Regarding Code-based UDTs
Think of it this way: if your project has a managed code component (your app) and a database component (SQL Server) what real advantage do you gain from defining managed code in the database? In my experience? None.
Deployment is more difficult because you'll have to add assemblies to your DB deployment and alter these assemblies, add files, etc. within SQL Server. You'll also have to turn on the CLR in SQL Server (not a big deal but no one's proven to me that this won't have a performance/memory penalty). In the end, you'll have exactly what you would have had if you had simply designed this into your application's code. There may be some performance enhancement but it really strikes me as a case of premature optimization - especially since I don't know if the overall performance suffers due to having the CLR on versus off.
Note: I'm assuming that you would be using SQL Server's CLR to define your types. HLGEM talks about SQL Server 2000 but I'm not familiar with 2000 and thought it only had UDFs and not UDTs in externally-defined dlls (but don't quote me...I really am not familiar with it!).
Regarding T-SQL UDTs
T_SQL UDTs can be defined in SQL alone (go to "Programmability | Types | User-defined Data Types" in SQL Server Management Studio). For standard UDTs I would in fact recommend that you master them. They are quite easy and can make your DDL more self-documenting and can enforce integrity constraints. For example, I define a "GenderType" (char(1), not nullable, holding "M" or "F") that ensures that only appropriate data is permitted in the Gender field.
UDTs are pretty easy overall but this article gives a pretty good example of how to take it to the next level by defining a Rule to constrain the data permitted in your UDT.
When I originally answered this question I was fixed on the idea of complex, code-defined types (smacks palm to forehead). So...thanks Marc.
The pro of user defined types is addressed quite well by Alex Papadimoulis. The cons have been well stated here.
I would also like to point out that the sp_bindrule function has been deprecated, as noted by Alex's post. I'm not sure when it was deprecated but it is now. In fact, rules are deprecated as a whole.
Were I to want to create a type with a restriction, I'd consider using a user defined table type with a check constraint on the appropriate column(s). This also gives me a way of building a complex data type.
I can't really recommend the use of any sql-implementation specific features that make it harder when you are growing out of mssql and are migrating to another dbms. For our dwh dbs we started on mssql, migrated to oracle and have since last year graduated to hp vertica.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Apart from the google/bigtable scenario, when shouldn't you use a relational database? Why not, and what should you use? (did you learn 'the hard way'?)
In my experience, you shouldn't use a relational database when any one of these criteria are true:
your data is structured as a hierarchy or a graph (network) of arbitrary depth,
the typical access pattern emphasizes reading over writing, or
there’s no requirement for ad-hoc queries.
Deep hierarchies and graphs do not translate well to relational tables. Even with the assistance of proprietary extensions like Oracle's CONNECT BY, chasing down trees is a mighty pain using SQL.
Relational databases add a lot of overhead for simple read access. Transactional and referential integrity are powerful, but overkill for some applications. So for read-mostly applications, a file metaphor is good enough.
Finally, you simply don’t need a relational database with its full-blown query language if there are no unexpected queries anticipated. If there are no suits asking questions like "how many 5%-discounted blue widgets did we sell in on the east coast grouped by salesperson?", and there never will be, then you, sir, can live free of DB.
The relational database paradigm makes some assumptions about usage of data.
A relation consists of an unordered set of rows.
All rows in a relation have the same set of columns.
Each column has a fixed name and data type and semantic meaning on all rows.
Rows in a relation are identified by unique values in primary key column(s).
etc.
These assumptions support simplicity and structure, at the cost of some flexibility. Not all data management tasks fit into this kind of structure. Entities with complex attributes or variable attributes do not, for instance. If you need flexibility in areas where a relational database solution doesn't support it, you need to use a different kind of solution.
There are other solutions for managing data with different requirements. Semantic Web technology, for example, allows each entity to define its own attributes and to be self-describing, by treating metadata as attributes just like data. This is more flexible than the structure imposed by a relational database, but that flexibility comes with a cost of its own.
Overall, you should use the right tool for each job.
See also my other answer to "The Next-gen databases."
There are three main data models (C.J.Date, E.F.Codd) and I am adding a flat file to this:
flat file(s) (structure varies - from 'stupid' flat text to files conforming to grammars which coupled with clever tools do very clever things, think compilers and what they can do, narrow application in modelling new things)
hierarchical (trees, nested sets - examples: xml and other markup languages, registry, organizational charts, etc; anything can be modelled, but integrity rules are not easy to express and retrieval is hard to optimize automatically, some retrieval is fast and some is very slow )
network (networks, graphs - examples: navigational databases, hyperlinks, semantic web, again almost anything can be modelled but automatic optimizing of retrieval is a problem)
relational (first order predicate logic - example: relational databases, automatic optimization of retrieval)
Both hierarchical and network can be represented in relational and relational can be expressed in the other two.
The reason that relational is considered 'better' is the declarative nature and standardization on not only the data retrieval language but also on the data definition language, including the strong declarative data integrity, backed up with stable, scalable, multi-user management system.
Benefits come at a cost, which most projects find to be a good ratio for systems (multi application) that store long term data in a from that will be usable in foreseeable future.
If you are not building a system, but a single application, perhaps for a single user, and you are fairly certain that you will not want multiple applications using your data, nor multiple users, any time soon then you'll probably find faster approaches.
Also if you don't know what kind of data you want to store and how to model it then relational model strengths are wasted on it.
Or if you simply don't care about integrity of your data that much (which can be fine).
All data structures are optimized for a certain kind of use, only relational if properly modelled tries to represent the 'reality' in semantically unbiased way. People who had bad experience with relational databases usually don't realize that their experience would have been much worse with other types of data models. Horrible implementations are possible, and especially with relational databases, where it is relatively easy to build complex models, you could end up with quite a monster on your hands. Still I always feel better when I try to imagine the same monster in xml.
One example of how good relational model is, IMO, is ratio of complexity vs shortness of the questions that you will find that involve SQL.
I suggest you visit the High Scalability blog, which discusses this topic almost on a daily basis and has many articles about projects that chose distributed hashes, etc. over RDMBS.
The quick (but very incomplete answer) is that not all data translates well to tables in efficient ways. For example, if your data is essentially one big dictionary, there are probably much faster alternatives that plain old RDBMS. Having said that, it mostly a matter of performance, and if performance isn't a huge concern in a project, and stability, consistency and reliability, for example, are, then I don't see much point in delving into these technologies when RDBMS is a much more mature and well developed scheme, with support in all languages and platforms and a huge set of solutions to choose from.
Fifteen years ago I was working on a credit risk system (basically a big tree walking system). We were using Sybase on HPUX & solaris and performnce was killing us. We hired in consultants direct from Sybase who said it couldn't be done. Then we switched to an OO database (Object store in this case) and got a about a 100x performance increase (and the code was about 100x easier to write too)
But such situations are quite rare - a relational database is a good first choice.
When you schema varies a lot you will have a hard time with relational databases. This is where XML databases or key-value pair databases work best. or you could use IBM DB2 and have both relational data and XML data managed by a single database engine.
About 7-8 years ago I worked on a web site that grew in popularity beyond our initial expectations and it got us in trouble performance-wise. Since we were all relatively inexperienced in web based projects it posed a significant strain on us about what to do beyond usual database separation onto separate server, load balancing etc.
One day I've thought of something pretty simple. Since site was based on users, their profiles were stored in a database table the usual way someone would do it - user id, lots of info variables and stuff like that - which would show up as a users profile page which other users could look up. I've flushed all that data into a simple html file, already prepared as a users profile page and got a significant boost - basically a cache. I even made a system that when user edited their profile info, it would parse original html file, put it up for edit, and then flush out html back to the file system - got even more boost.
I made something simillar with messages users sent to each other. Basically wherever I could make a system bypass a database altogether, avoiding a INSERT or UPDATE, I got a significant boost. It may sound like a common sense, but it was an enlightening moment. It is not an avoidance of relational setup per se, but it is an avoidance of the database altogether - KISS.