How to write a simple database engine [closed] - sql

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am interested in learning how a database engine works (i.e. the internals of it). I know most of the basic data structures taught in CS (trees, hash tables, lists, etc.) as well as a pretty good understanding of compiler theory (and have implemented a very simple interpreter) but I don't understand how to go about writing a database engine. I have searched for tutorials on the subject and I couldn't find any, so I am hoping someone else can point me in the right direction. Basically, I would like information on the following:
How the data is stored internally (i.e. how tables are represented, etc.)
How the engine finds data that it needs (e.g. run a SELECT query)
How data is inserted in a way that is fast and efficient
And any other topics that may be relevant to this. It doesn't have to be an on-disk database - even an in-memory database is fine (if it is easier) because I just want to learn the principals behind it.
Many thanks for your help.

If you're good at reading code, studying SQLite will teach you a whole boatload about database design. It's small, so it's easier to wrap your head around. But it's also professionally written.
SQLite 2.5.0 for Code Reading
http://sqlite.org/

The answer to this question is a huge one. expect a PHD thesis to have it answered 100% ;)
but we can think of the problems one by one:
How to store the data internally:
you should have a data file containing your database objects and a caching mechanism to load the data in focus and some data around it into RAM
assume you have a table, with some data, we would create a data format to convert this table into a binary file, by agreeing on the definition of a column delimiter and a row delimiter and make sure such pattern of delimiter is never used in your data itself. i.e. if you have selected <*> for example to separate columns, you should validate the data you are placing in this table not to contain this pattern. you could also use a row header and a column header by specifying size of row and some internal indexing number to speed up your search, and at the start of each column to have the length of this column
like "Adam", 1, 11.1, "123 ABC Street POBox 456"
you can have it like
<&RowHeader, 1><&Col1,CHR, 4>Adam<&Col2, num,1,0>1<&Col3, Num,2,1>111<&Col4, CHR, 24>123 ABC Street POBox 456<&RowTrailer>
How to find items quickly
try using hashing and indexing to point at data stored and cached based on different criteria
taking same example above, you could sort the value of the first column and store it in a separate object pointing at row id of items sorted alphabetically, and so on
How to speed insert data
I know from Oracle is that they insert data in a temporary place both in RAM and on disk and do housekeeping on periodic basis, the database engine is busy all the time optimizing its structure but in the same time we do not want to lose data in case of power failure of something like that.
so try to keep data in this temporary place with no sorting, append your original storage, and later on when system is free resort your indexes and clear the temp area when done
good luck, great project.

There are books on the topic a good place to start would be Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom

SQLite was mentioned before, but I want to add some thing.
I personally learned a lot by studying SQlite. The interesting thing is, that I did not go to the source code (though I just had a short look). I learned much by reading the technical material and specially looking at the internal commands it generates. It has an own stack based interpreter inside and you can read the P-Code it generates internally just by using explain. Thus you can see how various constructs are translated to the low-level engine (that is surprisingly simple -- but that is also the secret of its stability and efficiency).

I would suggest focusing on www.sqlite.org
It's recent, small (source code 1MB), open source (so you can figure it out for yourself)...
Books have been written about how it is implemented:
http://www.sqlite.org/books.html
It runs on a variety of operating systems for both desktop computers and mobile phones so experimenting is easy and learning about it will be useful right now and in the future.
It even has a decent community here: https://stackoverflow.com/questions/tagged/sqlite

Okay, I have found a site which has some information on SQL and implementation - it is a bit hard to link to the page which lists all the tutorials, so I will link them one by one:
http://c2.com/cgi/wiki?CategoryPattern
http://c2.com/cgi/wiki?SliceResultVertically
http://c2.com/cgi/wiki?SqlMyopia
http://c2.com/cgi/wiki?SqlPattern
http://c2.com/cgi/wiki?StructuredQueryLanguage
http://c2.com/cgi/wiki?TemplateTables
http://c2.com/cgi/wiki?ThinkSqlAsConstraintSatisfaction

may be you can learn from HSQLDB. I think they offers small and simple database for learning. you can look at the codes since it is open source.

If MySQL interests you, I would also suggest this wiki page, which has got some information about how MySQL works. Also, you might want to take a look at Understanding MySQL Internals.
You might also consider looking at a non-SQL interface for your Database engine. Please take a look at Apache CouchDB. Its what you would call, a document oriented database system.
Good Luck!

I am not sure whether it would fit to your requirements but I had implemented a simple file oriented database with support for simple (SELECT, INSERT , UPDATE ) using perl.
What I did was I stored each table as a file on disk and entries with a well defined pattern and manipulated the data using in built linux tools like awk and sed. for improving efficiency, frequently accessed data were cached.

Related

NoSql, Sql or Flatfile [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I've just started playing around with Node.js and Socket.io and I'm planning on building a little multi-player game. Probably something simple like each player has a character that they can run around in an arena and try and kill each other.
However I'm unsure how best to store the data. I can imagine there would be some loose relationships between objects such as a character and its weapon but that I would likely load these into the system by id as and when they are required and save them back out when I no longer need them.
In these terms would it be simpler to write 'objects' out to file instead of getting a database involved. Use a NoSql document database or just stick to good old Sql server?
My advice would be to start with NoSQL.
Flatfile is difficult because you'll want to read and write this data very, very often. One file per player is not a terrible place to start - and might be OK for the very first prototype - but you're going to be writing a huge amount. File systems are not good at this. The one benefit at prototype stage is you can debug quick - just cat out the current state of a user. Using a .json file, or similar .yaml format, will start you on your way very rapidly (and you can convert to the NoSQL approach as the prototype starts coming together).
SQL isn't a terrible approach. If you're familiar with this, you'll end up building a real schema, and creating a variety of tables and joining the user data against them quite a bit. This can be a benefit for helping you think through your game, but I think you'll end up spending a lot of time trying to figure out how to normalize your data and writing joins. Since it seems you're unfamiliar with the problem (thus are asking the question), you're likely to do this wrong (and get in the way of gaming awesomeness) and/or just spend too much time at it.
NoSQL - using a document store model - is much like just reading an writing a user object. You'll end up re-writing your user object every time - but this kind of access (key-value, accessed by the user id) is hyper efficient. You'll probably get into a prototype really, really quickly, and to the important aspect of building out your play mechanism. Key-value access is highly scalable in the long run.
If you want to store Player information, use sql. However if you're having a connection based system. As in something where you only need to store information while the player is connected and after the connection is lost you don't need to "save"; then just store it in Memory.
Otherwise, I would say that you should stick with Sql. Databases are optimized, quick, tried, tested and true. You can't go wrong with a Sql database.

Distributed Database Solution? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Hey. I am going to be setting up a a database which could get really really huge.
I've been using standard mysql for most of my stuff but this particular problem will get up to the TBs and I will want to be able to do hundreds of queries a second.
So aside from designing my database schema such that its not going to chug, and fast harddrive speeds what is my biggest bottleneck and what sort of solution is recommended for this.
Does it make sense to spread the database over multiple computers on my intranet so it can scale with CPU/Ram etc and if so is there software for this or database solutions for this?
Thanks for any help!
I did a search for questions to related to this and couldn't find anything so sorry if it has already been asked.
Database scalability is a VERY complicated issue; there are a LOT of issues that come into the whole process.
First, consider the lowest-hanging fruit; do you have individual tables (or columns) that are going to be containing the bulk of your data? Columns which will contain BLOBs which are > 4MB each? Those can be extracted from the database and stored on a flat-file storage system, and merely referred to from the database; right there, that can take many unwieldy solutions down to a manageable level.
If not, do you have deeply different usage patterns for different subgroupings of tables? If so, there's an opportunity right there for segmenting your database into different functional databases which can be partitioned onto different servers. A good example of this is read-mostly data, such as on webservers, which gets generated rarely (think user-specific home page data), but read frequently; that type of data can get segregated into a database (or, again, flatfile with references) that's separate from the rest of the user data).
Consider the transactional requirements of your database; can you isolate your transaction boundaries cleanly, or will there be deeply mingled transactions going on all through your database? If you can isolate your transaction boundaries, there's another potential useful boundary.
This is just touching on some of the issues involved with this sort of thing. One thing worth considering is whether or not you really need to have a database that is actually going to be huge, or if you're just trying to use the database as a persistence layer. If you're using the database just as a persistence layer, you might reconsider whether you actually need the relational nature of a database at all, or if you can get away with a smaller relational overlay on top of a simpler persistence layer. (I say this because a large quantity of solutions seem like they could get away with a thin relational layer over a large persistence layer; it's worth considering.)
Ok, first I need to point you to here. I don't think MySQL is going to perform like you want. I have a bad feeling that when I say you need to look into an Oracle instalation, you're going to say, "We don't have the cash for that." But, when I say get the latest/greatest SQL-Server, you're going to say, "We don't have the hardware it'll take to implement that." I'm afraid that terabytes is just flat out going to crush your MySQL instalation.
A new breed of NewSQL databases are being built to solve exactly the problem of distributing resources over multiple servers. The Clustrix database (which was built from the ground up to be a MySQL replacement) is one example that provides near-linear scale -- as you run out of CPU/Memory, you can simple add nodes.
Database scalability is a tough problem and you should consider solutions that can address it for you. I believe that MySQL can be used as the foundation for a solution to your problem.
Horizontal scalability; the ability to scale a database horizontally (aka scale-out) is a good technique to address the problem of very large tables and databases.

cleaning datasources [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm project managing a development that's pulling data from all kinds of data sources (SQL MySQL, Filemaker, excel) before installing into a new database structure with a record base through 10 years. Obviously I need to clean all this before exporting, and am wondering if there are any apps that can simplify this process for me, or any guides that I can follow.
Any help would be great
I do this all the time and like Tom do it in SQl Server using DTS or SSIS depending on the version of the final database.
Some things I strongly recommend:
Archive all files received before you process them especially if you are getting this data from outside sources, you may have to research old imports and go back to the raw data. After the archive is successful, copy the file to the processing location.
For large files especially, it is helpful to get some sort of flag file that is only copied after the other file is completed or even better whcich contains the number of records in the file. This can help prevent problems from corrupted or incomplete files.
Keep a log of number of records and start failing your jobs if the file size or number of records is suspect. Put in a method to process anyway if you find the change is correct. Sometimes they really did mean to cut the file in half but most of the time they didn't.
If possible get column headers in the file. You would be amazed at how often data sources change the columns, column names or order of the columns without advance warning and break imports. It is easier to check this before processing data if you have column headers.
Never import directly to a production table. Always better to use a staging table where you can check and clean data before putting it into prod.
Log each step of your process, so you can easily find what caused a failure.
If you are cleaning lots of files consider creating functions to do specific types of cleaning (phone number formatting for instance) then you can use the same function in multiple imports.
Excel files are evil. Look for places where leading zeros have been stripped in the import process.
I write my processes so I can run them as a test with a rollback at the end. Much better to do this than realize your dev data is so hopelessly messed up that you can't even do a valid test to be sure everything can be moved to prod.
Never do a new import on prod without doing it on dev first. Eyeball the records directly when you are starting a new import (not all of them if it is a large file of course, but a good sampling). If you think you should get 20 columns and it imports the first time as 21 columns, look at the records in that last column, many times that means the tab delimited file had a tab somewhere in the data and the column data is off for that record.
Don't assume the data is correct, check it first. I've had first names in the last name column, phones in the zip code column etc.
Check for invalid characters, string data where there should just be numbers etc.
Any time it is possible, get the identifier from the people providing the data. Put this in a table that links to your identifier. This will save you from much duplication of records becuase the last name changed or the address changed.
There's lots more but this should get you started on thinking about building processes to protect your company's data by not importing bad stuff.
I work mostly with Microsoft SQL Server, so that's where my expertise is, but SSIS can connect to a pretty big variety of data sources and is very good for ETL work. You can use it even if none of your data sources are actually MS SQL Server. That said, if you're not using MS SQL Server there is probably something out there that's better for this.
To provide a really good answer one would need to have a complete list of your data sources and destination(s) as well as any special tasks which you might need to complete along with any requirements for running the conversion (is it a one-time deal or do you need to be able to schedule it?)
Not sure about tools, but your going to have to deal with:
synchronizing generated keys
synchronizing/normalizing data formats (e.g. different date formats)
synchronizing record structures.
orphan records
If the data is running/being updated while you're developing this process or moving data you're also going to need to capture the updates. When I've had to do this sort of thing in the past the best, not so great answer I had was to develop a set of scripts that ran in multiple iterations, so that I could develop and test the process iteratively before I moved any of the data. I found it helpful to have a script (I used a schema and an ant script, but it could be anything) that could clean/rebuild the destination database. It's also likely that you'll need to have some way of recording dirty/mismatched data.
In similar situations I personally have found Emacs and Python mighty useful but, I guess, any text editor with good searching capabilities and a language with powerful string manipulation features should do the job. I first convert the data into flat text files and then
Eyeball either the whole data set or a representative true random sample of the data.
Based on that make conjectures about different columns ("doesn't allow nulls", "contains only values 'Y' and 'N'", "'start date' always precede 'end date'", etc.).
Write scripts to check the conjectures.
Obviously this kind method tends to focus on one table at a time and therefore only complements the checks made after uploading the data into a relational database.
One trick that comes in useful for me with this, is to find a way for each type of data source to output a single column plus unique identifier at a time in tab delimited form say, so that you can clean it up using text tools (sed, awk, orTextMate's grep search), and then re-import it / update the (copy of!) original source.
It then becomes much quicker to clean up multiple sources, as you can re-use tools across them (e.g. capitalising last names - McKay, O'Leary o'Neil, Da Silva, Von Braun, etc., fixing date formats, trimming whitespace) and to some extent automate the process (depending on the source).

Tools for Generating Mock Data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I'm looking for recommendations of a good, free tool for generating sample data for the purpose of loading into test databases. By analogy, something that produces "lorem ipsum" text for any RDBMS. Features I'm looking for include:
Flexibility to generate data for an existing table definition.
Ability to generate small and large data sets (> 1 million rows or more).
Generate in SQL script format (INSERT statements) or else in a flat file format suitable for bulk import (which is usually faster).
A command-line interface for easy scripting.
Extensible, open source, written in a dynamic language (these are nice-to-haves, not strong requirements).
PS: I did search for a duplicate question on StackOverflow, but I didn't find one. If there is one, I'll be grateful to get a pointer to it.
Thanks for the great responses everyone! I should amend my requirements that I use Mac OS X as my primary development environment, not Windows (though I did say command-line interface is desirable, and that practically rules out Windows). The Windows-specific suggestions will no doubt be useful to other readers of this question, though, so thanks.
Here is my conclusion:
GenerateData:
PHP web app interface, not command line
limited to generating 200 records (or pay $20 for license to generating 5,000 records)
RedGate SQL Data Generator
not free, price $295
requires Windows, .NET, SQL Server
Visual Studio 2008 Database Edition
requires Windows
requires costly MSDN or ISV subscription
Banner Datadect
not free, price $595
requires Windows (?)
no support for MySQL (?)
GUI, not command line or scriptable
Ruby Faker gem
way too slow to use ActiveRecord for bulk data load
Super Smack
chiefly a load-testing tool, with a random data generator built in
pretty simple to use nevertheless
overall a good runner-up tool
Databene Benerator
best solution for my needs
XML scripts, compatible with DbUnit
open source (GPL) Java code
command-line usage
access many databases directly via JDBC
Take a look at databene benerator, a test data generator that looks close to your requirements.
it can generate data for an existing table definition (or even anonymize production data)
it can generate larges data set (unlimited size)
it supports various input (CSV, Flat Files, DBUnit) and output format (CSV, Flat Files, DBUnit, XML, Excel, Scripts)
it can be used on the command line or through a maven plugin
it's open source and customizable
I would give it a try.
BTW, a list of similar products is available on databene benerator's web site.
This looks quite promising: generatedata.com. Open-source, has lots of built-in data types.
There are several others listed here: Test (Sample) Data Generators. I don't have experience with any of them, but a few on that list look like they could be pretty decent.
Try http://www.mockaroo.com
This is a tool my company made to help test our own applications. We've made it free for anyone to use. It's basically the Forgery ruby gem with a web app wrapped around it. You can generate data in CSV, txt, or SQL formats. Hope this helps.
I know you said you were looking for a free tool, but this is one case where I would suggest that spending $295 will pay you back quickly in time saved. I've been using the RedGate tool SQL Data Generator for the last year and it is, to be short, an awesome tool. It allows for setting dependencies between columns, generates realistic data for business objects such as phone numbers, urls, names, etc. I can honestly state that this tool has paid for itself time and time again.
If you are looking or willing to use something MySQL-specific, you could take a look at Super Smack. It is currently maintained by Tony Bourke.
Super Smack allows you to generate random data to insert into your database tables. It is customizable, allowing you to use the packaged words.dat file, or any test data of your choice.
One of the nice things about it is that it is command-line is highly customizable. There is some fairly decent examples of usage in the book High Performance MySQL which is also excerpted here.
Not sure if that is along the lines of what you are looking for, but just a thought.
A Ruby script with one of the available fake data generators should do you just fine.
http://faker.rubyforge.org/ is one such gem. Unfortunately, this doesn't fulfill all your requirements.
Here is another: http://random-data.rubyforge.org/
And a tutorial for using Faker: http://www.rubyandhow.com/how-to-generate-fake-names-addresses-in-ruby/
RE: Flexibility to generate data for an existing table definition. Combine the Faker gem with one of the available ORMs. ActiveRecord would probably be easiest.
Normally very costly, but if you are a small ISV you can get Visual Studio 2008 Database Edition very cheaply, see the empower and bizspark promotions. It provides a lot more functionality then just generating test data (Integration with SCC, Unit Testing, DB Refactoring, etc.)
As I like the fact that Red-Grate tools are so easy to learn, I would still look at SQL Data Generator
a tool that really should not be missing from the list is the Data Generator from Datanamic that populates databases directly or generates insert scripts, has a large collection of pre-installed generators ( and supports multiple databases...
http://www.datanamic.com/datagenerator/index.html
I know you're not looking for actual lorem ipsum text; but in case anyone else searches for an actual lorem ipsum generator and finds this thread: lipsum.com does a great job of it.
Not free, but Visual Studio 2008 Database Edition is a good alternative and it provides a lot more functionality (Integration with SCC, Unit Testing, DB Refactoring, etc...)
I use a tool called Datatect:
Generates data to flat files or any ODBC compliant database.
Extensible via VBScript.
Referentially aware; will populate foreign keys with values from parent table.
Data is context aware; city, state and phone numbers for given zip codes, first names and titles with gender.
Can create custom, complex data types.
Generate over 2 billion proper names, business names, street addresses, cities, states, and zip codes.
I've used this tool to generate as many as 40,000,000 rows of data to a SQLServer database, and 8,000,000 rows of data to an Oracle database.
I am in no way affiliated with Banner Systems, just a satisfied customer.
Here is the list of such tools (both free and commercial):
http://c2.com/cgi/wiki?TestDataGenerator
For OS X there is Data Creator (US $ 7). Download is free for test purpose. You can use it to evaluate the software and its features.
It requires OS X Lion or successive. It can generate a lot of different field type and has a custom export mode plus some pre-set (TSV, CSV, Html table, web page with table inside).
http://www.tensionsoftware.com/osx/datacreator/
here at the App Store:
https://itunes.apple.com/us/app/data-creator/id491686136?mt=12
You can use DbSchema, www.dbschema.com it's a database management tool and it has a Random Data Generator to populate your database.
Not direct answer to your question but this can be helpful for certain kind of data :
Fake Name Generator can be useful - http://www.fakenamegenerator.com/ , not for everything but user accounts or stuff like that. AFAIK They provide support for bulk order.
+1 for Benerator: I tried 3 or 4 of the other tools on offer (including dbmonster) but found Benerator to be very quick, to deliver realistic data and to be flexible. I also got very quick & helpful feedback from the tool's creator when I posted on the forum.

A beginner's guide to SQL database design [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Do you know a good source to learn how to design SQL solutions?
Beyond the basic language syntax, I'm looking for something to help me understand:
What tables to build and how to link them
How to design for different scales (small client APP to a huge distributed website)
How to write effective / efficient / elegant SQL queries
I started with this book: Relational Database Design Clearly Explained (The Morgan Kaufmann Series in Data Management Systems) (Paperback) by Jan L. Harrington and found it very clear and helpful
and as you get up to speed this one was good too Database Systems: A Practical Approach to Design, Implementation and Management (International Computer Science Series) (Paperback)
I think SQL and database design are different (but complementary) skills.
I started out with this article
http://en.tekstenuitleg.net/articles/software/database-design-tutorial/intro.html
It's pretty concise compared to reading an entire book and it explains the basics of database design (normalization, types of relationships) very well.
Experience counts for a lot, but in terms of table design you can learn a lot from how ORMs like Hibernate and Grails operate to see why they do things. In addition:
Keep different types of data separate - don't store addresses in your order table, link to an address in a separate addresses table, for example.
I personally like having an integer or long surrogate key on each table (that holds data, not those that link different tables together, e,g., m:n relationships) that is the primary key.
I also like having a created and modified timestamp column.
Ensure that every column that you do "where column = val" in any query has an index. Maybe not the most perfect index in the world for the data type, but at least an index.
Set up your foreign keys. Also set up ON DELETE and ON MODIFY rules where relevant, to either cascade or set null, depending on your object structure (so you only need to delete once at the 'head' of your object tree, and all that object's sub-objects get removed automatically).
If you want to modularise your code, you might want to modularise your DB schema - e.g., this is the "customers" area, this is the "orders" area, and this is the "products" area, and use join/link tables between them, even if they're 1:n relations, and maybe duplicate the important information (i.e., duplicate the product name, code, price into your order_details table). Read up on normalisation.
Someone else will recommend exactly the opposite for some or all of the above :p - never one true way to do some things eh!
I really liked this article..
http://www.codeproject.com/Articles/359654/important-database-designing-rules-which-I-fo
Head First SQL is a great introduction.
These are questions which, in my opionion, requires different knowledge from different domains.
You just can't know in advance "which" tables to build, you have to know the problem you have to solve and design the schema accordingly;
This is a mix of database design decision and your database vendor custom capabilities (ie. you should check the documentation of your (r)dbms and eventually learn some "tips & tricks" for scaling), also the configuration of your dbms is crucial for scaling (replication, data partitioning and so on);
again, almost every rdbms comes with a particular "dialect" of the SQL language, so if you want efficient queries you have to learn that particular dialect --btw. much probably write elegant query which are also efficient is a big deal: elegance and efficiency are frequently conflicting goals--
That said, maybe you want to read some books, personally I've used this book in my datbase university course (and found a decent one, but I've not read other books in this field, so my advice is to check out for some good books in database design).
It's been a while since I read it (so, I'm not sure how much of it is still relevant), but my recollection is that Joe Celko's SQL for Smarties book provides a lot of info on writing elegant, effective, and efficient queries.