Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Warning: This is a not very serious question/discussion that I am posting... but I am willing to bet that most developers have pondered this "issue"...
Always wanted to get other opinions regarding naming conventions for methods that went and got data from somewhere and returned it...
Most method names are somewhat simple and obvious... SaveEmployee(), DeleteOrder(), UploadDocument(). Of course, with classes, you would most likely use the short form...Save(), Delete(), Upload() respectively.
However, I have always struggled with initial action...how to get the data. It seems that for every project I end up jumping between different naming conventions because I am never quite happy with the last one I used. As far as I can tell these are the possibilities -->
GetBooks()
FetchBooks()
RetrieveBooks()
FindBooks()
LoadBooks()
What is your thought?
It is all about consistent semantics;
In your question title you use getting data. This is extremely
general in a sense that you need to define what getting means
semantically significantly unambiguous way. I offer the follow
examples to hopefully put you on the right track when thinking about
naming things.
getBooks() is when you are getting
all the books associated with an
object, it implies the criteria for the set is
already defined and where they are coming from is a hidden detail.
findBooks(criteria)
is when are trying to find a sub-set
of the books based on parameters to
the method call, this will usually
be overloaded with different search
criteria
loadBooks(source) is when you are
loading from an external source,
like a file or db.
I would not use
fetch/retrieve because they are too vague and get conflated with get and there is no unambiguous semantic associated with the terms.
Example: fetch implies that some entity needs to go and get something that is remote and bring it back. Dogs fetch a stick, and retrieve is a synonym for fetch with the added semantic that you may have had possession of the thing prior as well. get is a synonym for obtain as well which implies that you have sole possession of something and no one else can acquire it simultaneously.
Semantics are extremely important:
the branch of linguistics and logic concerned with meaning
The comments are proof that generic terms like get and fetch have
no specific semantic and are interpreted differently by different
people. Pick a semantic for a term, document what it is intended to
imply if the semantic is not clear and be consistent with its use.
words with vague or ambigious meanings are given different semantics by different people because of their predjudices and preconceptions based on their personal opinions and that will never end well.
Honestly you should just decide with your team which naming convention to use. But for fun, lets see what your train of thought would be to decide on any of these:
GetBooks()
This method belongs to a data source, and we don't care how it is obtaining them, we just want to Get them from the data source.
FetchBooks()
You treat your data source like a bloodhound, and it is his job to fetch your books. I guess you should decide on your own how many he can fit in his mouth at once.
FindBooks()
Your data source is a librarian and will use the Dewey Decimal system to find your books.
LoadBooks()
These books belong in some sort of "electronic book bag" and must be loaded into it. Be sure to call ZipClosed() after loading to prevent losing them.
RetrieveBooks()
I have nothing.
The answer is just stick to what you are comfortable with and be consistant.
If you have a barnes and nobles website and you use GetBooks(), then if you have another item like a Movie entity use GetMovies(). So whatever you and your team likes and be consistant.
It is not clear by what you mean for "getting the data". From the database? A file? Memory?
My view about method naming is that its role is to eliminate any ambiguities and ideally a need to look up documentation. I believe that this should be done even at the cost of longer method names. According to studies, most intermediate+ developers are able to read multiple words in camel case. With IDE and auto completions, writing long method names is also not a problem.
Thus, when I see "fetchBooks", unless the context is very clear (e.g., a class named BookFetcherFromDatabase), it is ambiguous. Fetch it from where? What is the difference between fetch and find? You're also risking the problem that some developers will associate semantics with certain keywords. For example, fetch for database (or memory) vs. load (from file) or download (from web).
I would rather see something like "fetchBooksFromDatabase", "loadBookFromFile", "findBooksInCollection", etc. It is less sightly, but once you get over the length, it is clear. Everyone reading this would right away get what it is that you are trying to do.
In OO (C++/Java) I tend to use getSomething and setSomething because very often if not always I am either getting a private attribute from the class representing that data object or setting it - the getter/setter pair. As a plus, Eclipse generates them for you.
I tend to use Load only when I mean files - as in "load into memory" and that usually implies loading into primitives, structs (C) or objects. I use send/receive for web.
As said above, consistency is everything and that includes cross-developers.
Related
The main examples of the words I mean are "object", "value" etc. In many (well, not really, but the chances are on some occasions at least) cases you may happen to find yourself willing to name a variable etc. of yours this way.
Another example I have stumbled upon in my practice is "try" which represents both the keyword (in many C-like and other languages) used in exception handling and the currency of Turkey. But this is an example just for fun, I doubt there are any common practices known for this particular case (though I feel like there may be for the previous).
What do people do in such cases? What are some synonyms for an object, a value etc reasonable in the programming and data modelling context?
For example imagine you are developing an object database, manipulating objects, properties and values (rather than documents, fields and... eh... values) is, for some reason, among the key ideas of its philosophy and you really don't want to use words too distinct from these semantically. What words would you use to replace the reserved ones while keeping the sense very close to that of theirs?
The easiest solution to come into my mind so far it to use misspelled (or spelled in a different language orthography) varieties of the same words like "objekt", "walue" etc. but although this can do the job this just disgusts me so much I really don't want to accept going this way ever.
UPDATE: Indeed, in some specific cases (particular languages) using a different case (which, some times, may go against the case aspect of the commynity and/or the company naming convention by the way) and/or namespaces (which have been introduced almost exactly for this) may solve the problem at least partially but I am still interested in alternatives as I believe actually duplicating a system keyword is a thing one should at least think about avoiding (might there be a way to do it easily without accepting compromises considered too serious) in every case.
I am even considering writing script that would scrape through GitHub to analyse the common code elements naming vocabulary but I think it is always a good idea to ask first rather than to "reinvent a bicycle", perhaps somebody has done something like this already.
UPDATE2: Please do me a favour and consider the following with applicable degree of objectivity before voting to close. With all do respect I would like to emphasise that the actual degree of subjectivity of this question is excusably low (though, I admit, somewhat above zero anyway). The only real flaw of it is that it might perhaps fit the English site better but I believe the audience of StackOverflow is much more relevant (generally informed in a much more relevant way) to the context. The actual goal of publishing this question is to highlight a problem that is fairly easy to understand clearly enough and which can not be denied of existence (though its importance may be questionable so far) but is spoken of too little (as importance of code clarity and semantic relevance is increasing, IMHO, code as a media is quickly moving towards obtaining bigger cultural (in the broad meaning of this word) importance than of books). And to let people share the ways of addressing it in practice they know of.
Capitalization: Often, a different capitalization instead of a synonym does the trick, as most language keywords are case-sensitive. E.g. object = new Object();
Prefix / Postfix: Another often encountered solution is to write myObject = new Object(). Which one you chose really depends on the naming conventions you follow. For private class fields, some developers use an underscore, e.g. this._object indicating a private access modifier.
Specification: In most cases however, you can find a more specific word describing the role - such as instance, parent, child or argument - or the subclass - such as integer or n instead of a generic number datatype - of your object.
In addition to the above, many language communities follow de-facto conventions such as cls for Class, obj for Object, me or self for this etc.
Background
I have 2 resources: courses and professors.
A course has the following attributes:
id
topic
semester_id
year
section
professor_id
A professor has the the following attributes:
id
faculty
super_user
first_name
last_name
So, you can say that a course has one professor and a professor may have many courses.
If I want to get all courses or all professors I can: GET /api/courses or GET /api/professors respectively.
Quandary
My quandary comes when I want to get all courses that a certain professor teaches.
I could use either of the following:
GET /api/professors/:prof_id/courses
GET /api/courses?professor_id=:prof_id
I'm not sure which to use though.
Current solution
Currently, I'm using an augmented form of the latter. My reasoning is that it is more scale-able if I want to add in filtering/sorting criteria.
I'm actually encoding/embedding JSON strings into the query parameters. So, a (decoded) example might be:
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
The request above would retrieve all courses that were (or are currently being) taught by the professor with the provided professor_id in the year 2016, sorted according to topic title in ascending ASCII order.
I've never seen anyone do it this way though, so I wonder if I'm doing something stupid.
Closing Questions
Is there a standard practice for using the query string vs the resource path for filtering criteria? What have some larger API's done in the past? Is it acceptable, or encouraged to use use both paradigms at the same time (make both endpoints available)? If I should indeed be using the second paradigm, is there a better organization method I could use besides encoding JSON? Has anyone seen another public API using JSON in their query strings?
Edited to be less opinion based. (See comments)
As already explained in a previous comment, REST doesn't care much about the actual form of the link that identifies a unique resource unless either the RESTful constraints or the hypertext transfer protocol (HTTP) itself is violated.
Regarding the use of query or path (or even matrix) parameters is completely up to you. There is no fixed rule when to use what but just individual preferences.
I like to use query parameters especially when the value is optional and not required as plenty of frameworks like JAX-RS i.e. allow to define default values therefore. Query parameters are often said to avoid caching of responses which however is more an urban legend then the truth, though certain implementations might still omit responses from being cached for an URI containing query strings.
If the parameter defines something like a specific flavor property (i.e. car color) I prefer to put them into a matrix parameter. They can also appear within the middle of the URI i.e. /api/professors;hair=grey/courses could return all cources which are held by professors whose hair color is grey.
Path parameters are compulsory arguments that the application requires to fulfill the request in my sense of understanding otherwise the respective method handler will not be invoked on the service side in first place. Usually this are some resource identifiers like table-row IDs ore UUIDs assigned to a specific entity.
In regards to depicting relationships I usually start with the 1 part of a 1:n relationship. If I face a m:n relationship, like in your case with professors - cources, I usually start with the entity that may exist without the other more easily. A professor is still a professor even though he does not hold any lectures (in a specific term). As a course wont be a course if no professor is available I'd put professors before cources, though in regards to REST cources are fine top-level resources nonetheless.
I therefore would change your query
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
to something like:
GET /api/professors/teacher45/courses;year=2016?sort=asc&onField=topic
I changed the semantics of your fields slightly as the year property is probably better suited on the courses rather then the professors resource as the professor is already reduced to a single resource via the professors id. The courses however should be limited to only include those that where held in 2016. As the sorting is rather optional and may have a default value specified, this is a perfect candidate for me to put into the query parameter section. The field to sort on is related to the sorting itself and therefore also belongs to the query parameters. I've put the year into a matrix parameter as this is a certain property of the course itself, like the color of a car or the year the car was manufactured.
But as already explained previously, this is rather opinionated and may not match with your or an other folks perspective.
I could use either of the following:
GET /api/professors/:prof_id/courses
GET /api/courses?professor_id=:prof_id
You could. Here are some things to consider:
Machines (in particular, REST clients) should be treating the URI as an opaque thing; about the closest they ever come to considering its value is during resolution.
But human beings, staring that a log of HTTP traffic, do not treat the URI opaquely -- we are actually trying to figure out the context of what is going on. Staying out of the way of the poor bastard that is trying to track down a bug is a good property for a URI Design to have.
It's also a useful property for your URI design to be guessable. A URI designed from a few simple consistent principles will be a lot easier to work with than one which is arbitrary.
There is a great overview of path segment vs query over at Programmers
https://softwareengineering.stackexchange.com/questions/270898/designing-a-rest-api-by-uri-vs-query-string/285724#285724
Of course, if you have two different URI, that both "follow the rules", then the rules aren't much help in making a choice.
Supporting multiple identifiers is a valid option. It's completely reasonable that there can be more than one way to obtain a specific representation. For instance, these resources
/questions/38470258/answers/first
/questions/38470258/answers/accepted
/questions/38470258/answers/top
could all return representations of the same "answer".
On the /other hand, choice adds complexity. It may or may not be a good idea to offer your clients more than one way to do a thing. "Don't make me think!"
On the /other/other hand, an api with a bunch of "general" principles that carry with them a bunch of arbitrary exceptions is not nearly as easy to use as one with consistent principles and some duplication (citation needed).
The notion of a "canonical" URI, which is important in SEO, has an analog in the API world. Mark Seemann has an article about self links that covers the basics.
You may also want to consider which methods a resource supports, and whether or not the design suggests those affordances. For example, POST to modify a collection is a commonly understood idiom. So if your URI looks like a collection
POST /api/professors/:prof_id/courses
Then clients are more likely to make the associate between the resource and its supported methods.
POST /api/courses?professor_id=:prof_id
There's nothing "wrong" with this, but it isn't nearly so common a convention.
GET /api/courses?where={professor_id: "teacher45", year: 2016}&order={attr: "topic", sort: "asc"}
I've never seen anyone do it this way though, so I wonder if I'm doing something stupid.
I haven't either, but syntactically it looks a little bit like GraphQL. I don't see any reason why you couldn't represent a query that way. It would make more sense to me as a single query description, rather than breaking it into multiple parts. And of course it would need to be URL encoded, etc.
But I would not want to crazy with that right unless you really need to give to your clients that sort of flexibility. There are simpler designs (see Roman's answer)
I'm building an API and I have a question about how to represent objects.
Imagine we have a system with Articles that have a bunch of properties. Some of these properties are complex, for example the Author of the Article refers to another object. We have an URL to fetch all the articles in the system, and another URL to fetch a particular Article.
My first approach to implement this would be to create two representations of the same object Article, because when you request all the articles, it makes sense that you don't retrieve all the information about the Articles, but for example just the title, the date and the name of the author (instead of the whole Author object), excluding other properties like tags, or the content. The idea beneath this is to try to make the response of all the Articles a little bit lighter.
Now I'm going to the client side, and I decide to implement a SDK for Android, for example. So the first step would be to create the objects to store the information that I retrieve from the API. Now a problem pops up, because I want to define the Article object, but I would need two versions of it and it's not only more difficult to implement, but it's going to be more difficult to use.
So my question is, when defining an API, is it a good practice to have multiple versions of the same object (maybe a light one, and a full one) to save some bandwidth when sending the result of a request but generating a more difficult to use service, or it's not worth it and you should retrieve always the same version of the object, generating heavier responses but making the service easier to use?
I work at a company that deals with Articles as well and we also have a REST API to expose the data.
I think you're on the right track, but I'll even take it one step further. These are the potential three calls for large entities in an API:
Index. For the articles, this would be something like /articles. It just returns a list of article ids. You can add parameters to filter, sort, etc. It's very lightweight and I've found it to be very useful.
Header/Mini/Light version. These are only the crucial fields that you think will meet the widest variety of use cases. For us, we have a lot of use cases where we might want to display the top 5 articles, and in those cases, only title, author and maybe publication date. Those fields belong in a "header" article, or a "light" article. This is especially useful for AJAX calls as you don't want to return the entire article (for us the object is quite large.)
Full version. This is the full article. All the text/paragraphs/image references - everything. It's a heavy call to make, but you will be guaranteed to get whatever is available.
Then it just takes discipline to leave the objects the way they are. Ideally users are able to get the version described in (2) to save time over the wire, but if they have to, they go with (3).
I've considered having a dynamic way to return only fields people are interested in, but it would be a lot of implementation. Basically the idea was to let the user go to /article and then show them a sample JSON result. Then the user could click on the fields they wanted returned and get a token. Then they'd pass the token as a parameter to the API and the API would then know which fields to return.
Creates a dynamic schema. Lots of work and I never got around to it, but you can see that if you want to be creative, you can.
Consider whether your data (for one API client) is changing a lot or not. If it's possible to cache data on the client, that'll improve performance by not contacting the API as much. Otherwise I think it's a good idea to have a light-weight and full-scale object type (or more like two views of the same object type).
In the client you should implement it as one object type (to keep it DRY; Don't Repeat Yourself) with all the properties. When fetching a light-weight object, you only store a few of the properties, the rest being null (or similar “undefined” value for the given property type). It should be possible to determine whether all or only a partial subset of the properties are loaded.
When making API requests in the client on a given model (ie. authors) you should be explicit about whether the light-weight or full-scale object is needed and whether cached data is acceptable. This makes it possible to control the data in the UI layer. For example a list of authors might only need to display a name and a number of articles connected with that author. When displaying the author screen, more properties are needed. Also, if using cached data, you should provide a way for the user to refresh it.
When the app works you can start to implement optimizations like: Don't fetch light-weight data if full-scala data is already known & Don't fetch data at all if a recent cache copy exists. I think the best is to look at the actual use cases and improve performance with the highest value for the user.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
How can I train myself to give better variable and function names (any user-defined name in a program).
Practice.
Always be thinking about it whenever you write or read code, including other people's code. Try to figure out what you would do differently in their code and talk to them about it, when possible (but don't harp on it, that would be a nuisance). Ask them questions about why they picked a certain name.
See how well your code reads without comments. If you ever need to comment on the basic purpose of something you named, consider whether it could have a better name.
The biggest thing is active mental participation: practice.
Thinking of names seems to be something that some people are extraordinarily bad at, and I'm not sure what the cure is. Back when I was an instructor working in commercial training, I'd often see situations like this:
Me: OK, now you need to create an integer variable to contain the value returned by getchar().
[Trainees start typing, and I wander round the training room. Most are doing fine, but one is sitting like a deer frozen the headlights]
Me: What's the problem?
Him: I can't think of a name for the variable!
So, I'd give them a name for it, but I have a feeling that people with this problem are not going to go far in programming. Or perhaps the problem is they go too far...
This is a subjective question.
I strive to make my code align with the libraries (or, at the least the standard ones) so that the code has an uniformity. I'd suggest: See how the standard library functions are named. Look for patterns. Learn what different naming conventions exist. See which one makes the most sense. E.g: most Java code uses really big identifier names, Camel casing etc. C uses terse/short names. There is then the Hungarian notation -- which was worth the trouble when editors weren't smart enough to provide you with type information. Nowadays, you probably don't need it.
Joel Spolsky wrote a helpful article on Hungarian notation a few years back. His key insight is this:
Let’s try to come up with a coding
convention that will ensure that if
you ever make this mistake, the code
will just look wrong. If wrong code,
at least, looks wrong, then it has a
fighting chance of getting caught by
someone working on that code or
reviewing that code.
He goes on to show how naming variables in a rigourous fashion can improve our code. The point being, that avoiding bugs has a quicker and more obvious ROI than making our code more "maintainable".
Read some good code and imitate it. That's the universal way of learning anything; just replace "read" and "code" with appropriate words.
A good way to find expressive names is starting with a textual description what your piece of software actually does.
Good candidates for function (method) names are verbs, for classes nouns.
If you do design first, one method is textual analysis.
(Even if you are only a team of 1) agree a coding standard with your colleagues so you all use the same conventions for naming. (e.g. It is common to use properties for values that are returned quickly, but a GetXXX or CalculateXXX method for values that will take time to calculate. This convention gives the caller a much better idea about whether they need to cache the results etc). Try to use the same names for equivalent concepts (e.g. Don't mix Array.Count and List.Length as Microsoft did in .net!)
Try to read your code as if somebody else wrote it (i.e. forget everything you know and just read it). Does it make sense? Does it explain everything they need to know to understand it? (Probably not, because we all forget to describe the stuff we "know" or which is "obvious". Go back and clarify the naming and documentation so that someone else can pick up your code file and easily understand it)
Keep names short but descriptive. There is no point writing a whole sentence, but with auto-completion in most IDEs, there is also little point in abbreviating anything unless it's a very standard abbreviation.
Don't waste characters telling somebody that this string is a string (the common interpretation of hungarian notation). Use names that describe what something does, and how it is used. e.g. I use prefixes to indicate the usage (m=member, i=iterator/index, p=pointer, v=volatile, s=static, etc). This is important information when accessing the variable so it's a useful addition to the name. It also allows you to copy a line of code into an email and the receiver can understand exactly what all the variables are meant to do - the difference in usage between a static volatile and a parameter is usually very important.
Describe the contents of a variable or the purpose of a method in its name, avoiding technical terms unless you know that all the readers of your code will know what those terms mean. Use the simplest description you can think of - complex words and technical terms sound intelligent and impressive, but are much more open to misinterpretation (e.g. off the top of my head: Collation or SortOrder, Serialise or Save - though these are well known words in programming so are not very good cases).
Avoid vague and near-meaningless terms like "value", "type". This is especially true of base class properties, because you end up with a "Type" in a derived class and you have no idea what kind if type it is. Use "JoystickType" or "VehicleType" and the meaning is immediately much clearer.
If you use a value with units, tell people what they are in the name (angleDegrees rather than angle). This simple trick will stop your spacecraft smashing into Mars.
For C#, C++, C in Visual Studio, try using AtomineerUtils to add documentation comments to methods, classes etc. This tool derives automatic documentation from your names, so the better your names are, the better the documentation is and the less effort you need to put in the finish the documentation off.
Read "Code Complete" book, more specifically Chapter 11 about Naming. This is the checklist (from here, free registration required):
General Naming Considerations
Does the name fully and accurately describe what the variable represents?
Does the name refer to the real-world problem rather than to the programming-language solution?
Is the name long enough that you don't have to puzzle it out?
Are computed-value qualifiers, if any, at the end of the name?
Does the name use Count or Index instead of Num?
Naming Specific Kinds Of Data
Are loop index names meaningful (something other than i, j, or k if the loop is more than one or two lines long or is nested)?
Have all "temporary" variables been renamed to something more meaningful?
Are boolean variables named so that their meanings when they're True are clear?
Do enumerated-type names include a prefix or suffix that indicates the category-for example, Color_ for Color_Red, Color_Green, Color_Blue, and so on?
Are named constants named for the abstract entities they represent rather than the numbers they refer to?
Naming Conventions
Does the convention distinguish among local, class, and global data?
Does the convention distinguish among type names, named constants, enumerated types, and variables?
Does the convention identify input-only parameters to routines in languages that don't enforce them?
Is the convention as compatible as possible with standard conventions for the language?
Are names formatted for readability?
Short Names
Does the code use long names (unless it's necessary to use short ones)?
Does the code avoid abbreviations that save only one character?
Are all words abbreviated consistently?
Are the names pronounceable?
Are names that could be mispronounced avoided?
Are short names documented in translation tables?
Common Naming Problems: Have You Avoided...
...names that are misleading?
...names with similar meanings?
...names that are different by only one or two characters?
...names that sound similar?
...names that use numerals?
...names intentionally misspelled to make them shorter?
...names that are commonly misspelled in English?
...names that conflict with standard library-routine names or with predefined variable names?
...totally arbitrary names?
...hard-to-read characters?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am interested in learning how a database engine works (i.e. the internals of it). I know most of the basic data structures taught in CS (trees, hash tables, lists, etc.) as well as a pretty good understanding of compiler theory (and have implemented a very simple interpreter) but I don't understand how to go about writing a database engine. I have searched for tutorials on the subject and I couldn't find any, so I am hoping someone else can point me in the right direction. Basically, I would like information on the following:
How the data is stored internally (i.e. how tables are represented, etc.)
How the engine finds data that it needs (e.g. run a SELECT query)
How data is inserted in a way that is fast and efficient
And any other topics that may be relevant to this. It doesn't have to be an on-disk database - even an in-memory database is fine (if it is easier) because I just want to learn the principals behind it.
Many thanks for your help.
If you're good at reading code, studying SQLite will teach you a whole boatload about database design. It's small, so it's easier to wrap your head around. But it's also professionally written.
SQLite 2.5.0 for Code Reading
http://sqlite.org/
The answer to this question is a huge one. expect a PHD thesis to have it answered 100% ;)
but we can think of the problems one by one:
How to store the data internally:
you should have a data file containing your database objects and a caching mechanism to load the data in focus and some data around it into RAM
assume you have a table, with some data, we would create a data format to convert this table into a binary file, by agreeing on the definition of a column delimiter and a row delimiter and make sure such pattern of delimiter is never used in your data itself. i.e. if you have selected <*> for example to separate columns, you should validate the data you are placing in this table not to contain this pattern. you could also use a row header and a column header by specifying size of row and some internal indexing number to speed up your search, and at the start of each column to have the length of this column
like "Adam", 1, 11.1, "123 ABC Street POBox 456"
you can have it like
<&RowHeader, 1><&Col1,CHR, 4>Adam<&Col2, num,1,0>1<&Col3, Num,2,1>111<&Col4, CHR, 24>123 ABC Street POBox 456<&RowTrailer>
How to find items quickly
try using hashing and indexing to point at data stored and cached based on different criteria
taking same example above, you could sort the value of the first column and store it in a separate object pointing at row id of items sorted alphabetically, and so on
How to speed insert data
I know from Oracle is that they insert data in a temporary place both in RAM and on disk and do housekeeping on periodic basis, the database engine is busy all the time optimizing its structure but in the same time we do not want to lose data in case of power failure of something like that.
so try to keep data in this temporary place with no sorting, append your original storage, and later on when system is free resort your indexes and clear the temp area when done
good luck, great project.
There are books on the topic a good place to start would be Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom
SQLite was mentioned before, but I want to add some thing.
I personally learned a lot by studying SQlite. The interesting thing is, that I did not go to the source code (though I just had a short look). I learned much by reading the technical material and specially looking at the internal commands it generates. It has an own stack based interpreter inside and you can read the P-Code it generates internally just by using explain. Thus you can see how various constructs are translated to the low-level engine (that is surprisingly simple -- but that is also the secret of its stability and efficiency).
I would suggest focusing on www.sqlite.org
It's recent, small (source code 1MB), open source (so you can figure it out for yourself)...
Books have been written about how it is implemented:
http://www.sqlite.org/books.html
It runs on a variety of operating systems for both desktop computers and mobile phones so experimenting is easy and learning about it will be useful right now and in the future.
It even has a decent community here: https://stackoverflow.com/questions/tagged/sqlite
Okay, I have found a site which has some information on SQL and implementation - it is a bit hard to link to the page which lists all the tutorials, so I will link them one by one:
http://c2.com/cgi/wiki?CategoryPattern
http://c2.com/cgi/wiki?SliceResultVertically
http://c2.com/cgi/wiki?SqlMyopia
http://c2.com/cgi/wiki?SqlPattern
http://c2.com/cgi/wiki?StructuredQueryLanguage
http://c2.com/cgi/wiki?TemplateTables
http://c2.com/cgi/wiki?ThinkSqlAsConstraintSatisfaction
may be you can learn from HSQLDB. I think they offers small and simple database for learning. you can look at the codes since it is open source.
If MySQL interests you, I would also suggest this wiki page, which has got some information about how MySQL works. Also, you might want to take a look at Understanding MySQL Internals.
You might also consider looking at a non-SQL interface for your Database engine. Please take a look at Apache CouchDB. Its what you would call, a document oriented database system.
Good Luck!
I am not sure whether it would fit to your requirements but I had implemented a simple file oriented database with support for simple (SELECT, INSERT , UPDATE ) using perl.
What I did was I stored each table as a file on disk and entries with a well defined pattern and manipulated the data using in built linux tools like awk and sed. for improving efficiency, frequently accessed data were cached.