Best practices for configuring a Solr schema - schema

I'm currently configuring my schema.xml file and trying to figure out what's the best way to set up my documents. I use a RMDBS and thus many objects are relational.
Take this site for instance; a document typically consists of a question, followed by 0 or more answers. Say you'd want to set up fields for this, you would have to declare all question and answer fields in the same document, the way I see it. But given the fact that there can be more than one answer, you would have to create a document for each answer. So that means each question and each answer is stored in a separate document, which contains fields for both.
I don't see a different approach for this kind of problem, however I am relatively new to Solr and document DB's, so I may be wrong.
In short: what are the best practices if I would implement such a schema?

Another way to do it would be to have a question field and a multi-valued field for answers and have them in the same document. This is probably the best way to start unless you have specific requirements that favor the document-per-answer approach.
For example, if you needed to match individual answers as stand-alone search results, you may get better results and performance with the document-per-answer approach, as the "answer" documents will be scored, ranked and loaded in isolation.
But that would be an unconventional use of this type of data. Normally when you search a site like stack overflow, you are searching for a question and set of answers that cover a particular topic, so having everything in one document makes more sense.

Related

Download OEIS sequences with known algorithm to produce them

I was reading some interesting questions about the topic "Can we make a program that, given a particular sequence, produces the next terms", like this one, and I really like the detailed answer of this one. I understand that the answer is "That's impossible without more restrictions", and that given some restrictions (polynomials, rational function or boolean map) we know some good algorithms, as the second answer I linked explains.
Now, a natural question is how much can we solve, trying our best even if we can't always solve it, to answer the original, general question. What I usually do when facing a hard sequence is trying to see if it's in OEIS, and if it seems to be there, seeing if there is any formula or algorithm to produce it in there. You can download a small version of OEIS with the first terms of each sequence, and you can make queries to find formulas or maple algorithms for a particular sequence. My question is, do you think it's feasible to download a small version of OEIS that includes, with the first terms, a little algorithm to produce it?
The natural problem here is that I haven't seen any link to download the entire database of OEIS with all the details, which maybe deserves its own question. Even if we had this, you need to read the formulas/algorithms (that can be written in different languages, from what I've seen) and interpret them correctly. But I thought maybe someone here knows how to solve this, in any case thanks in advance.
You could, as you note, download the sequences and their A-numbers from the link mentioned here: https://oeis.org/wiki/Welcome#Compressed_Versions
After searching that and finding one sequence (or a small number of sequences) of interest, you could scrape the respective page(s) for formulas. There are specific fields for Maple and Mathematica, which may be helpful, and otherwise, an entry in the PROGRAM field should include identifying information when it is not one of the standard languages with its own field in the database. See: http://oeis.org/wiki/Style_Sheet
Unofficially, but with the interests of the OEIS in mind, I would not recommend trying to download or scrape the OEIS in its entirety. Whether it's one person, or a whole host of people, we would certainly recommend using the compressed version of the database to identify sequences of interest by A-number first, then pulling their entire entry by scraping the site or querying the OEIS using methods that you have already mentioned: Programmatic access to On-Line Encyclopedia of Integer Sequences
If this sounds laborious, perhaps an alternative is the Wolfram Cloud, which actives this through other means. For example, you can navigate to the cloud (you may have to register just to get access) at: https://www.wolframcloud.com/
Typing in something like FindSequenceFunction[{1, 2, 3, 5, 17, 305, 34865}] will give you a formula, if Wolfram/Mathematica can find one. The documentation for FindSequenceFunction can be found here: https://reference.wolfram.com/language/ref/FindSequenceFunction.html
Wolfram/Mathematica can also invoke the OEIS using packages like the one described here: https://mathematica.stackexchange.com/questions/40/is-it-possible-to-invoke-the-oeis-from-mathematica

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai

Method names for getting data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Warning: This is a not very serious question/discussion that I am posting... but I am willing to bet that most developers have pondered this "issue"...
Always wanted to get other opinions regarding naming conventions for methods that went and got data from somewhere and returned it...
Most method names are somewhat simple and obvious... SaveEmployee(), DeleteOrder(), UploadDocument(). Of course, with classes, you would most likely use the short form...Save(), Delete(), Upload() respectively.
However, I have always struggled with initial action...how to get the data. It seems that for every project I end up jumping between different naming conventions because I am never quite happy with the last one I used. As far as I can tell these are the possibilities -->
GetBooks()
FetchBooks()
RetrieveBooks()
FindBooks()
LoadBooks()
What is your thought?
It is all about consistent semantics;
In your question title you use getting data. This is extremely
general in a sense that you need to define what getting means
semantically significantly unambiguous way. I offer the follow
examples to hopefully put you on the right track when thinking about
naming things.
getBooks() is when you are getting
all the books associated with an
object, it implies the criteria for the set is
already defined and where they are coming from is a hidden detail.
findBooks(criteria)
is when are trying to find a sub-set
of the books based on parameters to
the method call, this will usually
be overloaded with different search
criteria
loadBooks(source) is when you are
loading from an external source,
like a file or db.
I would not use
fetch/retrieve because they are too vague and get conflated with get and there is no unambiguous semantic associated with the terms.
Example: fetch implies that some entity needs to go and get something that is remote and bring it back. Dogs fetch a stick, and retrieve is a synonym for fetch with the added semantic that you may have had possession of the thing prior as well. get is a synonym for obtain as well which implies that you have sole possession of something and no one else can acquire it simultaneously.
Semantics are extremely important:
the branch of linguistics and logic concerned with meaning
The comments are proof that generic terms like get and fetch have
no specific semantic and are interpreted differently by different
people. Pick a semantic for a term, document what it is intended to
imply if the semantic is not clear and be consistent with its use.
words with vague or ambigious meanings are given different semantics by different people because of their predjudices and preconceptions based on their personal opinions and that will never end well.
Honestly you should just decide with your team which naming convention to use. But for fun, lets see what your train of thought would be to decide on any of these:
GetBooks()
This method belongs to a data source, and we don't care how it is obtaining them, we just want to Get them from the data source.
FetchBooks()
You treat your data source like a bloodhound, and it is his job to fetch your books. I guess you should decide on your own how many he can fit in his mouth at once.
FindBooks()
Your data source is a librarian and will use the Dewey Decimal system to find your books.
LoadBooks()
These books belong in some sort of "electronic book bag" and must be loaded into it. Be sure to call ZipClosed() after loading to prevent losing them.
RetrieveBooks()
I have nothing.
The answer is just stick to what you are comfortable with and be consistant.
If you have a barnes and nobles website and you use GetBooks(), then if you have another item like a Movie entity use GetMovies(). So whatever you and your team likes and be consistant.
It is not clear by what you mean for "getting the data". From the database? A file? Memory?
My view about method naming is that its role is to eliminate any ambiguities and ideally a need to look up documentation. I believe that this should be done even at the cost of longer method names. According to studies, most intermediate+ developers are able to read multiple words in camel case. With IDE and auto completions, writing long method names is also not a problem.
Thus, when I see "fetchBooks", unless the context is very clear (e.g., a class named BookFetcherFromDatabase), it is ambiguous. Fetch it from where? What is the difference between fetch and find? You're also risking the problem that some developers will associate semantics with certain keywords. For example, fetch for database (or memory) vs. load (from file) or download (from web).
I would rather see something like "fetchBooksFromDatabase", "loadBookFromFile", "findBooksInCollection", etc. It is less sightly, but once you get over the length, it is clear. Everyone reading this would right away get what it is that you are trying to do.
In OO (C++/Java) I tend to use getSomething and setSomething because very often if not always I am either getting a private attribute from the class representing that data object or setting it - the getter/setter pair. As a plus, Eclipse generates them for you.
I tend to use Load only when I mean files - as in "load into memory" and that usually implies loading into primitives, structs (C) or objects. I use send/receive for web.
As said above, consistency is everything and that includes cross-developers.

How to write a simple database engine [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I am interested in learning how a database engine works (i.e. the internals of it). I know most of the basic data structures taught in CS (trees, hash tables, lists, etc.) as well as a pretty good understanding of compiler theory (and have implemented a very simple interpreter) but I don't understand how to go about writing a database engine. I have searched for tutorials on the subject and I couldn't find any, so I am hoping someone else can point me in the right direction. Basically, I would like information on the following:
How the data is stored internally (i.e. how tables are represented, etc.)
How the engine finds data that it needs (e.g. run a SELECT query)
How data is inserted in a way that is fast and efficient
And any other topics that may be relevant to this. It doesn't have to be an on-disk database - even an in-memory database is fine (if it is easier) because I just want to learn the principals behind it.
Many thanks for your help.
If you're good at reading code, studying SQLite will teach you a whole boatload about database design. It's small, so it's easier to wrap your head around. But it's also professionally written.
SQLite 2.5.0 for Code Reading
http://sqlite.org/
The answer to this question is a huge one. expect a PHD thesis to have it answered 100% ;)
but we can think of the problems one by one:
How to store the data internally:
you should have a data file containing your database objects and a caching mechanism to load the data in focus and some data around it into RAM
assume you have a table, with some data, we would create a data format to convert this table into a binary file, by agreeing on the definition of a column delimiter and a row delimiter and make sure such pattern of delimiter is never used in your data itself. i.e. if you have selected <*> for example to separate columns, you should validate the data you are placing in this table not to contain this pattern. you could also use a row header and a column header by specifying size of row and some internal indexing number to speed up your search, and at the start of each column to have the length of this column
like "Adam", 1, 11.1, "123 ABC Street POBox 456"
you can have it like
<&RowHeader, 1><&Col1,CHR, 4>Adam<&Col2, num,1,0>1<&Col3, Num,2,1>111<&Col4, CHR, 24>123 ABC Street POBox 456<&RowTrailer>
How to find items quickly
try using hashing and indexing to point at data stored and cached based on different criteria
taking same example above, you could sort the value of the first column and store it in a separate object pointing at row id of items sorted alphabetically, and so on
How to speed insert data
I know from Oracle is that they insert data in a temporary place both in RAM and on disk and do housekeeping on periodic basis, the database engine is busy all the time optimizing its structure but in the same time we do not want to lose data in case of power failure of something like that.
so try to keep data in this temporary place with no sorting, append your original storage, and later on when system is free resort your indexes and clear the temp area when done
good luck, great project.
There are books on the topic a good place to start would be Database Systems: The Complete Book by Garcia-Molina, Ullman, and Widom
SQLite was mentioned before, but I want to add some thing.
I personally learned a lot by studying SQlite. The interesting thing is, that I did not go to the source code (though I just had a short look). I learned much by reading the technical material and specially looking at the internal commands it generates. It has an own stack based interpreter inside and you can read the P-Code it generates internally just by using explain. Thus you can see how various constructs are translated to the low-level engine (that is surprisingly simple -- but that is also the secret of its stability and efficiency).
I would suggest focusing on www.sqlite.org
It's recent, small (source code 1MB), open source (so you can figure it out for yourself)...
Books have been written about how it is implemented:
http://www.sqlite.org/books.html
It runs on a variety of operating systems for both desktop computers and mobile phones so experimenting is easy and learning about it will be useful right now and in the future.
It even has a decent community here: https://stackoverflow.com/questions/tagged/sqlite
Okay, I have found a site which has some information on SQL and implementation - it is a bit hard to link to the page which lists all the tutorials, so I will link them one by one:
http://c2.com/cgi/wiki?CategoryPattern
http://c2.com/cgi/wiki?SliceResultVertically
http://c2.com/cgi/wiki?SqlMyopia
http://c2.com/cgi/wiki?SqlPattern
http://c2.com/cgi/wiki?StructuredQueryLanguage
http://c2.com/cgi/wiki?TemplateTables
http://c2.com/cgi/wiki?ThinkSqlAsConstraintSatisfaction
may be you can learn from HSQLDB. I think they offers small and simple database for learning. you can look at the codes since it is open source.
If MySQL interests you, I would also suggest this wiki page, which has got some information about how MySQL works. Also, you might want to take a look at Understanding MySQL Internals.
You might also consider looking at a non-SQL interface for your Database engine. Please take a look at Apache CouchDB. Its what you would call, a document oriented database system.
Good Luck!
I am not sure whether it would fit to your requirements but I had implemented a simple file oriented database with support for simple (SELECT, INSERT , UPDATE ) using perl.
What I did was I stored each table as a file on disk and entries with a well defined pattern and manipulated the data using in built linux tools like awk and sed. for improving efficiency, frequently accessed data were cached.

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.