Developing English query-based search for an embodied agent - sql

I'm looking to create an Embodied Agent for handling search requests on a website. The agent needs to be able to handle simple questions, and provide a series of website links for an answer.
All the articles are in a database. Each article has a title field, and a series of tags to categorize the article.
At this point, my simple algorithm would be:
Split the question up into a series of words.
Remove all common words like "a", "the", "how", etc.
Create a "where" clause, searching the article body, article title, and tags for the remaining words.
Display the list, possibly ranked with those articles with matches in the title first, tags second, and article body third.
Is there a better algorithm for converting an English question into a SQL query? Are there specific details that should be tracked along with each article by the article author to further improve search results? Are there details that should be recorded over time while the search is in use to further improve search results?
UPDATE: The website will be running on IIS, with the latest ASP.NET. The backend database will be a SQL Server.

There really isn't an easy solution for true english query parsing. Most search engines simply eliminate noise words, like you're proposing, and look for the remaining terms. If you're using Microsoft SQL, you may want to look at Full-Text Search (SQL Server). You may also want to read Semantic Search (SQL Server), if you can use Microsoft SQL Server 2012. If you're using MySQL, see 12.9. Full-Text Search Functions.

You might find Kueri.me relevant.
Kueri converts natural language to SQL. It comes with a Javascript library out of the box that can be integrated inside a website.
You will be able to ask:
show me articles
top 10 articles by rating
bottom 5 articles by creation date
last 7 articles added in the last week and with description containing "xx" or "yy"
show all articles with more than 2 rankings
how many articles with no rating per section
etc

Related

Wikipedia API Extraction of abstracts in 2 languages

I am trying to connect 2 API queries.
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=&explaintext=&titles=Albert+Einstein&format=json
Where I search for article descriptions and
https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&format=json&lllang=de&titles=Companion%20dog
Where I retrieve the name of the article in another language (here German).
Is there a way to connect them to retrieve description data both in English and German?
I have tried connecting them via "generators" and I seem to not understand how to apply it here.
I also tried inputting another query after extracting names in 2 languages (searching for descriptions). However, the names are sometimes formatted so that I cannot reuse them in the query.
No. The description is a snippet from the start of the article. If you want a German description, you need to get it from the German Wikipedia (ie. a different API endpoint).

MediaWiki api for Wikipedia - is it possible to search by title on ALL languages?

I know that to search for a page id of a wikipedia with known title, i can do:
https://en.wikipedia.org/w/api.php?action=query&titles=7_Studios
However, in this case, 7_Studios is a french wikipedia article, so the above link would not work. Instead I need to try
https://fr.wikipedia.org/w/api.php?action=query&titles=7_Studios
My question is, if I do not know what language the article is about but only the title itself, how can it make sure i can find it using the api?
As Bergi mentioned, you can use Wikidata for this: it contains the database of interwiki links, so it's possible some article title won't be there, but most should.
To do this, you can use the wbgetentities module: you specify the title to search for and a list of wikis to search. For example:
https://www.wikidata.org/w/api.php?action=wbgetentities&titles=7_Studios&sites=enwiki|frwiki|nlwiki|dewiki
You can specify up to 50 wikis in one query. Currently, there are around 300 Wikipedias, so if you really need to query all of them, you may need up to 6 requests for each title.

How to design a database for efficient search-ability?

I am trying to design a database with search-ability at its core. My knowledge of database design and SQL is all self-taught and still fairly beginner-level, so my questions may possibly have easy answers.
Suppose I have a single table containing a large number of records. For example, suppose that each record contains details of a different computer application (name, developer, version number, etc). A list of keywords are associated with each record, such as a list of programming languages used to write the applications.
I wish to be able to enter one or more keywords (each separated by a space) into a search box, and I wish to have all associated records returned. How should I design the database to store the keywords, and what SQL query would I need to apply to the search text? (The search should be uppercase/lowercase independent.)
My next challenge would then be to order search results by relevance, and to allow entire key-phrases as well as keywords to be associated with each record. For example, if I type "Visual Basic" into the search field, I want the first results to have exactly the key-phrase "Visual Basic" associated with them. The next results should all have both keywords "Visual" and "Basic" associated with them, and the remaining results should have only one of these keywords. Again, please could anyone advise on how to implement this?
The final challenge I believe would be much harder: how much 'intelligent interpretation' can I design my database and SQL code to handle? For example, if I search for "CSS", can I get the records with the key-phrase "Cascading Style Sheets" to appear? Can I also get SQL to identify and search for similar words, such as plurals of search phrases or, for example, "programmer" or "programming" when "program" is input? Thanks!
Learn relational algebra, normalization rules, and SQL.
Start with entity relationships. Sounds like you could have an APPLICATION table as parent for a FEATURE child table, with a one-to-many relationship between the two. You'll query them by JOINing one to the other:
SELECT A.NAME, F.NAME
FROM APPLICATION AS A
JOIN FEATURE AS F
ON F.APP_ID = A.ID
Your challenges would not suggest SQL and relations to me. I would think more in terms of a parser, an indexer and search engine like Lucene, and a NoSQL document database like MongoDB.
I've come to the conclusion, after a LOT of research, that #duffymo's answer is hinting in the right direction. For the benefit of other n00bs like me, here's the conclusion I've drawn:
Many open source search engine server apps are out there to install for free. Lucene was the first I had ever heard of them, but others do exist and I think my favourite at the moment is Sphinx. As far as I can tell, the 'indexer' that #duffymo mentions is built into it. I have learnt that the indexer is the program that will examine my database for keywords and will automatically keep a record of which results should be returned for different input queries. I have also now learnt that the terminology for the behaviour I was looking for (and which Sphinx has) is 'stemming'. I'm still not sure what role a parser plays in all this...
A more basic approach would be to use SQL itself. Whilst I was already aware of the most basic of these (ie. using the LIKE keyword with 'wildcards'), I also discovered something a little more powerful: natural language / full-text search. For anyone not interested in installing a server app, I recommend you look this up.
Also, I see no reason why I would need to use NoSQL instead of SQL (as #duffymo has suggested), and so I'm going to stick with SQL for the moment (at least until I come across some good entry-level books to learn NoSQL from). Furthermore, I have very little intention to learn relational algebra until I know why I should and how it would be useful. The message here is that other beginners shouldn't be off-put by these things, as I don't think Sphinx requires any knowledge of them.
while I like #duffymo's answer, I will also suggest you research SPARQL and the wordnet project for your semantic equivalence questions.
If you choose Oracle, you can use the spatial option triple store to implement the SPARQL endpoint and do some very nice seaching like your css = Cascading Style Sheet example.

search a database

Let's say I have a large database with product information. I want to create a search engine for that database, preferably with indexing and autocorrect features. How do I go about doing this? Are there any good libraries I could use, so that I don't have to start from scratch with basic SQL? Just some basic recommendations, links, would be much appreciated.
I am familiar with PHP, C#, VB, and Java, but I know very little about databases.
If your product database creates web pages, you would be best served using lucene or htdig. Those will do really good text searching based on your content.
Otherwise you will want to search the large fields of your database using the full text search capabilities in mysql.
To do the autocomplete you will need to have an offline indexing process that works similarly to google. Create another table called wordIndex. It contains words and the number of occurrences in your product db.
When a user starts to type, you do an ajax lookup on this table and autocomplete based on that.
If mySQL FULLTEXT searching doesn't do all you need it to (databases have indexes of their own you can set up), two good choices are Solr (based on Lucene) and Sphinx. Both are often used to provide a full featured search index on top of a mySQL database. Here's a comparison of the two.

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.