SPARQL best approach for generating flat data for search - sparql

We have a triple store of information such as drugs and I'm unsure how I can extract this information to make it available so that it can be indexed by our search engine Elasticsearch. I had envisaged that I would run a SPARQL query to extract the following information:
Title
Body
Href
Please not the triple store does not contain the above structure it's a lot more complicated than that.
One of the requirements is to be able to format the Titles using different triples from the triple store so for example for drugs something like this would be needed:
Paracetamol | Introduction | Drug
(Pracetamol refers to the drug name, Introduction is a subsection and Drug refers to the type)
For the body I was thinking of extracting all the text values from all the triples related to drugs.
And for the href simply using the uri of the resource(drug).
I would then convert this information to JSON-LD so that it can be indexed by Elasticsearch. In the end the JSON-LD will simply contain the title, body and href.
So my question is, is using SPARQL the right approach for what I'm wanting to do or should I look at a different approach to extract the data I need based on the requirements above.

Related

Lucene difference between Term and Fields

I've read a lot about Lucene indexing and searching and still can't understand what Term is?What is the difference between term and fields?
A very rough analogy would be that fields are like columns in a database table, and terms are like the contents in each database column.
More specifically to Lucene:
Terms
Terms are indexed tokens. See here:
Lucene Analyzers are processing pipelines that break up text into indexed tokens, a.k.a. terms
So, for example, if you have the following sentence in a document...
"This is a list of terms"
...and you pass it through a whitespace tokenizer, this will generate the following terms:
This
is
a
list
of
terms
Terms are therefore also what you place into queries, when performing searches. See here for a definition of how they are used in the classic query parser.
Fields
A field is a section of a document.
A simple example is the title of a document versus the body (the remaining text/content) of the document. These can be defined as two separate Lucene fields within a Lucene index.
(You obviously need to be able to parse the source document so that you can separate the title from the body - otherwise you cannot populate each separate field correctly, while building your Lucene index.)
You can then place all of the title's terms into the title field; and the body's terms into the body field.
Now you can search title data separately from body data.
You can read about fields here and here. There are various different types of fields, specific to the type of data (terms) they will be holding.

Hibernate Search: how to query for embeded entities

I like to use Hibernate Search for implementing an sophisticated autosuggestion feature across multiple input fields on a web page.
Each input field is for its own entity, let's say Country and City. There is a many-to-one relationship between both entities
(countries contain cities).
The autosuggestion should work such that when typing e.g. a country name prefix and the city field is already filled,
you get only suggestions for countries that have such a city (and vice versa).
The server side autosuggestion service should return list of projections
(entityId, entityName) which are rendered into the input field (dropdown, whatever).
According to the schema and after having read the manual I tried the following index schema:
SearchMapping mapping = new SearchMapping();
mapping.analyzerDef(...
.entity(City.class).indexed().indexName("MyIndex")
.property("cityId", ElementType.FIELD)
.documentId()
.name("id")
.property("name", ElementType.FIELD)
.documentId()
.name("id")
.property("country", ElementType.METHOD)
.indexEmbedded()
.entity(Country.class).indexed()
.property("id", ElementType.FIELD)
.documentId()
.name("id")
.property("name", ElementType.METHOD)
.field()
.name("name")
This mapping defines City to be the main entity, right?
I have indexed all cities and am able to query for them (also by combining both fields). However, I only get matches when querying for cities.
i.e. when querying like
fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(City.class).get();
This is not useful for the country field becuse when I type in "Spain", I get a single row for each city of Spain. (Spain, Spain, Spain, Spain ,.... ;-))
The question is: How is it possible to search for country entities? Changing the index structure? The indexing procedure? Or how to query?
The only way I found was to setup a Facet for country, and you the different possible facets as autosuggestion. However, this is also not perfect
since it is not possible to sort facets alphabetically.
Of course, in this example, I could switch both entities in the mapping, but suppose scenarios with more complex entity graphs.
UPDATE: adding queries requested in comment
For building queries, I employ the QueryBuilder. The following produces a result set like in the Spain example:
fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(City.class).get();
with query:
country.name:Spain
If I try to use a query builder for countries
fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(Country.class).get();
and query:
name:Spain
I get no results.
You are not showing your actual query. You don't have to use the query DSL, but you can also write native Lucene queries. In both cases (DSL or native Lucene) you can combine queries via boolean logic. Embedded entities follow the java bean notation. The country name would for example in a city query be reached as country.name. Again, without your actual query it is hard to give any more specific feedback.
Last, but not least, facets can also be sorted alphabetically. Check FacetSortOrder.COUNT_DESC.

How do i include other fields in a lucene search?

Lets use emails for an example as a document. You have your subject, body, the person who its from and lets say we can also tag them (as gmail does)
From my understanding of QueryParser i give it ONE field and the parser type. If a user enter text the user only searches whatever i set. I notice it will look in the subject or body field if i wrote fieldName: text to search however how do i make a regular query such as "funny SO question unicorn" find result(s) with some of those strings in the subject, the others in the body? ATM because i knew it would be easy i made a field called ALL and combined all the other fields into that but i would like to know how i can do it in a proper way. Especially since my next app is text search dependent
Use MultiFieldQueryParser. You can specify list of fields to be searched using following constructor.
MultiFieldQueryParser(Version matchVersion, String[] fields, Analyzer analyzer)
This will generate a query as if you have created multiple queries on different fields. This partially addresses your problem. This, still, will not match one term matching in field1 and another matching in field2. For this, as you have rightly pointed out, you will need to combine all the fields in one single field and search in that field. Nevertheless, you will find MultiFieldQueryParser useful when query terms do not cross the field boundaries.

How would you reproduce a tagging system like the one StackOverflow uses?

I am trying to produce a tagging system for a recruitment agency model and love the way SO separates tags and searches for the remaining phrases.
How would you compare the tags in a table to the search query etc...
I have come up with the following but it has some hickups...
User enters search query
Full text SQL contains() search on tbl_tags
Returns 5 results
Check if each "exact tag phrase" exists in original query string.
If it does exist then add tagID to array.
Remove tag names from original search string...
Search in tbl_people for people with linked TagIDs and search text fields with remaining text.
Example search : French Project Managers with Oracle experience
Tags : [French] [Project Manager]s with [Oracle] experience
Remaining text : s with experience
Now the problem comes when I search for Project Managers it leaves me with a surplus "s"... and there are probably other bugs with this logic too that I cannot account for...
Any ideas on how to make the logic perfect?
Thanks in advance, I understand this might be a bit of an abstract question...
You're missing a key ingredient of how StackOverflow does its search. SO requires that the user delineate the tags in the search string by explicitly putting brackets around the tags. The (probably simplified) logic would then be.
Extract marked tags using regex to detect contents inside brackets
Using list of most common tags, scan string for unmarked tags and extract them.
Remove tag meta characters
Perform full-text search, filtered by tags

Pattern for searching entire DB record, not specific field

More and more, I'm seeing searches that not only find a substring in a specific column, but they appear to search in all columns. An example is in Amazon, where you can search for "Arnold" and it finds both the movie Running Man starring Arnold Schwarzeneggar, and the Gund toy Arnold the Snoring Pig. I don't know what the term is for this type of search (Wide search? Global search?), and that bugs me. But what I really want to know is what is the normal pattern for accomplishing this type of search in a QUICK way.
The obvious, and slow, way to do it would be to search for the substring "Arnold" in the title, "Arnold" in the author, "Arnold" in the description, etc.
The first quick solution that comes to mind is to store a mapping for each word used to describe a product to the product itself, and then search that word mapping. That could be quick, but doesn't seem very space-efficient to me.
There are probably a hundred ways to accomplish this, some of which probably don't even use a database. But what is the norm?
I've done this in the past by storing an XML version of items in an XML column in the table, then searching in that column instead of the others.
Maybe they're not storing the data the way you expect.
They could, for example, store all titles, authors, descriptions, and every other searchable field in one table with an attribute to distinguish the field's type.