GraphDB Lucene index - how to exclude property URIs from search results?

GraphDB Lucene index - how to exclude property URIs from search results? - lucene

It seems that by default a Lucene index that indexes "uris" will index both nodes and properties. How can properties be excluded from search results?
The documentation shows a setting:
luc:exclude luc:setParam "bnode".
However its only valid values are "literal", "bnode", and "uri". How can property URIs be excluded? (they are not something that a search would be interested in).

I assume that you're using https://graphdb.ontotext.com/documentation/standard/full-text-search.html and not https://graphdb.ontotext.com/documentation/standard/lucene-graphdb-connector.html ?
The doc doesn't show what you show above, but shows
luc:exclude luc:setParam "hello.*"
which means "exclude strings that match the regex".
Which things to index is controlled by
luc:include luc:setParam "literal" # literal, uri, centre
If I understand correctly, you want to index URIs of nodes, but not URIs of outgoing properties? Then the answer would depend on the kind of molecule you are traversing.
luc:include luc:setParam "literal centre" will index only the central node URIs, which is probably what you want
with luc:excludePredicates you can list all properties you want to exclude, but that will also cut out the nodes that they reach...

Related

Aem fulltextsearch

I want to search for a exact combination of words in all nodes in the aem using query builder.
Trying to debug the query http://localhost:4502/libs/cq/search/content/querydebug.html it returns me results that doesn't match my query.
For example if want to search for 'foo bar' in all nodes and I need to receive all nodes that contain 'Foo Bar', 'foo Bar', 'Foo bar', 'FOO BAR' but not only 'foo' and only 'bar' and not 'foo-bar'. Query in service is done by using QueryBuilder.

QueryBuilder is useful when you try to perform a query similar to SQL where you search against a property and its value. The full text search capabilities of the query debug interface is very limited as you have experienced.
However, remember that AEM uses an underlying Lucene and/or Solr index and it does provide a way to perform a native solr / lucene query.
Firstly create a embedded solr index (embedded is sufficient for a local development AEM instance) as mentioned under "Configuring AEM with an embedded SOLR server" in https://docs.adobe.com/docs/en/aem/6-0/deploy/upgrade/queries-and-indexing.html. This will trigger solr indexing of your JCR content.
Once indexing is complete (as seen from logs), you can perform native queries using the crx/de query interface.
Example query: select [jcr:path] from [nt:base] where native('solr', '<filter>?<solr_query_goes_here>'. Quite obviously you need to be familiar with solr queries. Thanks to the following slide share (slide 50 talks about native queries within AEM) http://www.slideshare.net/justinedelson/demystifying-oak-search
AEM support for native solr queries is a bit patchy. You might need to edit the SOLR schema xml file manually (created under the crx-quickstart folder) to add additional filters, custom fields etc. We had successfully tuned solr within AEM to perform a spacial search using the above method.

If you need all sorts of combinations for "foo bar" then you have to query:
fulltext=foo bar
You will only get the first 10 results. To get all, you'll need to:
p.limit=-1
You may want to specify the path:
path=/content/website/
Visit Adobe Query Builder API for more info.

Behind the scenes, AEM creates an xpath query and then executes it. Then, for any part of the query that doesn't map to xpath, it runs through the results and filters them.
You should also think about if there is a property to match as opposed to any text. That will give you much better results since you want accuracy. Right now you are casting an overly wide net, and I think you should consider restricting if for nothing other than performance reasons. Just a suggestion.
You say the results don't match your query, can you give us some idea of what comes back? And can you please put your actual query here. That will make it much easier to help.

this is a minimal example that provides a full-text search:
Query query = queryBuilder.createQuery(...);
// limit path
Predicate path = new Predicate(PathPredicateEvaluator.PATH);
path.set(PathPredicateEvaluator.PATH, "/content/where/ever);
query.getPredicates().add(path);
// Fulltext
Predicate fulltextSearch = new Predicate(FulltextPredicateEvaluator.FULLTEXT);
fulltextSearch.set(FulltextPredicateEvaluator.FULLTEXT, "foo bar");
fulltextSearch.set(FulltextPredicateEvaluator.REL_PATH, "jcr:content");
query.getPredicates().add(fulltextSearch);
// can I haz excerpt?
query.setExcerpt(true);
// Paging?
query.setStart(...);
query.setHitsPerPage(-1);
Note: it's not required to configure a solr index or whatever, you should be fine out of the box.
But if you limit the search to specific fields, you should create an index entry in oak:index. You can find a great cheat-sheet here.

I'm not sure if this helps.
but to get all the combinations of nodes that have the text i'm looking for I use jcr:like in xpath.
for example if I want to search all the nodes which has any property with Foo bar in its value or key, then my query looks like:
/jcr:root/content/yourpath//*[jcr:like(\*/, '%FOO bar%')]

You will not get that flexibility in QueryBuilder but you can still get what you want by using JCR-SQL2.
The following query will return all entries with "Foo Bar", "foo bar", "foo Bar", "Foo bar", but not "foo", "bar", "foo-bar" when your value is "foo bar".
SELECT * FROM [nt:unstructured] WHERE ISDESCENDANTNODE('/jcr:root/content/yourpath') AND LOWER([prop]) LIKE "%foo bar%" ORDER BY [cq:lastModified] desc
Just ensure that while checking for the values in repository you send the value in lowercase for case-insensitive search.
For case-sensitive search you can use:
SELECT * FROM [nt:unstructured] WHERE ISDESCENDANTNODE('/jcr:root/content/yourpath') AND [prop] LIKE "%foo bar%" ORDER BY [cq:lastModified] desc

nested field in Solr 5.2

I'm new to Solr and I have a very specific problem that I need to solve:
I have a csv file that contains my Solr document. Now, I do have a column (field) that's not only multiValued, but also contains 'subfields'
for example
"id":"0101",
"addMaterials":[{"name":"Mat1", "property":"prop1"},
{"name":"Mat2","property":"prop2"},
{"name":"Mat3","property":"prop3"}],
"mainProperty":"mainproperty1",
"URL":"http://www.mySite..."
where id, addMaterials, mainProperty, and URL are my main fields while 'name' and 'property' are my subfields. I know that Solr is designed to handle denormalized documents but denormalizing is not a possible solution for my application.
What I'm thinking is to just separate my data set and move the fields (that have subfields) to another document and somehow make a new field to link it to the orginial document (e.g. fromIdField).
Is there any other solution to do this? My minimum goal is to index the values of addMaterials field (even without indexing the subfields)
from:
"addMaterials":[{"name":"Mat1", "property":"prop1"},
{"name":"Mat2","property":"prop2"},
{"name":"Mat3","property":"prop3"}],
to
"addMaterials":{"name":"Mat1", "property":"prop1"}
"addMaterials":{"name":"Mat2", "property":"prop2"}
"addMaterials":{"name":"Mat3", "property":"prop3"}
Thanks in advance.

I have found a solution to my problem. Instead of separating my data set, I kept the addMaterials field as a multiValued field and ignored the subfields. So I only have one multiValued field to be indexed. What I did was to use the update/ request of Solr to index my csv file and put },{ as my separator in my addMaterials multiValued field. The indexed document looks like this:
"addMaterials": ["[{\"name\":\"Mat1\", \"property\":\"prop1\"",
"\"name\":\"Mat2\", \"property\":\"prop2\"",
"\"name\":\"Mat3\", \"property\":\"prop3\"}]"]
I indexed my document using this:
curl "http://localhost:8983/solr/<coreName>/update/csv?
stream.file=C:/userName/Solr/solr-5.2.0/documentFolder/myFile.csv&
f.addMaterials.split=true&
f.addMaterials.separator=\},\{&
stream.contentType=text/plain;charset=utf-8"
Also, this assumes that the addMaterials field is a multiValued field. So make sure you modify your schema first before indexing your document using the procedure above. Otherwise, it will give an error saying that the f. is not a multiValued field.
Of course, if you need to query against the sub-fields then I guess you can use the !join command/function of Solr.

Is "logstash-" a mandatory prefix of indices in kibana?

If I put messages on index "[logstash-example-]YYYY.MM.DD" then kibana can show the log message in charts but if it's on "[example-]YYYY.MM.DD" then it won't find it.
(curl query gives back the correct result in latter case)
According to documentation it should work:
"For example [web-]YYYY.MM.DD,[mail-]YYYY.MM.DD Please also note that indices should rollover at midnight UTC."
(Elasticsearch 1.3.4, Kibana 3.1.0)

You have to modify your kibana dashboard setting
Click Configure dashboarad in Kibana on Right Top.
Select Index tab.
Modify Index pattern to your new index pattern. For example: [example-]YYYY.MM.DD
Hope this can help you.

Map blank nodes from stardog to pubby

So I have this .rdf that I have loaded onto Stardog and then I am using Pubby running over Jetty, to browse the triple store.
In my rdf file, I have several blank nodes which is given a blank node identifier by stardog. So this is a snippet of the rdf file.
<kbp:ORG rdf:about="http://somehostname/resource/res1">
<kbp:canonical_mention>
<rdf:Description>
<kbp:mentionStart rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1234</kbp:mentionStart>
<kbp:mentionEnd rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1239</kbp:mentionEnd>
</rdf:Description>
</kbp:canonical_mention>
So basically I have some resource "res1" which has links to blank node which has a mention start and mention end offset value.
The snippet of the config.ttl file for Pubby is shown below.
conf:dataset [
# SPARQL endpoint URL of the dataset
conf:sparqlEndpoint <http://localhost:5822/xxx/query>;
#conf:sparqlEndpoint <http://localhost:5822/foaf/query>;
# Default graph name to query (not necessary for most endpoints)
conf:sparqlDefaultGraph <http://dbpedia.org>;
# Common URI prefix of all resource URIs in the SPARQL dataset
conf:datasetBase <http://somehostname/>;
...
...
So the key thing is the datasetBase which maps URIs to URL.
When I try to map this, there is an "Anonymous node" link but upon clicking, nothing is displayed. My guess is, this is because the blank node has some identifier like _:bnode1234 which is not mapped by Pubby.
I wanted to know if anyone out there knows how to map these blank nodes.
(Note: If I load this rdf as a static rdf file directly onto Pubby, it works fine. But when I use stardog as a triple store, this mapping doesn't quite work)

It probably works in Pubby because they are keeping the BNode id's available; generally, the SPARQL spec does not guarantee or require that BNode identifiers are persistent. That is, you can issue the same query multiple times, which brings back the same result set (including bnodes) and the identifiers can be different each time. Similarly, a bnode identifier in a query is treated like a variable, it does not mean you are querying for that specific bnode.
Thus, Pubby is probably being helpful and making that work which is why using it directly works as opposed to a third party database.
Stardog does support the Jena/ARQ trick of putting a bnode identifier in angle brackets, that is, <_:bnode1234> which is taken to mean, the bnode with the identifier "bnode1234". If you can get Pubby to use that syntax in queries for bnodes, it will probably work.
But generally, I think this is something you will have to take up with the Pubby developers.

Indexing file paths or URIs in Lucene

Some of the documents I store in Lucene have fields that contain file paths or URIs. I'd like users to be able to retrieve these documents if their query terms contain a path or URI segment.
For example, if the path is
C:\home\user\research\whitepapers\analysis\detail.txt
I'd like the user to be able to find it by queriying for path:whitepapers.
Likewise, if the URI is
http://www.stackoverflow.com/questions/ask
A query containing uri:questions would retrieve it.
Do I need to use a special analyzer for these fields, or will StandardAnaylzer do the job? Will I need to do any pre-processing of these fields? (To replace the forward slashes or backslashes with spaces, for example?)
Suggestions welcome!

You can use StandardAnalyzer.
I tested this, by adding the following function to Lucene's TestStandardAnalyzer.java:
public void testBackslashes() throws Exception {
assertAnalyzesTo(a, "C:\\home\\user\\research\\whitepapers\\analysis\\detail.txt", new String[]{"c","home", "user", "research","whitepapers", "analysis", "detail.txt"});
assertAnalyzesTo(a, "http://www.stackoverflow.com/questions/ask", new String[]{"http", "www.stackoverflow.com","questions","ask"});
}
This unit test passed using Lucene 2.9.1. You may want to try it with your specific Lucene distribution. I guess it does what you want, while keeping domain names and file names unbroken. Did I mention that I like unit tests?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

GraphDB Lucene index - how to exclude property URIs from search results? - lucene

Related

Aem fulltextsearch

nested field in Solr 5.2

Is "logstash-" a mandatory prefix of indices in kibana?

Map blank nodes from stardog to pubby

Indexing file paths or URIs in Lucene

Categories

Resources