I need to do a query in RavenDb and perform a get on a document by Id and a secondary parameter.
To be more precise, I'd like to load a document by document Id and by ApiKey. If the ApiKey of the given document does not match I want a null back.
My question is, is it faster to do a Query with Id and ApiKey comparison, or is it faster to do a Load by Id and throw away the document in code if the ApiKey does not match. My documents are probably 20k in size.
Do a load by id, then compare.
Related
I have a workflow where there is a layer of pre-processing in order to extract fields - this is later handed to another process to be ingested into Solr. The original files comprise documents with records, think tabular data.
Some of these columns are indexed in Solr in order to get the relevant documentID for that value of the field. I.e. you query like
q=indexedField:indexedValue1
fl= documentId
and have a response like:
... response: {documentID1, documentID3}
assuming indexedValue1 is present in field indexedField in documents documentID1, documentID3.
Each record will then have a value on one of the fields we want to index. The pre-processing concats these values to one (long) text field, with each value as a token, so you can later search by them. Indexed fields when handed to Morphlines look like this:
...
value1 value2 ... valueN
...
Some fields are extracted and then regrouped in a field, so if you want to search by a value, you can know in which document it is.
(fairly simple until here)
However, how could I also store in Solr, along with each token that I want to search by, the offset (or record number) on the original file? The problem is not to extract this information (that is another problem, but we can solve it).
i.e. you would query like above, but will get per each document ID, the original record number or file offset where the record is located - something like:
... response:{ {documentID1, [1234, 5678]}, { documentID3, [] } }
Is this possible at all? In that case, what's the correct Solr data structure to efficiently model it?
It sounds that what you are looking for is Payloads. This functionality is present in Solr, but often requires custom code to actually fully benefit from it.
The challenge however would be that you seem to want to return payloads that are associated with the tokens that matched during search. That's even more complicated as the search focuses on returning documents and extracting what matched in the specific document is a separate challenge, usually solved by highlighters.
Does Elastic/Lucene really need to store all indexed data in a document? Couldn't you just pass data through it so that Lucene may index the words into its hash table and have a single field for each document with the URL (or what ever pointer makes sense for you) that returns where each document came from?
A quick example may be indexing Wikipedia.org. If I pass each webpage to Elastic/Lucene to index - why do I need to save each webpages' main text in a field if Lucene indexes it and has a corresponding URL field to reply for searches?
We pay the cloud so much money to store so much redundant data -- Im just wondering why if Lucene is searching from its hash table and not the actual fields we save data into... why save that data if we dont want it?
Is there a way to index full text documents in Elastic without having to save all of the full text data from those documents?
There are a lot of options for the _source field. This is the field that actually stored the original document. You can disable it completely or decide which fields to keep. More information can be found in the docs:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
I would like to query for a list of particular documents with one call to CouchDB.
With SQL I would do something like
SELECT *
FROM database.table
WHERE database.table.id
IN (2,4,56);
What is a recipe for doing this in CouchDB by either _id or another field?
You need to use views keys query parameter to get records with keys in specified set.
function(doc){
emit(doc.table.id, null);
}
And then
GET /db/_design/ddoc_name/_view/by_table_id?keys=[2,4,56]
To retrieve document content in same time just add include_docs=True query parameter to your request.
UPD: Probably, you might be interested to retrieve documents by this reference ids (2,4,56). By default CouchDB views "maps" emitted keys with documents they belongs to. To tweak this behaviour you could use linked documents trick:
function(doc){
emit(doc.table.id, {'_id': doc.table.id});
}
And now request
GET /db/_design/ddoc_name/_view/by_table_id?keys=[2,4,56]&include_docs=True
will return rows with id field that points to document that holds 2,4 and 56 keys and doc one that contains referenced document content.
In CouchDB Bulk document APi is used for this:
curl -d '{"keys":["2","4", "56"]}' -X POST http://127.0.0.1:5984/foo/_all_docs?include_docs=true
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
We have data in our graph that is indexed by Lucene and need to query it with a
Field Grouping
The example in the link above shows the Lucene syntax to be:
title:(+return +"pink panther")
I can't figure out how to send a request like that via http to the REST interface. Specifically, the space in the second term is causing me problems.
Our data is actually a list and I need to match on multiple items:
regions:(+asia +"north america")
Anyone have any ideas?
Update: For the record, the following url encoded string works for this particular query:
regions%3A%28%2Basia+%2B%22north+america%22%29
Isn't it enough to just URL encode the query using java.net.URLEncoder or something?
When I'm searching for a query in Lucene, I receive a list of documents as result. But how can I get the hits within those documents? I want to access the payload of those word, which where found by the query.
If your query contains only one term you can simply use TermPositions to access the payload of this term. But if you have a more complex query with Phrase Search, Proximity Search, ... you can't just search for the single terms in TermPositions.
I would like to receive a List<Token>, TokenStream or something similiar, which contains all the Tokens that were found by the query. Then I can iterate over the list and access the payload of each Token.
I solved my problem by using SpanQueries. Nearly every Query can be expressed as SpanQuery. A SpanQuery gives access to the spans where the hit within a document is. Because the normal QueryParser doesn`t produce SpanQueries, I had to write my own parser which only creates SpanQueries. Another option would be the SurroundParser from Lucene-Contrib, which also creates SpanQueries.
I think you'll want to start by looking at the Lucene Highlighter, as it highlights the matching terms in the document.