Duplicates when using nutch -> elasticsearch solution - indexing

I have crawled some data using nutch and managed to inject it into elasticsearch. But I have one problem: If I inject the crawled data again it will create duplicates. Is there any way of disallowing this?
Has anyone managed to solve this or have any suggestions on how to solve it?
/Samus

If you index each page/document crawled with the same id in ElasticSearch it won't duplicate it. You could use a checksum/hash function to turn the page's URL into a distinct ID.
You can also use Operation_type to ensure that if that id is already indexed it should not reindex it:
The index operation also accepts an op_type that can be used to force
a create operation, allowing for “put-if-absent” behavior. When create
is used, the index operation will fail if a document by that id
already exists in the index.
ElasticSearch index API

One way , you can keep an index of check sum of all data you have entered into elasticSearch in some db and cross refer those before attempting to send data to elasticSearch.
Or then you can run a "more like this" query to see similar documents and take decision based on that.
LINK - http://www.elasticsearch.org/guide/reference/query-dsl/mlt-field-query.html

Related

Performance difference in Couchbase's get by Key and select by index

As we are doing benchmark tests on our Couchbase DB, we tried to compare search for item by their id / key and search for items by a query that uses secondary index.
Following this article about indexing and performance in Couchbase we thought the performance of the two will be the same.
However, in our tests, we discovered that sometimes, the search by key/id was much faster then the search that uses the secondary index.
E.g. ~3MS to search using the index and ~0.3MS to search by the key.(this is a 10 times factor)
The point is that this difference is not consist. The search by key varies from 0.3MS to 15MS.
We are wondering if:
There should be better performance for search by key over search by secondary index?
There should be such time difference between key searches?
The results you get are consistent with what I would expect. Couchbase works as a key-value store when you do any operation using the id. A key-value store is roughly a big distributed hashmap, and in this data structure, you can a very good performance on get/save/delete while using the id.
Whenever you store a new document, couchbase hash the key and assign a Virtual Bucket to it (something similar to a shard). When you need to get this document back, it uses the same algorithm to find out in which virtual bucket the document is located, as the SDK has the cluster map and knows exactly which node has which shards, your application will request the document directly to the node who owns it.
On the other hand, when you query the database, Couchbase has to make internally a map/reduce to find out where the document is located, that is why operations by id are faster.
About your questions about results from 0.3ms to 15ms, it is hard to tell without debugging your environment. However, there are a number of factors that could contribute to it. Ex: the document is cached/not cached, node is undersized, etc.
To add to #deniswrosa's answer, the secondary index will always be slower, because first the index must be traversed based on your query to find the document key, and then a key lookup is performed. Doing just the key lookup is faster if you already have the key. The amount of work to traverse the index can vary depending on how selective the index is, whether the entire index is in memory, etc. Memory-optimized indexes can ensure that the whole index is in memory, if you have enough memory to support that.
Of course even a simple key lookup can be slower if the document in question is not in the cache, and needs to be brought in to memory from storage.
It is possible to achieve sub-millisecond secondary lookups at scale, but it requires some tuning of your query, index, and some possibly some of Couchbase' system parameters. Consider the following simple example:
Sample document in userBucket:
"user::000000000001" : {
"email" : "benjamin1#couchbase.com",
"userId" : "000000000001"
}
This query:
SELECT userId
FROM userBucket
WHERE
email = "benjamin1#couchbase.com"
AND userId IS NOT NULL
;
...should be able to achieve sub-millisecond performance with a properly tuned secondary index:
CREATE INDEX idx01 ON userBucket(email, userId);
Since the index is covering the query completely there is no need for the Query engine to FETCH the document from the K/V store. However "SELECT * ..." will always cause the Query service to FETCH the document and thus will be slower than a simple k/v GET("user::000000000001").
For the best latencies, make sure to review your query plan (using EXPLAIN syntax) and make sure your query is not FETCHing. https://docs.couchbase.com/server/6.0/n1ql/n1ql-language-reference/explain.html

how to incrementally update nested objects in elasticsearch?

I have 2 document types (in normal form in relational db):
1: post (with title, text and author fields)
2: comment (with text, author, post_id fields)
I have only one type in elastic (post) that aggregate each post with all comments on them in nested form.
I want to index posts with comments on it as nested objects for decreasing response time of queries but it will increase cost of indexing significantly if I reindex whole "post" document every time a new "comment" added, how can I handle it efficiently? It is acceptable for me to have data of comments with 1h delay.
In fact it is three question:
1- how can I update a post document with only added comment data. (without need to reconstruct whole post document and send it to elastic)
2- how can I aggregate index commands that was related to a document and send it as a one single command to elastic?
3- Is river plugin a solution for these? is it index comments without need to reconstruct whole post document? is it aggregate all updates related to one document and apply it with one index request?
I think this post answers your questions:
elastic search, is it possible to update nested objects without updating the entire document?
Having multiple items to update can be done using the bulk api
There is no river that can help you not to reindex the whole document. With nested documents you always reindex the complete document. If this happens a lot and becomes a problem, parent-child mappings are the way to go.

Obtain all keys of a Neo4j index

I have a Neo4j database whose content is generated dynamically from a big dataset.
All “entry points” nodes are indexed on a named index (IndexManager.forNodes(…)). I can therefore look up a particular “entry point” node.
However, I would now like to enumerate all those specific nodes, but I can't know on which key they were indexed.
Is there any way to enumerate all keys of a Neo4j Index?
If not, what would be the best way to store those keys, a data type that is eminently non-graph-oriented?
UPDATE (thanks for asking details :) ): the list would be more than 2 million entries. The main use case would be to never update it after an initialization step, but other use cases might need it, so it has to be somewhat scalable.
Also, I would really prefer avoiding killing my current resilience abilities, so storing all keys at once, as opposed to adding them incrementally, would be a last-resort solution.
I would either use a different data store to supplement Neo4j- I like Redis- or try #MattiasPersson's suggestion and store the the list on a node.
Is it just one list of keys or is it a list per node? You could store such a list on a specific node, say the reference node.
Instead of using a different storage which increases complexety you could try again with
lucene indices. normally lucene is able to handle this easily, especially now that the MatchAllDocsQuery is better. but one problem is that the neo4j guys are using a very old lucene version.
a special "reference" field in every node especially for this key-traversal case linking to the next node where you easily get ALL properties :)
If you want to get all Nodes, which were indexed in a particular index, you can just do:
IndexHits<Node> hits = IndexManager.forNodes(<INDEX_NAME>).query("*:*");
try{
while(hits.hasNext()){
Node n = hits.next();
...process the node...
}
}finally{
hits.close();
}

How do I drop an index using Lucandra?

I am using Lucandra and want to drop an entire index. The IndexReader and IndexWriters don't have all methods implemented so even iterating through a call to deleteDocument(int docNum) isn't possible.
Has anyone run up against this and either figured out how to either
hack the Cassandra keyspace or
made additions to the Lucandra code, or
figured out how to construct an iterator to delete all docs?
The current version of lucandra doesn't store documents from 1-N so the deleteDocument(int) doesn't work.
What I've done is index a field with the same term in all documents so you can match all documents then delete them with deleteDocuments(Term) call.
Another option (if you only have 1 index per keyspace) is to truncate the cassandra CFs
The next version of lucandra(in development) does store documents 1-N fashion.

How to deal with constantly changing data and SOLR indexes?

Afternoon guys,
I'm using a SOLR index for searching through items on my site. The search results contain an average rating of the item and an amount of comments the item has. The results can be sorted by both rating and num of comments.
But obviously with the solr index, these numbers aren't updated until the db (2million~ rows) is reindexed (done nightly, probably).
What would you guys think is the best way to approach this?
Well, i think you should change your db - index sync policy:
First approach: when commiting database changes also post changes (a batch of them) to indexes. You should write a mapper tier to map your domain objects to solr docs (remember, persists and if it goes ok then index -this works fine for us ;-)). If you want to achieve near real-time index updates you should see solutions like zoey (linkedin lucene-based searching framework)
Second approach: take a look around delta import (and program more frecuently index updates).