When clustering with OpenRefine, is there a way to "exclude" a string in a cluster ? right now it feels like either it clusterize everything or not - openrefine

When using the clustering function in OpenRefine, you can select the "Merge?" option to clusterize the strings that were put together with the method of your choose, but what if the method clusterizes correctly most of them except for one string that I manually identify doesnt belongs in th ecluster, is there a way to exclude that specific string from the rest of the cluster ?

Unfortunately there is not currently a way of excluding or selecting a subset of terms from a cluster. The only two options I can think of are:
a) modify the clustering algorithm you are using to try to get better
clustering which doesn't include the incorrect terms
b) Go to 'browse
cluster' and mark the rows with the terms you don't want to have in
the cluster (e.g. by Flagging the rows), exclude the flagged rows in
a facet and re-cluster - this will then not include any of the terms
you didn't want

Related

RedisBloom: Option to add items (bit strings) as is with no hashing?

I'm considering redis for my next project (in-memory, fast) but now I have the problem of figuring out how and if at all it could actually achieve my goal. The goal is to store "large" (millions) amount of fixed-length bit strings and then searching over the database with a input (query) bit string. Search means to return everything which fulfills below condition:
query & value = query
eg. if all bits set in the query are also set in the value return that key eg. bloom-filter albeit in my domain of work it isn't usually called like that.
I found the module RedisBloom but I already have my bloom filter (bit strings) available from external program and would simply like to use RedisBloom for storage of them and searching (exists command). therefore in my case the "Add" command should take the input as is and not hash it again.
Is that possible? And if not other suggestions?
Nope, that isn't possible as RedisBloom is a "black box" in that sense - it manages its own data structures.

Alphabetical index with millions of rows in redis

For my application, I need an alphabetical index on a set with millions of rows.
When I use a sorted set, and give all members the same score, the result looks perfect.
Performance is also great, with a test set of 2 million rows, the last third does not perform noticably less than the first third of the set.
However, I need to query those results. For example, get the first (max) 100 items that start with "goo". I played around with zscan and sort, but it does not give me a working and performant result.
Since redis is very fast when inserting a new member to the sorted set, it must be technically possible to immediately (well, very quickly) go to the right memory location. I suppose redis uses some kind of quicksort mechanism to accomplish this.
But.. I don't seem to get the result when I just want to query the data, and not write to it.
We use replicated slaves for read actions, and we prefer the (default) read-only config switch. So creating a dummy key and deleting it afterward (however unelegant) is not really an option.
I'm stuck a bit, and I'm thinking about writing a ZLEX command in redis-server itself. Which I could use like this:
HELP "ZLEX" -> (ZLEX set score startswith)
-- Query the lexicographical index of a sorted set, supplying a 'startswith' string.
127.0.0.1:12345> ZLEX myset 0 goo LIMIT 0 100
1) goo
2) goof
3) goons
4) goozer
What are your thoughts? Am I missing something in the standard redis commands?
We're using Redis 2.8.4 x64 on Debian.
Kind regards, TW
Edits:
Note:
Related issue: indexing-using-redis-sorted-sets -> At least the name I gave to ZLEX seems to conform with Antirez' (Salvatore's) standards. As of 24-1-2014, I'm working on implementing ZLEX. It seems to be the easiest and most straight-forward solution for this use case, and Antirez could merge it into the main branch for everyone's benefit.
I've implemented ZLEX.
Here are the full specs.
You can grab the new functionality from here: github tw-bert
I also posted a pull request to Antirez here.
Kind regards, TW
Have you had a look at this ?
It can be useful depending on the length of the field by which you sort, this method requires b*(a^2) keys, where a is the length of the field , and b is amount of rows for this field.

Solr sort different criteria for each subset

We are using Apache SOLR for full text search. We have specific requirement for sorting the search results - basically when querying for data, we need 2 sets of data - A and B, but each set should have its own sorting criteria and we cannot make 2 different calls. We can get 2 sets by using an OR condition, but how do we sort each set differently ? To illustrate, if :
Set A = {3,1,2}
Set B = {8,5,9}
So, the expected response can have set A returned in ascending order {1,2,3} but the set B can be returned in descending order {9,8,5}
I believe the default sort in SOLR will sort the entire results sets. Any suggestions or if the question is not clear,let me know.
You can possibly achieve this using FieldCollapsing
You might need to do a little more work - i.e. have a display order field(could be an integer) so that Solr knows one field that it needs to sort by.
Next you could use a query like this -
&q=*&group=true&group.field=set&group.sort=display_order
I would recommend keeping the logic such as this out of Solr, it isn't meant to be a substitute for Relational Databases, and getting it to do complex SQL like operation (while some are possible) is going to be tricky.
By the way there is an open issue in Solr's JIRA that addresses batch processing of multiple queries. Which means, when it is merged into a release, you could fire n different queries to fetch these sets in one call to Solr.
If you are keen to have SOLR perform this task for you, the patch is available in the JIRA card, you could create a build for yourself and let us all know how it goes :)

lucene index match

I am trying to use Lucene for doing undup or dedup match. Essentially I have a file with records which I want to group based on certain fields (fuzzy search) and get back a result with a match key that tells me which records within that file matched to each other.
Is this possible?
This can be done (if I understand this correctly). You would index your terms that/records will be searched on in one pass. In the second pass, you will search for each term and log results.
While pre-processing the document you can generate a hash that aggregate those fields, and store this (as NOT_ANALYZED), this way you just have to search by one field with a known size, take a look at MessageDigest. This is what I normally do for duplicate detection of the file content (since the content might be too big for a single query).
If what you are looking for is creating a more complex query, try using CachingWrapperFilter, this way subsequent calls to your deduplication algorithm will be much faster.

How to implement autocomplete on a massive dataset

I'm trying to implement something like Google suggest on a website I am building and am curious how to go about doing in on a very large dataset. Sure if you've got 1000 items you cache the items and just loop through them. But how do you go about it when you have a million items? Further, suppose that the items are not one word. Specifically, I have been really impressed by Pandora.com. For example, if you search for "wet" it brings back "Wet Sand" but it also brings back Toad The Wet Sprocket. And their autocomplete is FAST. My first idea was to group the items by the first two letters, so you would have something like:
Dictionary<string,List<string>>
where the key is the first two letters. That's OK, but what if I want to do something similar to Pandora and allow the user to see results that match the middle of the string? With my idea: Wet would never match Toad the Wet Sprocket because it would be in the "TO" bucket instead of the "WE" bucket. So then perhaps you could split the string up and "Toad the Wet Sprocket" go in the "TO", "WE" and "SP" buckets (strip out the word "THE"), but when you're talking about a million entries which may have to say a few words each possibly, that seems like you'd quickly start using up a lot of memory. Ok, that was a long question. Thoughts?
As I pointed out in How to implement incremental search on a list you should use structures like a Trie or Patricia trie for searching patterns in large texts.
And for discovering patterns in the middle of some text there is one simple solution. I am not sure if it is the most efficient solution, but I usually do it as follows.
When I insert some new text into the Trie, I just insert it, then remove the first character, insert again, remove the second character, insert again ... and so on until the whole text is consumed. Then you can discover every substring of every inserted text by just one search from the root. That resulting structure is called a Suffix Tree and there are a lot of optimizations available.
And it is really incredible fast. To find all texts that contain a given sequence of n characters you have to inspect at most n nodes and perform a search on the list of children for every node. Depending on the implementation (array, list, binary tree, skip list) of the child node collection, you might be able to identify the required child node with as few as 5 search steps assuming case insensitive latin letters only. Interpolation sort might be helpful for large alphabets and nodes with many children as those usually found near the root.
Don't try to implement this yourself (unless you're just curious). Use something like Lucene or Endeca - it will save you time and hair.
Not algorithmically related to what you are asking, but make sure you have a 200ms or more delay (lag) after the kaypress(es) so you ensure that the user has stopped typing before issuing the asynchronous request. That way you will reduce redundant http requests to the server.
I would use something along the lines of a trie, and have the value of each leaf node be a list of the possibilities that contain the word represented by the leaf node. You could sort them in order of likelihood, or dynamically sort/filter them based on other words the user has entered into the search box, etc. It will execute very quickly and in a reasonable amount of RAM.
You keep the items on the server side (perhaps in a DB, if the dataset is really large and complex) and you send AJAX calls from the client's browser that return the results using json/xml. You can do this in response to the user typing, or with a timer.
if you don't want a trie and you want stuff from the middle of the string, you generally want to run some sort of edit distance function (levenshtein distance) which will give you a number indicating how well 2 strings match up. it's not a particularly efficient algorithm, but it doesn't matter too much for things like words, as they're relatively short. if you're running comparisons on like, 8000 character strings it'll probably take a few seconds. i know most languages have an implementation, or you can find code/pseudocode for it pretty easily on the internet.
I've built AutoCompleteAPI for this scenario exactly.
Sign up to get a private index, then,
Upload your documents.
Example upload using curl on document "New York":
curl -X PUT -H "Content-Type: application/json" -H "Authorization: [YourSecretKey]" -d '{
"key": "New York",
"input": "New York"
}' "http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]"
After indexing all document, to get autocomplete suggestions, use:
http://suggest.autocompleteapi.com/[YourAccountKey]/[FieldName]?prefix=new
You can use any client autocomplete library to show these results to the user.