Apache Lucene: Creating an index between strings and doing intelligent searching

Apache Lucene: Creating an index between strings and doing intelligent searching - lucene

My problem is as follows: Let's say I have three files. A, B, and C. Each of these files contains 100-150M strings (one per line). Each string is in the format of a hierarchical path like /e/d/f. For example:
File A (RTL):
/arbiter/par0/unit1/sigA
/arbiter/par0/unit1/sigB
...
/arbiter/par0/unit2/sigA
File B (SCH)
/arbiter_sch/par0/unit1/sigA
/arbiter_sch/par0/unit1/sigB
...
/arbiter_sch/par0/unit2/sigA
File C (Layout)
/top/arbiter/par0/unit1/sigA
/top/arbiter/par0/unit1/sigB
...
/top/arbiter/par0/unit2/sigA
We can think of file A corresponding to circuit signals in a hardware modeling language. File B corresponding to circuit signals in a schematic netlist. File C corresponding to circuit signals in a layout (for manufacturing).
Now a signal will have a mapping between File A <-> File B <-> File C. For example in this case, /arbiter/par0/unit1/sigA == /arbiter_sch/par0/unit1/sigA == /top/arbiter/par0/unit1/sigA. Of course, this association (equivalence) is established by me, and I don't expect the matcher to figure this out for me.
Now say, I give '/arbiter/par0/unit1/sigA'. In this case, the matcher should return a direct match from file A since it is found. For file B/C a direct match is not possible. So it should return the best possible matches (i.e., edit distance?) So in this example, it can give /arbiter_sch/par0/unit1/sigA from file B and /top/arbiter/par0/unit1/sigA from file C.
Instead of giving a full string search, I could also give something like *par0*unit1*sigA and it should give me all the possible matches from fileA/B/C.
I am looking for solutions, and came across Apache Lucene. However, I am not totally sure if this would work. I am going through the docs to get some idea.
My main requirements are the following:
There will be 3 text files with full path to signals. (I can adjust the format to make it more compact if it helps building the indexer more quickly).
Building the index should be fairly fast (take a couple of hours). The files above are static (no modifications).
Searching should be comprehensive. It is OK if it takes ~1s / search but the matching should support direct match, regex match, and edit distance matching. The main challenge is each file can have 100-150 million signals.
Can someone tell me if such a use case can be easily addressed by Lucene? What would be the correct way to go about building a index and doing quick/fast searching? I would like to write some proof-of-concept code and test the performance. Thanks.

i think based on your requirements the best solution would be a PoC with a given test set of entries. Based on this it should be possible to evaluate the target indexing time you like to achieve. Because you only use static informations it's easier, because do don't have to care about topics like NRT (near-real-time searches).
Personally i never used lucene for such a big information set but i think lucene is able to handle this.
How i would do it:
Read tutorials and best practices about lucene, indexing, searching and understand how it works
Define an data set for indexing lets say 1000 lines for each file
Define your lucene document structure
this is really important because based on this you will apply your
searches. take care about analyzer tasks like tokanization if needed
and how. If you need fulltext search care about a TextField.
Write code for simple indexing
Run small tests with indexing and inspect your index with Luke
Write code for simple searching
Define queries and your expected results. execute searches and check
results.
Try to structure your code. separate indexing and searching -> it will be easier to refactor.

Related

Cypher BFS with multiple Relations in Path

I'd like to model autonomous systems and their relationships in Graph Database (memgraph-db)
There are two different kinds of relationships that can exist between nodes:
undirected peer2peer relationships (edges without arrows in image)
directed provider2customer relationships (arrows pointing to provider in image)
The following image shows valid paths that I want to find with some query
They can be described as
(s)-[:provider*0..n]->()-[:peer*0..n]—()<-[:provider*0..n]-(d)
or in other words
0-n c2p edges followed by 0-n p2p edges followed by 0-n p2c edges
I can fix the first and last node and would like to find a (shortest/cheapest) path. As I understand I can do BFS if there is ONE relation on the path.
Is there a way to query for paths of such form in Cypher?
As an alternative I could do individual queries where I specify the length of each of the segments and then do a query for every length of path until a path is found.
i.e.
MATCH (s)<-[]->(d) // All one hop paths
MATCH (s)-[:provider]->()-[:peer]-(d)
MATCH (s)-[:provider]->()<-[:provider]-(d)
...

Since it's viable to have 7 different path sections, I don't see how 3 BFS patterns (... BFS*0..n) would yield a valid solution. It's impossible to have an empty path because the pattern contains some nodes between them (I have to double-check that).
Writing individual patterns is not great.
Some options are:
MATCH path=(s)-[:BFS*0.n]-(d) WHERE {{filter_expression}} -> The expression has to be quite complex in order to yield valid paths.
MATCH path=(s)-[:BFS*0.n]-(d) CALL module.filter_procedure(path) -> The module.procedure(path) could be implemented in Python or C/C++. Please take a look here. I would recommend starting with Python since it's much easier. Python for the PoC should be fine. I would also recommend starting with this option because I'm pretty confident the solution will work, + it's modular. After all, the filter_procedure could be extended easily, while the query will stay the same.
Could you please provide a sample dataset in a format of a Cypher query (a couple of nodes and edges / a small graph)? I'm glad to come up with a solution.

Dictionary API used for stressed syllables

This might end up being a very general question, but hopefully it will be useful to others as well.
I want to be able to request a word that is x number of syllables with a stress on x.[y] syllable. I've found plenty of APIs that return both of these such as Wordnik, but I'm not sure how to approach the search aspect. The URL to get the syllables is
GET /word.json/{word}/hyphenation
but I won't know the word ahead of time to make this request. They also have this:
GET /words.json/randomWords
which returns a list random words.
Is there a way to achieve what I want with this API without asking for random words over and over and checking if they meet my needs? That just seems like it would be really slow and push me over my usage limits.
Do I need to build my own data structure with the words and syllables to query locally?

I doubt you'll find this kind of specialized query on any of the big dictionary APIs. You'll need to download an English dictionary and create your own data structure to do this kind of thing.
The Moby Project has a hyphenated dictionary with about 185,000 words in it. There are many other dictionary projects available. A good place to start looking is http://www.dicts.info/dictionaries.php.
Once you've downloaded the dictionary, you'll need to preprocess it to build your data structure. You should be able to construct a dictionary or hash map that is indexed by (syllables, emphasis), and whose data member is a list of words. So you'd have an entry like (4, 2) (4-syllable word with emphasis on the 2nd syllable), and a list of all such words.
To query it, then, you'd just pack the query into a structure and look up that key in the hash map. Then pick a random word from the resulting list.

UniData - record count of all files / tables

Looking for a shortcut here. I am pretty adept with SQL database engines and ERPs. I should clarify... I mean databases like MS SQL, MySQL, postresql, etc.
One of the things that I like to do when I am working on a new project is to get a feel for what is being utilized and what isn't. In T-SQL this is pretty easy. I just query the information schema and get a row count of all the tables and filter out the ones having rowcount = 0. I know this isn't truly a precise row count, but it does give me an idea of what is in use.
So I recently started at a new company and one of their systems is running on UniData. This is a pretty radical shift from mainstream databases and there isn't a lot of help out there. I was wondering if anybody knew of a command to do the same thing listed above in UniBasic/UniQuery/whatever else.
Which tables, files, are heavily populated and which ones are not?

You can start with a special "table" (or file in Unidata terminology) named VOC - it will have a list of all the other files that are in your current "database" (aka account), as well as a bunch of other things.
To get a list of files in (or pointed to) the current account:
:SORT VOC WITH F1 = "F]" "L]" "DIR" F1 F2
Try HELP CREATE.FILE if you're curious about the difference between F and LF and DIR.
Once you have a list of files, weed out the ones named *TEMP* or *WORK* and start digging into the ones that seem important. There are other ways to get at what's important (e.g using triggers or timestamps), but browsing isn't a bad idea to see what conventions are used.
Once you have a file that looks interesting (let's say CUSTOMERS), you can look at the dictionary of that file to see
:SORT DICT CUSTOMERS F1 F2 BY F1 BY F2 USING DICT VOC
It can help to create something like F2.LONG in DICT VOC to increase the display size up from 15 characters.
Now you have a list of "columns" (aka fields or attributes), you're looking for D-type attributes that will tell you what columns are in the file. V or I-type's are calculations
https://github.com/ianmcgowan/SCI.BP/blob/master/PIVOT is helpful with profiling when you see an attribute that looks interesting and you want to see what the data looks like.
http://docs.rocketsoftware.com/nxt/gateway.dll/RKBnew20/unidata/previous%20versions/v8.1.0/unidata_userguide_v810.pdf has some generally good information on the concepts and there are many other online manuals available there. It can take a lot of reading to get to the right thing if you don't know the terminology.

How to find line number or page number using Lucene

Can anyone help me?
For my project i use lucene for indexing files. It only give me the file name and location not mention about the line number and page number.
If it is possible with Lucene to find line number or page number? Please Help me how to do it.

This ended up being too long for a comment so I just made it an answer.
Are you thinking of grep (*nix tool) output where you grep a set of documents and get a result set that contains matches with a line number and text? EG:
46: I saw the brown fox jumping over the lazy dog
If so, Lucene doesn't work like that. On the OS, grep, to simplify, opens each document serially and runs your specified pattern against each line of the contents inside each document. Hence, it can then produce output like the stuff I listed earlier because it's working on the file as it exists on the machine. Lucene behaves differently.
When you index a file with Lucene, Lucene creates a inverted index combining the contents of each document into a highly efficient structure that lets you quickly look up and find documents containing specific pieces of information. In turn, when you run a query against the Lucene Inverted Index, it will return its internal representation of all the documents that matched your query as well as a relevancy score to provide some indication of how useful a document might be to you, based on the query. It does this by operating against it's own internal inverted index structure, not iterating over all the files in place like grep. Lucene possesses no knowledge of line or page numbers, so no, it's not possible to replicate grep with Lucene right out of the box.

Improving lucene spellcheck

I have a lucene index, the documents are in around 20 different languages, and all are in the same index, I have a field 'lng' which I use to filter the results in only one language.
Based on this index I implemented spell-checker, the issue is that I get suggestions from all languages, which are irrelevant (if I am searching in English, suggestions in German are not what I need). My first idea was to create a different spell-check index for each language and than select index based on the language of the query, but I do not like this, is it possible to add additional column in spell-check index and use this, or is there some better way to do this?
Another question is how could I improve suggestions for 2 or more Terms in search query, currently I just do it for the first, which can be strongly improved to use them in combination, but I could not find any samples, or implementations which could help me solve this issue.
thanks
almir

As far as I know, it's not possible to add a 'language' field to the spellchecker index. I think that you need to define several search SpellCheckers to achieve this.
EDIT: As it turned out in the comments that the language of the query is entered by the user as well, then my answer is limited to: define multiple spellcheckers. As for the second question that you added, I think that it was discussed before, for example here.
However, even if it would be possible, it doesn't solve the biggest problem, which is the detection of query language. It is highly non-trivial task for very short messages that can include acronyms, proper nouns and slang terms. Simple n-gram based methods can be inaccurate (as e.g. the language detector from Tika). So I think that the most challenging part is how to use certainty scores from both language detector and spellchecker and what threshold should be chosen to provide meaningful corrections (e.g. language detector prefers German, but spellchecker has a good match in Danish...).

If you look at the source of SpellChecker.SuggestSimilar you can see:
BooleanQuery query = new BooleanQuery();
String[] grams;
String key;
for (int ng = GetMin(lengthWord); ng <= GetMax(lengthWord); ng++)
{
<...>
if (bStart > 0)
{
Add(query, "start" + ng, grams[0], bStart); // matches start of word
}
<...>
I.E. the suggestion search is just a bunch of OR'd boolean queries. You can certainly modify this code here with something like:
query.Add(new BooleanClause(new TermQuery(new Term("Language", "German")),
BooleanClause.Occur.MUST));
which will only look for suggestions in German. There is no way to do this without modifying your code though, apart from having multiple spellcheckers.
To deal with multiple terms, use QueryTermExtractor to get an array of your terms. Do spellcheck for each, and cartesian join. You may want to run a query on each combo and then sort based on the frequency they occur (like how the single-word spellchecker works).

After implement two different search features in two different sites with both lucene and sphinx, I can say that sphinx is the clear winner.
Consider using http://sphinxsearch.com/ instead of lucene. It's used by craigslist, among others.
They have a feature called morphology preprocessors:
# a list of morphology preprocessors to apply
# optional, default is empty
#
# builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
# 'soundex', and 'metaphone'; additional preprocessors available from
# libstemmer are 'libstemmer_XXX', where XXX is algorithm code
# (see libstemmer_c/libstemmer/modules.txt)
#
# morphology = stem_en, stem_ru, soundex
# morphology = libstemmer_german
# morphology = libstemmer_sv
morphology = none
There are many stemmers available, and as you can see, german is among them.
UPDATE:
Elaboration on why I feel that sphinx has been the clear winner for me.
Speed: Sphinx is stupid fast. Both indexing and in the serving search queries.
Relevance: Though it's hard to quantify this, I felt that I was able to get more relevant results with sphinx compared to my lucene implementation.
Dependence on the filesystem: With lucene, I was unable to break the dependence on the filesystem. And while their are workarounds, like creating a ram disk, I felt it was easier to just select the "run only in memory" option of sphinx. This has implications for websites with more than one webserver, adding dynamic data to the index, reindexing, etc.
Yes, these are just points of an opinion. However, they are an opinion from someone that has tried both systems.
Hope that helps...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas