Detecting duplicate or similar PDFs using elasticsearch

Detecting duplicate or similar PDFs using elasticsearch - pdf

I'm trying to find a good way to identify whether I have duplicate/highly similar PDFs in a system if they're not exactly alike (ie checksums are different because the pages in the pdf have been either been rearranged, deleted, or merged with other pages).
So a simple example would be:
Original PDF contains pages (A, B, C, D)
New PDF entering system contains pages (D, B, C, A, E, F) or (D, G, H, I, B) or any other iteration where some of the content is resides somewhere else in the original PDF.
Any suggestions for robust methods to determine match/similarity thresholds? Such as identifying that a new PDF is 80% similar to the original.
We are using elasticsearch for search purposes in our system but I haven't found a good way to query or use the score to come up with a useful percentage/number to use as a success threshold.
Any thoughts/ideas/suggestions would be most appreciated.

Related

Determine if List of Strings contains substring of all Strings in other list

I have this situation:
val a = listOf("wwfooww", "qqbarooo", "ttbazi")
val b = listOf("foo", "bar")
I want to determine if all items of b are contained in substrings of a, so the desired function should return true in the situation above. The best I can come up with is this:
return a.any { it.contains("foo") } && a.any { it.contains("bar") }
But it iterates over a twice. a.containsAll(b) doesn't work either because it compares on string equality and not substrings.

Not sure if there is any way of doing that without iterating over a the same amount as b.size. Because if you only want 1 iteration of a, you will have to check all the elements on b and now you are iterating over b a.size times plus, in this scenario, you also need to keep track of which item in b already had a match, and not check them again, which might be worse than just iterating over a, since you can only do that by either removing them from the list (or a copy, which you use instead of b), or by using another list to keep track of the matches, then compare that to the original b.
So I think that you are on the right track with your code there, but there are some issues. For example you don't have any reference to b, just hardcoded strings, and doing it like that for all elements in b will result in quite a big function if you have more than 2, or better yet, if you don't already know the values.
This code will do the same thing as the one you put above, but it will actually use elements from b, and not hardcoded strings that match b. (it will iterate over b b.size times, and partially over a b.size times)
return b.all { bItem ->
a.any { it.contains(bItem) }
}

Alex's answer is by far the simplest approach, and is almost certainly the best one in most circumstances.
However, it has complexity A*B (where A and B are the sizes of the two lists) — which means that it doesn't scale: if both lists get big, it'll get very slow.
So for completeness, here's a way that's more involved, and slower for the small cases, but has complexity proportional to A+B and so can cope efficiently with much larger lists.
The idea is to preprocess the a list, to generate a set of all the possible substrings, and then scan through the b list just checking for inclusion in that set.  (The preprocessing step takes time proportional* to A.  Converting the substrings into a set means that it can check whether a string is present in constant time, using its hash code; so the rest then takes time proportional to B.)
I think this is clearest using a helper function:
/**
* Generates a list of all possible substrings, including
* the string itself (but excluding the empty string).
*/
fun String.substrings()
= indices.flatMap { start ->
((start + 1)..length).map { end ->
substring(start, end)
}
}
For example, "1234".substrings() gives [1, 12, 123, 1234, 2, 23, 234, 3, 34, 4].
Then we can generate the set of all substrings of items from a, and check that every item of b is in it:
return a.flatMap{ it.substrings() }
.toSet()
.containsAll(b)
(* Actually, the complexity is also affected by the lengths of the strings in the a list.  Alex's version is directly proportional to the average length, while the preprocessing part of the algorithm above is proportional to its square (as indicated by the map nested in the flatMap).  That's not good, of course; but in practice while the lists are likely to get longer, the strings within them probably won't, so that's unlikely to be significant.  Worth knowing about, though.
And there are probably other, still more complex algorithms, that scale even better…)

Why are dead keys not working with some letters in AutoHotkey?

In an AutoHotkey script, why do dead keys not work with some letters?
As an example, when running AutoHotkey with the following script:
#InstallKeybdHook
EndKeys = {LControl}{RControl}{LAlt}{RAlt}{LShift}{RShift}{LWin}{RWin}{AppsKey}{F1}{F2}{F3}{F4}{F5}{F6}{F7}{F8}{F9}{F10}{F11}{F12}{Left}{Right}{Up}{Down}{Home}{End}{PgUp}{PgDn}{Del}{Ins}{BS}{Capslock}{Numlock}{PrintScreen}{Pause}
<^>!`::
Input, SingleKey, L1, EndKeys
IfInString,SingleKey,a
Send,{U+00E0} ;à
IfInString,SingleKey,e
Send,{U+00E8} ;è
return
return
then pressing the combination of
Alt-Gr & Grave, followed by an 'a', i get à, OK, but
Alt-Gr & Grave, followed by an 'e' does NOT produce è.
The issue is not related to grave (`), the same thing happens with any other dead keys (like circumflex, acute, macron etc.)
In my particular case, the letters not working are: e y s d k n. Could it have something to do with the keyboard layout? (I am using a UK English). Any ways of approaching the issue to ensure the dead keys will work?
Thank you!

In my particular case, the letters not working are: e y s d k n
Try reorganizing these letters. I find this very hillarious indeed. Please insert any expression of laughter yourself, for it would not be welcomed on stackoverflow if I did.
You forgot to include your %'s. It should be
Input, SingleKey, L1, %EndKeys%
Otherwise, only e, n, d, k, y, s will be recognized as EndKeys

cgal corefinement demo : cutting mesh A surface with mesh B, then remove A in B

I posted some time ago a CGAL question that was kindly answered by pointing to the Polyhedron demo and the corefinement plugin. The basic idea being that one open polyhedron A is cut by another open polyhedron B, and I need the list of intersection half edges owned by A, or better, A minus the part of A in B.
The co-refinement demo does this, but I want to select, as a result, all parts of A not in B. This does not match the available predicates in the demo (A - B (leaves parts of B inside A) , B - A (leaves parts of B outside A), A inter B, A union B). I tried combining/modifying them to get what I want but I must be missing something. The information on the 'darts' seem to be mutually exclusive.
The picture below illustrates this : A as been cut by B (I have a hole with the shape of B) but some parts of B are still in A (the facets on the hole border).
(edit : sorry : not enough reputation to post an image here :-()
Any advices on how to write a predicate that would select only A with a hole, and leave out any face coming from B?
Thank you!

How are records stored in erlang, and how are they mutated?

I recently came across some code that looked something like the following:
-record(my_rec, {f0, f1, f2...... f711}).
update_field({f0, Val}, R) -> R#my_rec{f0 = Val};
update_field({f1, Val}, R) -> R#my_rec{f1 = Val};
update_field({f2, Val}, R) -> R#my_rec{f2 = Val};
....
update_field({f711, Val}, R) -> R#my_rec{f711 = Val}.
generate_record_from_proplist(Props)->
lists:foldl(fun update_field/2, #my_rec{}, Props).
My question is about what actually happens to the record - lets say the record has 711 fields and I'm generating it from a proplist - since the record is immutable, we are, at least semantically, generating a new full record one every step in the foldr - making what looks like a function that would be linear in the length of the arguments, into one that is actually quadratic in the length, since there are updates corresponding to the length the record for every insert - Am I correct in this assumption,or is the compiler intelligent enough
to save me?

Records are tuples which first element contains the name of the records, and the next one the record fields.
The name of the fields is not stored, it is a facility for the compiler, and of course the programmer. I think it was introduced only to avoid errors in the field order when writting programs, and to allow tupple extension when releasing new version without rewriting all pattern matches.
Your code will make 712 copies of a 713 element tuple.

I am afraid, the compiler is not smart enough.
You can read more in this SO answer. If you have so big number of fields and you want to update it in O(1) time, you should use ETS tables.

Building dictionary of words from large text

I have a text file containing posts in English/Italian. I would like to read the posts into a data matrix so that each row represents a post and each column a word. The cells in the matrix are the counts of how many times each word appears in the post. The dictionary should consist of all the words in the whole file or a non exhaustive English/Italian dictionary.
I know this is a common essential preprocessing step for NLP. And I know it's pretty trivial to code it, sill I'd like to use some NLP domain specific tool so I get stop-words trimmed etc..
Does anyone know of a tool\project that can perform this task?
Someone mentioned apache lucene, do you know if lucene index can be serialized to a data-structure similar to my needs?

Maybe you want to look at GATE. It is an infrastructure for text-mining and processing. This is what GATE does (I got this from the site):
open source software capable of solving almost any text processing problem
a mature and extensive community of developers, users, educators, students and scientists
a defined and repeatable process for creating robust and maintainable text processing workflows
in active use for all sorts of language processing tasks and applications, including: voice of the customer; cancer research; drug research; decision support; recruitment; web mining; information extraction; semantic annotation
the result of a €multi-million R&D programme running since 1995, funded by commercial users, the EC, BBSRC, EPSRC, AHRC, JISC, etc.
used by corporations, SMEs, research labs and Universities worldwide
the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, the ISO 9001 of Text Mining

What you want is so simple that, in most languages, I would suggest you roll your own solution using an array of hash tables that map from strings to integers. For example, in C#:
foreach (var post in posts)
{
var row = new Dictionary<string, int>();
foreach (var word in GetWordsFromPost(post))
{
IncrementContentOfRow(row, word);
}
}
// ...
private void IncrementContentOfRow(IDictionary<string, int> row, string word)
{
int oldValue;
if (!row.TryGet(word, out oldValue))
{
oldValue = 0;
}
row[word] = oldValue + 1;
}

You can check out:
bow - a veteran C library for text classification; I know it stores the matrix, it may require some hacking to get it.
Weka - a Java machine learning framework that can handle text and build the matrix
Sujit Pal's blog post on building the term-document matrix from scratch
If you insist on using Lucene, you should create an index using term vectors, and use something like a loop over getTermFreqVector() to get the matrix.

Thanks to #Mikos' comment, I googled the term "term-document matrix' and found TMG (Text to Matrix Generator).
I found it suitable for my needs.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Detecting duplicate or similar PDFs using elasticsearch - pdf

Related

Determine if List of Strings contains substring of all Strings in other list

Why are dead keys not working with some letters in AutoHotkey?

cgal corefinement demo : cutting mesh A surface with mesh B, then remove A in B

How are records stored in erlang, and how are they mutated?

Building dictionary of words from large text

Categories

Resources