I'm working on a small design project, part of which involved writing out text in a given font such that the letters of a word are just touching each other on their right and left sides.
I've thought of implementing this as follows - create GlyphVectors of two letters, create Shape objects using vector.getOutline(), then create Area objects and intersect them.
The only thing I'm missing with this method is the ability to shift the second letter to the right until the intersect is empty.
Is there a way to do this, or do I need to use a different approach?
TIA
eta: ok, I've figured out I can use AffineTransform. Now, is there a way to tell the size (surface area) of the Area created by the intersection of two letters?
How precise do you want this to be? Pixel precision is much easier to attain than vector precision. Have you considered linearisation (usually done through
public PathIterator getPathIterator(AffineTransform at, double flatness)
) of outlines and then doing search in opposite directions among all points? This seems to be the most obvious solution even though it is not vector-precise.
Related
Here is what I want to do:
keep a reference curve unchanged (only shift and stretch a query curve)
constrain how many elements are duplicated
keep both start and end open
I tried:
dtw(ref_curve,query_curve,step_pattern=asymmetric,open_end=True,open_begin=True)
but I cannot constrain how the query curve is stretched
dtw(ref_curve,query_curve,step_pattern=mvmStepPattern(10))
it didn’t do anything to the curves!
dtw(ref_curve,query_curve,step_pattern=rabinerJuangStepPattern(4, "c"),open_end=True, open_begin=True)
I liked this one the most but in some cases it shifts the query curve more than needed...
I read the paper (https://www.jstatsoft.org/article/view/v031i07) and the API but still don't quite understand how to achieve what I want. Any other options to constrain number of elements that are duplicated? I would appreciate your help!
to clarify: we are talking about functions provided by the DTW suite packages at dynamictimewarping.github.io. The question is in fact language-independent (and may be more suited to the Cross-validated Stack Exchange).
The pattern rabinerJuangStepPattern(4, "c") you have found does in fact satisfy your requirements:
it's asymmetric, and each step advances the reference by exactly one step
it's slope-limited between 1/2 and 2
it's type "c", so can be normalized in a way that allows open-begin and open-end
If you haven't already, check out dtw.rabinerJuangStepPattern(4, "c").plot().
It goes without saying that in all cases you are getting is the optimal alignment, i.e. the one with the least accumulated distance among all allowed paths.
As an alternative, you may consider the simpler asymmetric recursion -- as your first attempt above -- constrained with a global warping window: see dtw.window and the window_type argument. This provides constraints of a different shape (and flexible size), which might suit your specific case.
PS: edited to add that the asymmetricP2 recursion is also similar to RJ-4c, but with a more constrained slope.
I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.
I'm working with a lot of name data where the following events are happening:
In one stream the data is submitted as "Sung" and in the other stream "Snug" my initial thought to this was to convert Sung and Snug to where each character equals a number then the sums would be the same, so even if they transverse a character, I'd be able to bucket these appropriately.
The other is where in one stream it comes in as "Lillly" as opposed to "Lilly" in the other stream. I'd like to figure out how to fuzzy match these such that I can identify them. I'm not sure if this is possible in Oracle.
I'm working with many millions of data points and trying to figure out how to write these classification buckets such that I can stop having so much noise in my primary task of finding where people are truly different people as opposed to a clerical error.
Any thoughts would be very appreciated.
A common measure for such distance is called Levenshtein distance (Wikipedia here). This measures the "edit" distance between two strings -- number of edit operations needed to convert one into the other.
That's the good news. More good news is that Oracle even has an implementation in the UTL_MATCH library.
The bad news is that it is really, really expensive on millions of data points. Unfortunately, I cannot help you there so much. One idea is to determine which names are "close enough" because they already share a certain minimum number of characters.
Another method is to convert the strings to what they sound like. That is called soundex. You may be able to use the two together -- assuming your names are predominantly English (the soundex algorithm was invented by the US Census Bureau, so it would work best on names in America).
No keyboard patterns. i.e. keys that are adjacent vertically or horizontally on a keyboard. For example, 'ZXCVBN123' should be rejected.
No commonly used words and no words written backwards or disguised with special characters. For example 'Universe1' and 'Un1ver$e' should be rejected.
Well, first you need to define exactly what you want. What are keyboard patterns? Is 'jk' a keyboard pattern, or just 'jkl'? What's the shortest pattern there is? Is 'gy' a pattern? First you need to define what a pattern really.
Then you should make a list of all the available patterns (There aren't all that many. You have 36 starting points and 4 directions to go from each starting point). When you get a password, try to locate each of the patterns in it. Note that if you decide the shortest pattern is 3 letters long, you don't need to search for 4-letter patterns, all 4-letter patterns already contain 3-letter patterns.
As for words, that's easier, but first you need to make a list of all disallowed transformations ($->S, 1->i, etc...). Once you get a word, apply all the transformations and get yourself a 'normalized' word. Compare the normalized password against a dictionary of all legal words twice - the second time reverse the password.
You will probably need to do something a little more complicated than that, because you need to ignore numbers at the end of the word - sometimes. 1ncredible can be a substitute for 'incredible', although ncredible is not a word.
If you inspect the code of http://howsecureismypassword.net you can see that the password is compared to a large array of usual passwords.
On the page threre is a reference to the page http://xato.net/passwords/more-top-worst-passwords/ which lists the top 10.000 most common passwords.
One approach would be to download that list and check the users passwords against it or at least some top 100 of them.
I present a tricky question that I am not sure how to approach. So, I have formulated a plist containing dictionaries which contain two objects:
The Country Name
The Plug Size Of The Country
There are only 210 countries/facts though.
And, I have enabled to search through a list of many many countries, in which there might be a fact or not. But here is my problem, I am using a web service called Geonames and the user can use a search bar display controller to search for countries, and these plist country names paired with plug sizes are actually from a Wikipedia article.
Now, the country naming in Geonames and in my plist from Wikipedia might be named slightly different, maybe with an extra space, an extra dash, an extra letter. This is why I want to see if the geoname country string is very similar to the one in the plist.
So, this would not be isEqualToString: because that finds if it is exact, can the compare: method work?
How can I approach this? Here is an example:
Geoname returns (not a real country just an example):
Yiting
But plist may return:
Yitting
So with 1 extra 't', but there are other circumstances. I would like these to be compared as exact, or at least similar, so I could consider them as a match.
Are there any tutorials, resources, projects etc. you could point me towards?
Thank you! Bye!
The Soundex algorithm is useful in cases like this.
I found a sample implementation on github.
You need to implement an algorithm for approximate matching of strings. One of the most popular such algorithms is Levenshtein distance, one of several Edit distance algorithms. The distance is calculated as the number of editing operations required to transform string A into string B - inserting, deleting, or changing a character counts as one edit operation. The closer are the strings, the smaller is the edit distance between them. You can calculate pairwise edit distances, and find the smallest one to identify the match.
You may find this post about auto update/complete useful:
I've tested that UITextViews work well while adhering to the UITextViewDelegate protocol in your UIViewController class and will produce a result similar to what you'll find in the Messages app. I haven't checked if UITextField and UITextFieldDelegate do as well.