finding a pattern in a set of string - vb.net

Let's say I have a set of documents that contains a persons name like a driver's license, a passport, an invoce etc,.
From each document I have a process that using ocr(Optical Character Recognition) extracts the persons name from these documents. Since the extraction process may contain errors I need to find the "correct name" in that set of strings.
So lets I have the following strings as a persons name - "John" ; "J0hn" ; "JOHN"; "10hn";"+o-."; "john smith".
As a person you can tell that the person name is John as it is the most common occurrence.
What is the best way to do this? Is there an algorithm to find the most common occurrence in a set of string?

Related

How to display only searched string in column in postgresql

I want to only display searched string from a table, as example this is my table:
Table name: guidelines
id content
1 An individual is accused “of” a crime, not “with” or “for” a crime. Accused, often as “the accused”, refers to the individual or individuals standing trial. EXAMPLES: The prosecutor accused the politician of bribery. The accused politician stood trial for bribery. See alleged, charged, suspected.
2 There were a lot of people getting accused on this particular town.
If I use search query to search for "accused", it will show the full result:
SELECT content FROM "guidelines" WHERE "content" 'ILIKE' '%accused%';
Result:
content
An individual is accused “of” a crime, not “with” or “for” a crime. Accused, often as “the accused”, refers to the individual or individuals standing trial. EXAMPLES: The prosecutor accused the politician of bribery. The accused politician stood trial for bribery. See alleged, charged, suspected.
There were a lot of people getting accused on this particular town.
How can I only get the first matching string and followed by the data on the column, as example this is my goal:
content
Accused, often as “the accused”, refers to th...
accused on this particular to...
update: I updated the table and column name to make it better to differentiate table and column
In Postgresql, you can do that by using position function and substring function. see the following query as an example:
SELECT
id,
substring(content, position ('accused' in content)) as matched
FROM
guidelines
WHERE
content LIKE '%accused%'
Try this :
SELECT substring(content from '%#"accused%#"%' for '#') from guidelines;
each # is the place holder defined in the last part for '#' and need and aditional "
So you have % and function will return what is found inside both placeholder. In this case is % or the rest of the string after accused

filter params for import users from AD

I'm to import users used this filter:
(&(objectClass=user)(objectCategory=PERSON))
And i want to add RealName parameter as filter.
RealName should contain 3 any words.
For example RealName contained "name middle_name surname" - it's good, need to import.
If RealName contained "name surname" (only two word) - wrong, not imported.
Can you help me with with filter?
LDAP queries can only use attributes that exist in Active Directory, and there is no attribute called "RealName".
You will have to split the input string yourself. So, for example, if you were given the string "Necro The Human", you would have to split that into 3 strings using whatever programming language you're using.
Then you will have to insert those into an LDAP query that matches the three name attributes: givenName, initials, and sn (surname)
Your finished query would look something like this:
(&(objectClass=user)(objectCategory=person)(givenName=Necro)(initials=The)(sn=Human))
Check if you're using initials or the middleName attribute for the middle name. It's the initials attribute that is labelled as "Initials" in Active Directory Users and Computers, so that may be what's used, even though the documentation says it's just for the initials of the full name, or middle initials (not the full middle name). It's also limited to only 6 characters, so you may be using middleName if you're storing full middle names.
If your company has the standard of setting the displayName to the user's full name, including middle name, then you could just match against that. But I think it would be pretty rare that the middle name would be in the display name.
(&(objectClass=user)(objectCategory=person)(displayName=Necro The Human))
There is also ambiguous name resolution, but it searches other attributes (not just the first/last name) and it does not include initials or middleName. I mention it only because it's not well known and you may find some other use for it one day.

How can I search lucene for "John J" and get people like "John Jameson" not just people with John?

For reasons out of my control, I must do this with a global search. I've taken converting a search term "John J" into (John AND J), which works for anyone who's last name doesn't start with the same letter as their first.
How can I make the search for "John J" become "find all people who have John and then another, different J in the field"?
Thanks for your time.
You may want to try out Wildcard Query. For example:
Term term = new Term("secondName", "J*");
Query query = new WildcardQuery(term);
I am assuming you have a different fields for first and second name. You can create a boolean query with a combination of queries for first and second names.
Documentation for WildcardQuery: http://lucene.apache.org/core/6_2_0/core/org/apache/lucene/search/WildcardQuery.html
I hope this helps.
Since you mentioned it is a type ahead input; prefixQuery might help -
new PrefixQuery(new Term("lastName","J"));
This will return all documents with lastName starting with "J".
To get results where firstName starts with "John" and lastName starts with "J", you can have -
BooleanQuery.Builder booleanQueryBuilder;
booleanQueryBuilder.add(new PrefixQuery(new Term("firstName","John")));
booleanQueryBuilder.add(new PrefixQuery(new Term("lastName","J")));`

Am I training my wit.ai bot correctly?

I'm trying to train my Wit.ai bot in order to recognize the first name of someone. I'm not very sure if I well understand how the NLP works so I'll give you an example.
I defined a lot of expressions like "My name is XXXX", "Everybody calls me XXXX"
In the "Understanding" table I added an entity named "contact_name" and I add almost 50 keywords like "Michel, John, Mary...".
I put the trait as "free-text" and "keywords".
I'm not sure if this process is correctly. So, I ask you:
does it matter the context like "My name is..." for the NLP? I mean...will it help the bot to predict that after this expression probably a fist name will come on?
is that right to add like 50 values to an entity or it's completly wrong?
what do you suggest as a training process in order to get the first name of someone?
You have done it right by keeping the entity's search strategy as "free-text" and "Keywords". But Adding keywords examples to the entity doesn't make any sense because a person's name is not a keyword.
So, I would recommend a training strategy which is as follows:
Create various templates of the message like, "My name is XYZ", "I am XYZ", "This is XYZ" etc. (all possible introduction messages you could think of)
Remove all keywords and expressions for the entity you created and add these two keywords:
"a b c d e f g h i j k l m n o p q r s t u v w x y z"
"XYZ" (can give any name but maintain this name same for validating the templates)
In the 'Understanding' tab enter the messages and extract the name into the entity ("contact_name" in your case) and validate them
Similarly, validate all message templates keeping the name as "XYZ"
After you have done this for all templates your bot will be able to recognise any name in a given template of the message
The logic behind this is your entity is a free-text and keyword which means it first tries to match the keyword if not matched it tries to find the word in the same position of the templates. Keeping the name same for validations helps to train the bot with the templates and learn the position where the name will be usually found.
Hope this works. I have tried this and worked for me. I am not sure how bot trains in background. I recommend you to start a new app and do this exercise.
Comment if there is any problem.
wit.ai has a pre-trained entity extraction method called wit/contact, which
Captures free text that's either the name or a clear reference to a
person, like "Paul", "Paul Smith", "my husband", "the dentist".
It works good even without any training data.
To read about the method refer to duckling.

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.