extract text from documents like PAN and Aadhaar - laravel-8

I am using cloud google vision API to extract text from Aadhaar and PAN. How can I get exact user details like name, father's name, and address?
Raw Data
ଭାରତ ସରକାର
Government of India
ଜିତ୍ୟାନନ୍ଦ ଖେମୁକୁ
NITYANANDA KHEMUDU
ପିତା : ସୀତାରାମ ଖେମୁକୁ
Father: Sitaram Khemudu
ଜନ୍ମ ତାରିଖ / DOB : 01.07.1999
ପୁରୁଷ / Male
ମୋ ଆଧାର, ମୋ ପରିଚୟ

I have built 5-6 OCR till date like aadhar, pan, ITR, Driving Linces etc., using google cloud vision API, I think you are looking for response like
{"pan_card_no":"ECXXXXXX123",
"name":"fshksj"
}
to get such response you need to built your own logic, here are some logic's i can share with you
Perform OCR on your document using Google_cloud_vision API and store that response into one array (Goggle gives logic line by line)
Like in above case if you want to grab DOB first you can build logic like i) if "DOB" in (list of item) then grab the numeric values
To get the name what you can do is dropping the unnecessary items from list by if using if condition like (if "India" in i) or (if i.isdigit()) then drop it likewise you can drop the unnesseary items from main list to get the Name
to grab the Address what you can do is, 95% of the time address come with pincode at last, so what you can do is treat pincode as a last index of address and look of "Address" kind of keyword then add all the elements from "Add keyword index" to "pincode index" ( this can be easily done in list) to validate whether the pincode is valid or not you can use library like Pyzipin
There are multiple conditions that you can use, above are the very basic one i mentioned, if you need any specific logic then then you can ask me

Related

Twitter Premium API Profile location operators profile_country: and profile_region: not working

I am using premium account (not sandbox) for data collection.
I want to collect:
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to US and not geolocated at tweet level, excluding all retweets
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to ‘Minnesota’ and not geolocated at tweet level, excluding all retweets
The code is as follows:
premium_search_args = load_credentials('twitter_API.yaml',
yaml_key ='search_tweets_premium_api', env_overwrite=False)
# keywords for the search
# key word 1
keywords = '(China OR Chinese) lang:en profile_country:US -place_country:US -is:retweet'
# key word 2
keywords = '(China OR Chinese) lang:en -place_country:US profile_region:"Minnesota" -is:retweet'
# define search rule
rule = gen_rule_payload(keywords,from_date='2019-12-01',
to_date='2019-12-10',results_per_call=500)
# create result stream and print before start
rs = ResultStream(rule_payload=rule, max_results=1250000,
**premium_search_args)
My problems are that:
For the first one, a large portion of the results I get didn’t satisfy the query. First, some don’t have Profile Geo enrichment, i.e. user.derived.locations attribute is not in the user object. Second, if it is, a lot don’t have country code US, i.e. they are identified to other countries.
For the second one, the result I get from this method is a smaller subset of the results I can get from 1). That is, when I filter all tweets user geolocated to Minnesota (by user.derived.locations.region) from profile_country:US, it gives a larger sample than using profile_region:“Minnesota”. A considerable amount of data is missing using this method.
I have tried several times but it seems that user geolocation operators don’t work exactly what I want. Does anyone has any idea why this is the case? I would very much appreciate any answers/suggestions/comments.
Thank you!

Google map api returns Notfound result for valid postal-cod for distance calculation

We are using Google map api to find distance between two postalcodes. We can get the distance between any two postalcodes by using google map api Working Map API example. Some of the postal-code this api returns NotFound error when the postal-code is sent without any space, but for the same postal-code if we give space after 3 letters it works well and returns the distance correctly.
API URL
http://maps.googleapis.com/maps/api/distancematrix/json?origins=H3W3C4&destinations=H1C0A6&mode=driving&language=en-EN&sensor=false
Example 1:
url:http://maps.googleapis.com/maps/api/distancematrix/json?origins=H3W3C4&destinations=H1A0C2&mode=driving&language=en-EN&sensor=false
postal-code: H1A0C2,
api result : NotFound
Example 2:
url:http://maps.googleapis.com/maps/api/distancematrix/json?origins=H3W3C4&destinations=H1A 0C2&mode=driving&language=en-EN&sensor=false
postal-code: H1A 0C2,api
result : returns correct distance
Below are list of postal codes for which api returns NotFound if we give postal-code without space(but if we give space like 'H1A 0C2' it will return results)
H1A0C2
H1A0C3
H1A0C4
H1B0B7
H1C0E3
H1C0E4
H1C0E5
H1C0E6
H1C0E7
H1C0E8
H1C0E9
H1C0G1
H1C0G2
below list postalcodes for which api returns the distance correctly if we give with/without space( like works well if we give 'H1C 0A7' and 'H1C0A7'.
H1C0A7
H1C0A8
H1C0A9
Though it works with or without space for most of the postalcodes, for few it does not return values without spaces. What could be the reason?
I would suggest checking if the postal code exists in the Google database.
For example the postal code H1A0C2 seems to be missing
https://developers-dot-devsite-v2-prod.appspot.com/maps/documentation/utils/geocoder/#q%3D%26options%3Dtrue%26in_country%3DCA%26in_postal_code%3DH1A0C2
As you can see, the geocoder tool returns only postal code prefix HA1, but not the postal code itself.
For the postal code H1C0A9 geocoder returns a complete postal code:
https://developers-dot-devsite-v2-prod.appspot.com/maps/documentation/utils/geocoder/#q%3D%26in_country%3DCA%26in_postal_code%3DH1C0A9
I think the distance matrix cannot find distance for missing postal codes. However, when you add an empty space it can find a postal code prefix and calculates distance based on coordinates of postal code prefix. So the result might be not precise enough in this case.
You can report missing postal codes to Google as described in the following support doc:
https://support.google.com/maps/answer/3094088
Hope this helps!

Am I training my wit.ai bot correctly?

I'm trying to train my Wit.ai bot in order to recognize the first name of someone. I'm not very sure if I well understand how the NLP works so I'll give you an example.
I defined a lot of expressions like "My name is XXXX", "Everybody calls me XXXX"
In the "Understanding" table I added an entity named "contact_name" and I add almost 50 keywords like "Michel, John, Mary...".
I put the trait as "free-text" and "keywords".
I'm not sure if this process is correctly. So, I ask you:
does it matter the context like "My name is..." for the NLP? I mean...will it help the bot to predict that after this expression probably a fist name will come on?
is that right to add like 50 values to an entity or it's completly wrong?
what do you suggest as a training process in order to get the first name of someone?
You have done it right by keeping the entity's search strategy as "free-text" and "Keywords". But Adding keywords examples to the entity doesn't make any sense because a person's name is not a keyword.
So, I would recommend a training strategy which is as follows:
Create various templates of the message like, "My name is XYZ", "I am XYZ", "This is XYZ" etc. (all possible introduction messages you could think of)
Remove all keywords and expressions for the entity you created and add these two keywords:
"a b c d e f g h i j k l m n o p q r s t u v w x y z"
"XYZ" (can give any name but maintain this name same for validating the templates)
In the 'Understanding' tab enter the messages and extract the name into the entity ("contact_name" in your case) and validate them
Similarly, validate all message templates keeping the name as "XYZ"
After you have done this for all templates your bot will be able to recognise any name in a given template of the message
The logic behind this is your entity is a free-text and keyword which means it first tries to match the keyword if not matched it tries to find the word in the same position of the templates. Keeping the name same for validations helps to train the bot with the templates and learn the position where the name will be usually found.
Hope this works. I have tried this and worked for me. I am not sure how bot trains in background. I recommend you to start a new app and do this exercise.
Comment if there is any problem.
wit.ai has a pre-trained entity extraction method called wit/contact, which
Captures free text that's either the name or a clear reference to a
person, like "Paul", "Paul Smith", "my husband", "the dentist".
It works good even without any training data.
To read about the method refer to duckling.

VK API - return city names in latin

I am using VK API to get list of cities in specific country. Does anyone know how to show Russian cities (which are in Cyrilic) in Latin?
Example of JSON response:
http://api.vk.com/method/places.getCities?lang=en&country_id=1&count=1000&need_all=1
I am trying to check if city exists, but if someone enter city name in latin, in some cases city check works, for example Vladivostok is Владивосток, but Moscow is Москва.
I found one solution that works for me: firstly get all cities ids by places.getCities (database.getCities) method and then use database.getCitiesById providing saved cities ids like following:
database.getCitiesById?lang=en&city_ids=1,2,123
In request to this API method you can specify needed language by parameter "lang" (e.g. lang=en) and up to 1k comma-separated cities ids (e.g. city_ids=1,2,123,...).
database.getCitiesById official documentation
As an alternative, you can get the static list of Russian cities here. This JSON array contains the slug name of each city in English. It's a free list, so anyone can edit and add correct information for any localities.

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.