How to retrieve book's information in XML/JSON from library of congress by ISBN - api

The Library of Congress has a site to search books by ISBN. A simple way to retrive book's information is using a URL like:
http://lccn.loc.gov/2009019559/mods
where it returns a XML structure that may parse easily. The URL requires a unique LCCN number in the the following format:
http://lccn.loc.gov/[lccn]/mods
I have a batch of books that has ISBN encoded in barcode. How may I retrieve/convert ISBN to LCCN in order to retrieve the XML data of the book?

You can use the SRU catalog from the Library of Congress. The query would look something like this:
lx2.loc.gov:210/lcdb?version=1.1&operation=searchRetrieve&query=bath.isbn=[ISBN]&maximumRecords=1&recordSchema=mods
Replacing [ISBN] with the ISBN you want to look up
Within that response is an LCCN element. However, the catalog already returns MODS, so it might not be necessary to do anything at all.

You may use the Google Books API, for example: https://www.googleapis.com/books/v1/volumes?q=LCCN2001051058
Answer is in JSON format. It includes both ISBN-10 and ISBN-13 identifiers. You will have to batch the requests using your favorite programming language, in Pharo Smalltalk with PetitJson parser and Zinc with HTTPS support it would be:
| parser lccnCollection |
parser := PPParserResource current parserAt: PPJsonParser.
lccnCollection := #('2001051058' '2001051058').
lccnCollection do: [: lccnNumber |
| json jsonObject |
json := (Url absoluteFromText: 'https://www.googleapis.com/books/v1/volumes?q=LCCN' , lccnNumber) retrieveContents contents.
jsonObject := parser parse: json.
" ... retrieve ISSN from jsonObject, etc ... "].
Beware you may need an API key to make batch requests to Google.

Related

extract text from documents like PAN and Aadhaar

I am using cloud google vision API to extract text from Aadhaar and PAN. How can I get exact user details like name, father's name, and address?
Raw Data
ଭାରତ ସରକାର
Government of India
ଜିତ୍ୟାନନ୍ଦ ଖେମୁକୁ
NITYANANDA KHEMUDU
ପିତା : ସୀତାରାମ ଖେମୁକୁ
Father: Sitaram Khemudu
ଜନ୍ମ ତାରିଖ / DOB : 01.07.1999
ପୁରୁଷ / Male
ମୋ ଆଧାର, ମୋ ପରିଚୟ
I have built 5-6 OCR till date like aadhar, pan, ITR, Driving Linces etc., using google cloud vision API, I think you are looking for response like
{"pan_card_no":"ECXXXXXX123",
"name":"fshksj"
}
to get such response you need to built your own logic, here are some logic's i can share with you
Perform OCR on your document using Google_cloud_vision API and store that response into one array (Goggle gives logic line by line)
Like in above case if you want to grab DOB first you can build logic like i) if "DOB" in (list of item) then grab the numeric values
To get the name what you can do is dropping the unnecessary items from list by if using if condition like (if "India" in i) or (if i.isdigit()) then drop it likewise you can drop the unnesseary items from main list to get the Name
to grab the Address what you can do is, 95% of the time address come with pincode at last, so what you can do is treat pincode as a last index of address and look of "Address" kind of keyword then add all the elements from "Add keyword index" to "pincode index" ( this can be easily done in list) to validate whether the pincode is valid or not you can use library like Pyzipin
There are multiple conditions that you can use, above are the very basic one i mentioned, if you need any specific logic then then you can ask me

How can I put several extracted values from a Json in an array in Kusto?

I'm trying to write a query that returns the vulnerabilities found by "Built-in Qualys vulnerability assessment" in log analytics.
It was all going smoothly I was getting the values from the properties Json and turning then into separated strings but I found out that some of the terms posses more than one value, and I need to get all of them in a single cell.
My query is like this right now
securityresources | where type =~ "microsoft.security/assessments/subassessments"
| extend assessmentKey=extract(#"(?i)providers/Microsoft.Security/assessments/([^/]*)", 1, id), IdAzure=tostring(properties.id)
| extend IdRecurso = tostring(properties.resourceDetails.id)
| extend NomeVulnerabilidade=tostring(properties.displayName),
Correcao=tostring(properties.remediation),
Categoria=tostring(properties.category),
Impacto=tostring(properties.impact),
Ameaca=tostring(properties.additionalData.threat),
severidade=tostring(properties.status.severity),
status=tostring(properties.status.code),
Referencia=tostring(properties.additionalData.vendorReferences[0].link),
CVE=tostring(properties.additionalData.cve[0].link)
| where assessmentKey == "1195afff-c881-495e-9bc5-1486211ae03f"
| where status == "Unhealthy"
| project IdRecurso, IdAzure, NomeVulnerabilidade, severidade, Categoria, CVE, Referencia, status, Impacto, Ameaca, Correcao
Ignore the awkward names of the columns, for they are in Portuguese.
As you can see in the "Referencia" and "CVE" columns, I'm able to extract the values from a specific index of the array, but I want all links of the whole array
Without sample input and expected output it's hard to understand what you need, so trying to guess here...
I think that summarize make_list(...) by ... will help you (see this to learn how to use make_list)
If this is not what you're looking for, please delete the question, and post a new one with minimal sample input (using datatable operator), and expected output, and we'll gladly help.

Google API doesn't return data for ISBN

I am using the Google API to search for books by ISBN.
I am trying with these 3 codes
0716604892, 0716604892, 0544506723
like this
https://www.googleapis.com/books/v1/volumes?q=isbn:0716604892
https://www.googleapis.com/books/v1/volumes?q=isbn:0716604892
https://www.googleapis.com/books/v1/volumes?q=isbn:0544506723
but the last one doesn't return anything when the book exists, and the ISBN code is right.
Why is that?
Use ISBN in uppercase.
type it as:
https://www.googleapis.com/books/v1/volumes?q=ISBN:0544506723
instead of:
https://www.googleapis.com/books/v1/volumes?q=isbn:0544506723

read all document by using particular category name using alfresco search.luceneSearch or search.lib.js

Category Name
|
Geograpy (8)
Study Db (18)
i am implement my own advance search in alfresco. i need to read all files which related with particular category.
example:
if there is 20 file under geograpy, lucene query should read particular document under search key word "banana".
Further explanation -
I am using search.lib.js to search. I would like to analyze the result to find out to which category the documents belong to. For example I would like to know how many documents belong to the category under Languages and the subcategories. I experimented with the Classification API but I don't get the result I want. Any Idea how to go through the result to get the category name of each document?
is there any simple method like node.properties["cm:creator"]?
thanks
janaka
I think you should specify more your question:
Are you using cm:content or a customized content?
Are you going to search the keyword inside the content of the file? or are you going to search the keyword in a specific metadata(s)?
Do you want to create a webscript (java or javascript)?
One thing to take in consideration:
if you use +PATH:"cm:generalclassifiable/...." for the categorization in your lucene queries, the performance will be slow (following my experince)
You can use for example the next query to find all nodes at any depth below /cm:Languages:
var results = search.luceneSearch("+PATH:\"cm:generalclassifiable/cm:Languages//*\");
Take a look to this url: https://wiki.alfresco.com/wiki/Search#Path_Queries
Once you have all the elements, you can loop all, and get to which category below. Of course you need to create some counter per each category/subcategory:
for(i = 0; i < results.length; i++){
var node = results[i];
var categoryNodeRef = node.properties["cm:categories"];
var categoryDesc = categoryNodeRef.properties["cm:description"];
var categoryName = categoryNodeRef.properties["cm:name"];
}
This is not exactly the solution, but can be a useful idea to start.
Sorry if it's not what you're asking for, I have just arrived from my holidays.

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.