decoding PDF page stream(I don't know specific name) - pdf

I'm looking for how to decode pdf page stream(as title I don't know specific name).
It looks like this
/OC /MC0 BDC ./Artifact <</O /Layout >>BDC .BT./CS0 cs 0.075 0.463 0.78 scn./GS0 gs./T1_0 1 Tf.18.75 0 0 18.75 40.1772 552.638 Tm.[(AF t)15(oolkit )]TJ.ET.EMC ./Artifact <</O /Layout >>BDC .BT./T1_1 1 Tf.18.75 0 0 18.75 140.6188 552.638 Tm.[(Det)15(ect, Pr)25(ot)15(ect a
I could find some keywords(BT, Tm, etc) in google.
However, I can't find antoher keywords like /OC, /MC0 BDC ...
So, anyone know how it works for all keywords?
Thanks.

You should simply look up the specification, i.e. ISO 32000. Adobe published a copy of the first version, ISO 32000-1:2008, on their web site to download for free. In this copy the ISO page headers have been replaced (so you may not use it for audits etc) but the technical contents are untouched. Simply google for PDF32000, currently it's at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf but that may not be a permalink.
Be aware, though, that some of the words you're looking for are names which are defined in your pdf itself. E.g. in your
/OC /MC0 BDC
...
EMC
the MC0 is an arbitrary name in the resources of your content stream, so googling for that name or searching it in the specification won't help. Instead search for the instructions, BDC and EMC here, the explanations for which will tell how to interpret those names.
But actually this example also illustrates an exception from the advice above because the name OC is special, so searching for it will help you along.
In this example the BDC and EMC pair of instructions envelop marked content, which here is used to define optional content (thus, the OC name) while the name MC0 simply is the name of the properties resource that describes the optional content group in question.

Related

Downloading all full-text articles in PMC and PubMed databases

According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central. However, can I use "NCBI E-utilities" to download all full-text papers in PMC database using Efetch or at least find all corresponding PMCids using Esearch in Entrez Programming Utilities? If yes, then how? If E-utilities cannot be used, is there any other way to download all full-text articles?
First of all, before you go downloading files in bulk, I highly recommend you read the E-utilities usage guidelines.
If you want full-text articles, you're going to want to limit your search to open access files. Furthermore, I suggest also restricting your search to Medline articles if you want articles that are any good. Then you can do the search.
Using Biopython, this gives us :
search_query = 'medline[sb] AND "open access"[filter]'
# getting search results for the query
search_results = Entrez.read(Entrez.esearch(db="pmc", term=search_query, retmax=10, usehistory="y"))
You can use the search function on the PMC website and it will display the generated query that you can copy/paste into your code.
Now that you've done the search, you can actually download the files :
handle = Entrez.efetch(db="pmc", rettype="full", retmode="xml", retstart=0, retmax=int(search_results["Count"]), webenv=search_results["WebEnv"], query_key=search_results["QueryKey"])
You might want to download in batches by changing retstart and retmax by variables in a loop in order to avoid flooding the servers.
If handle contains only one file, handle.read() contains the whole XML file as a string. If it contains more, the articles are contained in <article></article> nodes.
The full text is only available in XML, and the default parser available in pubmed doesn't handle XML namespaces, so you're going to be on your own with ElementTree (or an other parser) to parse your XML.
Here, the articles are found thanks to the internal history of E-utilities, which is accessed with the webenv argument and enabled thanks to the usehistory="y" argument in Entrez.read()
A few tips about XML parsing with ElementTree : You can't delete a grandchild node, so you're probably going to want to delete some nodes recursively. node.text returns the text in node, but only up to the first child, so you'll need to do something along the lines of "".join(node.itertext()) if you want to get all the text in a given node.
According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central.
https://www.nlm.nih.gov/bsd/medline.html + https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ will download a good portion of it (I don't know the percentage). It will indeed miss the PMC full-texts articles whose license doesn't allow redistribution as explained on https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.

Alfresco FTS - why first digit of folder's name should be escaped?

I have a question regarding the alfresco FTS/lucene search. It is known that in the search query some special characters have to be escaped, like space (by _x0020_).
But it turned out that if folder's name first chatacter is a digit, it should also be escaped. It can be easily tested in Node Browser by creating a folder, like 123456 and navigate to the parent folder in node browser (in my case I have following folder structure: */2017/123456/):
Primary Path: /app:company_home/st:sites/<some-folders>/cm:_x0032_017/cm:_x0031_23456
^this is 2 ^ and this is 1
If I don't ecape first character of the folder I have 500 error returned.
Why is that, I tried to find something relevant in Alfresco documentation, but didn't manage to.
Alfresco v.4.2.0
Lucene search uses ISO 9075 codification (SQL) like similar frameworks, so we need to encode the path elements. It would be nice if the API hides this requirement like the browser url but you could use ISO9075Encode to do the job.

Extract data with HTMLAgilityPack – simple example

I've searched the net and can not find simple HTMLAgilityPack example to extract 1 information from webpage. Most of the examples are in C# and code convertors don't work properly. Also developer's forum wasn't helpful.
Anyways, I am trying to extract “Consumer Defensive” string from this URL “http://quotes.morningstar.com/stock/c-company-profile?t=dltr” and this text “Dollar Tree Stores, Inc., operates discount variety stores in United States and Canada. Its stores offer merchandise at fixed price of $1.00 and C$1.25. The company operates stores under the names of Dollar Tree, Deal$, Dollar Tree Canada, etc. “ from same webpage.
Tried code on this link : https://stackoverflow.com/questions/13147749/html-agility-pack-with-vb-net-parsing but GetPageHTML is not declared.
This one is in C# HTML Agility pack - parsing tables
and so on.
Thanks.
The HTML returned from that URL is translated to XML with 2 root nodes, so it can not be transformed directly to an XML document.
For the values you wish to retrieve it may be easier to simply retrieve the HTML document and search for the start and end tags of the strings you wish to extract.

Apple script that can scan a pdf document and copy all the annotated highlighted text to clipboard with a page reference

When doing research I find myself usually annotating a pdf document (highlighting, adding notes), then I will create a note in Evernote and index all my annotations.
For example,
p 3 - "is it possible for schools to change their practices and thereby have a strongly positive effect on student achievement?"
p 10 - "the district boldly moved forward with several new reforms"
My hope is to work with a pdf document, annotate it, then run the applet which would copy all my annotations (highlights and notes) to clipboard, where then I could paste them in a note, thereby having an index of all the points I found useful.
I am using a mac, and am open to using which ever language would be simple to creating this. My thoughts are that an applescript would be best.
Skim can export notes as text, and it also has an AppleScript dictionary.
tell application "Skim" to tell document 1 to save as "notes as text" in "/Users/username/Desktop/notes.txt"
The output looks like this:
* Highlight, page 1
ocument (highlighting
* Text Note, page 1
aa
* Highlight, page 1
ent, annotate it,

Docx4J Open XML

I am reading some .docx files with Docx4J which contains hyperlinks.
I am getting URLs while clicking on those hyperlinks manually but when i am trying to read those file with Docx4J i am getting only text nothing about those Hyperlinks and URLs.
Document Text -
Infosys Chairman, KV Kamath said that IT services were facing challenges of scalability. Speaking at the 31st Annual General Meeting of the company in Bangalore, Kamath said the management has met all the challenges successfully and demonstrated leadership. Infosys, India's second largest IT services company announced a final dividend of Rs 22/share. The company also announced a special dividend of Rs 10/share on account of the 10th year of operations of the Infosys BPO. Speaking at the AGM, S D Shibulal, CEO of Infosys said that transformation is complete and the company is now focussed on growth. "Infosys 3.0 will help company address challenges," said Shibulal. Shibulal said: "We had a choice between commoditization and re-defining the industry. We chose to redefine the industry."..more
Hyperlink is on "more"
Docx4J is giving the text 'more' only. It is not giving information regarding that hyperlink.
Is there any way how to get that URL??
Please Help...