Downloading all full-text articles in PMC and PubMed databases - api

According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central. However, can I use "NCBI E-utilities" to download all full-text papers in PMC database using Efetch or at least find all corresponding PMCids using Esearch in Entrez Programming Utilities? If yes, then how? If E-utilities cannot be used, is there any other way to download all full-text articles?

First of all, before you go downloading files in bulk, I highly recommend you read the E-utilities usage guidelines.
If you want full-text articles, you're going to want to limit your search to open access files. Furthermore, I suggest also restricting your search to Medline articles if you want articles that are any good. Then you can do the search.
Using Biopython, this gives us :
search_query = 'medline[sb] AND "open access"[filter]'
# getting search results for the query
search_results = Entrez.read(Entrez.esearch(db="pmc", term=search_query, retmax=10, usehistory="y"))
You can use the search function on the PMC website and it will display the generated query that you can copy/paste into your code.
Now that you've done the search, you can actually download the files :
handle = Entrez.efetch(db="pmc", rettype="full", retmode="xml", retstart=0, retmax=int(search_results["Count"]), webenv=search_results["WebEnv"], query_key=search_results["QueryKey"])
You might want to download in batches by changing retstart and retmax by variables in a loop in order to avoid flooding the servers.
If handle contains only one file, handle.read() contains the whole XML file as a string. If it contains more, the articles are contained in <article></article> nodes.
The full text is only available in XML, and the default parser available in pubmed doesn't handle XML namespaces, so you're going to be on your own with ElementTree (or an other parser) to parse your XML.
Here, the articles are found thanks to the internal history of E-utilities, which is accessed with the webenv argument and enabled thanks to the usehistory="y" argument in Entrez.read()
A few tips about XML parsing with ElementTree : You can't delete a grandchild node, so you're probably going to want to delete some nodes recursively. node.text returns the text in node, but only up to the first child, so you'll need to do something along the lines of "".join(node.itertext()) if you want to get all the text in a given node.

According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central.
https://www.nlm.nih.gov/bsd/medline.html + https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ will download a good portion of it (I don't know the percentage). It will indeed miss the PMC full-texts articles whose license doesn't allow redistribution as explained on https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.

Related

Creating Configuration File for DDS Recording Service

I'm a beginner looking for some clarity on how to create configuration files for the DDS Recording Service in two areas.
If you are looking to record a set of specific topics from a domain how do you set up the topic group? Can you list the topics as individual <topic_expr> i.e.
<topic_group name="SomeTopics">
<topics>
<topic_expr>topic2</topic_expr>
<topic_expr>topic8</topic_expr>
</topics>
<field_expr>*</field_expr>
</topic_group>
When I tried something like this not all the listed topics would be recorded. Is there something I am overlooking?
Secondly, when you use -deserialize to you need to make any changes to the configuration file you used to record the database? As I sometimes get errors about how "rti dds failed to find" followed by something like X::Y::Z. Thanks.
The XSD schema for the configuration file does not expect you to use multiple <topic_expr> tags, but a single tag with a comma-separated list of Topic names. The RTI Recording Service User's Manual explains it as follows:
<topic_expr>POSIX fn expression</topic_expr>
Required.
A comma-separated list of POSIX expressions that specify the names of Topics to be included in the TopicGroup.
The syntax and semantics are the same as for Partition matching.
Default: Null
Note: Keep in mind that spaces are valid first characters in topic names, thus they can affect the matching process. For example, this will match both Triangle and Square topics (notice there is no space before Square):
<topic_expr>Triangle,Square</topic_expr>
However the following will only match Triangle topics (because there is a space before Square):
<topic_expr>Triangle, Square</topic_expr>
With regard to the -deserialize option, this is not applicable to the Recording Service but to the Converter tool (rtirecconv). If you want to record deserialized, you will have to indicate that in the Recording Service configuration, via the tag <deserialize_mode>. Again, see the User's Manual for details.

Paginating a long Markdown document?

I am writing a handbook for hotplate. It's going to be a lot bigger than I expected. So, I wanted to "break up" the document into several sub-documents.
I am thinking about slicing the documents according to its #titles. So:
# Main title
Under main title
## Installation
Under installation
## Initial use
Under initial use
Would generate three files:
maintitle.html -- with a point list pointing to installation.html and initialuse.html ("next")
installation.html -- with a link to maintitle.html ("prev") and one to initialuse.html ("next)
initialuse.html -- with links to installation.html ("prev")
It basically breaks up a Markdown file into sections.
Does something like this already exist?
"no" (3 years later) I guess this will help people in the future with the same question!

How do I access the "See Also" Field in the Wiktionary API?

Many of the Wiktionary pages for Chinese Characters (Hanzi) include links at the top of the page to other similar-looking characters. I'd like to use the Wiktionary API to send a single character in the query and receive a list of similar characters as the response. Unfortunately, I can't seem to find any query that includes the "See Also" field. Is this kind of query possible?
The “see also” field is just a line of wiki code in the page source, and there is no way for the API to know that it's different from any other piece of text on the page.
If you are happy with using only the English version of Wiktionary, you can fetch the wikicode: index.php?title=太&action=raw, and then parse the result for the template also. In this case, the line you are looking for is {{also|大|犬}}.
To check if the template is used on the page at all, query the API for titles=太&prop=templates&tltemplates=Template:also
Similar templates are avilable in more language editions of Wiktionary, in case you want to use other sources than the English one. The current list is:
br:Patrom:gwelet
ca:Plantilla:vegeu
cs:Šablona:Viz
de:Vorlage:Siehe auch
el:Πρότυπο:δείτε
es:Plantilla:desambiguación
eu:Txantiloi:Esanahi desberdina
fi:Malline:katso
fr:Modèle:voir
gl:Modelo:homo
id:Templat:lihat
is:Snið:sjá einnig
it:Template:Vedi
ja:テンプレート:see
no:Mal:se også
oc:Modèl:veire
pl:Szablon:podobne
pt:Predefinição:ver também
ru:Шаблон:Cf
sk:Šablóna:See
sv:Mall:se även
It has been suggested that the WikiData project be expanded to cover Wiktionary. If and when that happens, you might be able to query theWikiData API for that kind of stuff!

sharepoint crawl rule to exclude AllItems.aspx , but get an item/document in search resu lts if queried in the search box

I followed this blog Tips 1and created a crawl rule http://.*forms/allitems.aspx and ran full crawl. I no longer get the results with AllItems.aspx. However, if there is any document with name Something.doc in a Document Library , it no longer gets pulled in the search results.
I think what I desire is a basic functionality, like the user should not get to see Allitems.aspx in the search results but should get the item/document with names entered in the search box.
Please let me know if I am missing anything. I have already put in 24 hours...googled the max I could.
It seems that an Index Reset is required. Here's the steps I did:
1. Add the following crawl rule to exclude: *://*allitems.aspx.
2. Index Reset.
3. Full Crawl.
I could not find a good way to do this using crawl rules. Instead, I opted to set up a restriction on the search results web part.
In the search results web part properties, select "Change Query"
Add a property filter to exclude anything with "AllItems" (and any other exclusions you want in place.
Used Steve Mann's blog as a reference and for the images: http://stevemannspath.blogspot.com/2013/04/sharepoint-2013-search-removing-junk.html

citeseerx search api

Is there a way to access CiteSeerX programmatically (e.g. search by author and/or title?) Surprisingly I cannot find anything relevant; surely others too are trying to get scholarly article metadata without resorting to scraping?
EDIT: note that CiteSeerX supports OAI PMH, but that seems to be an API geared towards digital libraries keeping up to date with each other ("content dissemination") and does not specifically support search. Moreover the citeseer info on that page is very sparse and even says "Currently, there are difficulties with the OAI".
There is another SO question about CiteSeerX API (though not specifically search); the 2 answers do not resolve the problem (one talks about Mendeley, another piece of software, and the other says OAI-PMH implementations are free to offer extensions to the minimal spec).
Alternatively, can anyone suggest a good way to obtain citations from authors/titles programmatically?
As suggested by one of the commenters, I tried jabref first:
jabref -n -f "citeseer:title:(lessons from) author:(Brewer)"
However jabref seems to not realize that the query string needs to contain colons and so throws an error.
For search results, I ended up scraping the CiteSeerX results with Python's BeautifulSoup:
url = "http://citeseerx.ist.psu.edu/search?q="
q = "title%3A%28{1}%29+author%3%28{0}%29&submit=Search&sort=cite&t=doc"
url += q.format (author_last, title.replace (" ", "+"))
soup = BeautifulSoup (urllib2.urlopen (url).read ())
result = soup.html.body ("div", id = "result_list") [0].div
title = result.h3.a.string.strip ()
authors = result ("span", "authors") [0].string
authors = authors [len ("by "):].strip ()
date = result ("span", "pubyear") [0].string.strip (", ")
It is possible to get a document ID from the results (the misleadingly-named "doi=..." part in the summary link URL) and then pass that to the CiteSeerX OAI engine to get Dublin Core XML (e.g. http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:CiteSeerX.psu:10.1.1.42.2177); however that XML ends up containing multiple dc:date elements, which makes it less useful than the scrape output.
Too bad CiteSeerX makes people resort to scraping in spite of all the open archives / open access rhetoric.