Is there a way to access CiteSeerX programmatically (e.g. search by author and/or title?) Surprisingly I cannot find anything relevant; surely others too are trying to get scholarly article metadata without resorting to scraping?
EDIT: note that CiteSeerX supports OAI PMH, but that seems to be an API geared towards digital libraries keeping up to date with each other ("content dissemination") and does not specifically support search. Moreover the citeseer info on that page is very sparse and even says "Currently, there are difficulties with the OAI".
There is another SO question about CiteSeerX API (though not specifically search); the 2 answers do not resolve the problem (one talks about Mendeley, another piece of software, and the other says OAI-PMH implementations are free to offer extensions to the minimal spec).
Alternatively, can anyone suggest a good way to obtain citations from authors/titles programmatically?
As suggested by one of the commenters, I tried jabref first:
jabref -n -f "citeseer:title:(lessons from) author:(Brewer)"
However jabref seems to not realize that the query string needs to contain colons and so throws an error.
For search results, I ended up scraping the CiteSeerX results with Python's BeautifulSoup:
url = "http://citeseerx.ist.psu.edu/search?q="
q = "title%3A%28{1}%29+author%3%28{0}%29&submit=Search&sort=cite&t=doc"
url += q.format (author_last, title.replace (" ", "+"))
soup = BeautifulSoup (urllib2.urlopen (url).read ())
result = soup.html.body ("div", id = "result_list") [0].div
title = result.h3.a.string.strip ()
authors = result ("span", "authors") [0].string
authors = authors [len ("by "):].strip ()
date = result ("span", "pubyear") [0].string.strip (", ")
It is possible to get a document ID from the results (the misleadingly-named "doi=..." part in the summary link URL) and then pass that to the CiteSeerX OAI engine to get Dublin Core XML (e.g. http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:CiteSeerX.psu:10.1.1.42.2177); however that XML ends up containing multiple dc:date elements, which makes it less useful than the scrape output.
Too bad CiteSeerX makes people resort to scraping in spite of all the open archives / open access rhetoric.
Related
This is how retrieve a particular pull-request's comments according to bitbucket's documentation:
While I do have the pull-request ID and format a correct URL I still get a 400 response error. I am able to make a POST request to comment but I cannot make a GET. After further reading I noticed the six parameters listed for this endpoint do not say 'optional'. It looks like these need to be supplied in order to retrieve all the comments.
But what exactly are these parameters? I don't find their descriptions to be helpful in the slightest. Any and all help would be greatly appreciated!
fromHash and toHash are only required if diffType is'nt set to EFFECTIVE. state also seems optional to me (didn't give me an error when not including it), and anchorState specifies which kind of comments to fetch - you'd probably want ALL there. As far as I understand it, path contains the path of the file to read comments from. (ex: src/a.py and src/b.py were changed -> specify which of them to fetch comments for)
However, that's probably not what you want. I'm assuming you want to fetch all comments.
You can do that via /rest/api/1.0/projects/{projectKey}/repos/{repositorySlug}/pull-requests/{pullRequestId}/activities which also includes other activities like reviews, so you'll have to do some filtering.
I won't paste example data from the documentation or the bitbucket instance I tested this once since the json response is quite long. As I've said, there is an example response on the linked page. I also think you'll figure out how to get to the data you want once downloaded since this is a Q&A forum and not a "program this for me" page :b
As a small quickstart: you can use curl like this
curl -u <your_username>:<your_password> https://<bitbucket-url>/rest/api/1.0/projects/<project-key>/repos/<repo-name>/pull-requests/<pr-id>/activities
which will print the response json.
Python version of that curl snippet using the requests module:
import requests
url = "<your-url>" # see above on how to assemble your url
r = requests.get(
url,
params={}, # you'll need this later
auth=requests.auth.HTTPBasicAuth("your-username", "your-password")
)
Note that the result is paginated according to the api documentation, so you'll have to do some extra work to build a full list: Either set an obnoxiously high limit (dirty) or keep making requests until you've fetched everything. I stronly recommend the latter.
You can control which data you get using the start and limit parameters which you can either append to the url directly (e.g. https://bla/asdasdasd/activity?start=25) or - more cleanly - add to the params dict like so:
requests.get(
url,
params={
"start": 25,
"limit": 123
}
)
Putting it all together:
def get_all_pr_activity(url):
start = 0
values = []
while True:
r = requests.get(url, params={
"limit": 10, # adjust this limit to you liking - 10 is probably too low
"start": start
}, auth=requests.auth.HTTPBasicAuth("your-username", "your-password"))
values.extend(r.json()["values"])
if r.json()["isLastPage"]:
return values
start = r.json()["nextPageStart"]
print([x["id"] for x in get_all_pr_activity("my-bitbucket-url")])
will print a list of activity ids, e.g. [77190, 77188, 77123, 77136] and so on. Of course, you should probably not hardcode your username and password there - it's just meant as an example, not production-ready code.
Finally, to filter by action inside the function, you can replace the return values with something like
return [activity for activity in values if activity["action"] == "COMMENTED"]
I want to get minimal information of a Wikipedia page using MediaWiki API like DuckDuckGo. For example for Steve Carell: https://duckduckgo.com/?q=steve+carell&t=hp&ia=news&iax=about
How can I get this information with a Wikipedia url (eg https://en.wikipedia.org/wiki/Steve_Carell) in HTML format?
You can use the MediaWiki API for that. There's an extension, TextExtracts, which is exactly for that (and it is installed on Wikipedia).
In your case, e.g.:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exsentences=1&titles=Steve%20Carell
will return something like:
<p class=\"mw-empty-elt\">\n</p>\n\n<p class=\"mw-empty-elt\">\n \n</p>\n<p><b>Steven John Carell</b> (<span></span>; born August 16, 1962) is an American actor, comedian, producer, writer and director.</p>
You can customize how many sentences (or characters) the API returns, as well, please consult the API documentation for that.
There's also the way to retrieve the short description, which is saved at Wikidata (and visible in the mobile view of Wikipedia). This call would be:
https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&titles=Steve_Carell
This returns the following property in the pageprops of the page:
"wikibase-shortdesc": "American actor"
This may fit better depending on your use case.
You can even get both of the results with a single, combined, request:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts|pageprops&exsentences=1&titles=Steve_Carell
According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central. However, can I use "NCBI E-utilities" to download all full-text papers in PMC database using Efetch or at least find all corresponding PMCids using Esearch in Entrez Programming Utilities? If yes, then how? If E-utilities cannot be used, is there any other way to download all full-text articles?
First of all, before you go downloading files in bulk, I highly recommend you read the E-utilities usage guidelines.
If you want full-text articles, you're going to want to limit your search to open access files. Furthermore, I suggest also restricting your search to Medline articles if you want articles that are any good. Then you can do the search.
Using Biopython, this gives us :
search_query = 'medline[sb] AND "open access"[filter]'
# getting search results for the query
search_results = Entrez.read(Entrez.esearch(db="pmc", term=search_query, retmax=10, usehistory="y"))
You can use the search function on the PMC website and it will display the generated query that you can copy/paste into your code.
Now that you've done the search, you can actually download the files :
handle = Entrez.efetch(db="pmc", rettype="full", retmode="xml", retstart=0, retmax=int(search_results["Count"]), webenv=search_results["WebEnv"], query_key=search_results["QueryKey"])
You might want to download in batches by changing retstart and retmax by variables in a loop in order to avoid flooding the servers.
If handle contains only one file, handle.read() contains the whole XML file as a string. If it contains more, the articles are contained in <article></article> nodes.
The full text is only available in XML, and the default parser available in pubmed doesn't handle XML namespaces, so you're going to be on your own with ElementTree (or an other parser) to parse your XML.
Here, the articles are found thanks to the internal history of E-utilities, which is accessed with the webenv argument and enabled thanks to the usehistory="y" argument in Entrez.read()
A few tips about XML parsing with ElementTree : You can't delete a grandchild node, so you're probably going to want to delete some nodes recursively. node.text returns the text in node, but only up to the first child, so you'll need to do something along the lines of "".join(node.itertext()) if you want to get all the text in a given node.
According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central.
https://www.nlm.nih.gov/bsd/medline.html + https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ will download a good portion of it (I don't know the percentage). It will indeed miss the PMC full-texts articles whose license doesn't allow redistribution as explained on https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/.
I am entirely new to API, so sorry if the question is silly.
I would like to get all images in a category in Commons let's say X, but exclude those which are also in another one (Y). I do not understand if I can actually do this.
https://commons.wikimedia.org/w/api.php?action=query&list=categorymembers&cmtype=file&cmtitle=Category:X
will get all of them, how to exclude some?
moreover I would like in the result to have the description of the images, not just the name of the file, is that possible?
MediaWiki has - by default - no built-in support for category building and querying intersections. To accomplish this task, extensions or external tools or multiple API queries and result processing is required.
CirrusSearch API
On Wikimedia Commons, like on the whole Wikimedia Wiki farm, CirrusSearch powers filtered search, including search for category intersections and is also available through API (action=query&list=search&srsearch=incategory:A+-incategory:B, this is Category:A minus Category:B).
FastCCI
One of the tools I can recommend (because it's a dedicated high-performance solution and actually running) is fastcci, developed by Daniel Schwen; specifically for Wikimedia Commons, there is already a database maintained and a webservice running but it's possible to set it up for any wiki, provided the tool set has a host to run on and has database access.
Query
Consider the following query URL:
https://fastcci.wmflabs.org/?c1=3302993&c2=15516712&d1=0&d2=0&s=200&a=not&t=js
https://fastcci.wmflabs.org/ - Host Wikimedia Commons fastcci runs on
c1 - ID of category 1
c2 - ID of category 2
d1 - depth of category 1 to search in (fastcci by default considers sub-categories)
d2 - depth of category 2 to search in (fastcci by default considers sub-categories)
s - Number or results to return
o - Offset
a - conjunction
t - connection type (t=js for a JSONP response; otherwise assumes being used as websocket)
Response
fastcciCallback( [ 'RESULT 27572680,0,0|1675043,0,0|27577015,0,0|27577043,0,0|27577106,0,0|27576896,0,0|27576790,0,0|23481936,0,0|17560964,0,0|11009066,0,0', 'OUTOF 10', 'DBAGE 378310', 'DONE'] );
RESULT followed by a | separated list of up to 50 integer triplets of the form pageId,depth,tag. Each triplet stands for one image or category
Resources
Sample client side implementation - to see it in action, just visit any category and next to the Good pictures button in any category page.
Example is FilesOf('Category:Saaleck') - FilesOf('Category:Rapeseed fields in Saxony-Anhalt')
Server application
Presentation on YouTube
Slides
A note on pageIDs
page IDs → page titles: GET /w/api.php?action=query&pageids=page_IDs_separated_by_pipe
page titles → page IDs: GET /w/api.php?action=query&titles=Titles_separated_by_pipe
AFAIK, there is no way to get that directly using the API. But, assuming both categories are reasonably small, you could get all images from both of them and then compute the complement in your code.
To retrieve the description, you can use prop=imageinfo&iiprop=extmetadata&iiextmetadatafilter=ImageDescription.
In the context of your example query, it would look like this:
https://commons.wikimedia.org/w/api.php?action=query&generator=categorymembers&gcmtype=file&gcmtitle=Category:X&prop=imageinfo&iiprop=extmetadata&iiextmetadatafilter=ImageDescription
Many of the Wiktionary pages for Chinese Characters (Hanzi) include links at the top of the page to other similar-looking characters. I'd like to use the Wiktionary API to send a single character in the query and receive a list of similar characters as the response. Unfortunately, I can't seem to find any query that includes the "See Also" field. Is this kind of query possible?
The “see also” field is just a line of wiki code in the page source, and there is no way for the API to know that it's different from any other piece of text on the page.
If you are happy with using only the English version of Wiktionary, you can fetch the wikicode: index.php?title=太&action=raw, and then parse the result for the template also. In this case, the line you are looking for is {{also|大|犬}}.
To check if the template is used on the page at all, query the API for titles=太&prop=templates&tltemplates=Template:also
Similar templates are avilable in more language editions of Wiktionary, in case you want to use other sources than the English one. The current list is:
br:Patrom:gwelet
ca:Plantilla:vegeu
cs:Šablona:Viz
de:Vorlage:Siehe auch
el:Πρότυπο:δείτε
es:Plantilla:desambiguación
eu:Txantiloi:Esanahi desberdina
fi:Malline:katso
fr:Modèle:voir
gl:Modelo:homo
id:Templat:lihat
is:Snið:sjá einnig
it:Template:Vedi
ja:テンプレート:see
no:Mal:se også
oc:Modèl:veire
pl:Szablon:podobne
pt:Predefinição:ver também
ru:Шаблон:Cf
sk:Šablóna:See
sv:Mall:se även
It has been suggested that the WikiData project be expanded to cover Wiktionary. If and when that happens, you might be able to query theWikiData API for that kind of stuff!