How to get all url's ( not just titles ) in a wikipedia article using mediawiki api? - api

I am using the wikimedia api to retrieve all possible URL's from a wikipedia article ,'https://en.wikipedia.org/w/api.php?action=query&prop=links&redirects&pllimit=500&format=json' , but it is only giving a list of link titles , for example , Artificial Intelligence , wikipedia page has a link titled " delivery networks," , but the actual URL is "https://en.wikipedia.org/wiki/Content_delivery_network" , which is what I want

Use a generator:
action=query&
format=jsonfm&
titles=Estelle_Morris&
redirects&
generator=links&
gpllimit=500&
prop=info&
inprop=url
See API docs on generators and the info module.

I have replaced most of my previous answer, including the code, to use the information provided in Tgr's answer, in case someone else would like sample Python code. This code is heavily based on code from Mediawiki for so-called 'raw continuations'.
I have deliberately limited the number of links requested per invocation to five so that one more parameter possibility could be demonstrated.
import requests
def query(request):
request['action'] = 'query'
request['format'] = 'json'
request['prop'] = 'info'
request['generator'] = 'links'
request['inprop'] = 'url'
previousContinue = {}
while True:
req = request.copy()
req.update(previousContinue)
result = requests.get('http://en.wikipedia.org/w/api.php', params=req).json()
if 'error' in result:
raise Error(result['error'])
if 'warnings' in result:
print(result['warnings'])
if 'query' in result:
yield result['query']
if 'continue' in result:
previousContinue = {'gplcontinue': result['continue']['gplcontinue']}
else:
break
count = 0
for result in query({'titles': 'Estelle Morris', 'gpllimit': '5'}):
for url in [_['fullurl'] for _ in list(result.values())[0].values()]:
print (url)
I mentioned in my first answer that, if the OP wanted to do something similar with artificial intelligence then he should begin with 'Artificial intelligence' — noting the capitalisation. Otherwise the search would start with a disambiguation page and all of the complications that could arise with those.

Related

How to get the contributors you've coincided editing the most in Wikipedia

I'm doing a gamification web app to help Wikimedia's community health.
I want to find what editors have edited the same pages as 'Jake' the most in the last week or 100 last edits or something like that.
I know my query, but I can't figure out what tables I need because the Wikimedia DB layout is a mess.
So, I want to obtain something like
Username
Occurrences
Pages
Mikey
13
Obama,..
So the query would be something like (I'm accepting suggestions):
Get the pages that the user 'Jake' has edited in the last week.
Get the contributors of that page in last week.
For each of these contributors, get the pages they have edited in the last week and see if they match with the pages 'Jake' has edited and count them.
I've tried doing that something simpler in Pywikibot, but it's very, very slow (20secs for the last 500 contributions of Jake).
I only get the edited pages and get the contributors of that page and just count them and it's very slow.
My pywikibot code is:
site = Site(langcode, 'wikipedia')
user = User(site, username)
contributed_pages = set()
for page, oldid, ts, comment in user.contributions(total=100, namespaces=[0]):
contributed_pages.add(page)
return get_contributor_ocurrences(contributed_pages,site, username)
And the function
def get_contributor_ocurrences(contributed_pages, site,username):
contributors = []
for page in contributed_pages:
for editor in page.contributors():
if APISite.isBot(self= site,username=editor) or editor==username:
continue
contributors.append(editor)
return Counter(contributors)
PS: I have access to DB replicas, which I guess are way faster than Wikimedia API or Pywikibot
You can filter the data to be retrieved with timestamps parameters. This decreases the needed time a lot. Refer the documentation for their usage. Here is a code snippet to get the data with Pywikibot using time stamps:
from collections import Counter
from datetime import timedelta
import pywikibot
from pywikibot.tools import filter_unique
site = pywikibot.Site()
user = pywikibot.User(site, username) # username must be a string
# Setup the Generator for the last 7 days.
# Do not care about the timestamp format if using pywikibot.Timestamp
stamp = pywikibot.Timestamp.now() - timedelta(days=7)
contribs = user.contributions(end=stamp)
contributors= []
# filter_unique is used to remove duplicates.
# The key uses the page title
for page, *_ in filter_unique(contribs, key=lambda x: str(x[0])):
# note: editors is a Counter
editors = page.contributors(endtime=stamp)
print('{:<35}: {}'.format(page.title(), editors))
contributors.extend(editors.elements())
total = Counter(contributors)
This prints a list of pages and for each page it Shows the editors and their contribution counter of the given time range. Finally total should have the same content as your get_contributor_ocurrences functions above.
It requires some additional work to get the table you mentioned above.

Need to get issuelinks types with Jira python

I am writing a script with Jira python and I have encountered a big obstacle here.
I need to access to one of the issuelinks under "is duplicated by" but I don't have any idea about the attributes I can use.
I can get to the issuelinks field but I can't go further from here.
This is I've got so far:
issue = jira.issue(ISSUE_NUM) #this is the issue I am handling
link = issue.fields.issuelinks # I 've accessed to the issuelinks field
if hasattr(link, "inwardIssue"):
inwardIssue = link.inwardIssue
and I want to do this from here :
if(str(inwardIssue.type(?)) == "is duplicated by"):
inward Issues can be
is cloned by
is duplicated by
and so on.
how can I get the type of inward Issues??
There seem to be a few types of issue links. So far I've seen: Blocker, Cause, Duplicate and Reference.
In order to identify the type that the IssueLink is you can do the following:
issue = jira.issue(ISSUE_NUM)
all_issue_links = issue.fields.issuelinks
for link in all_issue_links:
if link.type.name == 'Duplicate':
inward_issue = link.inwardIssue
# Do something with link

How to get the latest papers from pubmed

This is a bit of a specific question, but somebody must have done this before. I would like to get the latest papers from pubmed. Not papers about a certain subjects, but all of them. I thought to query depending on modification date (mdat). I use biopython.py and my code looks like this
handle = Entrez.egquery(mindate='2015/01/10',maxdate='2017/02/19',datetype='mdat')
results = Entrez.read(handle)
for row in results["eGQueryResult"]:
if row["DbName"]=="nuccore":
print(row["Count"])
However, this results in zero papers. If I add term='cancer' I get heaps of papers. So the query seems to need the term keyword... but I want all papers, not papers on a certain subjects. Any ideas how to do this?
thanks
carl
term is a required parameter, so you can't omit it in your call to Entrez.egquery.
If you need all the papers within a specified timeframe, you will probably need a local copy of MEDLINE and PubMed Central:
For MEDLINE, this involves getting a license. For PubMed Central, you
can download the Open Access subset without a license by ftp.
EDIT for python3. The idea is that the latest pubmed id is the same thing as the latest paper (which I'm not sure is true). Basically does a binary search for the latest PMID, then gives a list of the n most recent. This does not look at dates, and only returns PMIDs.
There is an issue however where not all PMIDs exist, for example https://pubmed.ncbi.nlm.nih.gov/34078719/ exists, https://pubmed.ncbi.nlm.nih.gov/34078720/ does not (retraction?), and https://pubmed.ncbi.nlm.nih.gov/34078721/ exists. This ruins the binary search since it can't know if it's found a PMID that hasn't been used yet, or if it has found one that has previously existed.
CODE:
import urllib
def pmid_exists(pmid):
url_stem = 'https://www.ncbi.nlm.nih.gov/pubmed/'
query = url_stem+str(pmid)
try:
request = urllib.request.urlopen(query)
return True
except urllib.error.HTTPError:
return False
def get_latest_pmid(guess = 27239557, _min_guess=None, _max_guess=None):
#print(_min_guess,'<=',guess,'<=',_max_guess)
if _min_guess and _max_guess and _max_guess-_min_guess <= 1:
#recursive base case, this guess must be the largest PMID
return guess
elif pmid_exists(guess):
#guess PMID exists, search for larger ids
_min_guess = guess
next_guess = (_min_guess+_max_guess)//2 if _max_guess else guess*2
else:
#guess PMID does not exist, search for smaller ids
_max_guess = guess
next_guess = (_min_guess+_max_guess)//2 if _min_guess else guess//2
return get_latest_pmid(next_guess, _min_guess, _max_guess)
#Start of program
n = 5
latest_pmid = get_latest_pmid()
most_recent_n_pmids = range(latest_pmid-n, latest_pmid)
print(most_recent_n_pmids)
OUTPUT:
[28245638, 28245639, 28245640, 28245641, 28245642]

What is the best way to call channel9 api for getting video?

Lets say that i have URL to channel9 movie;
Ex: https://channel9.msdn.com/Series/Office-365-Tips--Tricks/01-Wprowadzenie
And I want to display this movie on my site, and display some information for it ex. duration.
All what I know already is that, I can get list of movies by calling
https://channel9.msdn.com/odata/Entries and skipping it +25 for showing next 25 results.
My implementation right now is something like:
Get first 25 elements from api
Iterate throught them
compare my url with elementFromApi[i].url
Its working but I don't like this solution, it is non elegant and slow as hell. I have no knowledge about the api so i dunno know how to refactor this^.
Maybe someone of You can help me.
PS. I need information from api, embed iframe with given url is not the solution here :)
PS2. Sorry for my english.
I got this requirement through my client and I ended up doing it using RSS! In your use case the URL is https://channel9.msdn.com/Series/Office-365-Tips--Tricks/01-Wprowadzenie - Simply we can read the rss by using the link https://s.ch9.ms/Series/Office-365-Tips--Tricks/rss/mp4 - In PowerShell we can explore the content with the below piece of code
$Sessions = Invoke-Restmethod -Uri 'https://s.ch9.ms/Series/Office-365-Tips--Tricks/rss/mp4' -UseDefaultCredentials
foreach($Session in $Sessions) {
$Duration = [timespan]::FromSeconds($Session.duration)
[pscustomobject]#{
Title = $Session.title
Duration = ("{0:0}:{1:00}:{2:00}" -f ($Duration.Hours , $Duration.Minutes , $Duration.Seconds))
Creator = $Session.creator
"URl(MP3)" = $Session.group.content.url[0]
"URl(MP4)" = $Session.group.content.url[1]
"URl(webm)" = $Session.group.content.url[2]
"URl(MP4High)" = $Session.group.content.url[3]
}
}
Note: Code needs to be improvised!
The same can be achieved using C# - But I am not a certified developer - So , I used PowerShell to meet client requirement.

How to get all URLs in a Wikipedia page

It seems like Wikipedia API's definition of a link is different from URL? I'm trying to use the API to return all the urls in a specific wiki page.
I have been playing around with this query that I found from this page under generators and redirects.
I'm not sure why exactly are you confused (it would help if you explained that), but I'm quite sure that query is not what you want. It lists links (prop=links) on pages that are linked (generator=links) from the page “Title” (titles=Title). It also lists only the first page of links on the first page of links (with page size the tiny default value of 10).
If you want to get all the links on the page “Title”:
Use just prop=links, you don't want the generator.
Increase the limit to the maximum possible by adding pllimit=max (pl is the “prefix” for links)
Use the value given in the query-continue element to get to the second (and following) page of results.
So, the query for the first page would be:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max
And the second (and in this case, final) page:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max&plcontinue=226160|0|Lieutenant_General
Another thing that might be confusing you is that links returns only internal links (to other Wikipedia pages). To get external links, use prop=extlinks. You can also combine the two into one query:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links|extlinks
Here's a Python solution that gets (and prints) all the pages linked to from a particular page. It gets the maximum number of links in the first request, then looks to see if the returned JSON object has a "continue" property. If it does, it adds the "plcontinue" value to the params dictionary and makes another request. (The last page of results returned will not have this property.)
import requests
session = requests.Session()
url = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"format": "json",
"titles": "Albert Einstein",
"prop": "links",
"pllimit": "max"
}
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count = 1
page_titles = []
print("Page %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
while "continue" in data:
plcontinue = data["continue"]["plcontinue"]
params["plcontinue"] = plcontinue
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count += 1
print("\nPage %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
print("%d titles found." % len(page_titles))
This code was adapted from the code in the MediaWiki API:Links example.