How to get all URLs in a Wikipedia page - wikipedia-api

It seems like Wikipedia API's definition of a link is different from URL? I'm trying to use the API to return all the urls in a specific wiki page.
I have been playing around with this query that I found from this page under generators and redirects.

I'm not sure why exactly are you confused (it would help if you explained that), but I'm quite sure that query is not what you want. It lists links (prop=links) on pages that are linked (generator=links) from the page “Title” (titles=Title). It also lists only the first page of links on the first page of links (with page size the tiny default value of 10).
If you want to get all the links on the page “Title”:
Use just prop=links, you don't want the generator.
Increase the limit to the maximum possible by adding pllimit=max (pl is the “prefix” for links)
Use the value given in the query-continue element to get to the second (and following) page of results.
So, the query for the first page would be:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max
And the second (and in this case, final) page:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max&plcontinue=226160|0|Lieutenant_General
Another thing that might be confusing you is that links returns only internal links (to other Wikipedia pages). To get external links, use prop=extlinks. You can also combine the two into one query:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links|extlinks

Here's a Python solution that gets (and prints) all the pages linked to from a particular page. It gets the maximum number of links in the first request, then looks to see if the returned JSON object has a "continue" property. If it does, it adds the "plcontinue" value to the params dictionary and makes another request. (The last page of results returned will not have this property.)
import requests
session = requests.Session()
url = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"format": "json",
"titles": "Albert Einstein",
"prop": "links",
"pllimit": "max"
}
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count = 1
page_titles = []
print("Page %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
while "continue" in data:
plcontinue = data["continue"]["plcontinue"]
params["plcontinue"] = plcontinue
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count += 1
print("\nPage %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
print("%d titles found." % len(page_titles))
This code was adapted from the code in the MediaWiki API:Links example.

Related

Lucene calculate term vectors for existing index

With Lucene.net I would like to get the term vectors as described in this stackoverflow question.
The problem is, the index is already generated with the field indexed and stored, but without term vectors.
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);
Theoretically, it should be possible to re-calculate the term vectors for each document and then store it in the index.
Do you know how this could be possible, without deleting the complete Lucene index?
As mentioned in my comments in the question, you can generate term vector data on-the-fly, which may help you to avoid a complete rebuild of your indexed data.
In my scenario, I want to find the offset positions of my search term in the matched document.
I don't want to oversell this approach - it's absolutely not a substitute for re-indexing - but if your queries are basic, it may help.
Step 1: Perform whatever query you are currently performing.
For each document in the list of hits, you will then need to re-process the relevant field from that document - so, either you already have the field data stored in your existing index, or you will need to retrieve it from its original source.
Step 2: For each such field, you can re-use the same analyzer to build a token stream on-the-fly. The token stream can be configured with different attributes, such as:
token attributes
offset attributes
and others (see here)
Example:
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;
const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;
String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";
var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();
try
{
ts.Reset();
Console.WriteLine("");
Console.WriteLine("Token: " + searchTerm);
while (ts.IncrementToken())
{
if (searchTerm.Equals(charTermAttr.ToString()))
{
var start = offsetAttr.StartOffset;
var end = offsetAttr.EndOffset;
Console.WriteLine(String.Format(" > offset: {0}-{1}", start, end));
}
}
ts.End();
}
catch (Exception)
{
throw;
}
The above example assumes one of the hits from step 1 was a field containing "Foo Bar Baz Bar Bat" - with a search term of bar.
The output generated is:
Token: bar
> offset: 4-7
> offset: 12-15
So, as you can see, you are not re-executing a query - you are just re-processing a token stream. The more complex the original search term is, the harder it will be to make this approach work the way you probably need it to.

How to get the contributors you've coincided editing the most in Wikipedia

I'm doing a gamification web app to help Wikimedia's community health.
I want to find what editors have edited the same pages as 'Jake' the most in the last week or 100 last edits or something like that.
I know my query, but I can't figure out what tables I need because the Wikimedia DB layout is a mess.
So, I want to obtain something like
Username
Occurrences
Pages
Mikey
13
Obama,..
So the query would be something like (I'm accepting suggestions):
Get the pages that the user 'Jake' has edited in the last week.
Get the contributors of that page in last week.
For each of these contributors, get the pages they have edited in the last week and see if they match with the pages 'Jake' has edited and count them.
I've tried doing that something simpler in Pywikibot, but it's very, very slow (20secs for the last 500 contributions of Jake).
I only get the edited pages and get the contributors of that page and just count them and it's very slow.
My pywikibot code is:
site = Site(langcode, 'wikipedia')
user = User(site, username)
contributed_pages = set()
for page, oldid, ts, comment in user.contributions(total=100, namespaces=[0]):
contributed_pages.add(page)
return get_contributor_ocurrences(contributed_pages,site, username)
And the function
def get_contributor_ocurrences(contributed_pages, site,username):
contributors = []
for page in contributed_pages:
for editor in page.contributors():
if APISite.isBot(self= site,username=editor) or editor==username:
continue
contributors.append(editor)
return Counter(contributors)
PS: I have access to DB replicas, which I guess are way faster than Wikimedia API or Pywikibot
You can filter the data to be retrieved with timestamps parameters. This decreases the needed time a lot. Refer the documentation for their usage. Here is a code snippet to get the data with Pywikibot using time stamps:
from collections import Counter
from datetime import timedelta
import pywikibot
from pywikibot.tools import filter_unique
site = pywikibot.Site()
user = pywikibot.User(site, username) # username must be a string
# Setup the Generator for the last 7 days.
# Do not care about the timestamp format if using pywikibot.Timestamp
stamp = pywikibot.Timestamp.now() - timedelta(days=7)
contribs = user.contributions(end=stamp)
contributors= []
# filter_unique is used to remove duplicates.
# The key uses the page title
for page, *_ in filter_unique(contribs, key=lambda x: str(x[0])):
# note: editors is a Counter
editors = page.contributors(endtime=stamp)
print('{:<35}: {}'.format(page.title(), editors))
contributors.extend(editors.elements())
total = Counter(contributors)
This prints a list of pages and for each page it Shows the editors and their contribution counter of the given time range. Finally total should have the same content as your get_contributor_ocurrences functions above.
It requires some additional work to get the table you mentioned above.

How do I get the correct path to a JSON file from a URL?

How do I get the "request url" part of the get request?
The number part is time in milliseconds but the part before the ".dat" in the URL changes for every game so I need a way to get the whole URL, using requests and BeautifulSoup4.
link to page https://www.oddsportal.com/soccer/germany/bundesliga/1-fc-koln-holstein-kiel-0IRBLw8b/
This was an interesting challenge so I decided to have a look.
You can construct the url from various parts of the initial response, with the inclusion of a tab mapping for football (shown in dictionary). It may be possibly to derive the mappings for the dictionary dynamically from the onmousedown arguments and the associated uid function. I started looking into it and may carry on time permits. Hardcoding for football, for full/1st half/2nd half tabs, seems to be ok for now.
import requests
import re, urllib, time
time_lkup = {
'full_time':'1-2',
'first_half':'1-3',
'second_half':'1-4'
}
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0',
'referer': 'https://www.oddsportal.com'}
r = s.get('https://www.oddsportal.com/soccer/germany/bundesliga/1-fc-koln-holstein-kiel-0IRBLw8b/')
version_id = re.search(r'"versionId":(\d+)', r.text).group(1)
sport_id = re.search(r'"sportId":(\d+)', r.text).group(1)
xeid = re.search(r'"id":"(.*?)"', r.text).group(1)
xhash = urllib.parse.unquote(re.search(r'"xhash":"(.*?)"', r.text).group(1))
unix = int(time.time())
url = f'https://fb.oddsportal.com/feed/match/{version_id}-{sport_id}-{xeid}-{time_lkup["full_time"]}-{xhash}.dat?_={unix}'
print(url)
r = s.get(url)
print(r.text)

How to get all url's ( not just titles ) in a wikipedia article using mediawiki api?

I am using the wikimedia api to retrieve all possible URL's from a wikipedia article ,'https://en.wikipedia.org/w/api.php?action=query&prop=links&redirects&pllimit=500&format=json' , but it is only giving a list of link titles , for example , Artificial Intelligence , wikipedia page has a link titled " delivery networks," , but the actual URL is "https://en.wikipedia.org/wiki/Content_delivery_network" , which is what I want
Use a generator:
action=query&
format=jsonfm&
titles=Estelle_Morris&
redirects&
generator=links&
gpllimit=500&
prop=info&
inprop=url
See API docs on generators and the info module.
I have replaced most of my previous answer, including the code, to use the information provided in Tgr's answer, in case someone else would like sample Python code. This code is heavily based on code from Mediawiki for so-called 'raw continuations'.
I have deliberately limited the number of links requested per invocation to five so that one more parameter possibility could be demonstrated.
import requests
def query(request):
request['action'] = 'query'
request['format'] = 'json'
request['prop'] = 'info'
request['generator'] = 'links'
request['inprop'] = 'url'
previousContinue = {}
while True:
req = request.copy()
req.update(previousContinue)
result = requests.get('http://en.wikipedia.org/w/api.php', params=req).json()
if 'error' in result:
raise Error(result['error'])
if 'warnings' in result:
print(result['warnings'])
if 'query' in result:
yield result['query']
if 'continue' in result:
previousContinue = {'gplcontinue': result['continue']['gplcontinue']}
else:
break
count = 0
for result in query({'titles': 'Estelle Morris', 'gpllimit': '5'}):
for url in [_['fullurl'] for _ in list(result.values())[0].values()]:
print (url)
I mentioned in my first answer that, if the OP wanted to do something similar with artificial intelligence then he should begin with 'Artificial intelligence' — noting the capitalisation. Otherwise the search would start with a disambiguation page and all of the complications that could arise with those.

Get ALL tweets, not just recent ones via twitter API (Using twitter4j - Java)

I've built an app using twitter4j which pulls in a bunch of tweets when I enter a keyword, takes the geolocation out of the tweet (or falls back to profile location) then maps them using ammaps. The problem is I'm only getting a small portion of tweets, is there some kind of limit here? I've got a DB going collecting the tweet data so soon enough it will have a decent amount, but I'm curious as to why I'm only getting tweets within the last 12 hours or so?
For example if I search by my username I only get one tweet, that I sent today.
Thanks for any info!
EDIT: I understand twitter doesn't allow public access to the firehose.. more of why am I limited to only finding tweets of recent?
You need to keep redoing the query, resetting the maxId every time, until you get nothing back. You can also use setSince and setUntil.
An example:
Query query = new Query();
query.setCount(DEFAULT_QUERY_COUNT);
query.setLang("en");
// set the bounding dates
query.setSince(sdf.format(startDate));
query.setUntil(sdf.format(endDate));
QueryResult result = searchWithRetry(twitter, query); // searchWithRetry is my function that deals with rate limits
while (result.getTweets().size() != 0) {
List<Status> tweets = result.getTweets();
System.out.print("# Tweets:\t" + tweets.size());
Long minId = Long.MAX_VALUE;
for (Status tweet : tweets) {
// do stuff here
if (tweet.getId() < minId)
minId = tweet.getId();
}
query.setMaxId(minId-1);
result = searchWithRetry(twitter, query);
}
Really it depend on which API system you are using. I mean Streaming or Search API. In the search API there is a parameter (result_type) that is an optional parameter. The values of this parameter might be followings:
* mixed: Include both popular and real time results in the response.
* recent: return only the most recent results in the response
* popular: return only the most popular results in the response.
The default one is the mixed one.
As far as I understand, you are using the recent one, that is why; you are getting the recent set of tweets. Another issue is getting low volume of tweets that have the geological information. Because there are very few users added the geological information to their profile, you are getting very few tweets.