How to embed or scrape the section "About This Dataset" given in a Socrata dataset API Docs page? - beautifulsoup

This page link is for a Socrata API Docs page for a public NYC dataset:
On the top right of the page, there is a colophon/cartouche that lists some information about the dataset:
"About This Dataset"
Its last line enables a copy of the code needed to "Embed Theses Docs".
I have tried it. It works but the entire page is embedded.
I would like to embed this colophon every time I access the dataset so that I show this information in my report.
I know where this information is located in the html tree:
<body class="dev foundry 200" ...>
...
<div class="container-fluid content">
...
<div id="foundry-docs">
...
<div class="pull-right sidebar metadata">
<div class="panel panel-info about">
...
< div class="panel-body">
<ul>
<li>...</li> ==> items (9) needed
So, I tried to scrape that information using beautifulsoup:
from bs4 import BeautifulSoup
import requests
data_api_page = 'https://dev.socrata.com/foundry/data.cityofnewyork.us/qiz3-axqb'
page = requests.get( data_api_page )
print(page.status_code)
print(page.headers['content-type'])
soup = BeautifulSoup(page.text, 'html.parser')
all_divs = soup.find(name='div', attrs={'class':'panel panel-info about'})
for tag in all_divs.children:
print(tag)
Nothing is returned (even with find_all): what am I doing wrong?
Thanks for your help!
PS: The other reason, besides annotating a report with this info, is that I want to retrieve the dataset row count before accessing the dataset in order to bypass the 1000-records limit of the Socrata API (v2.1 has the same limit as the prior version) and retrieve the entire dataset.

A couple of things might be useful and doesn't include scraping. There is a metadata API end-point where you can retrieve a lot of the descriptions of the data. Here is the metadata endpoint for that NYC data: http://data.cityofnewyork.us/api/views/metadata/v1/qiz3-axqb.
Unfortunately, the metadata API does not include row count. For that, might be simpler to assemble a SoQL query that simply returns the count, such as: https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=count(date)

Yes Tom: the metadata API has a ton of info, except the row count...
This is what I am doing to retrieve the entire dataset via the authenticated client:
# First call to obtain the dataset size: count of rows:
LIM = client.get(MVD_collisions, select='COUNT(*) as tot')
LIM = int(LIM[0]['tot'])
# Retrieval call:
results = client.get(MVD_collisions, limit=LIM)
One day, I'll figure out how to scrape the row count from the summary data at the bottom of DATAPAGE = 'https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95'...

I have this better solution, which does not involve sending a SQL count over all rows:
def get_rows_from_metadata(metadata):
rows = -1 #something went wrong if output
for c in metadata['columns']:
if c['name'] == 'UNIQUE KEY':
try:
rows = int(c['cachedContents']['not_null'])
except:
rows = 0
break
return rows
dataset_rows = get_rows_from_metadata(metadata)

Related

How to get the contributors you've coincided editing the most in Wikipedia

I'm doing a gamification web app to help Wikimedia's community health.
I want to find what editors have edited the same pages as 'Jake' the most in the last week or 100 last edits or something like that.
I know my query, but I can't figure out what tables I need because the Wikimedia DB layout is a mess.
So, I want to obtain something like
Username
Occurrences
Pages
Mikey
13
Obama,..
So the query would be something like (I'm accepting suggestions):
Get the pages that the user 'Jake' has edited in the last week.
Get the contributors of that page in last week.
For each of these contributors, get the pages they have edited in the last week and see if they match with the pages 'Jake' has edited and count them.
I've tried doing that something simpler in Pywikibot, but it's very, very slow (20secs for the last 500 contributions of Jake).
I only get the edited pages and get the contributors of that page and just count them and it's very slow.
My pywikibot code is:
site = Site(langcode, 'wikipedia')
user = User(site, username)
contributed_pages = set()
for page, oldid, ts, comment in user.contributions(total=100, namespaces=[0]):
contributed_pages.add(page)
return get_contributor_ocurrences(contributed_pages,site, username)
And the function
def get_contributor_ocurrences(contributed_pages, site,username):
contributors = []
for page in contributed_pages:
for editor in page.contributors():
if APISite.isBot(self= site,username=editor) or editor==username:
continue
contributors.append(editor)
return Counter(contributors)
PS: I have access to DB replicas, which I guess are way faster than Wikimedia API or Pywikibot
You can filter the data to be retrieved with timestamps parameters. This decreases the needed time a lot. Refer the documentation for their usage. Here is a code snippet to get the data with Pywikibot using time stamps:
from collections import Counter
from datetime import timedelta
import pywikibot
from pywikibot.tools import filter_unique
site = pywikibot.Site()
user = pywikibot.User(site, username) # username must be a string
# Setup the Generator for the last 7 days.
# Do not care about the timestamp format if using pywikibot.Timestamp
stamp = pywikibot.Timestamp.now() - timedelta(days=7)
contribs = user.contributions(end=stamp)
contributors= []
# filter_unique is used to remove duplicates.
# The key uses the page title
for page, *_ in filter_unique(contribs, key=lambda x: str(x[0])):
# note: editors is a Counter
editors = page.contributors(endtime=stamp)
print('{:<35}: {}'.format(page.title(), editors))
contributors.extend(editors.elements())
total = Counter(contributors)
This prints a list of pages and for each page it Shows the editors and their contribution counter of the given time range. Finally total should have the same content as your get_contributor_ocurrences functions above.
It requires some additional work to get the table you mentioned above.

How do I find a specific tag's value (which could be anything) with beautifulsoup?

I am trying to get the job IDs from the tags of Indeed listings. So far, I have taken Indeed search results and put each job into its own "bs4.element.Tag" object, but I don't know how to extract the value of the tag (or is it a class?) "data-jk". Here is what I have so far:
import requests
import bs4
import re
# 1: scrape (5?) pages of search results for listing ID's
results = []
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=10"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=20"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=30"))
results.append(requests.get("https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=40"))
# each search page has a query "q", location "l", and a "start" = 10*int
# the search results are contained in a "td" with ID = "resultsCol"
justjobs = []
for eachResult in results:
soup_jobs = bs4.BeautifulSoup(eachResult.text, "lxml") # this is for IDs
justjobs.extend(soup_jobs.find_all(attrs={"data-jk":True})) # re.compile("data-jk")
# each "card" is a div object
# each has the class "jobsearch-SerpJobCard unifiedRow row result clickcard"
# as well as a specific tag "data-jk"
# "data-jk" seems to be the actual IDs used in each listing's URL
# Now, each div element has a data-jk. I will try to get data-jk from each one:
jobIDs = []
print(type(justjobs[0])) # DEBUG
for eachJob in justjobs:
jobIDs.append(eachJob.find("data-jk"))
print("Length: " + str(len(jobIDs))) # DEBUG
print("Example JobID: " + str(jobIDs[1])) # DEBUG
The examples I've seen online generally try to get the information contained between and , but I am not sure how to get the info from inside of the (first) tag itself. I've tried doing it by parsing it as a string instead:
print(justjobs[0])
for eachJob in justjobs:
jobIDs.append(str(eachJob)[115:131])
print(jobIDs)
but the website is also inconsistent with how the tags operate, and I think that using beautifulsoup would be more flexible than multiple cases and substrings.
Any pointers would be greatly appreciated!
Looks like you can regex them out from a script tag
import requests,re
html = requests.get('https://www.indeed.com/jobs?q=data+analyst&l=United+States&start=0').text
p = re.compile(r"jk:'(.*?)'")
ids = p.findall(html)

How to get all url's ( not just titles ) in a wikipedia article using mediawiki api?

I am using the wikimedia api to retrieve all possible URL's from a wikipedia article ,'https://en.wikipedia.org/w/api.php?action=query&prop=links&redirects&pllimit=500&format=json' , but it is only giving a list of link titles , for example , Artificial Intelligence , wikipedia page has a link titled " delivery networks," , but the actual URL is "https://en.wikipedia.org/wiki/Content_delivery_network" , which is what I want
Use a generator:
action=query&
format=jsonfm&
titles=Estelle_Morris&
redirects&
generator=links&
gpllimit=500&
prop=info&
inprop=url
See API docs on generators and the info module.
I have replaced most of my previous answer, including the code, to use the information provided in Tgr's answer, in case someone else would like sample Python code. This code is heavily based on code from Mediawiki for so-called 'raw continuations'.
I have deliberately limited the number of links requested per invocation to five so that one more parameter possibility could be demonstrated.
import requests
def query(request):
request['action'] = 'query'
request['format'] = 'json'
request['prop'] = 'info'
request['generator'] = 'links'
request['inprop'] = 'url'
previousContinue = {}
while True:
req = request.copy()
req.update(previousContinue)
result = requests.get('http://en.wikipedia.org/w/api.php', params=req).json()
if 'error' in result:
raise Error(result['error'])
if 'warnings' in result:
print(result['warnings'])
if 'query' in result:
yield result['query']
if 'continue' in result:
previousContinue = {'gplcontinue': result['continue']['gplcontinue']}
else:
break
count = 0
for result in query({'titles': 'Estelle Morris', 'gpllimit': '5'}):
for url in [_['fullurl'] for _ in list(result.values())[0].values()]:
print (url)
I mentioned in my first answer that, if the OP wanted to do something similar with artificial intelligence then he should begin with 'Artificial intelligence' — noting the capitalisation. Otherwise the search would start with a disambiguation page and all of the complications that could arise with those.

How to get the N latest data and display that in the angularjs nvd3 stacked area chart

I'm using angularjs-nvd3-directives to create stacked area chart.
Now my problem is I'm polling hundreds of data from server and I need to display on the the N latest data. How will I do that?
Here's the HTML file
<div ng-controller="GraphController as viewAll">
<nvd3-stacked-area-chart data="viewAll.data"
id="graph" showXAxis="true" showYAxis="true"
showLegend="true" interactive="true"
tooltips="true" forcex="[xFunction()]">
<svg></svg> </nvd3-stacked-area-chart> </div>
Try to define a second array that holds only the last N values and pass that as the 'data' parameter instead of the full data array.
$scope.viewAll.visData = $scope.viewAll.data.slice(Math.max($scope.viewAll.data.length - N, 1))
In your html:
<nvd3-stacked-area-chart
data="viewAll.visData"
...
>

How to get all URLs in a Wikipedia page

It seems like Wikipedia API's definition of a link is different from URL? I'm trying to use the API to return all the urls in a specific wiki page.
I have been playing around with this query that I found from this page under generators and redirects.
I'm not sure why exactly are you confused (it would help if you explained that), but I'm quite sure that query is not what you want. It lists links (prop=links) on pages that are linked (generator=links) from the page “Title” (titles=Title). It also lists only the first page of links on the first page of links (with page size the tiny default value of 10).
If you want to get all the links on the page “Title”:
Use just prop=links, you don't want the generator.
Increase the limit to the maximum possible by adding pllimit=max (pl is the “prefix” for links)
Use the value given in the query-continue element to get to the second (and following) page of results.
So, the query for the first page would be:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max
And the second (and in this case, final) page:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links&pllimit=max&plcontinue=226160|0|Lieutenant_General
Another thing that might be confusing you is that links returns only internal links (to other Wikipedia pages). To get external links, use prop=extlinks. You can also combine the two into one query:
http://en.wikipedia.org/w/api.php?action=query&titles=Title&prop=links|extlinks
Here's a Python solution that gets (and prints) all the pages linked to from a particular page. It gets the maximum number of links in the first request, then looks to see if the returned JSON object has a "continue" property. If it does, it adds the "plcontinue" value to the params dictionary and makes another request. (The last page of results returned will not have this property.)
import requests
session = requests.Session()
url = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"format": "json",
"titles": "Albert Einstein",
"prop": "links",
"pllimit": "max"
}
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count = 1
page_titles = []
print("Page %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
while "continue" in data:
plcontinue = data["continue"]["plcontinue"]
params["plcontinue"] = plcontinue
response = session.get(url=url, params=params)
data = response.json()
pages = data["query"]["pages"]
pg_count += 1
print("\nPage %d" % pg_count)
for key, val in pages.items():
for link in val["links"]:
print(link["title"])
page_titles.append(link["title"])
print("%d titles found." % len(page_titles))
This code was adapted from the code in the MediaWiki API:Links example.