Web-scraping using python - pandas

I am trying to extract data from this website, It is almost impossible to scrape as after any search it's not changing its URL.
I want to search based on PUBLISHER IPI '00144443097' and extract all data they have insideclass="items-container".
My code
quote_page = 'https://portal.themlc.com/search'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('section', attrs={'class': 'items-container'})
name = name_box.text
print(name)
Here as the URL after search doesn't change it's not giving me any value.
After extracting values I want to sort them in pandas

When the url doesn't change, you can use the developer tools to see if an api is being called. In this case there are two apis. One gives basic information about the writer and the other gives the information on the works. You can parse the json response however you wish from here.
Note: this a post, not a get
url = 'https://api.ptl.themlc.com/api/search/writer?page=1&limit=10'
payload = {'writerIpi': "00144443097"}
requests.post(url, json=payload).json()
url = 'https://api.ptl.themlc.com/api/search/work?page=1&limit=10'
payload = {'writerIpi': "00144443097"}
requests.post(url, json=payload).json()
url = 'https://api.ptl.themlc.com/api/search/publisher?page=1&limit=10'
payload = {"publisherIpi":"00144443097"}
requests.post(url, json=payload).json()
# this url gets the 161 works for the publisheripid you want. it's convoluted, but you may be able to automate, but I used developer tools to find the right publisheripid
url = 'https://api.ptl.themlc.com/api/search/work?page=1&limit=10'
payload = {'publisherIpId': "7305902"}
requests.post(url, json=payload).json()

To find the publisheripid, you need to open some of works within the author and look for the work endpoint. hopefully this image loads correctly

Related

GET API call returns non-text content

I am making an API call and reading the response in JSON
r = requests.get(query_url, headers=headers)
pprint(r.json())
But the 'content' is not in text format
'content': 'CmRlbGV0ZSBmcm9tICB7eyBwYXJhbXMuY3VyYXRlZF9kYXRhYmFzZV9uYW1l\n'
'IH19LmNybS5BRkZJTElBVEVfUFJJQ0lORzsKCklOU0VSVCBJTlRPICB7eyBw\n'
'TkcKICB3aGVyZSAxID0gMQogIAogIDsKICAKICAKCg==\n'
How do I convert the 'content' to text
For full context, I am trying to download code from the GitHub repo as text to store in our Database
Are you using python and in particular the requests package? If so, I think you could use r.text.
docs: https://pypi.org/project/requests/
Thanks to the article #CherryDT pointed me to, I was able to get the text
import base64
r_dict = r.json()
content = r_dict['content']
base64.b64decode(content)

How to count the number of ads on a website

I've been looking around but can't find anything. Is it possible to scrape and identify the use of ads (and presumably count them for any given site?
As an example this page has 13 ads.
I'm currently using BeautifulSoup to obtain the page
headers = {'Content-Type':'application/json'}
url = requests.get("https://www.worthofweb.com/website-value/wikipedia.com/")
response = requests.request(method="GET", url=url, headers=headers, timeout=5)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
The problem is parsing the page.
You can consider to analyze every element in the DOM and check for standard ad sizes. Here's a list:
https://www.creatopy.com/blog/banner-standard-sizes/
Briefly, get the width/height from the element's style, and see if it matches a standard ad unit size. If so, we can argue it's an ad (false positives possible though).

Fetch All Pull-Request Comments Via Bitbucket REST API

This is how retrieve a particular pull-request's comments according to bitbucket's documentation:
While I do have the pull-request ID and format a correct URL I still get a 400 response error. I am able to make a POST request to comment but I cannot make a GET. After further reading I noticed the six parameters listed for this endpoint do not say 'optional'. It looks like these need to be supplied in order to retrieve all the comments.
But what exactly are these parameters? I don't find their descriptions to be helpful in the slightest. Any and all help would be greatly appreciated!
fromHash and toHash are only required if diffType is'nt set to EFFECTIVE. state also seems optional to me (didn't give me an error when not including it), and anchorState specifies which kind of comments to fetch - you'd probably want ALL there. As far as I understand it, path contains the path of the file to read comments from. (ex: src/a.py and src/b.py were changed -> specify which of them to fetch comments for)
However, that's probably not what you want. I'm assuming you want to fetch all comments.
You can do that via /rest/api/1.0/projects/{projectKey}/repos/{repositorySlug}/pull-requests/{pullRequestId}/activities which also includes other activities like reviews, so you'll have to do some filtering.
I won't paste example data from the documentation or the bitbucket instance I tested this once since the json response is quite long. As I've said, there is an example response on the linked page. I also think you'll figure out how to get to the data you want once downloaded since this is a Q&A forum and not a "program this for me" page :b
As a small quickstart: you can use curl like this
curl -u <your_username>:<your_password> https://<bitbucket-url>/rest/api/1.0/projects/<project-key>/repos/<repo-name>/pull-requests/<pr-id>/activities
which will print the response json.
Python version of that curl snippet using the requests module:
import requests
url = "<your-url>" # see above on how to assemble your url
r = requests.get(
url,
params={}, # you'll need this later
auth=requests.auth.HTTPBasicAuth("your-username", "your-password")
)
Note that the result is paginated according to the api documentation, so you'll have to do some extra work to build a full list: Either set an obnoxiously high limit (dirty) or keep making requests until you've fetched everything. I stronly recommend the latter.
You can control which data you get using the start and limit parameters which you can either append to the url directly (e.g. https://bla/asdasdasd/activity?start=25) or - more cleanly - add to the params dict like so:
requests.get(
url,
params={
"start": 25,
"limit": 123
}
)
Putting it all together:
def get_all_pr_activity(url):
start = 0
values = []
while True:
r = requests.get(url, params={
"limit": 10, # adjust this limit to you liking - 10 is probably too low
"start": start
}, auth=requests.auth.HTTPBasicAuth("your-username", "your-password"))
values.extend(r.json()["values"])
if r.json()["isLastPage"]:
return values
start = r.json()["nextPageStart"]
print([x["id"] for x in get_all_pr_activity("my-bitbucket-url")])
will print a list of activity ids, e.g. [77190, 77188, 77123, 77136] and so on. Of course, you should probably not hardcode your username and password there - it's just meant as an example, not production-ready code.
Finally, to filter by action inside the function, you can replace the return values with something like
return [activity for activity in values if activity["action"] == "COMMENTED"]

Is there a way to get the URL that a link is scraped from?

I've got a spider written out that crawls my website and scrapes a bunch of tags. I'm now trying to have it return the URL that the link was discovered on.
For example:
www.example.com/product/123 was found on www.example.com/page/2.
When scrapy scrapes information from /product/123 I want to have a field that is "Scraped From" and return /page/2. For every URL that is scraped, I'd want to find the originating page that the URL was found. I've been pouring over the docs and can't seem to figure this out. Any help would be appreciated!
The easiest way is to use the response.headers. There should be a referer header.
referer = response.headers['Referer']
You can also use meta to pass information along to the next URL.
def parse(self, response):
product_url = response.css('#url').get()
yield scrapy.Request(product_url, callback=self.parse_product, meta={'referer': response.url})
def parse_product(self, response):
referer = response.meta['referer']
item = ItemName()
item['referer'] = referer
yield item

Having trouble with Python Web Scraper

i'm new to scrapping and would love some help or just a push along in the right direction. I've currently tried using scrapy but could not get it working at all.
What i'm trying to do is get the titles, episode and html 5 video player link's + different qualities (480p, 720p,etc..) from this page. I'm not sure how i'm meant to get the video src's from the iframe elements though.
As mentioned before any help would be very helpful.
Thanks.
I don't have previous experience with Scrapy, but I'm in the middle of a Python Web Scraping project myself. I'm using BeautifulSoup for scraping.
I've written part of the code - this gets all of the titles, episodes, thumbnails, and loads the link to the new page for further processing. If you're having more troubles, leave a message ;)
from bs4 import BeautifulSoup
from urllib import request
url = "http://getanime.to/recent"
h = {'User-Agent': 'Mozilla/5.0'}
req = request.Request(url, headers=h)
data = request.urlopen(req)
soup = BeautifulSoup(data)
# print(soup.prettify()[:1000]) # For testing purposes - should print out the first 1000 characters of the HTML document
links = soup.find_all('a', class_="episode-release")
for link in links:
# Get required info from this link
thumbnail = link.find('div', class_="thumbnail")["style"]
thumbnail = thumbnail[22:len(thumbnail)-3]
title = link.find('div', class_="title-text").contents[0].strip()
episode = link.find('div', class_="super-block").span.contents[0]
href = link["href"]
# print(thumbnail, title, episode, href) # For testing purposes
# Load the link to this episode for further processing
req2 = request.Request(href, headers=h)
data2 = request.urlopen(req2)
soup2 = BeautifulSoup(data2)
vid_sources = soup2.find('ul', class_="dropdown-menu dropdown-menu--top video-sources")
# TODO repeat the above process to find all video sources
Edit: the above code is for python3. For clarification.
(posting as another answer, since comments remove linebreaks):
Sure, happy to help ;) you're very much on the right track, so keep at it. I am wondering why you're using find_all('iframe'), since I couldn't find any examples with multiple iframe's, but it'll work just as well I guess. If you know there's only one, it saves some time to use soup.find().
Using type(iframexx) shows me that it points to a list which contains the actual data we want. Then
for iframe in iframexx:
print(type(iframexx))
ifr = iframexx[0]
print(ifr)
print(ifr["data-src"])
allowed me to get the data-src.