Error 403 Forbidden not User-Agent - beautifulsoup

I've tried looking at previous posts on the same subject but none of the solutions seem to be working and I'd like to confirm that there is indeed nothing I can do to get around this.
I'm a journalist trying to download permit data from off the planning authority's website. I could do this no problem up till a few months ago but the website has been changed and after adapting my code to the new site, I now seem to be getting an Error 403 every time I try to follow links on the site.
Any help would be greatly appreciated.
My code -not the best looking or most efficient, but I'm self taught and use coding mainly for scraping data for work - stats on the page: http://www.pa.org.mt/padecisionSearch?date=1/31/2018%2012:00:00%20AM
In the bit of code I have pasted beneath I am trying to access each link permit link (first one on page: http://www.pa.org.mt/PACaseDetails?Systemkey=200414&CaseType=PA/10351/17%27) in order to scrape permit details.
While I can generate the link addresses without a problem (they are accessible by clicking the link), sending a request to the address returns:
b'\r\nForbidden\r\n\r\nForbidden URL\r\nHTTP Error 403. The request URL is forbidden.\r\n\r\n'
I've tried changing the User-Agent, and I've also tried to put in a timer between requests but nothing seems to have any effect.
Any suggestions would be very welcome
My code:
import requests
import pandas as pd
import csv
from bs4 import BeautifulSoup
from datetime import date, timedelta as td
import pandas as pd
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import urllib
with requests.Session() as s:
#s.headers.update(head)
r= s.get("http://www.pa.org.mt",data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"})
page = (s.get("http://www.pa.org.mt/padecisionSearch?date=1/31/2018%2012:00:00%20AM", data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/63.0.3239.132 Safari/537.36"}).content)
soup = BeautifulSoup(page, 'html.parser')
search_1 = soup.find_all('table')
for item in search_1:
item1 = item.find_all('tr')
for item2 in item1:
item3 = item2.find_all('td', class_ = 'fieldData')
for element in item3:
list2.append(element.text)
zejt_number = (len(list2)/6)
zi = element.find_all('a')
if len(zi) == 0 and ((len(list2)-1)%5 == 0 or len(list2) == 1):
case_status.append("")
applicant.append("")
architect.append("")
application_type.append("")
case_category.append("")
case_officer.append("")
case_officer2.append("")
date_approved.append("")
application_link.append("")
elif len(zi) != 0:
for li in zi:
hyperlink = "http://www.pa.org.mt/"+li.get('href')
application_link.append(hyperlink)
print(hyperlink)
z = (s.get(hyperlink, data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}).content)
print(z)

first of all your code is a bit messy. is it all your code? or just a part of it? e.g. you are importing pandas twice. nevertheless your main problem why this is not working is the hyperlinks you are generating:
for li in zi:
hyperlink = "http://www.pa.org.mt/"+li.get('href')
print(hyperlink)
the result looks like this:
http://www.pa.org.mt/../PACaseDetails?Systemkey=200414&CaseType=PA/10351/17'
this is link won't work. a quick workaround would be to edit the hyperlink before you do the request:
for li in zi:
hyperlink = "http://www.pa.org.mt/"+li.get('href')
hyperlink = hyperlink.replace('../', '')
print(hyperlink)
z = (s.get(hyperlink, data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}).content)
print(z)
the hyperlinks now should look like this:
http://www.pa.org.mt/PACaseDetails?Systemkey=200414&CaseType=PA/10351/17'
and the request should pass through.

Related

How to collect all comments with scrapy?

I have to use a data scraper to scrape all the comments from newspaper articles. I have very little experience with any kind of coding. A very kind person on reddit gave me this code:
import json
import scrapy
class NewsCommentsSpider(scrapy.Spider):
name = "newscomments"
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36." }
def start_requests(self):
with open("news.txt") as file:
lines = [line.rstrip() for line in file]
for article_id in lines:
url = f"https://www.dailymail.co.uk/reader-comments/p/asset/readcomments/{article_id}?max=500&order=desc"
yield scrapy.Request(
url=url,
callback=self.parse_comments,
headers=self.headers,
meta={"article_id": article_id},
)
def parse_comments(self, response):
comments_dict = json.loads(response.text)
valid_comments = []
for comment in comments_dict["payload"]["page"]:
if comment["replies"]["totalCount"] >= 3:
valid_comments.append(comment)
with open(f"{response.meta.get('article_id')}.json", "w") as f:
json.dump(valid_comments, f)
I tested it, and it works! However, I think he only designed it to download comments with three or replies, which was my origial query. So I was wondering if anyone here can help me change the variables in what's written here so that it will download all the comments, not just the one's that got replies, but the one's that got replies as well as the one's that didn't.
Quick aside: the data I got from this also contained alot of repeated words, like it repeated the title of the article before every comment, and there were words like "userid" infront of every username, this made it kind of difficult to read, and I was wondering if anyone here could help change the code so it downloads less information, all I really need is the comments, usernames and dates the things comments were made.
Thanks a bunch!
Here's the code once again:
import json
import scrapy
class NewsCommentsSpider(scrapy.Spider):
name = "newscomments"
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36." }
def start_requests(self):
with open("news.txt") as file:
lines = [line.rstrip() for line in file]
for article_id in lines:
url = f"https://www.dailymail.co.uk/reader-comments/p/asset/readcomments/{article_id}?max=500&order=desc"
yield scrapy.Request(
url=url,
callback=self.parse_comments,
headers=self.headers,
meta={"article_id": article_id},
)
def parse_comments(self, response):
comments_dict = json.loads(response.text)
valid_comments = []
for comment in comments_dict["payload"]["page"]:
if comment["replies"]["totalCount"] >= 3:
valid_comments.append(comment)
with open(f"{response.meta.get('article_id')}.json", "w") as f:
json.dump(valid_comments, f)

Scrapy - Splash - Not rendering everything on the site

I try to scrape the odds comparison site from www.raingpost.com
Example from racingpost -> these sites are only working until the race is over, so if you can not see it anymore, pick a race that is still to come :)
So I scraped this site for some info using different spiders, but it seems the odds from the bookmakers are not rendered by splash - at least I can not see the odds in my local splash or the html returned.
I tried:
Increasing the wait time up to 20sec
deactivating the private mode
using scroll down
But it is still not rendering.
How do I scrape these odds?
I tried some solutions from answers here on stackoverflow, the last code I tried was this one:
class DailyoddSpider(scrapy.Spider):
name = 'dailyodd'
allowed_domains = ['www.racingpost.com']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(splash:go(url))
assert(splash:wait(5))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(url="https://www.racingpost.com/racecards/394/southwell-aw/2022-03-05/804308/odds-comparison", callback=self.parse, endpoint='execute', args={
'lua_source': self.script
})

BeautifulSoup: How to get only one part of <p> output?

I am trying to find the solution how to get only the price without text from the paragraph.
from bs4 import BeautifulSoup
import requests
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
p = requests.get(url = 'https://www.tia-mobiteli.hr/detaljan-prikaz.aspx?gid=11-appise_64wheu', headers = headers)
soup = BeautifulSoup(p.content,'lxml')
price = soup.find('div', class_='widget widget-info widget-price').p.text
price2 = price.strip()
print(price2)
My output is:
Naša najniža cijena za gotovinsko/virmansko plaćanje: 3.649,00 kn
I want to get only:
3.649,00 kn
Or if it is possible:
3649.00
The price is inside <b> tag:
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
}
p = requests.get(
url="https://www.tia-mobiteli.hr/detaljan-prikaz.aspx?gid=11-appise_64wheu",
headers=headers,
)
soup = BeautifulSoup(p.content, "lxml")
price = soup.find("div", class_="widget widget-info widget-price").b.text
price = float(price.split()[0].replace(".", "").replace(",", "."))
print(price)
Prints:
3649.0
You can use parse module that acts like reverse format().
Usage:
import parse
...
float(parse.parse('Naša najniža cijena za gotovinsko/virmansko plaćanje: {} kn',price2)[0].replace('.','').replace(',','.'))

Unable to login into PSN using Python requests module

I am trying to login into PSN https://www.playstation.com/en-in/sign-in-and-connect/ using python requests module and API got from the inspect element of browser. Below is the code
import requests
login_data = {
'password': "mypasswordhere",
'username': "myemailhere",
}
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
with requests.Session() as s1:
url = "https://auth.api.sonyentertainmentnetwork.com/2.0/oauth/token"
r = s1.post(url, data = login_data, headers = header)
print(r.text)
With this, I got below response from the server.
{"error":"invalid_client","error_description":"Bad client credentials","error_code":4102,"docs":"https://auth.api.sonyentertainmentnetwork.com/docs/","parameters":[]}
Can I know any alternative method to login into PSN network? Preferably using API model instead of selenium? My objective is to login into PSN network with my credentials and change password but seems got stuck in login page only...

Why is Beautifulsoup find_all not returning complete results?

I am trying to parse an Amazon search results page. I want to access the data contained in an <li> tag with <id=result_0>, <id=result_1>, <id=result_2>, etc. The find_all('li') function only returns 4 results (up to result_3), which I thought was odd, since when viewing the webpage in my browser, I see 12 results.
When I print parsed_html, I see it contains all the way to result_23. Why isn't find_all returning all 24 objects? A snippet of my code is below.
import requests
try:
from BeautifulSoup import bsoup
except ImportError:
from bs4 import BeautifulSoup as bsoup
search_url = 'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-
alias%3Dstripbooks&field-keywords=data+analytics'
response = requests.get(search_url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"})
parsed_html = bsoup(response.text)
results_tags = parsed_html.find_all('div',attrs={'id':'atfResults'})
results_html = bsoup(str(results_tags[0]))
results_html.find_all('li')
For what it's worth, the results_tags object also only contains the 4 results. Which is why I am thinking the issue is in the find_all step, rather than with the BeautifulSoup object.
If anyone can help me figure out what is happening here and how I can access all of the search results on this webpage, I will really appreciate it!!
import requests, re
try:
from BeautifulSoup import bsoup
except ImportError:
from bs4 import BeautifulSoup as bsoup
search_url = 'https://www.amazon.com/s/?url=search-%20alias%3Dstripbooks&field-keywords=data+analytics' #delete the irrelevant part from url
response = requests.get(search_url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" }) # add 'Accept' header
parsed_html = bsoup(response.text, 'lxml')
lis = parsed_html.find_all('li', class_='s-result-item' ) # use class to find li tag
len(lis)
out:
25
Can access the li elements directly through class instead of id. This will print the text from each li element.
results_tags = parsed_html.find_all('li',attrs={'class':'s-result-item'})
for r in results_tags:
print(r.text)