Scrapy - Splash - Not rendering everything on the site - scrapy

I try to scrape the odds comparison site from www.raingpost.com
Example from racingpost -> these sites are only working until the race is over, so if you can not see it anymore, pick a race that is still to come :)
So I scraped this site for some info using different spiders, but it seems the odds from the bookmakers are not rendered by splash - at least I can not see the odds in my local splash or the html returned.
I tried:
Increasing the wait time up to 20sec
deactivating the private mode
using scroll down
But it is still not rendering.
How do I scrape these odds?
I tried some solutions from answers here on stackoverflow, the last code I tried was this one:
class DailyoddSpider(scrapy.Spider):
name = 'dailyodd'
allowed_domains = ['www.racingpost.com']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(splash:go(url))
assert(splash:wait(5))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(url="https://www.racingpost.com/racecards/394/southwell-aw/2022-03-05/804308/odds-comparison", callback=self.parse, endpoint='execute', args={
'lua_source': self.script
})

Related

How to collect all comments with scrapy?

I have to use a data scraper to scrape all the comments from newspaper articles. I have very little experience with any kind of coding. A very kind person on reddit gave me this code:
import json
import scrapy
class NewsCommentsSpider(scrapy.Spider):
name = "newscomments"
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36." }
def start_requests(self):
with open("news.txt") as file:
lines = [line.rstrip() for line in file]
for article_id in lines:
url = f"https://www.dailymail.co.uk/reader-comments/p/asset/readcomments/{article_id}?max=500&order=desc"
yield scrapy.Request(
url=url,
callback=self.parse_comments,
headers=self.headers,
meta={"article_id": article_id},
)
def parse_comments(self, response):
comments_dict = json.loads(response.text)
valid_comments = []
for comment in comments_dict["payload"]["page"]:
if comment["replies"]["totalCount"] >= 3:
valid_comments.append(comment)
with open(f"{response.meta.get('article_id')}.json", "w") as f:
json.dump(valid_comments, f)
I tested it, and it works! However, I think he only designed it to download comments with three or replies, which was my origial query. So I was wondering if anyone here can help me change the variables in what's written here so that it will download all the comments, not just the one's that got replies, but the one's that got replies as well as the one's that didn't.
Quick aside: the data I got from this also contained alot of repeated words, like it repeated the title of the article before every comment, and there were words like "userid" infront of every username, this made it kind of difficult to read, and I was wondering if anyone here could help change the code so it downloads less information, all I really need is the comments, usernames and dates the things comments were made.
Thanks a bunch!
Here's the code once again:
import json
import scrapy
class NewsCommentsSpider(scrapy.Spider):
name = "newscomments"
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36." }
def start_requests(self):
with open("news.txt") as file:
lines = [line.rstrip() for line in file]
for article_id in lines:
url = f"https://www.dailymail.co.uk/reader-comments/p/asset/readcomments/{article_id}?max=500&order=desc"
yield scrapy.Request(
url=url,
callback=self.parse_comments,
headers=self.headers,
meta={"article_id": article_id},
)
def parse_comments(self, response):
comments_dict = json.loads(response.text)
valid_comments = []
for comment in comments_dict["payload"]["page"]:
if comment["replies"]["totalCount"] >= 3:
valid_comments.append(comment)
with open(f"{response.meta.get('article_id')}.json", "w") as f:
json.dump(valid_comments, f)

I send a post request by scrapy, response data is 'too frequently',but i send this same request by postman,response is this i want

**
This is my code of my scrapy. I also send same request with postman.No matter i send it any times,i can recive data that i want.But i send it by scrapy,I recive data alwanys is 'too frequently,forbid visit'.Maybe there will are many causes.But I want to know what are the possible causes.
**
'
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false']
def start_requests(self):
yield FormRequest(
self.start_urls[0],
callback=self.parse,
)
def parse(self,response):
print(response.text)
'
You need to show the website that you are an actual user, not a bot
try sending a user-agent in the header
yield FormRequest(
url=self.start_urls[0],
callback=self.parse,
headers={ 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36',}
)

Scrapy. How to resolve 520?

This website response
DEBUG: Crawled (520) <GET https://ddlfr.pw/> (referer: None)
How can i resolve this ?
I post my code for explain
import json
from scrapy import Spider, Request, Selector
class LoginSpider(Spider):
name = 'ddlfr.pw'
start_urls = ['https://ddlfr.pw/index.php?do=search']
numero = 0
def parse(self, response):
global numero
return scrapy.FormRequest.from_response(
response,
headers = {'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'},
formdata= {'dosearch': 'Rechercher', 'story': 'musso', 'do': 'search' , 'subaction': 'search', 'search_start': str(self.numero) , 'full_search': '0', 'result_form': '1'},
callback=self.after_login,
dont_filter = True
)
def after_login(self, response):
for title in response.xpath('//div[#class="short nl nl2"]'):
yield {'roman': title.extract()}
yes because the web require valid browser's headers. while scrapy send headers as a bot.
Try to use these headers:
headers = {
'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
You can see crawled status over your website
I suggest that you monitor what your web browser does when you send the form from the web browser (Network tab of the developer tools), and try to reproduce the request with Scrapy.
In Firefox, for example, you can copy the successful request from the Network tab as a curl command, which is a clear representation of the request.

Error 403 Forbidden not User-Agent

I've tried looking at previous posts on the same subject but none of the solutions seem to be working and I'd like to confirm that there is indeed nothing I can do to get around this.
I'm a journalist trying to download permit data from off the planning authority's website. I could do this no problem up till a few months ago but the website has been changed and after adapting my code to the new site, I now seem to be getting an Error 403 every time I try to follow links on the site.
Any help would be greatly appreciated.
My code -not the best looking or most efficient, but I'm self taught and use coding mainly for scraping data for work - stats on the page: http://www.pa.org.mt/padecisionSearch?date=1/31/2018%2012:00:00%20AM
In the bit of code I have pasted beneath I am trying to access each link permit link (first one on page: http://www.pa.org.mt/PACaseDetails?Systemkey=200414&CaseType=PA/10351/17%27) in order to scrape permit details.
While I can generate the link addresses without a problem (they are accessible by clicking the link), sending a request to the address returns:
b'\r\nForbidden\r\n\r\nForbidden URL\r\nHTTP Error 403. The request URL is forbidden.\r\n\r\n'
I've tried changing the User-Agent, and I've also tried to put in a timer between requests but nothing seems to have any effect.
Any suggestions would be very welcome
My code:
import requests
import pandas as pd
import csv
from bs4 import BeautifulSoup
from datetime import date, timedelta as td
import pandas as pd
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import urllib
with requests.Session() as s:
#s.headers.update(head)
r= s.get("http://www.pa.org.mt",data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"})
page = (s.get("http://www.pa.org.mt/padecisionSearch?date=1/31/2018%2012:00:00%20AM", data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/63.0.3239.132 Safari/537.36"}).content)
soup = BeautifulSoup(page, 'html.parser')
search_1 = soup.find_all('table')
for item in search_1:
item1 = item.find_all('tr')
for item2 in item1:
item3 = item2.find_all('td', class_ = 'fieldData')
for element in item3:
list2.append(element.text)
zejt_number = (len(list2)/6)
zi = element.find_all('a')
if len(zi) == 0 and ((len(list2)-1)%5 == 0 or len(list2) == 1):
case_status.append("")
applicant.append("")
architect.append("")
application_type.append("")
case_category.append("")
case_officer.append("")
case_officer2.append("")
date_approved.append("")
application_link.append("")
elif len(zi) != 0:
for li in zi:
hyperlink = "http://www.pa.org.mt/"+li.get('href')
application_link.append(hyperlink)
print(hyperlink)
z = (s.get(hyperlink, data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}).content)
print(z)
first of all your code is a bit messy. is it all your code? or just a part of it? e.g. you are importing pandas twice. nevertheless your main problem why this is not working is the hyperlinks you are generating:
for li in zi:
hyperlink = "http://www.pa.org.mt/"+li.get('href')
print(hyperlink)
the result looks like this:
http://www.pa.org.mt/../PACaseDetails?Systemkey=200414&CaseType=PA/10351/17'
this is link won't work. a quick workaround would be to edit the hyperlink before you do the request:
for li in zi:
hyperlink = "http://www.pa.org.mt/"+li.get('href')
hyperlink = hyperlink.replace('../', '')
print(hyperlink)
z = (s.get(hyperlink, data=None, headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}).content)
print(z)
the hyperlinks now should look like this:
http://www.pa.org.mt/PACaseDetails?Systemkey=200414&CaseType=PA/10351/17'
and the request should pass through.

CasperJS fetchText() function echoing blank output

I'm trying to use fetchText() to print out the URL of a google search result to the terminal. Here's the image of what exactly I'm trying to print.
It's only prints out blank though! I don't see anything I'm doing wrong?
Code:
phantom.casperPath = "/usr/local/Cellar/casperjs/1.0.3/libexec/";
phantom.injectJs(phantom.casperPath + '/bootstrap.js');
var utils = require('utils');
var casper = require('casper').create();
casper.start('https://www.google.com/search?q=amazon+shoes');
casper.wait(3000, function () {
this.echo(this.fetchText('#rso > div:nth-child(1) > li:nth-child(1) > div > div > div > div.f.kv._TD > cite'));
}).run();
Google will change the page depending on the useragent string. So you need to set a string during creation (with example string)
var casper = require("casper").create({
pageSettings: {
userAgent: "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
}
});
or with specific function
casper.userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36");
Sometimes it is also necessary to set the viewport to something desktop-like, because PhantomJS' default viewport is 400x300 and Google might render a different site based on the viewport.