This website response
DEBUG: Crawled (520) <GET https://ddlfr.pw/> (referer: None)
How can i resolve this ?
I post my code for explain
import json
from scrapy import Spider, Request, Selector
class LoginSpider(Spider):
name = 'ddlfr.pw'
start_urls = ['https://ddlfr.pw/index.php?do=search']
numero = 0
def parse(self, response):
global numero
return scrapy.FormRequest.from_response(
response,
headers = {'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'},
formdata= {'dosearch': 'Rechercher', 'story': 'musso', 'do': 'search' , 'subaction': 'search', 'search_start': str(self.numero) , 'full_search': '0', 'result_form': '1'},
callback=self.after_login,
dont_filter = True
)
def after_login(self, response):
for title in response.xpath('//div[#class="short nl nl2"]'):
yield {'roman': title.extract()}
yes because the web require valid browser's headers. while scrapy send headers as a bot.
Try to use these headers:
headers = {
'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
You can see crawled status over your website
I suggest that you monitor what your web browser does when you send the form from the web browser (Network tab of the developer tools), and try to reproduce the request with Scrapy.
In Firefox, for example, you can copy the successful request from the Network tab as a curl command, which is a clear representation of the request.
Related
I try to scrape the odds comparison site from www.raingpost.com
Example from racingpost -> these sites are only working until the race is over, so if you can not see it anymore, pick a race that is still to come :)
So I scraped this site for some info using different spiders, but it seems the odds from the bookmakers are not rendered by splash - at least I can not see the odds in my local splash or the html returned.
I tried:
Increasing the wait time up to 20sec
deactivating the private mode
using scroll down
But it is still not rendering.
How do I scrape these odds?
I tried some solutions from answers here on stackoverflow, the last code I tried was this one:
class DailyoddSpider(scrapy.Spider):
name = 'dailyodd'
allowed_domains = ['www.racingpost.com']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(splash:go(url))
assert(splash:wait(5))
return splash:html()
end
'''
def start_requests(self):
yield SplashRequest(url="https://www.racingpost.com/racecards/394/southwell-aw/2022-03-05/804308/odds-comparison", callback=self.parse, endpoint='execute', args={
'lua_source': self.script
})
I am trying to build a web crawler using scrapy. I want to change useragent for a single request in the spider. I tried the below code but the user agent is not being updated during the crawl process.
def start_requests(self):
request = Request(
"url",
callback=self.parse_search,
meta={'xpaths': self.xpaths},
headers={
"User-Agent": "Googlebot-Image/1.0"
}
)
return [request]
Your code works perfectly (see my code). But some middleware on your side may affect your User-Agent header:
class UserAgentSpider(scrapy.Spider):
name = 'useragent_spider'
user_agents = [
{'title': 'Galaxy S9', 'value': 'Mozilla/5.0 (Linux; Android 8.0.0; SM-G960F Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36'},
{'title': 'iPhone', 'value': 'Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/69.0.3497.105 Mobile/15E148 Safari/605.1'},
{'title': 'Edge', 'value': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'},
]
def start_requests(self):
for user_agent in self.user_agents:
yield scrapy.Request(
url="https://www.myip.com/",
headers={
'user-agent': user_agent['value'],
},
cb_kwargs={
'user_agent': user_agent['title']
},
callback=self.parse,
dont_filter=True,
)
def parse(self, response, user_agent):
with open(f"Samples/{user_agent}.htm", 'wb') as f:
f.write(response.body)
I am trying to login into PSN https://www.playstation.com/en-in/sign-in-and-connect/ using python requests module and API got from the inspect element of browser. Below is the code
import requests
login_data = {
'password': "mypasswordhere",
'username': "myemailhere",
}
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
with requests.Session() as s1:
url = "https://auth.api.sonyentertainmentnetwork.com/2.0/oauth/token"
r = s1.post(url, data = login_data, headers = header)
print(r.text)
With this, I got below response from the server.
{"error":"invalid_client","error_description":"Bad client credentials","error_code":4102,"docs":"https://auth.api.sonyentertainmentnetwork.com/docs/","parameters":[]}
Can I know any alternative method to login into PSN network? Preferably using API model instead of selenium? My objective is to login into PSN network with my credentials and change password but seems got stuck in login page only...
**
This is my code of my scrapy. I also send same request with postman.No matter i send it any times,i can recive data that i want.But i send it by scrapy,I recive data alwanys is 'too frequently,forbid visit'.Maybe there will are many causes.But I want to know what are the possible causes.
**
'
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false']
def start_requests(self):
yield FormRequest(
self.start_urls[0],
callback=self.parse,
)
def parse(self,response):
print(response.text)
'
You need to show the website that you are an actual user, not a bot
try sending a user-agent in the header
yield FormRequest(
url=self.start_urls[0],
callback=self.parse,
headers={ 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36',}
)
start_urls = ['https://www.qichacha.com/search?key=北京证大向上']
def parse(self, response):
# the start_url is a list page, the company_url is a detail_url from the list page
yield scrapy.Request(url=company_url, meta={"infos":info},callback=self.parse_basic_info, dont_filter=True)
when request the company_url, then response 405,
but, if i use
response = requests.get(company_url, headers=headers)
print(response.code)
print(response.txt)
then response 200 and can parse the html page, or
start_urls=[company_url]
def parse(self, response):
print(response.code)
print(response.txt)
and also response 200,I don't know why response 405
when it response 405,i print request like this:
{'_encoding': 'utf-8', 'method': 'GET', '_url': 'https://www.qichacha.com/firm_b18bf42ee07d7961e91a0edaf1649287.html', '_body': b'', 'priority': 0, 'callback': None, 'errback': None, 'cookies': {}, 'headers': {b'User-Agent': [b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20']}, 'dont_filter': False, '_meta': {'depth': 1}, 'flags': []}
what's wrong with it?
It seems that the page blocks Scrapy using the default user-agent string. Running the spider like this works for me:
scrapy runspider -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36" spider.py
Alternatively, you can set USER_AGENT in your project's settings.py. Or, use something like scrapy-fake-useragent to handle this automatically.