start url head changed after modified scrapy to scrapy-redis - scrapy

I have a scrapy project and I want to modified it to scrapy-redis:
the main scrapy file was below:
class MySpider(RedisSpider):
name = 'ScrapyBot'
redis_key = 'myspider:start_urls'
start_urls = []
my_header = {
"Host": "jd.com",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0",
}
def start_requests(self):
for url in MySpider.start_urls:
yield scrapy.Request(
url=url,
headers=MySpider.my_header,
callback=self.parse}
)
the request works fine in Scrapy, but after add scrapy-redis part, header in start request(catched from Fidder) changed to default
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en
User-Agent: Scrapy/1.6.0 (+https://scrapy.org)
Accept-Encoding: gzip,deflate
which caused server returns 403 error, how can I fix the header for start urls in scrapy-redis?

You can set default headers in settings.py file this way:
DEFAULT_REQUEST_HEADERS = {
"Host": "jd.com",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0",
}

Related

reformat HTTP request from json file to raw

i have multiple files which is HTTP request in json file format and every file content as following .
{
"http://testphp.vulnweb.com/search.php": {
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "close",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0"
},
"method": "POST",
"params": [
"goButton",
"searchFor"
]
}
}%
desired output is raw request as following
POST /search.php HTTP/1.1
Host: testphp.vulnweb.com
Content-Length: 23
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30
Connection: close
searchFor=a&goButton=go
and BTW the default char in the end of lines is \r\n as it the default format of HTTP raw request and there is extra \r\n in the line before parameters.
i really don`t know if awk could handle this or not but if it how hard it will be??
plz also recommend any online solution to do so if i have to go that way because i struggling in search for reformat from json to raw request.
Thanks

How do I deserialize to object with default formatter in net core 5 webapi when supporting multipart/form-data?

If I make a typical controller and action, with a parameter that comes from the body, the webapi will deserialize the body content into an instance of the class type using the appropriate deserializer, based on the content-type header:
[ApiController]
[Route("[controller]")]
public class TestController : ControllerBase
{
[System.Runtime.Serialization.DataContract]
public class MyObject
{
[System.Runtime.Serialization.DataMember]
public string Text { get; set; }
}
[HttpPost()]
public void Test([FromBody] MyObject value)
{
}
}
If I make a request to it, the request looks like this if using json:
POST https://localhost:44380/Message HTTP/1.1
Host: localhost:44380
Connection: keep-alive
Content-Length: 22
accept: */*
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36
Content-Type: application/json
Referer: https://localhost:44380/swagger/index.html
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
{"Text":"123 abc 123"}
Or this if xml (my web api is set up to support json and xml):
POST https://localhost:44380/Message HTTP/1.1
Host: localhost:44380
Connection: keep-alive
Content-Length: 82
accept: */*
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36
Content-Type: application/*+xml
Referer: https://localhost:44380/swagger/index.html
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
<?xml version="1.0" encoding="UTF-8"?>
<MyObject>
<Text>string</Text>
</MyObject>
If I am changing my webapi to use multipart mime, I can't rely on the automated deserialization of the content body. As I loop through each part, how would I deserialize each sections content based on the content-type header and the configured formatters of the webapi? I could hard code a specific serializer but I'd like to use what's configured.
var boundary = HeaderUtilities.RemoveQuotes(Request.ContentType.Boundary).Value;
MultipartReader reader = new MultipartReader(boundary, HttpContext.Request.Body);
var section = await reader.ReadNextSectionAsync();
while (section != null)
{
ContentDispositionHeaderValue contentDispositionHeaderValue = section.GetContentDispositionHeader();
Stream strData = null;
if (contentDispositionHeaderValue.IsFileDisposition())
{
FileMultipartSection fileSection = section.AsFileSection();
strData = fileSection.FileStream;
fileName = fileSection.FileName;
name = fileSection.Name;
}
else if (contentDispositionHeaderValue.IsFormDisposition())
{
FormMultipartSection formSection = section.AsFormDataSection();
name = formSection.Name;
strData = section.Body;
}
//How to deserialize strData to MyObject using the webap's configured formatters?
section = await reader.ReadNextSectionAsync();
}

How to change the header just for a specific request in scrapy spider?

I am trying to build a web crawler using scrapy. I want to change useragent for a single request in the spider. I tried the below code but the user agent is not being updated during the crawl process.
def start_requests(self):
request = Request(
"url",
callback=self.parse_search,
meta={'xpaths': self.xpaths},
headers={
"User-Agent": "Googlebot-Image/1.0"
}
)
return [request]
Your code works perfectly (see my code). But some middleware on your side may affect your User-Agent header:
class UserAgentSpider(scrapy.Spider):
name = 'useragent_spider'
user_agents = [
{'title': 'Galaxy S9', 'value': 'Mozilla/5.0 (Linux; Android 8.0.0; SM-G960F Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36'},
{'title': 'iPhone', 'value': 'Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/69.0.3497.105 Mobile/15E148 Safari/605.1'},
{'title': 'Edge', 'value': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246'},
]
def start_requests(self):
for user_agent in self.user_agents:
yield scrapy.Request(
url="https://www.myip.com/",
headers={
'user-agent': user_agent['value'],
},
cb_kwargs={
'user_agent': user_agent['title']
},
callback=self.parse,
dont_filter=True,
)
def parse(self, response, user_agent):
with open(f"Samples/{user_agent}.htm", 'wb') as f:
f.write(response.body)

I send a post request by scrapy, response data is 'too frequently',but i send this same request by postman,response is this i want

**
This is my code of my scrapy. I also send same request with postman.No matter i send it any times,i can recive data that i want.But i send it by scrapy,I recive data alwanys is 'too frequently,forbid visit'.Maybe there will are many causes.But I want to know what are the possible causes.
**
'
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.lagou.com']
start_urls = ['https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false']
def start_requests(self):
yield FormRequest(
self.start_urls[0],
callback=self.parse,
)
def parse(self,response):
print(response.text)
'
You need to show the website that you are an actual user, not a bot
try sending a user-agent in the header
yield FormRequest(
url=self.start_urls[0],
callback=self.parse,
headers={ 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36',}
)

Scrapy | How get response from request without urllib?

I believe there is a better way to get response using scrapy.Request then I do
...
import urllib.request
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
...
class MatchResultsSpider(scrapy.Spider):
name = 'match_results'
allowed_domains = ['site.com']
start_urls = ['url.com']
def get_detail_page_data(self, detail_url):
req = urllib.request.Request(
detail_url,
data=None,
headers={
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'accept': 'application/json, text/javascript, */*; q=0.01',
'referer': 'site.com',
}
)
page = urllib.request.urlopen(req)
response = HtmlResponse(url=detail_url, body=page.read())
target = Selector(response=response)
return target.xpath('//dd[#data-first_name]/text()').extract_first()
I get all information inside parse function.
But in one place I need to get a little peace data from inside detail page.
# Lineups
lineup_team_tables = lineups_container.xpath('.//tbody')
for i, table in enumerate(lineup_team_tables):
# lineup players
line_up = []
lineup_players = table.xpath('./tr[not(contains(string(), "Coach"))]')
for lineup_player in lineup_players:
line_up_entries = {}
lineup_player_url = lineup_player.xpath('.//a/#href').extract_first()
line_up_entries['player_id'] = get_id(lineup_player_url)
line_up_entries['jersey_num'] = lineup_player.xpath('./td[#class="shirtnumber"]/text()').extract_first()
abs_lineup_player_url = response.urljoin(lineup_player_url)
line_up_entries['position_id_detail'] = self.get_detail_page_data(abs_lineup_player_url)
line_up.append(line_up_entries)
# team_lineup['line_up'] = line_up
self.write_to_scuard(i, 'line_up', line_up)
Can I get data from other page using scrapy.Request(detail_url, calback_func)?
Thank for your help!
Too much extra code. Use simple scheme of Scrapy parsing:
class ********(scrapy.Spider):
name = '*******'
domain = '****'
allowed_domains = ['****']
start_urls = ['https://******']
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64;AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
'DEFAULT_REQUEST_HEADERS': {
'ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'ACCEPT_ENCODING': 'gzip, deflate, br',
'ACCEPT_LANGUAGE': 'en-US,en;q=0.9',
'CONNECTION': 'keep-alive',
}
def parse(self, response):
(You already have responsed html start_urls = ['https://******'])
yield scrapy.Request(url, callback=self.parse_details)
then you can parse further (nested). And return back to parse callback:
def parse_details(self, response):
************
yield scrapy.Request(url_2, callback=self.parse)