scrapy why jsessionid is appened to url when call extract? - scrapy

why jsessionid is appened to url when call extract?
response.css('.pan-con dl > dt > h2 > a::attr(href)').extract()
['/xf/jintianxincheng2qi;jsessionid=TBg8OmdAj97Als+x9wVX2UOu.undefined', '/xf/lufuzangyuexiaoqu;jsessionid=TBg8OmdAj97Als+x9wVX2UOu.undefined', '/xf/dongchengzhixingguandi4qi;jsessionid=TBg8OmdAj97Als+x9wVX2UOu.undefined', '/xf/changjianxingyuecheng2qi;jsessionid=TBg8OmdAj97Als+x9wVX2UOu.undefined', '/xf/huaxiyishucunxidou;jsessionid=TBg8OmdAj97Als+x9wVX2UOu.undefined', '/xf/wanjingfeng2qi;jsessionid=TBg8OmdAj97Als+x9wVX2UOu.undefined', '/xf/gongyuanshijia;jse

Related

Getting header with selenium-wire python

Need to get header from request with selenium wire. I'm taking cookie from header
result_cookies = ""
for request in self.browser.iter_requests():
result = request.headers.get("cookie")
if result:
if len(result_cookies) < len(result):
result_cookies = result
But I have there different cookies, where userCategory=PU, and userCategory=LIM, how i can get cookie where userCategory is LIM.
Part of cookie:
g2usersessionid=e792fb427388bfbfd2f6d8c82163d763; G2JSESSIONID=A27799E20A8A42058502875CD1617C61-n1; userLang=en; visid_incap_242093=ivG0wNBcTX2qbGGfef/Kzsp15WAAAAAAQUIPAAAAAAALk1ZNmpA0m0hbUNh68t/9; incap_ses_260_242093=g/XKTK5T3TznIlAEpLSbA8p15WAAAAAALxw6bB7fGHsf1fHHG2AGPg==; userCategory=LIM; copartTimezonePref=%7B%22displayStr%22%3A%22CEST%22%2C%22offset%22%3A2%2C%22dst%22%3Atrue%2C%22windowsTz%22%3A%22Europe%2FBerlin%22%7D; timezone=Europe%2FBerlin;
This how I make use of selenium-wire to get the cookie from the headers:
for request in self.driver.requests:
if request.response:
print(
request.url,
request.response.status_code,
request.response.headers['Content-Type'],
request.response.headers['set-cookie']
)
if "ivc_id" in request.response.headers['set-cookie']:
cookie=request.response.headers['set-cookie']
print(cookie)
Is this what you are looking for?

How to navigate and validate through all the pages of a api response

I have a scenario where the api returns payload response in pages if the payload has lot of data.
Request:
Background:
* url url
* call read('classpath:examples/common.feature')
And header accesstoken = accessToken
And header accept = '*/*'
And header Accept-Encoding = 'gzip, deflate, br'
Scenario: Get Scores
* param start = '2020-07-01'
Given path '/scores'
When method Get
Then status 200
* def totalPages = response.totalPages
* def response = {"requestId": "6a4287f35112",
"timestampMs": 1595228005245,
"totalMs": 51,
"page": 1,
"totalPages": 100,
"data": [.......]}
After this i am getting total pages, and need to navigate through all the pages by passing the same request with additional * param page = #page_number and validate response is 200. page_number has to be iterated from 2 to 100.
Thought of using Karate loop or calling feature file and building dynamic data and using dynamic data driven feature, but not sure how to proceed.
Please advise
I think the easiest option is to write a second feature file and call it in a loop.
* def totalPages = 10
* def pages = karate.repeat(totalPages, function(i){ return { page: i } })
* call read('second.feature') pages

Update response.body in scrapy(without reload)

I use scrapy and selenium for crawl! my site use ajax for pagination! actully , url no change and so response.body also no change! I want to click with selenium (for pagination) and get self.driver.page_source and use it instead response.body!
So i writed this code :
res = scrapy.http.TextResponse(url=self.driver.current_url, body=self.driver.page_source,
encoding='utf-8')
print(str(res)) //nothing to print!
for quote in res.css("#ctl00_ContentPlaceHolder1_Grd_Dr_DXMainTable > tr.dxgvDataRow_Office2003Blue"):
i = i+1
item = dict()
item['id'] = int(quote.css("td.dxgv:nth-child(1)::text").extract_first())
And no error !
You can replace body of original response in scrapy by using response.replace() method:
def parse(self, response):
response = response.replace(body=driver.page_source)

Test file upload in Flask

I have a flask's controller (POST) to upload a file:
f = request.files['external_data']
filename = secure_filename(f.filename)
f.save(filename)
I have tried to test it:
handle = open(filepath, 'rb')
fs = FileStorage(stream=handle, filename=filename, name='external_data')
payload['files'] = fs
url = '/my/upload/url'
test_client.post(url, data=payload)
But in the controller request.files contains:
ImmutableMultiDict: ImmutableMultiDict([('files', <FileStorage: u'myfile.png' ('image/png')>)])
My tests pass in case I replace 'external_data' with 'files'
How is it possible to create flask test request that contains request.files('external_data')?
You're not showing the origin from payload, which is the issue.
payload should probably be a .copy() of a dict() version of your original object.

Scrapy FormRequest return 400 error code

I am trying to scrapy following website in which the pagination is though AJAX request.
http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak
I am sending FormRequest to access the different pages, however I am getting following error.
Retrying http://studiegids.uva.nl/xmlpages/plspub/uva_search.courses_pls> (failed 1 times): 400 Bad Request
Not able to understand what is wrong? Following is the code.
class Spider(BaseSpider):
name = "zoek"
allowed_domains = ["studiegids.uva.nl"]
start_urls = ["http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak"]
def parse(self, response):
base_url = "http://studiegids.uva.nl/xmlpages/page/2014-2015/zoek-vak"
for i in range(1, 10):
data = {'p_fetch_size': unicode(20),
'p_page:': unicode(i),
'p_searchpagetype': u'courses',
'p_site_lang': u'nl',
'p_strip': u'/2014-2015',
'p_ctxparam': u'/xmlpages/page/2014-2015/',
'p_rsrcpath':u'/xmlpages/resources/TXP/studiegidswebsite/'}
yield FormRequest.from_response(response,
formdata=data,
callback=self.fetch_details,
dont_click=True)
# yield FormRequest(base_url,
# formdata=data,
# callback=self.fetch_details)
def fetch_details(self, response):
# print response.body
hxs = HtmlXPathSelector(response)
item = ZoekItem()
Studiegidsnummer = hxs.select("//div[#class=item-info']//tr[1]/td[2]/p/text()")
Studielast = hxs.select("//div[#class=item-info']//tr[2]/td[2]/p/text()")
Voertaal = hxs.select("//div[#class=item-info']//tr[3]/td[2]/p/text()")
Ingangseis = hxs.select("//div[#class=item-info']//tr[4]/td[2]/p/text()")
Studiejaar = hxs.select("//div[#class=item-info']//tr[5]/td[2]/p/text()")
Onderwijsinstituut = hxs.select("//div[#class=item-info']//tr[6]/td[2]/p/text()")
for i in range(20):
item['Studiegidsnummer'] = Studiegidsnummer
item['Studielast'] = Studielast
item['Voertaal'] = Voertaal
yield item
Try also check headers using firebug.
400 Bad Request usually means that your request does not fully match the expected request format. Common causes include missing or invalid cookies, headers or parameters.
On your web browser, open the Network tab of the Developer Tools and trigger the request. When you see the request in the Network tab, inspect it fully (parameters, headers, etc.). Try to match such a request in your code.