i'm trying to change some headers but nothing is working:
var casper = require('casper').create({ //
stepTimeout: 15000,
verbose: false,
logLevel: 'error',
pageSettings: {
loadImages: true,
loadPlugins: true,
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.364',
customHeaders: {
Connection: 'keep-alive',
Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
} } });
I also tried:
phantom.page.customHeaders = {
"User-Agent" : "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate",
"Connection" : "keep-alive" };
And for a single connection:
this.open('http://localhost/post.php', {
method: 'post',
headers: { 'Accept': 'application/json' }
});
None of them are working or am i doing something wrong?
Thanks
I cannot reproduce your problem. It seems to work for me... Maybe you have an issue with a redirection somewhere, like discussed here.
May I suggest you to do like this guy and try the following code?
casper.on('started', function () {
this.page.customHeaders = {
"User-Agent" : "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection" : "keep-alive"
}
});
Related
I am new at scrapy. I am getting none insted of item here is my code
class IndiaSpider(scrapy.Spider):
name = 'espace'
allowed_domains = ['worldwide.espacenet.com']
search_value = 'laptop'
start_urls = [f'https://worldwide.espacenet.com/patent/search?q={search_value}']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
def request_header(self):
yield scrapy.Request(url=self.start_urls, callback=self.parse, headers={'User-Agent':self.user_agent})
def parse(self, response):
title = response.xpath("//span[#class='h2--2VrrSjFb item__content--title--dYTuyzV6']/text()").extract_first()
yield{
'title':title
}
I am getting
2023-01-17 15:58:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://worldwide.espacenet.com/patent/search?q=laptop> (referer: None)
2023-01-17 15:58:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://worldwide.espacenet.com/patent/search?q=laptop>
{'title': None}
2023-01-17 15:58:54 [scrapy.core.engine] INFO: Closing spider (finished)
Anyone can help me...?
See the comments in the code. Read this, and this.
Basically when you have data that's loaded with JavaScript you'll want to get it from the API. If you open devtools in your browser you can see where the data is loaded from and try to recreate the request with scrapy, and then parse the data from the JSON file.
Lose the request_header method, it's not part of the Spider's methods and you never call it. You probably wanted to use start_requests.
import json
import scrapy
class IndiaSpider(scrapy.Spider):
name = 'espace'
allowed_domains = ['worldwide.espacenet.com']
search_value = 'laptop'
# browser devtools -> network tab -> JSON url -> headers
headers = {
"Accept": "application/json,application/i18n+xml",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Type": "application/json",
"DNT": "1",
"EPO-Trace-Id": "YOUR ID", # <------ copy it from your browser
"Host": "worldwide.espacenet.com",
"Origin": "https://worldwide.espacenet.com",
"Pragma": "no-cache",
"Referer": "https://worldwide.espacenet.com/patent/search?q=laptop",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"X-EPO-PQL-Profile": "cpci"
}
api_url = f'https://worldwide.espacenet.com/3.2/rest-services/search?lang=en,de,fr&q={search_value}&qlang=cql&'
def start_requests(self):
# browser devtools -> network tab -> JSON url -> Request
payload = {
"filters": {
"publications.patent": [
{
"value": [
"true"
]
}
]
},
"query": {
"fields": [
"publications.ti_*",
"publications.abs_*",
"publications.pn_docdb",
"publications.in",
"publications.inc",
"publications.pa",
"publications.pac",
"publications.pd",
"publications.pr_docdb",
"publications.app_fdate.untouched",
"publications.ipc_ic",
"publications.ipc_icci",
"publications.ipc_iccn",
"publications.ipc_icai",
"publications.ipc_ican",
"publications.ci_cpci",
"publications.ca_cpci",
"publications.cl_cpci",
"biblio:pa;pa_orig;pa_unstd;in;in_orig;in_unstd;pac;inc;pd;pn_docdb;allKindCodes;",
"oprid_full.untouched",
"opubd_full.untouched"
],
"from": 0,
"highlighting": [
{
"field": "publications.ti_en",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.abs_en",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.ti_de",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.abs_de",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.ti_fr",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.abs_fr",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.pn_docdb",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.pa",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
}
],
"size": 20
},
"widgets": {}
}
yield scrapy.Request(url=self.api_url, headers=self.headers, method='POST', body=json.dumps(payload))
def parse(self, response):
# browser devtools -> network tab -> JSON url -> Response
json_data = response.json()
if json_data:
for hit in json_data['hits']:
if 'publications.ti_en' in hit['hits'][0]['fields']:
title = hit['hits'][0]['fields']['publications.ti_en']
yield {'title': title}
Output:
{'title': ['METHOD AND DEVICE FOR CHECKING THE DETERMINATION OF THE POSITION OF A MOBILE STATION CARRIED OUT BY A RADIO COMMUNICATION SYSTEM']}
{'title': ['Laptop']}
{'title': ['PRESENTATION LAPTOP']}
{'title': ['LAPTOP COMPUTER']}
{'title': ['Laptop comprises an integrated flat bed scanner containing a composite glass plate made from a mineral glass pane and a plastic layer']}
...
...
...
I'm having an annoying issue.
I want to send a request with Cypress, and it is like follows:
cy.request({
method: 'POST',
url: 'http://myUrl.com/a/nice/path',
body: {email: email},
});
And it fails:
The request we sent was:
Method: POST
URL: http://backend-openpay-pe.test.geopagos.com/api/registrations/send-code
Headers: {
"Connection": "keep-alive",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/106.0.5249.119 Safari/537.36",
"accept": "*/*",
"accept-encoding": "gzip, deflate",
"referer": "aNiceReferer"
}
Redirects: [
"308: http://myUrl.com/a/nice/path"
]
And the status code from the response is 404.
When I try to set the host header as follows:
cy.request({
method: 'POST',
headers: {
host: 'myUrl.com',
},
url: 'http://myUrl.com/a/nice/path',
})
The log from Cypress was:
The request we sent was:
Method: POST
URL: http://backend-openpay-pe.test.geopagos.com/api/registrations/send-code
Headers: {
"Connection": "keep-alive",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/106.0.5249.119 Safari/537.36",
"accept": "*/*",
"accept-encoding": "gzip, deflate",
"referer": "http://myUrl.com/a/nice/path"
}
Redirects: [
"308: http://myUrl.com/a/nice/path"
]
And again, a response with 404. What I can see in the previous response, is that Cypress is not sending the host header, even when I manually set it. I tried testing some other headers:
cy.request({
method: 'POST',
headers: {
host: 'myUrl.com',
banana: 'MonkeyLikesBananas',
},
url: 'http://myUrl.com/a/nice/path',
})
And the log was:
The request we sent was:
Method: POST
URL: http://backend-openpay-pe.test.geopagos.com/api/registrations/send-code
Headers: {
"Connection": "keep-alive",
"banana": "MonkeyLikesBananas",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/106.0.5249.119 Safari/537.36",
"accept": "*/*",
"accept-encoding": "gzip, deflate",
"referer": "http://myUrl.com/a/nice/path"
}
Redirects: [
"308: http://myUrl.com/a/nice/path"
]
Again, Cypress seems to correctly send any header I set, except for host, that for some reason is missing there.
Cypress version:
$ npx cypress -v
Cypress package version: 10.9.0
Cypress binary version: 10.9.0
Electron version: 19.0.8
Bundled Node version: 16.14.2
Any help will be welcome.
Thanks!
I'm learning to build a scraper that scrapes search results but previously needs to log in. I read the documentation and this article here. Unfortunately, I'm still stuck. My spider reports the following <403 https://github.com/login>: HTTP status code is not handled or not allowed.
class GitHubSpider(CrawlSpider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
rules = (
Rule(
LinkExtractor(restrict_css="a.mr-1"),
callback="parse_engineer",
),
Rule(LinkExtractor(restrict_css=".next_page")),
)
def start_requests(self):
return [
scrapy.FormRequest(
url="https://github.com/login",
formdata={
"login": "scrapy",
"password": "12345",
},
callback=self.parse,
)
]
def parse_engineer(self, response):
yield {
"username": response.css(".vcard-username::text").get().strip(),
}
Edit: Answering on #SuperUser's suggestion.
headers = {
[...]
}
def start_requests(self):
# Do I have access on response here?
token = response.xpath('//form/input[#name="authenticity_token"]/#value').get()
return [
scrapy.FormRequest(
url="https://github.com/login",
formdata={
"login": "scrapy",
"password": "12345",
"authenticity_token": token, # <-------------
},
headers=self.headers,
callback=self.parse,
)
]
Go to settings.py and set 'ROBOTSTXT_OBEY=False'
Replace the default user_agent with another one
Add the request headers from the requested page, you can get it with your browser's devtools.
Just know that they can block your IP, and also block your account.
I suggest you to use PyGithub instead.
Edit:
The request headers:
class GitHubSpider(CrawlSpider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
rules = (
Rule(
LinkExtractor(restrict_css="a.mr-1"),
callback="parse_engineer",
),
Rule(LinkExtractor(restrict_css=".next_page")),
)
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
"Host": "github.com",
"Pragma": "no-cache",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Sec-GPC": "1",
"TE": "trailers",
"Upgrade-Insecure-Requests": "1",
"USER_AGENT": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36",
}
def start_requests(self):
return [
scrapy.FormRequest(
url="https://github.com/login",
formdata={
"login": "scrapy",
"password": "12345",
},
headers=self.headers,
callback=self.parse,
)
]
def parse_engineer(self, response):
yield {
"username": response.css(".vcard-username::text").get().strip(),
}
Also notice that you need to get the csrf token:
token = response.xpath('//form/input[#name="authenticity_token"]/#value').get()
Pass the token with the username and password.
formdata={
"login":...,
"password":...,
"authenticity_token": token,
}
I'm trying to POST the following but I keep getting an error:
"http: error: argument REQUEST_ITEM: "with" is not a valid value"
http POST https://someurl.com fields:='{\"example-api-identifier\":\"String with spaces\"}' Token:randomnumbers
How do I escape these spaces? I'm assuming that's the issue here?
I don't personally know about powershell, but httpie should be fine with spaces without needing the := syntax
$ http POST http://httpbin.org/post example-api-identifier="String with spaces"
yields
HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Origin: *
Connection: close
Content-Length: 413
Content-Type: application/json
Date: Sat, 01 Feb 2020 00:25:41 GMT
Server: gunicorn/19.9.0
{
"args": {},
"data": "{\"example-api-identifier\": \"String with spaces\"}",
"files": {},
"form": {},
"headers": {
"Accept": "application/json, */*",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Content-Length": "48",
"Content-Type": "application/json",
"Host": "httpbin.org",
"User-Agent": "HTTPie/1.0.0"
},
"json": {
"example-api-identifier": "String with spaces"
},
"origin": "127.0.0.1",
"url": "http://httpbin.org/post"
}
I am trying to pass a value to the PHP server side. My Store code is as follows;
Ext.define('MyApp.store.MyArrayStore', {
extend: 'Ext.data.Store',
requires: [
'MyApp.model.MyMOD'
],
config: {
autoLoad: true,
model: 'MyApp.model.MyMOD',
storeId: 'MyArrayStore',
proxy: {
type: 'ajax',
actionMethods: 'POST',
url: 'http://localhost/mm/app/php/res.php',
reader: {
type: 'json'
}
},
listeners: [
{
fn: 'onArraystoreBeforeLoad',
event: 'beforeload'
}
]
},
onArraystoreBeforeLoad: function(store, operation, eOpts) {
this.proxy.extraParams.VALUES1 = "pass some name here";
}
});
PHP Code
<?php
error_reporting(E_ERROR | E_PARSE);
require_once 'conn.php'; // contains the connection
$v = $_POST['VALUES1'];
echo json_encode($v);
?>
What gets returned is null, and not the value that i am passing from the store (which is pass some name here).
How can i correct this ?
UPDATE
Request URL:http://localhost/test/app/php/res.php?_dc=1373343459447
Request Method:POST
Status Code:200 OK
Request Headersview source
Accept:*/*
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Content-Length:23
Content-Type:application/x-www-form-urlencoded; charset=UTF-8
Host:localhost
Origin:http://localhost
Referer:http://localhost/test/app.html
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36
X-Requested-With:XMLHttpRequest
Query String Parametersview sourceview URL encoded
_dc:1373343459447
Form Dataview sourceview URL encoded
page:1
start:0
limit:25
Response Headersview source
Connection:Keep-Alive
Content-Length:24
Content-Type:text/html
Date:Tue, 09 Jul 2013 04:17:39 GMT
Keep-Alive:timeout=5, max=96
Server:Apache/2.2.14 (Unix) DAV/2 mod_ssl/2.2.14 OpenSSL/0.9.8l PHP/5.3.1 mod_perl/2.0.4 Perl/v5.10.1
X-Powered-By:PHP/5.3.1
You need to change the way you setting extraParams.. in this case i will using
store.getProxy().setExtraParam('VALUES1','pass some name here');
If you need to send more than one parameter then use setExtraParams
var param = { VALUES1: 'param1', VALUES2 : 'param2'};
store.getProxy().setExtraParams(param);
So full Store code
Ext.define('MyApp.store.MyArrayStore', {
extend: 'Ext.data.Store',
requires: [
'MyApp.model.MyMOD'
],
config: {
autoLoad: true,
model: 'MyApp.model.MyMOD',
storeId: 'MyArrayStore',
proxy: {
type: 'ajax',
actionMethods: 'POST',
url: 'http://localhost/mm/app/php/res.php',
reader: {
type: 'json'
}
},
listeners: [
{
fn: 'onArraystoreBeforeLoad',
event: 'beforeload'
}
]
},
onArraystoreBeforeLoad: function(store, operation, eOpts) {
store.getProxy().setExtraParam('VALUES1 ','pass some name here');
}
});
Instead of this.proxy try this.getProxy().
I find the console very useful for this sort of thing. In my own app running Ext.getStore('MyStore').proxy; gets me undefined whereas Ext.getStore('MyStore').getProxy() gets me my proxy.
Use the console, for me it is the most valuable development tool next the API.
Good luck, Brad