Scrapy: Login to page pre-crawling - scrapy

I'm learning to build a scraper that scrapes search results but previously needs to log in. I read the documentation and this article here. Unfortunately, I'm still stuck. My spider reports the following <403 https://github.com/login>: HTTP status code is not handled or not allowed.
class GitHubSpider(CrawlSpider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
rules = (
Rule(
LinkExtractor(restrict_css="a.mr-1"),
callback="parse_engineer",
),
Rule(LinkExtractor(restrict_css=".next_page")),
)
def start_requests(self):
return [
scrapy.FormRequest(
url="https://github.com/login",
formdata={
"login": "scrapy",
"password": "12345",
},
callback=self.parse,
)
]
def parse_engineer(self, response):
yield {
"username": response.css(".vcard-username::text").get().strip(),
}
Edit: Answering on #SuperUser's suggestion.
headers = {
[...]
}
def start_requests(self):
# Do I have access on response here?
token = response.xpath('//form/input[#name="authenticity_token"]/#value').get()
return [
scrapy.FormRequest(
url="https://github.com/login",
formdata={
"login": "scrapy",
"password": "12345",
"authenticity_token": token, # <-------------
},
headers=self.headers,
callback=self.parse,
)
]

Go to settings.py and set 'ROBOTSTXT_OBEY=False'
Replace the default user_agent with another one
Add the request headers from the requested page, you can get it with your browser's devtools.
Just know that they can block your IP, and also block your account.
I suggest you to use PyGithub instead.
Edit:
The request headers:
class GitHubSpider(CrawlSpider):
name = "github"
start_urls = [
"https://github.com/search?p=1&q=React+Django&type=Users",
]
rules = (
Rule(
LinkExtractor(restrict_css="a.mr-1"),
callback="parse_engineer",
),
Rule(LinkExtractor(restrict_css=".next_page")),
)
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
"Host": "github.com",
"Pragma": "no-cache",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Sec-GPC": "1",
"TE": "trailers",
"Upgrade-Insecure-Requests": "1",
"USER_AGENT": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36",
}
def start_requests(self):
return [
scrapy.FormRequest(
url="https://github.com/login",
formdata={
"login": "scrapy",
"password": "12345",
},
headers=self.headers,
callback=self.parse,
)
]
def parse_engineer(self, response):
yield {
"username": response.css(".vcard-username::text").get().strip(),
}
Also notice that you need to get the csrf token:
token = response.xpath('//form/input[#name="authenticity_token"]/#value').get()
Pass the token with the username and password.
formdata={
"login":...,
"password":...,
"authenticity_token": token,
}

Related

Scrapy return output `None`

I am new at scrapy. I am getting none insted of item here is my code
class IndiaSpider(scrapy.Spider):
name = 'espace'
allowed_domains = ['worldwide.espacenet.com']
search_value = 'laptop'
start_urls = [f'https://worldwide.espacenet.com/patent/search?q={search_value}']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
def request_header(self):
yield scrapy.Request(url=self.start_urls, callback=self.parse, headers={'User-Agent':self.user_agent})
def parse(self, response):
title = response.xpath("//span[#class='h2--2VrrSjFb item__content--title--dYTuyzV6']/text()").extract_first()
yield{
'title':title
}
I am getting
2023-01-17 15:58:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://worldwide.espacenet.com/patent/search?q=laptop> (referer: None)
2023-01-17 15:58:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://worldwide.espacenet.com/patent/search?q=laptop>
{'title': None}
2023-01-17 15:58:54 [scrapy.core.engine] INFO: Closing spider (finished)
Anyone can help me...?
See the comments in the code. Read this, and this.
Basically when you have data that's loaded with JavaScript you'll want to get it from the API. If you open devtools in your browser you can see where the data is loaded from and try to recreate the request with scrapy, and then parse the data from the JSON file.
Lose the request_header method, it's not part of the Spider's methods and you never call it. You probably wanted to use start_requests.
import json
import scrapy
class IndiaSpider(scrapy.Spider):
name = 'espace'
allowed_domains = ['worldwide.espacenet.com']
search_value = 'laptop'
# browser devtools -> network tab -> JSON url -> headers
headers = {
"Accept": "application/json,application/i18n+xml",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Content-Type": "application/json",
"DNT": "1",
"EPO-Trace-Id": "YOUR ID", # <------ copy it from your browser
"Host": "worldwide.espacenet.com",
"Origin": "https://worldwide.espacenet.com",
"Pragma": "no-cache",
"Referer": "https://worldwide.espacenet.com/patent/search?q=laptop",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
"X-EPO-PQL-Profile": "cpci"
}
api_url = f'https://worldwide.espacenet.com/3.2/rest-services/search?lang=en,de,fr&q={search_value}&qlang=cql&'
def start_requests(self):
# browser devtools -> network tab -> JSON url -> Request
payload = {
"filters": {
"publications.patent": [
{
"value": [
"true"
]
}
]
},
"query": {
"fields": [
"publications.ti_*",
"publications.abs_*",
"publications.pn_docdb",
"publications.in",
"publications.inc",
"publications.pa",
"publications.pac",
"publications.pd",
"publications.pr_docdb",
"publications.app_fdate.untouched",
"publications.ipc_ic",
"publications.ipc_icci",
"publications.ipc_iccn",
"publications.ipc_icai",
"publications.ipc_ican",
"publications.ci_cpci",
"publications.ca_cpci",
"publications.cl_cpci",
"biblio:pa;pa_orig;pa_unstd;in;in_orig;in_unstd;pac;inc;pd;pn_docdb;allKindCodes;",
"oprid_full.untouched",
"opubd_full.untouched"
],
"from": 0,
"highlighting": [
{
"field": "publications.ti_en",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.abs_en",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.ti_de",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.abs_de",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.ti_fr",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.abs_fr",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.pn_docdb",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
},
{
"field": "publications.pa",
"fragment_words_number": 20,
"hits_only": True,
"number_of_fragments": 3
}
],
"size": 20
},
"widgets": {}
}
yield scrapy.Request(url=self.api_url, headers=self.headers, method='POST', body=json.dumps(payload))
def parse(self, response):
# browser devtools -> network tab -> JSON url -> Response
json_data = response.json()
if json_data:
for hit in json_data['hits']:
if 'publications.ti_en' in hit['hits'][0]['fields']:
title = hit['hits'][0]['fields']['publications.ti_en']
yield {'title': title}
Output:
{'title': ['METHOD AND DEVICE FOR CHECKING THE DETERMINATION OF THE POSITION OF A MOBILE STATION CARRIED OUT BY A RADIO COMMUNICATION SYSTEM']}
{'title': ['Laptop']}
{'title': ['PRESENTATION LAPTOP']}
{'title': ['LAPTOP COMPUTER']}
{'title': ['Laptop comprises an integrated flat bed scanner containing a composite glass plate made from a mineral glass pane and a plastic layer']}
...
...
...

How to configure Krakend so it return http redirect response as-is instead of following the http redirect?

I am currently using Krakend (https://krakend.io) API Gateway to proxy request to my backend service. One of my backend service API response is a redirect response with http 303. The redirect response looks like this below :
HTTP/1.1 303 See Other
content-length: 48
content-type: text/plain; charset=utf-8
date: Thu, 16 Jul 2020 10:25:41 GMT
location: https://www.detik.com/
vary: Accept
x-powered-by: Express
x-envoy-upstream-service-time: 17
server: istio-envoy
The problem is that, instead of returning the http 303 response to client (with location response header) as-is, Krakend is actually following the http redirect and return the response of the redirect Url, which is the html response of https://www.detik.com/.
My current krakend configuration looks like this below :
{
"version": 2,
"extra_config": {
"github_com/devopsfaith/krakend-cors": {
"allow_origins": [],
"expose_headers": [
"Content-Length",
"Content-Type",
"Location"
],
"allow_headers": [
"Content-Type",
"Origin",
"X-Requested-With",
"Accept",
"Authorization",
"secret",
"Host"
],
"max_age": "12h",
"allow_methods": [
"GET",
"POST",
"PUT"
]
},
"github_com/devopsfaith/krakend-gologging": {
"level": "ERROR",
"prefix": "[GATEWAY]",
"syslog": false,
"stdout": true,
"format": "default"
},
"github_com/devopsfaith/krakend-logstash": {
"enabled": false
}
},
"timeout": "10000ms",
"cache_ttl": "300s",
"output_encoding": "json",
"name": "api-gateway",
"port": 8080,
"endpoints": [
{
"endpoint": "/ramatestredirect",
"method": "GET",
"extra_config": {},
"output_encoding": "no-op",
"concurrent_calls": 1,
"backend": [
{
"url_pattern": "/",
"encoding": "no-op",
"sd": "static",
"extra_config": {},
"method": "GET",
"host": [
"http://ramatestredirect.default.svc.cluster.local"
],
"disable_host_sanitize": false
}
]
}
]
}
So how can I make krakend to return original http 303 response unaltered from my backend service to the client ?
Thank You
I assume that you're calling this endpoint /ramatestredirect
To get backend http status code (as you said it return 303 http status code), you can use this way:
{
"endpoint": "/ramatestredirect",
"method": "GET",
"extra_config": {},
"output_encoding": "no-op",
"concurrent_calls": 1,
"backend": [
{
"url_pattern": "/",
"encoding": "no-op",
"sd": "static",
"extra_config": {
"github.com/devopsfaith/krakend/http": {
"return_error_details": "authentication"
}
},
"method": "GET",
"host": [
"http://ramatestredirect.default.svc.cluster.local"
],
"disable_host_sanitize": false
}
]
}
So, basically with this plugin you can get the original backend http status code
"github.com/devopsfaith/krakend/http": {
"return_error_details": "authentication"
}
If you use Lura Framework (formerly known as Kraken framework), then you may have to disable redirects for your http client.
client := &http.Client{
CheckRedirect: func(req *http.Request, via []*http.Request) error {
return http.ErrUseLastResponse
},
}

API Gateway POST method working during tests, but not with postman

i will try to explain my problem clearly.
I have an API who writes something in DynamoDB with a lambda function written in Node.js. When i'm calling it within the AWS console, the API works as expected. I send a body like that:
{
"user-id":"4dz545zd",
"name":"Bush",
"firstname":"Gerard",
}
And that creates the entry within my dynamoDB table. But when i call the same API (freshly deployed) with Postman, i get this error:
{
"statusCode": "400",
"body": "One or more parameter values were invalid: An AttributeValue may not contain an empty string",
"headers": {
"Content-Type": "application/json"
}
}
When i check in cloudwatch why it fails, i see:
Method request body before transformations: [Binary Data]
This is weird, because i sent JSON with the two headers:
Content-Type:application/json
Accept:application/json
And then in cloudwatch, i see that being processed is:
{
"user-id":"",
"name":"",
"firstname":"",
}
Thats explains the error, but i don't understand why when i'm sending it with postman, being not empty, with the json format, it still sends it as "binary" data, and so not being processed by my mapping rule (And so lambda processing it with an empty json):
#set($inputRoot = $input.path('$'))
{
"httpMethod": "POST",
"body": {
"TableName": "user",
"Item": {
"user-id":"$inputRoot.get('user-id')",
"name":"$inputRoot.get('name')",
"firstname":"$inputRoot.get('firstname')",
}
}
}
Thank you in advance !
EDIT: I'm adding the lambda code function
'use strict';
console.log('Function Prep');
const doc = require('dynamodb-doc');
const dynamo = new doc.DynamoDB();
exports.handler = (event, context, callback) => {
const done = (err, res) => callback(null, {
statusCode: err ? '400' : '200',
body: err ? err.message : res,
headers: {
'Content-Type': 'application/json'
},
});
switch (event.httpMethod) {
case 'DELETE':
dynamo.deleteItem(event.body, done);
break;
case 'HEAD':
dynamo.getItem(event.body, done);
break;
case 'GET':
if (event.queryStringParameters !== undefined) {
dynamo.scan({ TableName: event.queryStringParameters.TableName }, done);
}
else {
dynamo.getItem(event.body, done);
}
break;
case 'POST':
dynamo.putItem(event.body, done);
break;
case 'PUT':
dynamo.putItem(event.body, done);
break;
default:
done(new Error(`Unsupported method "${event.httpMethod}"`));
}
};
That's because when testing from AWS Lambda's console, you're sending the JSON you actually expect. But when this is invoked from API Gateway, the event looks different.
You'll have to access the event.body object in order to get your JSON, however, the body is a Stringified JSON, meaning you'll have to first parse it.
You didn't specify what language you're coding in, but if you're using NodeJS you can parse the body like this:
JSON.parse(event.body).
If you're using Python, then you can do this:
json.loads(event["body"])
If you're using any other language, I suggest you look up how to parse a JSON from a given String
That gives what you need.
This is what an event from API Gateway looks like:
{
"path": "/test/hello",
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, lzma, sdch, br",
"Accept-Language": "en-US,en;q=0.8",
"CloudFront-Forwarded-Proto": "https",
"CloudFront-Is-Desktop-Viewer": "true",
"CloudFront-Is-Mobile-Viewer": "false",
"CloudFront-Is-SmartTV-Viewer": "false",
"CloudFront-Is-Tablet-Viewer": "false",
"CloudFront-Viewer-Country": "US",
"Host": "wt6mne2s9k.execute-api.us-west-2.amazonaws.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36 OPR/39.0.2256.48",
"Via": "1.1 fb7cca60f0ecd82ce07790c9c5eef16c.cloudfront.net (CloudFront)",
"X-Amz-Cf-Id": "nBsWBOrSHMgnaROZJK1wGCZ9PcRcSpq_oSXZNQwQ10OTZL4cimZo3g==",
"X-Forwarded-For": "192.168.100.1, 192.168.1.1",
"X-Forwarded-Port": "443",
"X-Forwarded-Proto": "https"
},
"pathParameters": {
"proxy": "hello"
},
"requestContext": {
"accountId": "123456789012",
"resourceId": "us4z18",
"stage": "test",
"requestId": "41b45ea3-70b5-11e6-b7bd-69b5aaebc7d9",
"identity": {
"cognitoIdentityPoolId": "",
"accountId": "",
"cognitoIdentityId": "",
"caller": "",
"apiKey": "",
"sourceIp": "192.168.100.1",
"cognitoAuthenticationType": "",
"cognitoAuthenticationProvider": "",
"userArn": "",
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36 OPR/39.0.2256.48",
"user": ""
},
"resourcePath": "/{proxy+}",
"httpMethod": "GET",
"apiId": "wt6mne2s9k"
},
"resource": "/{proxy+}",
"httpMethod": "GET",
"queryStringParameters": {
"name": "me"
},
"stageVariables": {
"stageVarName": "stageVarValue"
},
"body": "'{\"user-id\":\"123\",\"name\":\"name\", \"firstname\":\"firstname\"}'"
}
EDIT
After further discussion in the comments, one more problem is that the you're using the DynamoDB API rather than the DocumentClient API. When using the DynamoDB API, you must specify the types of your objects. DocumentClient, on the other hands, abstracts this complexity away.
I have also refactored your code a little bit (only dealing with POST at the moment for the sake of simplicity), so you can make use of async/await
'use strict';
console.log('Function Prep');
const AWS = require('aws-sdk');
const dynamo = new AWS.DynamoDB.DocumentClient();
exports.handler = async (event) => {
switch (event.httpMethod) {
case 'POST':
await dynamo.put({TableName: 'users', Item: JSON.parse(event.body)}).promise();
break;
default:
throw new Error(`Unsupported method "${event.httpMethod}"`);
}
return {
statusCode: 200,
body: JSON.stringify({message: 'Success'})
}
};
Here's the Item in DynamoDB:
And this is my Postman request:
With proper headers:
When creating API Gateway, I checked the box Use Lambda Proxy integration. My API looks like this:
If you reproduce these steps it should just work.
I got the exact same problem, the solution for me was deploying the api to make my changes available through Postman !
Hope it helps, even one year later
you need to deploy your Amazon API Gateway!!! It took me forever to figure this out, than
Deploy API
I encountered the same problem while working with java and I fixed it by just checking the Use Lambda Proxy integration for POST method.

traefik expose internal metrics

I would like to expose internal metrics of traefik.
After reading the documentation I created the following configuration file:
logLevel = "INFO"
[entryPoints]
[entryPoints.http]
address = ":80"
[entryPoints.dashboard]
address = ":16081"
# API definition
[api]
entryPoint = "dashboard"
dashboard = true
debug = false
[api.statistics]
recentErrors = 10
# Metrics definition
[metrics]
# DataDog metrics exporter type
[metrics.datadog]
address = "172.17.0.1:8125"
pushInterval = "10s"
################################################################
# Mesos/Marathon Provider
################################################################
# Enable Marathon Provider.
[marathon]
endpoint = "http://mesos.lan:8080/"
watch = true
domain = "service.lan"
exposedByDefault = false
When I query the dashboard entrypoint I got a 404 error on /metrics:
curl -s http://localhost:16081/health | jq
{
"pid": 1,
"uptime": "3h31m3.5252748s",
"uptime_sec": 12663.5252748,
"time": "2018-09-04 16:53:17.7128687 +0000 UTC m=+12663.602939001",
"unixtime": 1536079997,
"status_code_count": {},
"total_status_code_count": {
"404": 5
},
"count": 0,
"total_count": 5,
"total_response_time": "390.7µs",
"total_response_time_sec": 0.0003907,
"average_response_time": "78.14µs",
"average_response_time_sec": 7.814e-05,
"recent_errors": [
{
"status_code": 404,
"status": "Not Found",
"method": "GET",
"host": "localhost:16081",
"path": "/metrics",
"time": "2018-09-04T16:53:12.0232879Z"
},
{
"status_code": 404,
"status": "Not Found",
"method": "GET",
"host": "localhost:16081",
"path": "/metrics",
"time": "2018-09-04T13:18:52.7206202Z"
},
{
"status_code": 404,
"status": "Not Found",
"method": "GET",
"host": "localhost:16081",
"path": "/metrics",
"time": "2018-09-04T13:18:51.853093Z"
},
{
"status_code": 404,
"status": "Not Found",
"method": "GET",
"host": "localhost:16081",
"path": "/metrics",
"time": "2018-09-04T13:18:50.9894516Z"
},
{
"status_code": 404,
"status": "Not Found",
"method": "GET",
"host": "localhost:16081",
"path": "/metrics",
"time": "2018-09-04T13:18:49.8598176Z"
}
]
}
curl -s http://localhost:16081/metrics
404 page not found
Did I miss something ?
My main objective is to be able to get metrics per frontend/backend.
I would like to be able to know the number of requests and returned status code per frontend.
Thanks,
Renaud
This is solved, long story short, /metrics is only exposed when promotheus provider is enable. When Datadog provider is enable all the metrics are sent to datadog.
Details can be found here: github.com/containous/traefik/issues/3877

BigCommerce Create Shipment - no response

I am coding a API to create shipments on Big Commerce.
I am getting responses from the 'Get' URL's - I just can't seem to get the API to respond on the 'PUT'
I fired up a 'Web Responder' and it returns the following:
The tokens etc are moved for security.
Header:
{
"VERSION": "HTTP/1.1",
"CONNECTION": "close",
"ACCEPT-ENCODING": "gzip",
"CONTENT-TYPE": "application/json",
"AUTHORIZATION": "Bearer ---------------------",
"X-AUTH-CLIENT": "======================",
"X-AUTH-TOKEN": "=========================",
"ACCEPT": "application/json;",
"ACCEPT-CHARSET": "UTF-8;",
"USER-AGENT": "West Wind Internet Protocols 5.56",
"CACHE-CONTROL": "no-cache",
"COOKIE": "__cfduid=dfebfa0729eeaf50601b1fe187807c6fc1529278210; owner_token=cdc79c402c05c15d01ce0996dcc40654e3a0fe75a256eae3",
"CONTENT-LENGTH": "171"
}
The 'PUT' has:
PUT /b7ezoY2bqq2DKg0soyMy
{
"tracking_number": "PBT0000124",
"comments": "Shipped by PBT",
"order_address_id": 392,
"shipping_provider": "",
"items": [
{
"order_product_id": 1540,
"quantity": 1
}
]
}
As far as I can tell, all the details are correct. I just get no response. Please not this is a 'Desktop' application - not a Website.
Any clues?