Change redirected URL in Scrapy

Change redirected URL in Scrapy - scrapy

It is possible to change a redirected url in scrapy?
For example, I crawl an url:
http://someurl.com/A
which has a redirect to
http://redirectedurl.com:8080/A
This url fails because of the port number. The good URL needs to be without a port number, so I would like to change it to
http://redirectedurl.com/A
I tried to update the request.meta with redirect_urls having the new url without a port.
The docs says that MetaRefreshMiddleware obeys the redirect_urls, but no succes
meta.update({'redirect_urls': ['http://redirectedurl.com/A '] })
r = Request(url=url, callback=callback, meta=meta)
Any ideas?

No need to go deep and try to fix things "under the hood". You can just check if the request was redirected, and then create a new request with the modified URL:
import re
if 'redirect_urls' in response.meta:
new_url = re.sub(":\d+","", response.url)
yield Request(new_url)
Of course, you would add additional checks there, this is just a minimum example.

Related

Our base Url redirects to another URL which we can see in logs and this has a variable we need to store and pass in sub subsequent request

Using karate-UI
Given driver 'http://localhost:8080/auth/realms/auth?scope=openid&state=eferov08J37HlzbycjxHGs4.xzyoGFvM3QQ.test&response_type=code&client_id=hetg&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2Fauth%2Frealms%2Fendpoint'
* fullscreen()
And delay(2000)
And input('#username', 'username')
And input('#password', 'password')
When submit().click("#kc-login")
Then waitForUrl('http://localhost:8080/auth/realms/endpoint')
This URL contains a value which we want to extract. waitForUrl waits for this URL to come and once this URL is received how to proceed further. Is it possible to store this in some variable somehow? As all demos I saw is that input is mentioned or button can be clicked on this URL what about extracting a value from URL. How to store this URL so value can be extracted?
http://localhost:8080/auth/realms/endpoint?state=abbv&code=t6002231-3031-459f-b4c4-2e8a25223550.64f22bbc-6c28-49e4-bc2c-ca0ed40060de.36aee969-73e3-4bc5-bc5e-a4b68

Please read the documentation. waitForUrl() actually returns the value of the URL: https://github.com/karatelabs/karate/tree/master/karate-core#waitforurl
* def actualUrl = waitForUrl('/some/path')
Also see driver.url: https://github.com/karatelabs/karate/tree/master/karate-core#driverurl
* def actualUrl = driver.url

python requests login with redirect

I'd like to automate my log in to my bank to automatically fetch my transactions to stay up-to-date with spendings and earnings, but I am stuck.
The bank's login webpage is: https://login.bancochile.cl/bancochile-web/persona/login/index.html#/login
I am using python's request module with sessions:
urlLoginPage = 'https://login.bancochile.cl/bancochile-web/persona/login/index.html'
urlLoginSubmit = 'https://login.bancochile.cl/oam/server/auth_cred_submit'
username = '11.111.111-1' # this the format of a Chilean National ID ("RUT")
usernameFormatted = '111111111' # same id but formatted
pw = "password"
payload = [
("username2", usernameFormatted),
("username2", username),
("userpassword", pw),
("request_id", ''),
("ctx", "persona"),
("username", usernameFormatted),
("password", pw),
]
with requests.Session() as session:
login = session.get(urlLoginPage)
postLogin = session.post(
urlLoginSubmit,
data=payload,
allow_redirects=False,
)
redirectUrl = postLogin.headers["Location"]
First I find that the form data has duplicated keys, so I am using the payload as a list of tuples. From Chrome's inspect I find the form data to be like this:
username2=111111111&username2=11.111.111-1&userpassword=password&request_id=&ctx=persona&username=111111111&password=password
I've checked the page's source code to look for the use of a csrf token, but couldn't find any hint of it.
What happens is that the site does a redirect upon submitting the login data. I set allow_redirects=False to catch the redirect url of the post under the Location-header. However, here is the problem. Using the web-browser I know that the redirect url should be https://portalpersonas.bancochile.cl/mibancochile/rest/persona/perfilamiento/home, but I always end up on an error page when using the above method (https://login.bancochile.cl/bancochile-web/contingencia/error404.html). (I am using my own, correct login credentials to try this)
If I submit the payload in a wrong format (e.g. by dropping a key) I am redirected to the same error-page. This tells me that probably something with the payload is incorrect, but I don't know how to find out what may be wrong.
I am kind of stuck and don't know how I can figure out where/how to look for errors and possible solutions. Any suggestions on how to debug this and continue or ideas for other approaches would be very welcome!
Thanks!

Include request parameters in URL when using Postman

I need to fire some requests using Postman but I need to include the parameter in the URL.
What I need:
https://serveraddress/v1/busride/user/favorites/route/RanDOMid
What I currently can configure in Postman:
https://serveraddress/v1/busride/user/favorites/route/?id=RanDOMid
I do not control the server, so I need to work it out how to craft the request in Postman to accept the input data as part of the URL, not as parameter. How can I specify input data in Postman to get it included in URL?

Click on Manage Environment
Add variable as path with Initial and current value as RanDOMid
Add path to URL:
https://serveraddress/v1/busride/user/favorites/route/{{path}}

#User7294900's answer should do for you in case all you want to do is include a variable in your request URL.
However, if you want to actually generate a random ID for every request, you may use {{$guid}} or {{$randomInt}} directly in you URL as follows:
https://serveraddress/v1/busride/user/favorites/route/{{$guid}}
This will generate a random GUID every time your request is fired and the generated GUID will replace {{$guid}} in your URL.
or
https://serveraddress/v1/busride/user/favorites/route/{{$randomInt}}
This will generate a random integer between 0 and 1000 every time your request is fired and the generated integer will replace {{$randomInt}} in your URL.
Refer postman documentation for more details - https://www.getpostman.com/docs/v6/postman/environments_and_globals/variables
Hope this helps!

Crawler4J seed url gets encoded and error page is crawler instead of actual page

I am using crawler 4J to crawl user profile on gitHub for instance I want to crawl url: https://github.com/search?q=java+location:India&p=1
for now I am adding this hard coded url in my crawler controller like:
String url = "https://github.com/search?q=java+location:India&p=1"; controller.addSeed(url);
When crawler 4J starts the URL Crawled is :
https://github.com/search?q=java%2Blocation%3AIndia&p=1
which gives me error page.
What should I do, I have tried giving encoded url but that doesn't work either.

I had to eventually make the slightest of changes to crawler4J source code:
File Name: URLCanonicalizer.java
Method : percentEncodeRfc3986
Just commented the first line in this method and I was able to crawl and fetch my results
//string = string.replace("+", "%2B");
In my url there was + character and that was being replaced by %2B and I was getting a error page,I wonder why they have specifically replaced + character before encoding the entire URL.

CFX will hijack \services URL, how to disable or config it?

See http://cxf.apache.org/docs/jaxrs-services-description.html#JAXRSServicesDescription-ServicelistingsandWADLqueries,
If you input URL like http://localhost:8080/store/books/services, CFX will hijack the URL and return some service description.
But in my case, the URL http://localhost:8080/store/books/services should be one of my webservice URL. How can I disable CFX's hijack?

By carefully reading the CFX document http://cxf.apache.org/docs/jaxrs-services-description.html#JAXRSServicesDescription-ServicelistingsandWADLqueries again,
I know that CFX has the ability to configure service list URL:
Note that you can override the location at which listings are provided (in case you would like '/services' be available to your resources) using 'service-list path' CXFServlet parameter, example:
'service-list-path' = '/listings'
That is a org.apache.cxf.transport.servlet.CXFServlet parameter.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Change redirected URL in Scrapy - scrapy

Related

Our base Url redirects to another URL which we can see in logs and this has a variable we need to store and pass in sub subsequent request

python requests login with redirect

Include request parameters in URL when using Postman

Crawler4J seed url gets encoded and error page is crawler instead of actual page

CFX will hijack \services URL, how to disable or config it?

Categories

Resources