Scrapy 503 Service unavailable for CloudFlare protection - scrapy

I'm trying to fetch articles from https://journals.sagepub.com/, the website is accessible though my browser but I keep getting a 503 error when I try to crawl in scrapy shell. When I view the response in browser it shows the generic cloudflare ddos protection page. I have tried changing user agents and download delay but nothing works. I am new to scrapy and web scraping in general so some help would be much appreciated.

Related

Google Load Balancer and App Engine 400 errors - how to find more detail?

I am looking at my Google App Engine dashboard, I have a .NET Core Web API deployed and am processing somewhere between 5-10 requests per second, but I am also reporting a handful of 400 errors from the GCP HTTP Load Balancer and I don't know why. I look at the Google Load Balancer logs and I also see a bunch of 400 errors on my primary POST endpoint. This only happens on the POST endpoint. I try to see the full body of the HTTP Request but I can't seem to find it, it looks like Google doesn't log it by default.
I have a .NET Core API running on Google App Engine Flex connected to Google Cloud SQL running PostGresSQL.
How can I get more detail either from the Load Balancer to find out exactly what requests are coming in and what is happening? I have attached a sample of the requests/400 status responses below.
Thanks in advance.
See Google App Engine Requests VS 400 errors graph
See HTTP 400 Errors with HTTP Load Balancer

frontend cloud run app can not access my backend cloud run app due a MixedContent problem

I have two cloud services up and running.
frontend (URL: https://frontend-abc-ez.a.run.app/)
backend (URL: http://backend-abc-ez.a.run.app/)
Frontend is calling the backend through a nuxt.js server middleware proxy to dodge the CORS problematics.
The call is coming through - I can see that in the backend log files. However the response is not really coming back through because of CORS. I see this error in the console:
Mixed Content: The page at 'https://frontend-abc-ez.a.run.app/' was loaded over HTTPS, but requested an insecure XMLHttpRequest endpoint 'http://backend-abc-ez.a.run.app/login'. This request has been blocked; the content must be served over HTTPS.
What I find weird is that I configured the backend url with https but it is enforced as http - at least that is what the error is telling me. Also I see a /login path segment in the unsecure URL. Why is that? I never explicitly defined that endpoint. Is it the security layer proxy of the run service itself?
Anyway - I need to get through this properly and am having a hard time to understand the source of the problem.
For some reason as I rechecked the applications today in the morning everything went fine. I have really no idea why it is working now. I did not change a thing - I waited for the answers here before I'd continue.
Very weird. But the solution so far seems to be waiting. Maybe Cloud Run had some troubles.

Occasional SSL error serving GitHub Pages site over custom domain

Using GitHub pages built-in SSL, I have been serving a github pages site over https since it was announced. I set this up by following the tutorial GitHub provides. However, perhaps 25% of the time when I try to access the site, I get an SSL error that says the browser cannot find the certificate. Reloading the page one or more times resolves the issue for a while. The site use enforced HTTPS. Due to the intermittent nature of the issue, I have failed to determine the cause. I am unsure of what information I could provide to help diagnose the error.
Edit: the error is NET::ERR_CERT_COMMON_NAME_INVALID on chrome

How to ensure my website loads all resources via https?

URL in question: https://newyorkliquorgiftshop.com/admin/
When you open the above page, you can see in the console that there are lots of error messages saying "...was loaded over HTTPS, but requested an insecure stylesheet.."
This website was working well until all of a sudden this problem shows up. I am not very familiar with https, but I have contacted with Godaddy and the SSL certificate is valid, and there is no obvious problem with "https://newyorkliquorgiftshop.com". And I am stuck here, I've some experiences with HTTPS website before, if the URL of website's homepage is "https", then every resources it loads is via "https" too. I don't know why my website behave differently and I don't know where to start to solve the problem? Any hint is appreciated especially articles about HTTPS that is related to my problem.(I have done a brief research regarding HTTPS but most of the articles I found are about the basic concepts.)
If you have access to the code (not sure what you built the website using), try using https instead of http for the URL's you use to load your style sheets and script files.
For example one of the errors is
Mixed Content: The page at 'https://newyorkliquorgiftshop.com/admin/' was loaded over HTTPS, but requested an insecure script 'http://www.newyorkliquorgiftshop.com/admin/view/javascript/common.js'. This request has been blocked; the content must be served over HTTPS.
You are requesting the .js file using HTTP, try using HTTPS like so:
https://www.newyorkliquorgiftshop.com/admin/view/javascript/common.js

Google compute load balancer throws 400 Bad Request on DELETE

I created an instance group through an instance template, and aligned this instance group to a backend service which is used by a http load balancer.
Now when I open a url to an instance vm from the instance group I created, I can do GET POST and DELETE requests and all of the requests are fast, and everything works as expected.
When I open up the url to the static IP for the load balancer. I can do GET and POST requests, but DELETE requests throw a 400 BAD REQUEST with a response page saying:
That’s an error.
Your client has issued a malformed or illegal request. That’s all we
know.
Other load balancer issues:
The site is quite slow through the load balancer. Perhaps
there is a setting I'm missing, I'm pretty sure I set everything to
us-central-1b.
Sometimes the site doesn't even show up. It will work for http, but then
it won't work for https and visa versa. The load balancer has very strange
behaviour.
My VM api access is set to This instance has full API access to all Google Cloud services
I'm using Django as my api layer, I turned on debugging on this host and saw that the DELETE requests weren't even coming through when making requests through the loadbalancer static ip. Is there a firewall setting I'm missing?
Please help me make this fast again and allow the DELETE requests to happen.
Thanks!
Are you sending anything in the body of the request?
Google load balancer will respond with 400 BAD REQUEST if you try to send anything in the body. Easy way to check if this is the problem is fire up Chrome Developer tools and check the Request Payload section is empty/doesn't exist.
The HTTP spec doesn't explicitly say wether you can pass anything in the body so this isn't wrong, just undefined.
Is the load balancer slow for all requests or just pages with lots of elements on?