Using Scrapy Spider results for a website - scrapy

I've experimented with some crawlers to pull web data from within a Python environment on my local machine. Ideally, I'd like to host a website that can initiate crawlers to aggregate content and display that on the site.
My question is, is it possible to do this from a web environment and not my local machine?

Sure there are many services that are doing the same task you wanted.
scrapingHub is the best example you can get. https://scrapinghub.com/
You can deploy your spiders in there and run it periodically(paid service). Deploy and call spider via scrapingHub API form your website and use the spider output in your host website.
Also, you can achieve the same idea in your server and website via API call.

Related

Vue Vite and Wordpress hosted on the same server

I'm twinkering around in a hobby project and would like to know, is it possible to host a Wordpress backend as my headless CMS and my frontend Vue Vite on the same server?
I would like to achieve this, because when I have Wordpress on another server completely, it takes about 800-1000ms for the API to load the content (Rest Api cache enabled with plugin), even when its like 6 posts each containing 1 midsize (around 1000x1000) image and 100-200 words?
Or should I look for another solution, eg. completely move to another CMS like buttercms?
If the latter, can I host that CMS on the same server to cut down loading/response times of rest api requests?
Thank you for your time,
Have a nice one!

Is it possible to use Selenium from within a web app?

I am building a web site in Django that would scrape data from some site, so people could enter the site, set custom data filters and view scraped data in friendly format.
The problem is that requests and beautiful soup modules will not be enough for the scraping purposes, since I will also need some automation to be done (loading javascript or clicking buttons).
Since Selenium requiers a webdriver to be downloaded and put into a path, is it possible to use it from within web app? Like hosting the webdriver somewhere?
I am also open to solutions other than Selenium, if there are any.
I think what you would want is a selenium grid server.
https://www.seleniumhq.org/docs/07_selenium_grid.jsp
Basically you host it on some remote server and then you can connect to it and spin up web drivers remotely and use them in code as needed. It also comes with a handy interface for checking on current browser instances and even taking screenshots or executing scripts from the web ui.

Set locale/region while crawling

I am trying to crawl information from this website on an AWS machine. The machine, being hosted in US, gives me the price of the product in USD. How can I get the price in INR - the way I see it when I crawl on my local machine.
I normally use Scrapy to crawl the information but am open to using Selenium or any other tool for the same.
I tried using selenium and setting the browser locale to "en-IN" but that did not help.
I'd use TOR (you can setup proxy in Selenium and select desired output node in TOR).
Changing locale can help in some cases only because it's more likely that site is tracking your IP

Import.io some crawlers don't have the button for crawl locally

I was creating some crawlers using import.io however seems like for some of them the option for run locally is not showing. Anyone knows why they dont have the run from local button or how i can make to put in the crawlers?
If you don't see the option to run the crawler [Remotely or locally] then it means that your crawler is already running locally only.
When you save a crawler, import.io does a few checks to see if it can be run remotely on our servers, in some cases this increases chances of crawlers working as the servers do additional processing.
If those check fail, then the crawler can only run locally and therefore your crawler will be run locally be default.

what are the advantages use scrapyd?

The scrapy doc says that:
Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy (aka. upload) your projects and control their spiders using a JSON web service.
is there some advantages in comformance use scrapyd?
Scrapyd allows you to run scrapy on a different machine than the one you are using via a handy web API which means you can just use curl or even a web browser to upload new project versions and run them. Otherwise if you wanted to run Scrapy in the cloud somewhere you would have to scp copy the new spider code and then login with ssh and spawn your scrapy crawl myspider.
Scrapyd will also manage processes for you if you want to run many spiders in parallel; but if you have Scrapy on your local machine and have access to the command-line or a way to run spiders and just want to run one spider at a time, then you're better off running the spider manually.
If you are developing spiders then for sure you don't want to use scrapyd for quick compile/test iterations as it just adds a layer of complexity.