I have a process that runs daily that crawls a list of movie prices on Amazon. Because Amazon doesn't expose all prices in their product search API, web crawling is the only way to get the ASINs, then based on the ASINs use their product search API to get the price.
However, after a few thousand web crawls, Amazon starts throttling and throwing a capcha page which I can no longer parse the ASIN from.
I'm thinking that a good solution would be to switch IPs to get around the throttling. My service runs rails on Heroku, is there a good way to implement IP switching?
Found a few solutions with proxies:
https://www.ruby-forum.com/topic/510798
Get past request limit in crawling a web site
Related
I am trying to scrape linkedin profiles and posts. I tried with selenium and webdriver. It works perfect but after some attempts linkedin blocks account or ip address.
Then I tried tools like phantombuster, scraper api they scrape linkedin data without getting blocked.. So how does these paid tools manages to not get blocked. any idea?
Commercial scrapers have many ways of scraping websites without getting blocked.
A few methods:
Change your user agent
Use rotating proxies
Use residential proxies
Captcha farms
Bypassing JS challenges
Throttle requests
Use API whenever possible
P.S. I do not recommend breaking the ToS and bypassing LinkedIn's blocks.
I need to develop a Vue.js SPA where some of its pages need to be referenced by search engines.
I've read about multiple ways to make SPAs SEO-Friendly so I found the following solutions
Server-rendered pages
Prerendering
Since we have a lot of dynamic content to index, generating a static page for each "row" in the database seems not acceptable since we have hundreds if not thousands of content pages.
Creating multiple routes (one for users to visualize and one for bots to crawl)
This solution has been proposed by my manager and it interests me since it's more suitable for our case.
I found this article that illustrates the idea using another SPA framework
My question here is how can I detect that a crawler or an indexation bot have accessed our SPA in order to redirect it to our server rendered web pages and how to actually achieve that in Vue.js 2 (Webpack) ?
If you are concerned about SEO, the solution is to use SSR. Read through https://v2.vuejs.org/v2/guide/ssr.html#The-Complete-SSR-Guide.
If you already have server rendered webpages, as you mentioend in your question, then using SSR would be less work than keeping the Vue SPA and server app in sync. Not to mention, if the Vue SPA and server app show different content, this could hurt SEO and frustrate users.
Any method you use to target web crawlers can be by-passed, so there is no generic solution for this. Instead, try and focus on specific web crawlers, like Google's.
For Google, start by registering your app at https://www.google.com/webmasters. If you need to determine if the Google crawler visited your server, look at the user agent string; there are already other answers for that: Is it possible to find when google bot is crawling any urls on my site and record the last access time to a text file on server, https://support.google.com/webmasters/answer/80553?hl=en.
I am asking this here because Soundcloud does not have support. I going to build a website that people can purchase audio files from using Soundcloud to download the files (and stream before buy). I want to be able to access the download file link in the Soundcloud API without the download link being enabled and showing on the Soundcloud UI. I can't seem to find this info in the Soundcloud API docs. I am going to have a Paypal redirect after the payment to the download link. I know this is a weird way of doing this but I have certain criteria I have to meet. I would host the audio files on my server but they are huge. Anyone have experience with this or can help?
im not sure its possible to do what you want. (very easily at least)
there would be no way for the purchaser to access the 'download' track on soundcloud directly unless downloads are specifically enabled for that track.
really the only way to not host the files and still be able to provide the download would be to use the api to download or proxy the track from soundcloud to your server, using your credentials (because you always have access to your own tracks, download or stream). mind you this would use 2x the bandwidth usage (the server getting the track from soundcloud, and the client downloading the track), and storage space would only be impacked on a temporary bases. but. this is a pretty hacky way and not really a good/proper solution.
you can:
-compress/re-encode the audio as to not use as much disk space
-pay for more storage space at your web host, its usually pretty cheap thse days.
So you want to charge on something free? Well, I think all the downloader out there are middleware where they stream the track from soundcloud and response to client as attachment upon request, one of many examples is http://wittysound.com. Cheapest way to get thing done is providing direct link to soundcloud server like what http://soundflush.com does
My soundcloud app (that makes quite a lot of requests) recently started receiving the error:
429 - Too Many Requests
for all DELETE commands (when I attempt to delete a follow relationship). This was working perfectly last week. Do throttling limits exist for soundcloud api requests? I looked through the developer's guide but haven't found anything yet. The support page said to post a question here.
A 429 Status indicates that your application is making too many requests to the /me/followings endpoint. This isn't throttling per say, but an attempt to circumvent follow-spam applications.
How many requests are you making and with what frequency? If you back off for a while, are you able to get a successful response?
I am working on a research project in which I need to find the in-links for approx. 170K URIs. My first thoughts were to use the Google or Yahoo APIs but then I realized that Yahoo Boss is now at Bing, which doesn't seem to support inlink queries, and that Google, deprecated its search API a while ago and replaced it with Custom Search, which doesn't seem to support inlink queries over their whole web index.
Are there any alternative solutions I missed? I am looking for a simple Web API that accepts a given URI and returns the inlinks for that URI.
Thx,
B
Update:
Google: for retrieving XML feeds of search results I need to convert a given custom search engine to Google Site Search, which is a commercial service. Even then I am only allowed to retrieved inlinks for a pre-defined set of sites, not for the whole web
Yahoo: Site Explorer unfortunately shut down and moved to Bing
Bing: in Bing's Webmaster tools app you can view the inlinks for a specific site, but you cannot query inlinks for arbitrary URIs, because it Bing Webmaster Tools doesn't provide an API yet
The SEOmoz API - If you just need to get total amount of links pointing to a URL you should be able to use the free version.