How many requests I can make to Gmail/Google? - selenium

I have one question if someone can answer that, I would really appreciate that.
I am taking screenshots of emails in my Gmail inbox as shown in below picture
https://ibb.co/KNMvFsh
As, these screenshots cannot be taken using Gmail API. So, I am using selenium for this.
So, the question is, How many screenshots in a day I can take from one account? I don't know how much requests I can make until it blocks me?
I don't wanna get blocked and get captcha thing. I am not an experienced guy, relatively new to this. So, I don't have an idea how much requests I can make without getting blocked.
If someone of you know or have any idea, ill be appreciated.

You can potentially make unlimited requests if you use proxies.
Simply add a list of proxies to your GET request and enjoy:
proxyDict = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
r = requests.get(url, headers=headers, proxies=proxyDict)
Also for Selenium, from this answer:
PROXY = "1.111.111.1:8080" #your proxy
chrome_options = WebDriverWait.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
chrome = webdriver.Chrome(chrome_options=chrome_options)
chrome.get("gmail.com")

Related

How to crawl websites without getting blocked?

I crawl websites very often at the rate of hundreds of requests in an hour.
How to make crawlers behavior more like a human?
How to not get on radar by detection bots?
Currently crawling site with selenium, chrome.
Kindly suggest.
Well, you will have to pause the script between loops.
import time
time.sleep(1)
time.sleep(N)
So, it could hypothetically work like this.
import json,urllib.request
import requests
import pandas as pd
from string import ascii_lowercase
import time
alldata = []
for c in ascii_lowercase:
response = requests.get('https://reservia.viarail.ca/GetStations.aspx?q=' + c)
json_data = response.text.encode('utf-8', 'ignore')
df = pd.DataFrame(json.loads(json_data), columns=['sc', 'sn', 'pv']) # etc.,
time.sleep(3)
alldata.append(df)
Or, look for an API to grab data from the URL you are targeting. You didn't post an actual URL, so it's impossible to say for sure if an API is exposed or not.
There are a lot of ways that sites can detect you are trying to crawl them. The easiest is probably IP. If you are making requests too fast from the same IP you might get blocked. You can introduce (random) delays into your script to try and appear slower.
To continue going fast as possible, you will have to use different IP addresses. There are many proxy and VPN services that you can use to accomplish this.

Accessing Metacritic API and/or Scraping

Does anybody know where documentation for the Metacritic api is/if it still works. There used to be a Metacritic API at https://market.mashape.com/byroredux/metacritic-v2#get-user-details which disappeared today.
Otherwise I'm trying to scrape the site myself but keeping getting a blocked by a 429 Slow down. I got data like 3 times this hour and haven't been able to get anymore in the last 20 minutes which is making testing difficult and application possibly useless. Please let me know if there's anything else I can be doing to scape I don't know about.
I was using that API as well for an app I wrote a while ago. Looks like the creator removed it from Mashape. I just sent him an email to ask whether it'll be back up. I did find this scraper online. It only has a few endpoints but following the examples given you could easily add more. Let me know if you make any progress!
Edit: Looks like CBS requested it to be taken down. The ToS prohibits scraping:
[…] you agree not to do the following, or assist others to do the following:
Engage in unauthorized spidering, “scraping,” data mining or harvesting of Content, or use any other unauthorized automated means to gather data from or about the Services;
Though I was hoping for a Javascript way of doing this, the creator of the API also told me some info.
He says I was getting blocked for not having a User agent in the header and should use a 429 handling procedure i.e. re-request with longer pauses in between.
A PHP plugin available as well: http://datalinx.io/shop/metacritic-api/
I had to add a user agent like JCDJulian said and now it allows me to scrape. So for Ruby:
agent = Mechanize.new
agent.user_agent_alias = "Mac Firefox"
Then it stopped giving me the 403 Forbidden error.

Scrappy response different than browser response

I am trying to scrape a this page with scrapy:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=7391
and the response which I get is different than what I see in the browser. Browser response has the correct page, while scrapy response is:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=1
page. I have tried with urllib2 but still have the same issue. Any help is much appreciated.
I don't really understand the issue, but usually a different response for a browser and scrapy is caused by one these:
the server analyzes your User-Agent header, and returns a specially crafted page for mobile clients or bots;
the server analyzes the cookies, and does something special when it looks like you are visiting for the first time;
you are trying to make a POST request via scrapy like the browser does, but you forgot some form fields, or put wrong values
etc.
There is no universal way to determine what's wrong, because it depends on the server logic, which you don't know. If you are lucky, you will analyze and fix all the mentioned issues and will make it work.

How do I get HTTP Headers [Links Only] using a Web Browser in VB.NET?

What I'm trying to achieve is something similar to an Add-on called Live Http Headers used with Firefox. I'm not trying to get the Headers or cookies, but the links that load on the page itself. Let us assume I visited Mail.Yahoo.com, this is pretty much what you would see when I use the add-on.
CLICK HERE
How can I achieve something similar ? Only the links that load on the page itself !
I'm looking forward into reading your suggestions, please enlighten me if you know!
You can download the webpage using a webclient instance
Then with the result string, you can get the urls using a regular expression
http://www.geekzilla.co.uk/view2D3B0109-C1B2-4B4E-BFFD-E8088CBC85FD.htm

Prevention from entire website downloading?

There is one IP (from China) which is trying to download my entire website. It downloads all my pages and loads the server significantly (I have more than 500 000 pages). Looking at the access logs I can tell it's definitely not a Google bot or any other search engine bot.
Temporarily I've banned it (using iptables rules), but it's not a solution for me, because some of my real users also have the same IP, so they are also banned and cannot acces the website.
Is there any way to prevent such kind of "user activity"? Maybe a mechanism which implements captcha if you try to request more than 5 requests a second or something?
P.S. I'm using Yii framework (PHP).
Any suggestions are greatly appreciated.
thank you!
You have answered your own question!
Make captcha appear if the request exceeds certain number per second or per minute!
You should use CCaptchaAction to implement, like this.
I guess the best way to monitor for suspicious user activity is really user session, CWebUser's getState()/setState(). Store current request time in user session, compare it to several previous values, show captcha if user makes requests too often.
Create new component, preload it via CWebApplication::$preload and check user activity in components init() function. This way you'll be able to turn bot check on and off easily.