I crawl websites very often at the rate of hundreds of requests in an hour.
How to make crawlers behavior more like a human?
How to not get on radar by detection bots?
Currently crawling site with selenium, chrome.
Kindly suggest.
Well, you will have to pause the script between loops.
import time
time.sleep(1)
time.sleep(N)
So, it could hypothetically work like this.
import json,urllib.request
import requests
import pandas as pd
from string import ascii_lowercase
import time
alldata = []
for c in ascii_lowercase:
response = requests.get('https://reservia.viarail.ca/GetStations.aspx?q=' + c)
json_data = response.text.encode('utf-8', 'ignore')
df = pd.DataFrame(json.loads(json_data), columns=['sc', 'sn', 'pv']) # etc.,
time.sleep(3)
alldata.append(df)
Or, look for an API to grab data from the URL you are targeting. You didn't post an actual URL, so it's impossible to say for sure if an API is exposed or not.
There are a lot of ways that sites can detect you are trying to crawl them. The easiest is probably IP. If you are making requests too fast from the same IP you might get blocked. You can introduce (random) delays into your script to try and appear slower.
To continue going fast as possible, you will have to use different IP addresses. There are many proxy and VPN services that you can use to accomplish this.
Related
I have one question if someone can answer that, I would really appreciate that.
I am taking screenshots of emails in my Gmail inbox as shown in below picture
https://ibb.co/KNMvFsh
As, these screenshots cannot be taken using Gmail API. So, I am using selenium for this.
So, the question is, How many screenshots in a day I can take from one account? I don't know how much requests I can make until it blocks me?
I don't wanna get blocked and get captcha thing. I am not an experienced guy, relatively new to this. So, I don't have an idea how much requests I can make without getting blocked.
If someone of you know or have any idea, ill be appreciated.
You can potentially make unlimited requests if you use proxies.
Simply add a list of proxies to your GET request and enjoy:
proxyDict = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
r = requests.get(url, headers=headers, proxies=proxyDict)
Also for Selenium, from this answer:
PROXY = "1.111.111.1:8080" #your proxy
chrome_options = WebDriverWait.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
chrome = webdriver.Chrome(chrome_options=chrome_options)
chrome.get("gmail.com")
I would like to limit the amount of pages CrawlSpider is visiting on website.
How can I stop the Scrapy CrawlSpider after 100 requests?
I believe you can use closespider extension for that with the CLOSESPIDER_PAGECOUNT setting. According to the docs:
... specifies the maximum number of responses to crawl. If the spider
crawls more than that, the spider will be closed with the reason
closespider_pagecount
All you would need to do is set in your settings.py:
CLOSESPIDER_PAGECOUNT = 100
If this doesn't suit your need, another approach could be writing your own extension using Scrapy's stats module to keep track of number of requests.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
class Instabot:
def __init__(self,username,password):
self.driver=webdriver.Chrome("/Users/Aleti Sunil/Downloads/chromedriver_win32/chromedriver.exe")
self.driver.get("https://www.instagram.com/")
sleep(2)
self.driver.find_element_by_xpath("//input[#name=\"username\"]").send_keys(username)
self.driver.find_element_by_xpath("//input[#name=\"password\"]").send_keys(password)
self.driver.find_element_by_xpath("//button[#type=\"submit\"]").click()
sleep(10)
self.driver.find_element_by_xpath("//button[text()='Not Now']").click()
self.driver.find_element_by_xpath("//a[contains[#href,'/{}')]".format(username)).click()
usrlink = "https://www.instagram.com/"+username+"/"
self.driver.get(usrlink)
self.driver.find_element_by_xpath("//a[contains(#href,'following')]").click()
self.driver.execute_script("window.scrollTo(0,500)")
sleep(20)
Instabot('sunil_aleti','password')
I'm unable to scroll that dialog box, so i couldn't get that followers list
Plz help me
Use this code it may help you.
wait = WebDriverWait(self.driver, 15)
wait.until(lambda d: d.find_element_by_css_selector('div[role="dialog"]'))
# now scroll
driver.execute_script('''
var fDialog = document.querySelector('div[role="dialog"] .isgrP');
fDialog.scrollTop = fDialog.scrollHeight
''')
To scroll the list down use the below code. If it's not working change the xpath to match the parent div of the list.
followers_panel = browser.find_element_by_xpath('/html/body/div[5]/div/div/div[2]')
browser.execute_script("arguments[0].scrollTop = arguments[0].scrollHeight",followers_panel)
This is a worse practice in selenium (Please check seleniumhq site)
For multiple reasons, logging into sites like Gmail and Facebook using WebDriver is not recommended. Aside from being against the usage terms for these sites (where you risk having the account shut down), it is slow and unreliable.
The ideal practice is to use the APIs that email providers offer, or in the case of Facebook the developer tools service which exposes an API for creating test accounts, friends, and so forth. Although using an API might seem like a bit of extra hard work, you will be paid back in speed, reliability, and stability. The API is also unlikely to change, whereas webpages and HTML locators change often and require you to update your test framework.
Logging in to third-party sites using WebDriver at any point of your test increases the risk of your test failing because it makes your test longer. A general rule of thumb is that longer tests are more fragile and unreliable.
WebDriver implementations that are W3C conformant also annotate the navigator object with a WebDriver property so that Denial of Service attacks can be mitigated.
I am trying to scrape a this page with scrapy:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=7391
and the response which I get is different than what I see in the browser. Browser response has the correct page, while scrapy response is:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=1
page. I have tried with urllib2 but still have the same issue. Any help is much appreciated.
I don't really understand the issue, but usually a different response for a browser and scrapy is caused by one these:
the server analyzes your User-Agent header, and returns a specially crafted page for mobile clients or bots;
the server analyzes the cookies, and does something special when it looks like you are visiting for the first time;
you are trying to make a POST request via scrapy like the browser does, but you forgot some form fields, or put wrong values
etc.
There is no universal way to determine what's wrong, because it depends on the server logic, which you don't know. If you are lucky, you will analyze and fix all the mentioned issues and will make it work.
I started with scrapy some days ago, learned about scraping particular sites, ie the dmoz.org example; so far it's fine and i like it. As I want to learn about search engine development I aim to build a crawler (and storage, indexer etc) for large amount of websites of any "color" and content.
So far I also tried the depth-first-order and bredth-first-order crawling.
I use at the moment just one Rule, I set some path to skip and some domains.
Rule(SgmlLinkExtractor(deny=path_deny_base, deny_domains=deny_domains),
callback='save_page', follow=True),
I have one pipeline, a mysql storage to store url, body and headers of the downloaded pages, done via a PageItem with these fields.
My questions for now are:
Is it fine to use item for simple storing of pages ?
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Is there something like a blacklist for useless domains, ie. placeholder domains, link farms etc.?
There are many other issues like storage but I guess I stop here, just one more general search engine question
Is there a way to obtain crawl result data from other professional crawlers, of course it must be done by sending harddisks otherwise the data volume would be the same if I crawl them myself, (compressing left aside).
I will try to answer only two of your questions:
Is it fine to use item for simple storing of pages ?
AFAIK, scrapy doesn't care what you put into an Item's Field. Only your pipeline will dealing be with them.
How does it work that the spider checks the database if a page is already crawled (in the last six months ie), it's builtin somehow?
Scrapy has duplicates middleware, but it filters duplicates only in current session. You have to manually prevent scrapy to not crawl sites you've crawled six months ago.
As for question 3 and 4 - you don't understand them.