How to detect captchas when scraping google? - beautifulsoup

I'm using the requests package with BeautifulSoup to scrape Google News for the number of search results for a query. I'm getting two types of IndexError, which I want to distinguish between:
When the number of search results is empty. Here #resultStats returns the empty string '[]'. What seems to be going on is that when a query string is too long, google doesn't even say "0 search results"; it just doesn't say anything.
The second IndexError is when google gives me a captcha.
I need to distinguish between these cases, because I want my scraper to wait five minutes when google sends me a captcha, but not when it's just an empty results string.
I currently have a jury-rigged approach, where I send another query with a known nonzero number of search results, which allows me to distinguish between the two IndexErrors. I'm wondering if there's a more elegant and direct approach to doing this, using BeautifulSoup.
Here's my code:
import requests, bs4, lxml, re, time, random
import pandas as pd
import numpy as np
URL = 'https://www.google.com/search?tbm=nws&q={query}&tbs=cdr%3A1%2Ccd_min%3A{year}%2Ccd_max%3A{year}&authuser=0'
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
def tester(): # test for captcha
test = requests.get('https://www.google.ca/search?q=donald+trump&safe=off&client=ubuntu&espv=2&biw=1910&bih=969&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2016%2Ccd_max%3A&tbm=nws', headers=headers)
dump = bs4.BeautifulSoup(test.text,"lxml")
result = dump.select('#resultStats')
num = result[0].getText()
num = re.search(r"\b\d[\d,.]*\b",num).group() # regex
num = int(num.replace(',',''))
num = (num > 0)
return num
def search(**params):
response = requests.get(URL.format(**params),headers=headers)
print(response.content, response.status_code) # check this for google requiring Captcha
soup = bs4.BeautifulSoup(response.text,"lxml")
elems = soup.select('#resultStats')
try: # want code to flag if I get a Captcha
hits = elems[0].getText()
hits = re.search(r"\b\d[\d,.]*\b",hits).group() # regex
hits = int(hits.replace(',',''))
print(hits)
return hits
except IndexError:
try:
tester() > 0 # if captcha, this will throw up another IndexError
print("Empty results!")
hits = 0
return hits
except IndexError:
print("Captcha'd!")
time.sleep(120) # should make it rotate IP when captcha'd
hits = 0
return hits
for qry in list:
hits = search(query= qry, year=2016)

I'd just search for the "captcha" element, for example, if this is Google Recaptcha, you can search for the hidden input containing the token:
is_captcha_on_page = soup.find("input", id="recaptcha-token") is not None

Related

BeautifulSoup: loop only does it once

I try to finalize my scrapping loop but as I am a non-programmer it is hard to debug.
My code grabs some information from a site and puts it in a list. That works fine for the first page of the loop. But what I expect is a list for lets say the first 5 webpages. Unfortunately it stops after the first loop. I tried various indent but nothing worked.
My code is the following:
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
n_pages = 0
for page in range(0,5):
n_pages += 1
sapo_url = 'https://casa.sapo.pt/comprar-moradias/?lp=20000&gp=80000'+'&pn='+str(page)
print(sapo_url)
r = get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = []
for e in page_html.select('div.property'):
d = {'link':'https://casa.sapo.pt'+e.a.get('href')}
d.update(json.loads(e.script.string))
house_containers.append(d)
else:
break
sleep(randint(1,2))
print('You scraped {} pages containing {} properties.'.format(n_pages, len(name)))
What do I have to change to loop through the five pages stated with range(0,5)?

Not able to scrape Forbes with BeautifulSoup

I am trying to scrape the table from this website "https://www.forbes.com/fintech/2019/#ef5484d2b4c6"
When i tried to find the main table, it returned the following without the table
Blockquote ""
error message received
url = 'https://www.forbes.com/fintech/2019/#ef5484d2b4c6'
source = urlopen(url).read()
soup = BeautifulSoup(source)
div1 = soup.find('div', attrs={'id':'main-content'})
div1
quote:""
Here's the data i am looking for:
high level section of data
Data i would like to scrape
You'd have to use Selenium to let the page render first, then you could grab that html source and parse with beautifulsoup. Or, you can access that data through their api json response.
the link you'are trying to get is is not in that json response though, but it appears to follow the same format/structure:
import requests
url = 'https://www.forbes.com/forbesapi/org/fintech/2019/position/true.json?limit=2000'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
data = requests.get(url, headers=headers).json()
for each in data['organizationList']['organizationsLists']:
orgName = each['organizationName']
link = 'https://www.forbes.com/companies/%s/?list=fintech' %each['uri']
print ('%-20s %s' %(orgName, link ))
Output:
Acorns https://www.forbes.com/companies/acorns/?list=fintech
Addepar https://www.forbes.com/companies/addepar/?list=fintech
Affirm https://www.forbes.com/companies/affirm/?list=fintech
Axoni https://www.forbes.com/companies/axoni/?list=fintech
Ayasdi https://www.forbes.com/companies/ayasdi/?list=fintech
Behavox https://www.forbes.com/companies/behavox/?list=fintech
Betterment https://www.forbes.com/companies/betterment/?list=fintech
Bitfury https://www.forbes.com/companies/bitfury/?list=fintech
Blend https://www.forbes.com/companies/blend/?list=fintech
Bolt https://www.forbes.com/companies/bolt/?list=fintech
Brex https://www.forbes.com/companies/brex/?list=fintech
Cadre https://www.forbes.com/companies/cadre/?list=fintech
Carta https://www.forbes.com/companies/carta/?list=fintech
Chime https://www.forbes.com/companies/chime/?list=fintech
Circle https://www.forbes.com/companies/circle/?list=fintech
Coinbase https://www.forbes.com/companies/coinbase/?list=fintech
Credit Karma https://www.forbes.com/companies/credit-karma/?list=fintech
Cross River https://www.forbes.com/companies/cross-river/?list=fintech
Digital Reasoning https://www.forbes.com/companies/digital-reasoning/?list=fintech
Earnin https://www.forbes.com/companies/earnin/?list=fintech
Enigma https://www.forbes.com/companies/enigma/?list=fintech
Even https://www.forbes.com/companies/even/?list=fintech
Flywire https://www.forbes.com/companies/flywire/?list=fintech
Forter https://www.forbes.com/companies/forter/?list=fintech
Fundrise https://www.forbes.com/companies/fundrise/?list=fintech
Gemini https://www.forbes.com/companies/gemini/?list=fintech
Guideline https://www.forbes.com/companies/guideline/?list=fintech
iCapital Network https://www.forbes.com/companies/icapital-network/?list=fintech
IEX Group https://www.forbes.com/companies/iex-group/?list=fintech
Kabbage https://www.forbes.com/companies/kabbage/?list=fintech
Lemonade https://www.forbes.com/companies/lemonade/?list=fintech
LendingHome https://www.forbes.com/companies/lendinghome/?list=fintech
Marqeta https://www.forbes.com/companies/marqeta/?list=fintech
Nova Credit https://www.forbes.com/companies/nova-credit/?list=fintech
Opendoor https://www.forbes.com/companies/opendoor/?list=fintech
Personal Capital https://www.forbes.com/companies/personal-capital/?list=fintech
Plaid https://www.forbes.com/companies/plaid/?list=fintech
Poynt https://www.forbes.com/companies/poynt/?list=fintech
Remitly https://www.forbes.com/companies/remitly/?list=fintech
Ripple https://www.forbes.com/companies/ripple/?list=fintech
Robinhood https://www.forbes.com/companies/robinhood/?list=fintech
Roofstock https://www.forbes.com/companies/roofstock/?list=fintech
Root Insurance https://www.forbes.com/companies/root-insurance/?list=fintech
Stash https://www.forbes.com/companies/stash/?list=fintech
Stripe https://www.forbes.com/companies/stripe/?list=fintech
Symphony https://www.forbes.com/companies/symphony/?list=fintech
Tala https://www.forbes.com/companies/tala/?list=fintech
Toast https://www.forbes.com/companies/toast/?list=fintech
Tradeshift https://www.forbes.com/companies/tradeshift/?list=fintech
TransferWise https://www.forbes.com/companies/transferwise/?list=fintech

Need to return Scrapy callback method data to calling function

In below code I am trying to collect email ids from a website. It can be on contact or about us page.
From parse method I follow extemail method for all those pages.
From every page I collected few email ids.
Now I need to print them with original record sent to init method.
For example:
record = "https://www.wockenfusscandies.com/"
I want to print output as,
https://www.wockenfusscandies.com/|abc#gamil.com|def#outlook.com
I am not able to store them in self.emails and deliver back to init method.
Please help.
import scrapy
from scrapy.crawler import CrawlerProcess
class EmailSpider(scrapy.Spider):
def __init__(self, record):
self.record = record
self.emails = []
url = record.split("|")[4]
if not url.startswith("http"):
url = "http://{}".format(url)
if url:
self.start_urls = ["https://www.wockenfusscandies.com/"]
else:
self.start_urls = []
def parse(self, response):
contact_list = [a.attrib['href'] for a in response.css('a') if 'contact' in a.attrib['href'] or 'about' in a.attrib['href']]
contact_list.append(response.request.url)
for fllink in contact_list:
yield response.follow(fllink, self.extemail)
def extemail(self, response):
emails = response.css('body').re('[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
yield {
'emails': emails
}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
f = open("/Users/kalpesh/work/data/test.csv")
for rec in f:
process.crawl(EmailSpider, record=rec)
f.close()
process.start()
If I understand your intend correctly you could try the following proceeding:
a) collect the mail-ids in self.emails like
def extemail(self, response):
emails = response.css('body').re('[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+')
self.emails = emails.copy()
yield {
'emails': emails
}
(Or on what other way you get the email-ids from emails)
b) add a close(self, reason) method as in GitHub-Example which is called when the spider has finished
def close(self, reason):
mails_for_record = ""
for mail in self.emails:
mails_for_record += mail + "|"
print(self.record + mails_for_record)
Please also note, I read somewhere that for some versions of Scrapy it is def close(self, reason), for others it is def closed(self, reason).
Hope, this proceeding helps you.
You should visit all the site pages before yielding result for this one site.
This means that you should have queue of pages to visit and results storage.
It can be done using meta.
Some pseudocode:
def parse(self, response):
meta = response.meta
if not meta.get('seen'):
# -- finding urls of contact and about us pages --
# -- putting it to meta['queue'] --
# -- setting meta['seen'] = True
page_emails_found = ...getting emails here...
# --- extending already discovered emails
# --- from other pages/initial empty list with new ones
meta['emails'].extend(page_emails_found)
# if queue isn't empty - yielding new request
if meta['queue']:
next_url = meta['queue'].pop()
yield Request(next_url, callback=self.parse, meta=copy(meta))
# if queue is empty - yielding result from meta
else:
yield {'url': current_domain, 'emails': meta['emails']}
Something like this..

How to get website to consistently return content from a GET request when it's inconsistent?

I posted a similar question earlier but I think this is a more refined question.
I'm trying to scrape: https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=0
My code randomly throws errors when I send a GET request to the URL. After debugging, I saw the following happen. A GET request for the following url will be sent(Example URL, could happen on any page): https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=2400
The webpage will then say "There were no matching transactions found.". However, if I refresh the page, the content will then be loaded. I'm using BeautifulSoup and Selenium and have put sleep statements in my code in hopes that it'll work but to no avail. Is this a problem on the website's end? It doesn't make sense to me how one GET request will return nothing but the exact same request will return something. Also, is there anything I could to fix it or is it out of control?
Here is a sample of my code:
t
def scrapeWebsite(url, start, stop):
driver = webdriver.Chrome(executable_path='/Users/Downloads/chromedriver')
print(start, stop)
madeDict = {"Date": [], "Team": [], "Name": [], "Relinquished": [], "Notes": []}
#for i in range(0, 214025, 25):
for i in range(start, stop, 25):
print("Current Page: " + str(i))
currUrl = url + str(i)
#print(currUrl)
#r = requests.get(currUrl)
#soupPage = BeautifulSoup(r.content)
driver.get(currUrl)
#Sleep program for dynamic refreshing
time.sleep(1)
soupPage = BeautifulSoup(driver.page_source, 'html.parser')
#page = urllib2.urlopen(currUrl)
#time.sleep(2)
#soupPage = BeautifulSoup(page, 'html.parser')
info = soupPage.find("table", attrs={'class': 'datatable center'})
time.sleep(1)
extractedInfo = info.findAll("td")
The error occurs at the last line. "findAll" complains because it can't find findAll when the content is null(meaning the GET request returned nothing)
I did some workaround to scrape all the page using try except.
Probably the requests loop it is so fast and the page can't support it.
See the example below, worked like a charm:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=' \
'&PlayerMovementChkBx=yes&submit=Search&start=%s'
def scrape(start=0, stop=214525):
for page in range(start, stop, 25):
current_url = URL % page
print('scrape: current %s' % page)
while True:
try:
response = requests.request('GET', current_url)
if response.ok:
soup = BeautifulSoup(response.content.decode('utf-8'), features='html.parser')
table = soup.find("table", attrs={'class': 'datatable center'})
trs = table.find_all('tr')
slice_pos = 1 if page > 0 else 0
for tr in trs[slice_pos:]:
yield tr.find_all('td')
break
except Exception as exception:
print(exception)
for columns in scrape():
values = [column.text.strip() for column in columns]
# Continuous your code ...

Web page is showing weird unicode(?) letters: \u200e

How can I remove that? I Tried so many things and I am exhausted of trying to defeat this error by myself. I spent the last 3 hours looking at this and trying to get through it and I surrender to this code. Please help.
The first "for" statement grabs article titles from news.google.com
The second "for" statement grabs the time of submisssion from that article on news.google.com.
This is on django btw and this page shows the list of article titles and their time of submission in a list, going down. The weird unicode letters are popping up from the second "for" statement which is the time submissions. Here is my views.py:
def articles(request):
""" Grabs the most recent articles from the main news page """
import bs4, requests
list = []
list2 = []
url = 'https://news.google.com/'
r = requests.get(url)
try:
r.raise_for_status() == True
except ValueError:
print('Something went wrong.')
soup = bs4.BeautifulSoup(r.text, 'html.parser')
for (listarticles) in soup.find_all('h2', 'esc-lead-article-title'):
if listarticles is not None:
a = listarticles.text
list.append(a)
for articles_times in soup.find_all('span','al-attribution-timestamp'):
if articles_times is not None:
b = articles_times.text
list2.append(b)
list = zip(list,list2)
context = {'list':list}
return render(request, 'newz/articles.html', context)