BeautifulSoup: loop only does it once - beautifulsoup

I try to finalize my scrapping loop but as I am a non-programmer it is hard to debug.
My code grabs some information from a site and puts it in a list. That works fine for the first page of the loop. But what I expect is a list for lets say the first 5 webpages. Unfortunately it stops after the first loop. I tried various indent but nothing worked.
My code is the following:
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
n_pages = 0
for page in range(0,5):
n_pages += 1
sapo_url = 'https://casa.sapo.pt/comprar-moradias/?lp=20000&gp=80000'+'&pn='+str(page)
print(sapo_url)
r = get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = []
for e in page_html.select('div.property'):
d = {'link':'https://casa.sapo.pt'+e.a.get('href')}
d.update(json.loads(e.script.string))
house_containers.append(d)
else:
break
sleep(randint(1,2))
print('You scraped {} pages containing {} properties.'.format(n_pages, len(name)))
What do I have to change to loop through the five pages stated with range(0,5)?

Related

How do i handle this dropdown using selenium with no select class?

I am trying to access this website and after that, I want to click on the Status dropdown and select Active from the dropdown. I think there are no select tags to be used hence the traditional select method is not working. Would be really helpful if anyone can have a look at it!
"""
PATH = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(PATH)
driver.get("https://www.silveroaksp.com/portfolio")
driver.find_element_by_xpath("//div[#class='btn-group show']").click()
time.sleep(2)
driver.close()
"""
*To capture all the Webelements in dropdown we use driver.findElements for multiple elements(this return element in the list).
*You can use to method for this. click on Feild and then click on active Web element, Second is you put it in a for each loop and in IFELSE condition put click
1.
PATH = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(PATH)
driver.get("https://www.silveroaksp.com/portfolio")
driver.find_element_by_xpath("//select[#id='StatusMultiselect']/following-
sibling::div").click()
time.sleep(2)
driver.find_element_by_xpath("//select[#id='StatusMultiselect']/following-
sibling::div/ul/li[2]/a/label/input").click()
time.sleep(2)
driver.close()
*This'll work I guess. IF doesn't let me know I'll provide a second option
This problem can be solved with Requests & Pandas:
import requests
from bs4 import BeautifulSoup
import pandas as pd
final_list = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:23.0) Gecko/20131011 Firefox/23.0',
'Host': 'www.silveroaksp.com',
'Accept': 'text/html, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.silveroaksp.com/portfolio',
'Content-Type': 'application/json; charset=utf-8',
'X-Requested-With': 'XMLHttpRequest'
}
data = '{"prmStatusIds":"PortStatus1","prmIndustryIds":"","prmFundIds":"","AllStatusClickChecker":"0","AllIndustryClickChecker":"1","AllFundClickChecker":"1"}'
with requests.Session() as s:
s.get('https://www.silveroaksp.com/portfolio')
r = s.post('https://www.silveroaksp.com/CommonItem/FilterPort', data=data, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
actives = soup.select('div.col-lg-4')
for active in actives:
business = active.select_one('img')['alt']
link = active.get('id')
r = s.post('https://www.silveroaksp.com/CommonItem/PortfolioInfo', data = '{"PortUrl":"' + link + '","Status":"all"}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
try:
headq = soup.find("label", string="Headquarters").find_next_sibling('span').text.strip()
except Exception as e:
headq = 'unkown'
try:
industry = soup.find("label", string="Industry").find_next_sibling('span').text.strip()
except Exception as e:
industry = 'unkown'
try:
aquired = soup.find("label", string="Acquired").find_next_sibling('span').text.strip()
except Exception as e:
aquired = 'unkown'
try:
website = soup.find("label", string="Website").find_next_sibling('span').text.strip()
except Exception as e:
website = 'unkown'
try:
title = soup.select_one('div.about-author').select_one('h4').text.strip()
except Exception as e:
title = 'unkown'
try:
description = ' '.join([x.text.strip() for x in soup.select_one('div.about-author').select('p')])
except Exception as e:
description = 'unkown'
try:
headline = soup.select_one('div.about-author').select_one('strong').text.strip()
except Exception as e:
headline = 'unkown'
final_list.append([headq, industry, aquired, website, title, description, headline])
final_df = pd.DataFrame(final_list, columns = ['headq', 'industry', 'aquired', 'website', 'title', 'description', 'headline'])
final_df.head()
This returns:
headq industry aquired website title description headline
0 Chicago, Illinois Business Services March 2018 www.brilliantfs.com Brilliant Brilliant is a leading provider of temporary a... Brilliant is actively looking for add-on acqui...
1 New York, New York Consumer Services November 2017 caringpeopleinc.com Caring People Founded in 1998, Caring People is a leading ho... Caring People is actively looking for add-on a...
2 Denver, Colorado Business Services December 2021 ccsbts.com CCS Facility Services CCS Facility Services is a leading provider of... CCS is actively looking for add-on acquisition...
3 Atlanta, GA Consumer Services August 2021 unkown Drive Automotive Services Drive, headquartered in Atlanta, GA, is a lead... Drive is actively looking for add-on acquisiti...
4 Woburn, Massachusetts Healthcare Services December 2020 www.geriatricmedical.com Geriatric Medical Founded in 1945, Geriatric Medical is a leadin... Geriatric Medical is actively looking for add-...

scrapy splash take screen shot of entire page

How can I modify the args so it captures the entire page
def start_requests(self):
url =#some url
splash_args = {
'html': 1,
'png': 1,
'width': 600,
}
yield SplashRequest(url=url, callback=self.parse,
endpoint="render.json",
args=splash_args)
def parse(self, response):
imgdata = base64.b64decode(response.data['png'])
filename = 'image.png'
with open(filename, 'wb') as f:
f.write(imgdata)
I tried adding 'height' in splash_args the image does get width*height but the extra height is blank, is there any way to solve this?
You can capture the entire page by adding following line to your Lua script
splash:set_viewport_full()
You can add it to the main spider:
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["www.dmoz.org"]
script = '''
function main(splash, args)
headers = {
["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
}
splash:set_custom_headers(headers)
splash.private_mode_enabled = false
url=args.url
assert(splash:go(url))
assert(splash:wait(5))
splash:set_viewport_full()
return {
image = splash:png(),
html = splash:html()
}
end
'''
def start_requests(self):
yield SplashRequest(url="http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
callback=self.parse,
endpoint="execute",
args={"lua_source": self.script})

Not able to scrape Forbes with BeautifulSoup

I am trying to scrape the table from this website "https://www.forbes.com/fintech/2019/#ef5484d2b4c6"
When i tried to find the main table, it returned the following without the table
Blockquote ""
error message received
url = 'https://www.forbes.com/fintech/2019/#ef5484d2b4c6'
source = urlopen(url).read()
soup = BeautifulSoup(source)
div1 = soup.find('div', attrs={'id':'main-content'})
div1
quote:""
Here's the data i am looking for:
high level section of data
Data i would like to scrape
You'd have to use Selenium to let the page render first, then you could grab that html source and parse with beautifulsoup. Or, you can access that data through their api json response.
the link you'are trying to get is is not in that json response though, but it appears to follow the same format/structure:
import requests
url = 'https://www.forbes.com/forbesapi/org/fintech/2019/position/true.json?limit=2000'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
data = requests.get(url, headers=headers).json()
for each in data['organizationList']['organizationsLists']:
orgName = each['organizationName']
link = 'https://www.forbes.com/companies/%s/?list=fintech' %each['uri']
print ('%-20s %s' %(orgName, link ))
Output:
Acorns https://www.forbes.com/companies/acorns/?list=fintech
Addepar https://www.forbes.com/companies/addepar/?list=fintech
Affirm https://www.forbes.com/companies/affirm/?list=fintech
Axoni https://www.forbes.com/companies/axoni/?list=fintech
Ayasdi https://www.forbes.com/companies/ayasdi/?list=fintech
Behavox https://www.forbes.com/companies/behavox/?list=fintech
Betterment https://www.forbes.com/companies/betterment/?list=fintech
Bitfury https://www.forbes.com/companies/bitfury/?list=fintech
Blend https://www.forbes.com/companies/blend/?list=fintech
Bolt https://www.forbes.com/companies/bolt/?list=fintech
Brex https://www.forbes.com/companies/brex/?list=fintech
Cadre https://www.forbes.com/companies/cadre/?list=fintech
Carta https://www.forbes.com/companies/carta/?list=fintech
Chime https://www.forbes.com/companies/chime/?list=fintech
Circle https://www.forbes.com/companies/circle/?list=fintech
Coinbase https://www.forbes.com/companies/coinbase/?list=fintech
Credit Karma https://www.forbes.com/companies/credit-karma/?list=fintech
Cross River https://www.forbes.com/companies/cross-river/?list=fintech
Digital Reasoning https://www.forbes.com/companies/digital-reasoning/?list=fintech
Earnin https://www.forbes.com/companies/earnin/?list=fintech
Enigma https://www.forbes.com/companies/enigma/?list=fintech
Even https://www.forbes.com/companies/even/?list=fintech
Flywire https://www.forbes.com/companies/flywire/?list=fintech
Forter https://www.forbes.com/companies/forter/?list=fintech
Fundrise https://www.forbes.com/companies/fundrise/?list=fintech
Gemini https://www.forbes.com/companies/gemini/?list=fintech
Guideline https://www.forbes.com/companies/guideline/?list=fintech
iCapital Network https://www.forbes.com/companies/icapital-network/?list=fintech
IEX Group https://www.forbes.com/companies/iex-group/?list=fintech
Kabbage https://www.forbes.com/companies/kabbage/?list=fintech
Lemonade https://www.forbes.com/companies/lemonade/?list=fintech
LendingHome https://www.forbes.com/companies/lendinghome/?list=fintech
Marqeta https://www.forbes.com/companies/marqeta/?list=fintech
Nova Credit https://www.forbes.com/companies/nova-credit/?list=fintech
Opendoor https://www.forbes.com/companies/opendoor/?list=fintech
Personal Capital https://www.forbes.com/companies/personal-capital/?list=fintech
Plaid https://www.forbes.com/companies/plaid/?list=fintech
Poynt https://www.forbes.com/companies/poynt/?list=fintech
Remitly https://www.forbes.com/companies/remitly/?list=fintech
Ripple https://www.forbes.com/companies/ripple/?list=fintech
Robinhood https://www.forbes.com/companies/robinhood/?list=fintech
Roofstock https://www.forbes.com/companies/roofstock/?list=fintech
Root Insurance https://www.forbes.com/companies/root-insurance/?list=fintech
Stash https://www.forbes.com/companies/stash/?list=fintech
Stripe https://www.forbes.com/companies/stripe/?list=fintech
Symphony https://www.forbes.com/companies/symphony/?list=fintech
Tala https://www.forbes.com/companies/tala/?list=fintech
Toast https://www.forbes.com/companies/toast/?list=fintech
Tradeshift https://www.forbes.com/companies/tradeshift/?list=fintech
TransferWise https://www.forbes.com/companies/transferwise/?list=fintech

How to get website to consistently return content from a GET request when it's inconsistent?

I posted a similar question earlier but I think this is a more refined question.
I'm trying to scrape: https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=0
My code randomly throws errors when I send a GET request to the URL. After debugging, I saw the following happen. A GET request for the following url will be sent(Example URL, could happen on any page): https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start=2400
The webpage will then say "There were no matching transactions found.". However, if I refresh the page, the content will then be loaded. I'm using BeautifulSoup and Selenium and have put sleep statements in my code in hopes that it'll work but to no avail. Is this a problem on the website's end? It doesn't make sense to me how one GET request will return nothing but the exact same request will return something. Also, is there anything I could to fix it or is it out of control?
Here is a sample of my code:
t
def scrapeWebsite(url, start, stop):
driver = webdriver.Chrome(executable_path='/Users/Downloads/chromedriver')
print(start, stop)
madeDict = {"Date": [], "Team": [], "Name": [], "Relinquished": [], "Notes": []}
#for i in range(0, 214025, 25):
for i in range(start, stop, 25):
print("Current Page: " + str(i))
currUrl = url + str(i)
#print(currUrl)
#r = requests.get(currUrl)
#soupPage = BeautifulSoup(r.content)
driver.get(currUrl)
#Sleep program for dynamic refreshing
time.sleep(1)
soupPage = BeautifulSoup(driver.page_source, 'html.parser')
#page = urllib2.urlopen(currUrl)
#time.sleep(2)
#soupPage = BeautifulSoup(page, 'html.parser')
info = soupPage.find("table", attrs={'class': 'datatable center'})
time.sleep(1)
extractedInfo = info.findAll("td")
The error occurs at the last line. "findAll" complains because it can't find findAll when the content is null(meaning the GET request returned nothing)
I did some workaround to scrape all the page using try except.
Probably the requests loop it is so fast and the page can't support it.
See the example below, worked like a charm:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=' \
'&PlayerMovementChkBx=yes&submit=Search&start=%s'
def scrape(start=0, stop=214525):
for page in range(start, stop, 25):
current_url = URL % page
print('scrape: current %s' % page)
while True:
try:
response = requests.request('GET', current_url)
if response.ok:
soup = BeautifulSoup(response.content.decode('utf-8'), features='html.parser')
table = soup.find("table", attrs={'class': 'datatable center'})
trs = table.find_all('tr')
slice_pos = 1 if page > 0 else 0
for tr in trs[slice_pos:]:
yield tr.find_all('td')
break
except Exception as exception:
print(exception)
for columns in scrape():
values = [column.text.strip() for column in columns]
# Continuous your code ...

How to detect captchas when scraping google?

I'm using the requests package with BeautifulSoup to scrape Google News for the number of search results for a query. I'm getting two types of IndexError, which I want to distinguish between:
When the number of search results is empty. Here #resultStats returns the empty string '[]'. What seems to be going on is that when a query string is too long, google doesn't even say "0 search results"; it just doesn't say anything.
The second IndexError is when google gives me a captcha.
I need to distinguish between these cases, because I want my scraper to wait five minutes when google sends me a captcha, but not when it's just an empty results string.
I currently have a jury-rigged approach, where I send another query with a known nonzero number of search results, which allows me to distinguish between the two IndexErrors. I'm wondering if there's a more elegant and direct approach to doing this, using BeautifulSoup.
Here's my code:
import requests, bs4, lxml, re, time, random
import pandas as pd
import numpy as np
URL = 'https://www.google.com/search?tbm=nws&q={query}&tbs=cdr%3A1%2Ccd_min%3A{year}%2Ccd_max%3A{year}&authuser=0'
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
def tester(): # test for captcha
test = requests.get('https://www.google.ca/search?q=donald+trump&safe=off&client=ubuntu&espv=2&biw=1910&bih=969&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2016%2Ccd_max%3A&tbm=nws', headers=headers)
dump = bs4.BeautifulSoup(test.text,"lxml")
result = dump.select('#resultStats')
num = result[0].getText()
num = re.search(r"\b\d[\d,.]*\b",num).group() # regex
num = int(num.replace(',',''))
num = (num > 0)
return num
def search(**params):
response = requests.get(URL.format(**params),headers=headers)
print(response.content, response.status_code) # check this for google requiring Captcha
soup = bs4.BeautifulSoup(response.text,"lxml")
elems = soup.select('#resultStats')
try: # want code to flag if I get a Captcha
hits = elems[0].getText()
hits = re.search(r"\b\d[\d,.]*\b",hits).group() # regex
hits = int(hits.replace(',',''))
print(hits)
return hits
except IndexError:
try:
tester() > 0 # if captcha, this will throw up another IndexError
print("Empty results!")
hits = 0
return hits
except IndexError:
print("Captcha'd!")
time.sleep(120) # should make it rotate IP when captcha'd
hits = 0
return hits
for qry in list:
hits = search(query= qry, year=2016)
I'd just search for the "captcha" element, for example, if this is Google Recaptcha, you can search for the hidden input containing the token:
is_captcha_on_page = soup.find("input", id="recaptcha-token") is not None