Not able to scrape Forbes with BeautifulSoup

Not able to scrape Forbes with BeautifulSoup - beautifulsoup

I am trying to scrape the table from this website "https://www.forbes.com/fintech/2019/#ef5484d2b4c6"
When i tried to find the main table, it returned the following without the table
Blockquote ""
error message received
url = 'https://www.forbes.com/fintech/2019/#ef5484d2b4c6'
source = urlopen(url).read()
soup = BeautifulSoup(source)
div1 = soup.find('div', attrs={'id':'main-content'})
div1
quote:""
Here's the data i am looking for:
high level section of data
Data i would like to scrape

You'd have to use Selenium to let the page render first, then you could grab that html source and parse with beautifulsoup. Or, you can access that data through their api json response.
the link you'are trying to get is is not in that json response though, but it appears to follow the same format/structure:
import requests
url = 'https://www.forbes.com/forbesapi/org/fintech/2019/position/true.json?limit=2000'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
data = requests.get(url, headers=headers).json()
for each in data['organizationList']['organizationsLists']:
orgName = each['organizationName']
link = 'https://www.forbes.com/companies/%s/?list=fintech' %each['uri']
print ('%-20s %s' %(orgName, link ))
Output:
Acorns https://www.forbes.com/companies/acorns/?list=fintech
Addepar https://www.forbes.com/companies/addepar/?list=fintech
Affirm https://www.forbes.com/companies/affirm/?list=fintech
Axoni https://www.forbes.com/companies/axoni/?list=fintech
Ayasdi https://www.forbes.com/companies/ayasdi/?list=fintech
Behavox https://www.forbes.com/companies/behavox/?list=fintech
Betterment https://www.forbes.com/companies/betterment/?list=fintech
Bitfury https://www.forbes.com/companies/bitfury/?list=fintech
Blend https://www.forbes.com/companies/blend/?list=fintech
Bolt https://www.forbes.com/companies/bolt/?list=fintech
Brex https://www.forbes.com/companies/brex/?list=fintech
Cadre https://www.forbes.com/companies/cadre/?list=fintech
Carta https://www.forbes.com/companies/carta/?list=fintech
Chime https://www.forbes.com/companies/chime/?list=fintech
Circle https://www.forbes.com/companies/circle/?list=fintech
Coinbase https://www.forbes.com/companies/coinbase/?list=fintech
Credit Karma https://www.forbes.com/companies/credit-karma/?list=fintech
Cross River https://www.forbes.com/companies/cross-river/?list=fintech
Digital Reasoning https://www.forbes.com/companies/digital-reasoning/?list=fintech
Earnin https://www.forbes.com/companies/earnin/?list=fintech
Enigma https://www.forbes.com/companies/enigma/?list=fintech
Even https://www.forbes.com/companies/even/?list=fintech
Flywire https://www.forbes.com/companies/flywire/?list=fintech
Forter https://www.forbes.com/companies/forter/?list=fintech
Fundrise https://www.forbes.com/companies/fundrise/?list=fintech
Gemini https://www.forbes.com/companies/gemini/?list=fintech
Guideline https://www.forbes.com/companies/guideline/?list=fintech
iCapital Network https://www.forbes.com/companies/icapital-network/?list=fintech
IEX Group https://www.forbes.com/companies/iex-group/?list=fintech
Kabbage https://www.forbes.com/companies/kabbage/?list=fintech
Lemonade https://www.forbes.com/companies/lemonade/?list=fintech
LendingHome https://www.forbes.com/companies/lendinghome/?list=fintech
Marqeta https://www.forbes.com/companies/marqeta/?list=fintech
Nova Credit https://www.forbes.com/companies/nova-credit/?list=fintech
Opendoor https://www.forbes.com/companies/opendoor/?list=fintech
Personal Capital https://www.forbes.com/companies/personal-capital/?list=fintech
Plaid https://www.forbes.com/companies/plaid/?list=fintech
Poynt https://www.forbes.com/companies/poynt/?list=fintech
Remitly https://www.forbes.com/companies/remitly/?list=fintech
Ripple https://www.forbes.com/companies/ripple/?list=fintech
Robinhood https://www.forbes.com/companies/robinhood/?list=fintech
Roofstock https://www.forbes.com/companies/roofstock/?list=fintech
Root Insurance https://www.forbes.com/companies/root-insurance/?list=fintech
Stash https://www.forbes.com/companies/stash/?list=fintech
Stripe https://www.forbes.com/companies/stripe/?list=fintech
Symphony https://www.forbes.com/companies/symphony/?list=fintech
Tala https://www.forbes.com/companies/tala/?list=fintech
Toast https://www.forbes.com/companies/toast/?list=fintech
Tradeshift https://www.forbes.com/companies/tradeshift/?list=fintech
TransferWise https://www.forbes.com/companies/transferwise/?list=fintech

Related

BeautifulSoup: loop only does it once

I try to finalize my scrapping loop but as I am a non-programmer it is hard to debug.
My code grabs some information from a site and puts it in a list. That works fine for the first page of the loop. But what I expect is a list for lets say the first 5 webpages. Unfortunately it stops after the first loop. I tried various indent but nothing worked.
My code is the following:
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
n_pages = 0
for page in range(0,5):
n_pages += 1
sapo_url = 'https://casa.sapo.pt/comprar-moradias/?lp=20000&gp=80000'+'&pn='+str(page)
print(sapo_url)
r = get(sapo_url, headers=headers)
page_html = BeautifulSoup(r.text, 'html.parser')
house_containers = []
for e in page_html.select('div.property'):
d = {'link':'https://casa.sapo.pt'+e.a.get('href')}
d.update(json.loads(e.script.string))
house_containers.append(d)
else:
break
sleep(randint(1,2))
print('You scraped {} pages containing {} properties.'.format(n_pages, len(name)))
What do I have to change to loop through the five pages stated with range(0,5)?

How do i handle this dropdown using selenium with no select class?

I am trying to access this website and after that, I want to click on the Status dropdown and select Active from the dropdown. I think there are no select tags to be used hence the traditional select method is not working. Would be really helpful if anyone can have a look at it!
"""
PATH = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(PATH)
driver.get("https://www.silveroaksp.com/portfolio")
driver.find_element_by_xpath("//div[#class='btn-group show']").click()
time.sleep(2)
driver.close()
"""

*To capture all the Webelements in dropdown we use driver.findElements for multiple elements(this return element in the list).
*You can use to method for this. click on Feild and then click on active Web element, Second is you put it in a for each loop and in IFELSE condition put click
1.
PATH = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(PATH)
driver.get("https://www.silveroaksp.com/portfolio")
driver.find_element_by_xpath("//select[#id='StatusMultiselect']/following-
sibling::div").click()
time.sleep(2)
driver.find_element_by_xpath("//select[#id='StatusMultiselect']/following-
sibling::div/ul/li[2]/a/label/input").click()
time.sleep(2)
driver.close()
*This'll work I guess. IF doesn't let me know I'll provide a second option

This problem can be solved with Requests & Pandas:
import requests
from bs4 import BeautifulSoup
import pandas as pd
final_list = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:23.0) Gecko/20131011 Firefox/23.0',
'Host': 'www.silveroaksp.com',
'Accept': 'text/html, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.silveroaksp.com/portfolio',
'Content-Type': 'application/json; charset=utf-8',
'X-Requested-With': 'XMLHttpRequest'
}
data = '{"prmStatusIds":"PortStatus1","prmIndustryIds":"","prmFundIds":"","AllStatusClickChecker":"0","AllIndustryClickChecker":"1","AllFundClickChecker":"1"}'
with requests.Session() as s:
s.get('https://www.silveroaksp.com/portfolio')
r = s.post('https://www.silveroaksp.com/CommonItem/FilterPort', data=data, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
actives = soup.select('div.col-lg-4')
for active in actives:
business = active.select_one('img')['alt']
link = active.get('id')
r = s.post('https://www.silveroaksp.com/CommonItem/PortfolioInfo', data = '{"PortUrl":"' + link + '","Status":"all"}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
try:
headq = soup.find("label", string="Headquarters").find_next_sibling('span').text.strip()
except Exception as e:
headq = 'unkown'
try:
industry = soup.find("label", string="Industry").find_next_sibling('span').text.strip()
except Exception as e:
industry = 'unkown'
try:
aquired = soup.find("label", string="Acquired").find_next_sibling('span').text.strip()
except Exception as e:
aquired = 'unkown'
try:
website = soup.find("label", string="Website").find_next_sibling('span').text.strip()
except Exception as e:
website = 'unkown'
try:
title = soup.select_one('div.about-author').select_one('h4').text.strip()
except Exception as e:
title = 'unkown'
try:
description = ' '.join([x.text.strip() for x in soup.select_one('div.about-author').select('p')])
except Exception as e:
description = 'unkown'
try:
headline = soup.select_one('div.about-author').select_one('strong').text.strip()
except Exception as e:
headline = 'unkown'
final_list.append([headq, industry, aquired, website, title, description, headline])
final_df = pd.DataFrame(final_list, columns = ['headq', 'industry', 'aquired', 'website', 'title', 'description', 'headline'])
final_df.head()
This returns:
headq industry aquired website title description headline
0 Chicago, Illinois Business Services March 2018 www.brilliantfs.com Brilliant Brilliant is a leading provider of temporary a... Brilliant is actively looking for add-on acqui...
1 New York, New York Consumer Services November 2017 caringpeopleinc.com Caring People Founded in 1998, Caring People is a leading ho... Caring People is actively looking for add-on a...
2 Denver, Colorado Business Services December 2021 ccsbts.com CCS Facility Services CCS Facility Services is a leading provider of... CCS is actively looking for add-on acquisition...
3 Atlanta, GA Consumer Services August 2021 unkown Drive Automotive Services Drive, headquartered in Atlanta, GA, is a lead... Drive is actively looking for add-on acquisiti...
4 Woburn, Massachusetts Healthcare Services December 2020 www.geriatricmedical.com Geriatric Medical Founded in 1945, Geriatric Medical is a leadin... Geriatric Medical is actively looking for add-...

How can I get all coins in USD parity to the Binance API?

I need binance data to build a mobile app. Only USDT pairs are sufficient. In the link below it takes all trading pairs, but I only want USDT pairs. Which link should I use for this?
https://api.binance.com/api/v3/ticker/price

You can use the Binance Exchange API. There is no need for registering.
The used API call is this: https://api.binance.com/api/v3/exchangeInfo
I recomend you use google colab and python, or any other python resource:
import requests
def get_response(url):
response = requests.get(url)
response.raise_for_status() # raises exception when not a 2xx response
if response.status_code != 204:
return response.json()
def get_exchange_info():
base_url = 'https://api.binance.com'
endpoint = '/api/v3/exchangeInfo'
return get_response(base_url + endpoint)
def create_symbols_list(filter='USDT'):
rows = []
info = get_exchange_info()
pairs_data = info['symbols']
full_data_dic = {s['symbol']: s for s in pairs_data if filter in s['symbol']}
return full_data_dic.keys()
create_symbols_list('USDT')
Result:
['BTCUSDT', 'ETHUSDT', 'BNBUSDT', 'BCCUSDT', 'NEOUSDT', 'LTCUSDT',...
The api call brings you a very large response fill with with interesting data about the exchange. In the function create_symbols_list you get all this data in the full_data_dic dictionary.

There is a python binance client library and you can do check the list of tickers which tickers are quoted in USDT (and status is trading):
from binance.client import Client
client = Client()
info = client.get_exchange_info()
for c in info['symbols']:
if c['quoteAsset']=='USDT' and c['status']=="TRADING":
print(c['symbol'])

HTTP POST to Google Maps Geolocation API fails on NodeMCU

I am using a Wemos D1 Mini board and try to get the boards location using its surrounding APs and the Google Maps Geolocation API. The board runs NodeMCU and is programmed in Lua.
This is my code so far:
function listap(t) -- (SSID : Authmode, RSSI, BSSID, Channel)
body = {}
body["wifiAccessPoints"] = {}
for bssid,v in pairs(t) do
this_m = {}
this_m.macAddress = bssid
table.insert(body.wifiAccessPoints, this_m)
end
ok, json = pcall(sjson.encode, body)
if ok then
http.post('https://www.googleapis.com/geolocation/v1/geolocate?key=AIzaSyC4IZU8CEB0jvSblOHqYm********', 'Content-Type: application/json\r\n', json,
function(code, data)
if (code < 0) then
print("HTTP request failed")
else
print(code, data)
end
end)
else
print("failed to encode!")
end
end
wifi.sta.getap(1, listap)
The JSON it created looks good and I can make the request using Postman for example, but every time I run this script, the request fails. Any idea why?

How to detect captchas when scraping google?

I'm using the requests package with BeautifulSoup to scrape Google News for the number of search results for a query. I'm getting two types of IndexError, which I want to distinguish between:
When the number of search results is empty. Here #resultStats returns the empty string '[]'. What seems to be going on is that when a query string is too long, google doesn't even say "0 search results"; it just doesn't say anything.
The second IndexError is when google gives me a captcha.
I need to distinguish between these cases, because I want my scraper to wait five minutes when google sends me a captcha, but not when it's just an empty results string.
I currently have a jury-rigged approach, where I send another query with a known nonzero number of search results, which allows me to distinguish between the two IndexErrors. I'm wondering if there's a more elegant and direct approach to doing this, using BeautifulSoup.
Here's my code:
import requests, bs4, lxml, re, time, random
import pandas as pd
import numpy as np
URL = 'https://www.google.com/search?tbm=nws&q={query}&tbs=cdr%3A1%2Ccd_min%3A{year}%2Ccd_max%3A{year}&authuser=0'
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
def tester(): # test for captcha
test = requests.get('https://www.google.ca/search?q=donald+trump&safe=off&client=ubuntu&espv=2&biw=1910&bih=969&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2016%2Ccd_max%3A&tbm=nws', headers=headers)
dump = bs4.BeautifulSoup(test.text,"lxml")
result = dump.select('#resultStats')
num = result[0].getText()
num = re.search(r"\b\d[\d,.]*\b",num).group() # regex
num = int(num.replace(',',''))
num = (num > 0)
return num
def search(**params):
response = requests.get(URL.format(**params),headers=headers)
print(response.content, response.status_code) # check this for google requiring Captcha
soup = bs4.BeautifulSoup(response.text,"lxml")
elems = soup.select('#resultStats')
try: # want code to flag if I get a Captcha
hits = elems[0].getText()
hits = re.search(r"\b\d[\d,.]*\b",hits).group() # regex
hits = int(hits.replace(',',''))
print(hits)
return hits
except IndexError:
try:
tester() > 0 # if captcha, this will throw up another IndexError
print("Empty results!")
hits = 0
return hits
except IndexError:
print("Captcha'd!")
time.sleep(120) # should make it rotate IP when captcha'd
hits = 0
return hits
for qry in list:
hits = search(query= qry, year=2016)

I'd just search for the "captcha" element, for example, if this is Google Recaptcha, you can search for the hidden input containing the token:
is_captcha_on_page = soup.find("input", id="recaptcha-token") is not None

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Not able to scrape Forbes with BeautifulSoup - beautifulsoup

Related

BeautifulSoup: loop only does it once

How do i handle this dropdown using selenium with no select class?

How can I get all coins in USD parity to the Binance API?

HTTP POST to Google Maps Geolocation API fails on NodeMCU

How to detect captchas when scraping google?

Categories

Resources