I am trying to access this website and after that, I want to click on the Status dropdown and select Active from the dropdown. I think there are no select tags to be used hence the traditional select method is not working. Would be really helpful if anyone can have a look at it!
"""
PATH = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(PATH)
driver.get("https://www.silveroaksp.com/portfolio")
driver.find_element_by_xpath("//div[#class='btn-group show']").click()
time.sleep(2)
driver.close()
"""
*To capture all the Webelements in dropdown we use driver.findElements for multiple elements(this return element in the list).
*You can use to method for this. click on Feild and then click on active Web element, Second is you put it in a for each loop and in IFELSE condition put click
1.
PATH = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(PATH)
driver.get("https://www.silveroaksp.com/portfolio")
driver.find_element_by_xpath("//select[#id='StatusMultiselect']/following-
sibling::div").click()
time.sleep(2)
driver.find_element_by_xpath("//select[#id='StatusMultiselect']/following-
sibling::div/ul/li[2]/a/label/input").click()
time.sleep(2)
driver.close()
*This'll work I guess. IF doesn't let me know I'll provide a second option
This problem can be solved with Requests & Pandas:
import requests
from bs4 import BeautifulSoup
import pandas as pd
final_list = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:23.0) Gecko/20131011 Firefox/23.0',
'Host': 'www.silveroaksp.com',
'Accept': 'text/html, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.silveroaksp.com/portfolio',
'Content-Type': 'application/json; charset=utf-8',
'X-Requested-With': 'XMLHttpRequest'
}
data = '{"prmStatusIds":"PortStatus1","prmIndustryIds":"","prmFundIds":"","AllStatusClickChecker":"0","AllIndustryClickChecker":"1","AllFundClickChecker":"1"}'
with requests.Session() as s:
s.get('https://www.silveroaksp.com/portfolio')
r = s.post('https://www.silveroaksp.com/CommonItem/FilterPort', data=data, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
actives = soup.select('div.col-lg-4')
for active in actives:
business = active.select_one('img')['alt']
link = active.get('id')
r = s.post('https://www.silveroaksp.com/CommonItem/PortfolioInfo', data = '{"PortUrl":"' + link + '","Status":"all"}', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
try:
headq = soup.find("label", string="Headquarters").find_next_sibling('span').text.strip()
except Exception as e:
headq = 'unkown'
try:
industry = soup.find("label", string="Industry").find_next_sibling('span').text.strip()
except Exception as e:
industry = 'unkown'
try:
aquired = soup.find("label", string="Acquired").find_next_sibling('span').text.strip()
except Exception as e:
aquired = 'unkown'
try:
website = soup.find("label", string="Website").find_next_sibling('span').text.strip()
except Exception as e:
website = 'unkown'
try:
title = soup.select_one('div.about-author').select_one('h4').text.strip()
except Exception as e:
title = 'unkown'
try:
description = ' '.join([x.text.strip() for x in soup.select_one('div.about-author').select('p')])
except Exception as e:
description = 'unkown'
try:
headline = soup.select_one('div.about-author').select_one('strong').text.strip()
except Exception as e:
headline = 'unkown'
final_list.append([headq, industry, aquired, website, title, description, headline])
final_df = pd.DataFrame(final_list, columns = ['headq', 'industry', 'aquired', 'website', 'title', 'description', 'headline'])
final_df.head()
This returns:
headq industry aquired website title description headline
0 Chicago, Illinois Business Services March 2018 www.brilliantfs.com Brilliant Brilliant is a leading provider of temporary a... Brilliant is actively looking for add-on acqui...
1 New York, New York Consumer Services November 2017 caringpeopleinc.com Caring People Founded in 1998, Caring People is a leading ho... Caring People is actively looking for add-on a...
2 Denver, Colorado Business Services December 2021 ccsbts.com CCS Facility Services CCS Facility Services is a leading provider of... CCS is actively looking for add-on acquisition...
3 Atlanta, GA Consumer Services August 2021 unkown Drive Automotive Services Drive, headquartered in Atlanta, GA, is a lead... Drive is actively looking for add-on acquisiti...
4 Woburn, Massachusetts Healthcare Services December 2020 www.geriatricmedical.com Geriatric Medical Founded in 1945, Geriatric Medical is a leadin... Geriatric Medical is actively looking for add-...
Related
I'm trying to make a scraper for capterra. I'm having issues getting blocked, so I think I need a proxy for my driver.get. Also, I am having trouble exporting a dataframe to a CSV. The first half of my code (not attached) is able to get all the links and store them in a list that I am trying to access with Selenium to get the information I want, but the second part is where I am having trouble.
For an example, these are the types of links I am storing in the plinks list and that the driver is accessing:
https://www.capterra.com/p/212448/Blackbaud-Altru/
https://www.capterra.com/p/80509/Volgistics-Volunteer-Management/
https://www.capterra.com/p/179048/One-Earth/
for link in plinks:
driver.get(link)
#driver.implicitly_wait(20)
companyProfile = bs(driver.page_source, 'html.parser')
try:
name = companyProfile.find("h1", class_="sm:nb-type-2xl nb-type-xl").text
except AttributeError:
name = "couldn't find"
try:
reviews = companyProfile.find("div", class_="nb-ml-3xs").text
except AttributeError:
reviews = "couldn't find"
try:
location = driver.find_element(By. XPATH, "//*[starts-with(., 'Located in')]").text
except NoSuchElementException:
location = "couldn't find"
try:
url = driver.find_element(By. XPATH, "//*[starts-with(., 'http')]").text
except NoSuchElementException:
url = "couldn't find"
try:
features = [x.get_text() for x in companyProfile.select('[id="LoadableProductFeaturesSection"] li span')]
except AttributeError:
features = "couldn't find"
companyInfo.append([name, reviews, location, url, features])
companydf = pd.DataFrame(companyInfo, columns = ["Name", "Reviews", "Location", "URL", "Features"])
companydf.to_csv(wmtest.csv, sep='\t')
driver.close()
I am using Mozilla for the webdriver, and I am happy to change to Chrome if it works better, but is it possible to have the webdriver pick from a random set of proxies for each get request?
Thanks!
I actually want to use Africa's talking api on my ussd app. I am from bangladesh and i am confused if it supports bangladesh or not.there are mainly four service provider in bangladesh namely gameenphone, robi, banglalink, airtel. I want to send ussd from one of the operator to my application.
here is python code:
from flask import Flask, request
import africastalking
import os
app = Flask(__name__)
username = "sandbox"
api_key = "*384*89376#"
africastalking.initialize(username, api_key)
sms = africastalking.SMS
#app.route('/', methods=['POST', 'GET'])
def ussd_callback():
global response
session_id = request.values.get("sessionId", None)
service_code = request.values.get("serviceCode", None)
phone_number = request.values.get("phoneNumber", None)
text = request.values.get("text", "default")
sms_phone_number = []
sms_phone_number.append(phone_number)
#ussd logic
if text == "":
#main menu
response = "CON What would you like to do?\n"
response += "1. Check account details\n"
response += "2. Check phone number\n"
response += "3. Send me a cool message"
elif text == "1":
#sub menu 1
response = "CON What would you like to check on your account?\n"
response += "1. Account number"
response += "2. Account balance"
elif text == "2":
#sub menu 1
response = "END Your phone number is {}".format(phone_number)
elif text == "3":
try:
#sending the sms
sms_response = sms.send("Thank you for going through this tutorial",
sms_phone_number)
print(sms_response)
except Exception as e:
#show us what went wrong
print(f"Houston, we have a problem: {e}")
elif text == "1*1":
#ussd menus are split using *
account_number = "1243324376742"
response = "END Your account number is {}".format(account_number)
elif text == "1*2":
account_balance = "100,000"
response = "END Your account balance is USD {}".format(account_balance)
else:
response = "END Invalid input. Try again."
return response
if __name__ == "__main__":
app.run()
Africa's Talking USSD services are in the following countries:
Kenya,
Uganda,
Tanzania,
Rwanda,
Nigeria
Côte d'Ivoire,
Malawi,
Zambia,
South Africa
Hope this helps. Reach out if you have any more questions.
I am trying to scrape the table from this website "https://www.forbes.com/fintech/2019/#ef5484d2b4c6"
When i tried to find the main table, it returned the following without the table
Blockquote ""
error message received
url = 'https://www.forbes.com/fintech/2019/#ef5484d2b4c6'
source = urlopen(url).read()
soup = BeautifulSoup(source)
div1 = soup.find('div', attrs={'id':'main-content'})
div1
quote:""
Here's the data i am looking for:
high level section of data
Data i would like to scrape
You'd have to use Selenium to let the page render first, then you could grab that html source and parse with beautifulsoup. Or, you can access that data through their api json response.
the link you'are trying to get is is not in that json response though, but it appears to follow the same format/structure:
import requests
url = 'https://www.forbes.com/forbesapi/org/fintech/2019/position/true.json?limit=2000'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'}
data = requests.get(url, headers=headers).json()
for each in data['organizationList']['organizationsLists']:
orgName = each['organizationName']
link = 'https://www.forbes.com/companies/%s/?list=fintech' %each['uri']
print ('%-20s %s' %(orgName, link ))
Output:
Acorns https://www.forbes.com/companies/acorns/?list=fintech
Addepar https://www.forbes.com/companies/addepar/?list=fintech
Affirm https://www.forbes.com/companies/affirm/?list=fintech
Axoni https://www.forbes.com/companies/axoni/?list=fintech
Ayasdi https://www.forbes.com/companies/ayasdi/?list=fintech
Behavox https://www.forbes.com/companies/behavox/?list=fintech
Betterment https://www.forbes.com/companies/betterment/?list=fintech
Bitfury https://www.forbes.com/companies/bitfury/?list=fintech
Blend https://www.forbes.com/companies/blend/?list=fintech
Bolt https://www.forbes.com/companies/bolt/?list=fintech
Brex https://www.forbes.com/companies/brex/?list=fintech
Cadre https://www.forbes.com/companies/cadre/?list=fintech
Carta https://www.forbes.com/companies/carta/?list=fintech
Chime https://www.forbes.com/companies/chime/?list=fintech
Circle https://www.forbes.com/companies/circle/?list=fintech
Coinbase https://www.forbes.com/companies/coinbase/?list=fintech
Credit Karma https://www.forbes.com/companies/credit-karma/?list=fintech
Cross River https://www.forbes.com/companies/cross-river/?list=fintech
Digital Reasoning https://www.forbes.com/companies/digital-reasoning/?list=fintech
Earnin https://www.forbes.com/companies/earnin/?list=fintech
Enigma https://www.forbes.com/companies/enigma/?list=fintech
Even https://www.forbes.com/companies/even/?list=fintech
Flywire https://www.forbes.com/companies/flywire/?list=fintech
Forter https://www.forbes.com/companies/forter/?list=fintech
Fundrise https://www.forbes.com/companies/fundrise/?list=fintech
Gemini https://www.forbes.com/companies/gemini/?list=fintech
Guideline https://www.forbes.com/companies/guideline/?list=fintech
iCapital Network https://www.forbes.com/companies/icapital-network/?list=fintech
IEX Group https://www.forbes.com/companies/iex-group/?list=fintech
Kabbage https://www.forbes.com/companies/kabbage/?list=fintech
Lemonade https://www.forbes.com/companies/lemonade/?list=fintech
LendingHome https://www.forbes.com/companies/lendinghome/?list=fintech
Marqeta https://www.forbes.com/companies/marqeta/?list=fintech
Nova Credit https://www.forbes.com/companies/nova-credit/?list=fintech
Opendoor https://www.forbes.com/companies/opendoor/?list=fintech
Personal Capital https://www.forbes.com/companies/personal-capital/?list=fintech
Plaid https://www.forbes.com/companies/plaid/?list=fintech
Poynt https://www.forbes.com/companies/poynt/?list=fintech
Remitly https://www.forbes.com/companies/remitly/?list=fintech
Ripple https://www.forbes.com/companies/ripple/?list=fintech
Robinhood https://www.forbes.com/companies/robinhood/?list=fintech
Roofstock https://www.forbes.com/companies/roofstock/?list=fintech
Root Insurance https://www.forbes.com/companies/root-insurance/?list=fintech
Stash https://www.forbes.com/companies/stash/?list=fintech
Stripe https://www.forbes.com/companies/stripe/?list=fintech
Symphony https://www.forbes.com/companies/symphony/?list=fintech
Tala https://www.forbes.com/companies/tala/?list=fintech
Toast https://www.forbes.com/companies/toast/?list=fintech
Tradeshift https://www.forbes.com/companies/tradeshift/?list=fintech
TransferWise https://www.forbes.com/companies/transferwise/?list=fintech
I am using NYTimes API to scrap news articles for my text analysis. Here is the code.
from nytimesarticle import articleAPI
api = articleAPI('<API key>')
articles = api.search(q = 'President',
fq = {'source':['Reuters','AP', 'The New York Times']},
begin_date = 20110829,
end_date = 20161203)
print ("the response is ", articles)
However, it does not return any results. This is the sample response with null dataset:
{'response': {'meta': {'offset': 0, 'time': 227, 'hits': 0}, 'docs':
[]}, 'status': 'OK', 'copyright': 'Copyright (c) 2013 The New York
Times Company. All Rights Reserved.'}
Should there be any additional paremeters when sending the request
I actually was using a completely different API in my initial code. Using "requests" API solved the problem here
uri = "http://api.nytimes.com/svc/archive/v1/"+str(year)+"/"+str(month)+".json?api-key=<>"
resp = requests.get(uri)
I'm using the requests package with BeautifulSoup to scrape Google News for the number of search results for a query. I'm getting two types of IndexError, which I want to distinguish between:
When the number of search results is empty. Here #resultStats returns the empty string '[]'. What seems to be going on is that when a query string is too long, google doesn't even say "0 search results"; it just doesn't say anything.
The second IndexError is when google gives me a captcha.
I need to distinguish between these cases, because I want my scraper to wait five minutes when google sends me a captcha, but not when it's just an empty results string.
I currently have a jury-rigged approach, where I send another query with a known nonzero number of search results, which allows me to distinguish between the two IndexErrors. I'm wondering if there's a more elegant and direct approach to doing this, using BeautifulSoup.
Here's my code:
import requests, bs4, lxml, re, time, random
import pandas as pd
import numpy as np
URL = 'https://www.google.com/search?tbm=nws&q={query}&tbs=cdr%3A1%2Ccd_min%3A{year}%2Ccd_max%3A{year}&authuser=0'
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
def tester(): # test for captcha
test = requests.get('https://www.google.ca/search?q=donald+trump&safe=off&client=ubuntu&espv=2&biw=1910&bih=969&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2016%2Ccd_max%3A&tbm=nws', headers=headers)
dump = bs4.BeautifulSoup(test.text,"lxml")
result = dump.select('#resultStats')
num = result[0].getText()
num = re.search(r"\b\d[\d,.]*\b",num).group() # regex
num = int(num.replace(',',''))
num = (num > 0)
return num
def search(**params):
response = requests.get(URL.format(**params),headers=headers)
print(response.content, response.status_code) # check this for google requiring Captcha
soup = bs4.BeautifulSoup(response.text,"lxml")
elems = soup.select('#resultStats')
try: # want code to flag if I get a Captcha
hits = elems[0].getText()
hits = re.search(r"\b\d[\d,.]*\b",hits).group() # regex
hits = int(hits.replace(',',''))
print(hits)
return hits
except IndexError:
try:
tester() > 0 # if captcha, this will throw up another IndexError
print("Empty results!")
hits = 0
return hits
except IndexError:
print("Captcha'd!")
time.sleep(120) # should make it rotate IP when captcha'd
hits = 0
return hits
for qry in list:
hits = search(query= qry, year=2016)
I'd just search for the "captcha" element, for example, if this is Google Recaptcha, you can search for the hidden input containing the token:
is_captcha_on_page = soup.find("input", id="recaptcha-token") is not None