Scraping: scrape multiple pages in looping (Beautifulsoup) - selenium

I am trying to scrape real estate data using Beautifulsoup, but when I save the result of the scrape to a .csv file, it only contains the information from the first page. I would like to scrape the number of pages I have set in the "pages_number" variable.
# How many pages
pages_number =int(input('How many pages? '))
# inicializa o tempo de execução
tic = time.time()
# Chromedriver
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
#initial link
link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1'
driver.get(link)
# creating looping pages
for page in range(1,pages_number+1):
time.sleep(15)
data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
I already tried this solution but got an error:
link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={}.format(page)'
Does anyone have any idea what can be done?
COMPLETE CODE
https://github.com/arturlunardi/webscraping_vivareal/blob/main/scrap_vivareal.ipynb

I see that the url you are using belongs to page 1 only.
https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1
Are you changing it anywhere in your code? If not, then no matter what you fetch, it would fetch from page 1 only.
You should do something like this:
for page in range(1,pages_number+1):
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
#initial link
link = f"https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={page}"
driver.get(link)
time.sleep(15)
data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
driver.close()
Test Output (not the soup part) - for pages_number = 3 (stored urls in a list, for easy view):
['https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=2', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=3']
Process finished with exit code 0

Related

How can I keep scraping data from my data set contains links of linkedin's job description by selenium and for loop as it is stopped and doesn't work?

I made a dataset of linkedin links for data science jobs (with 1000 rows) and made a for-loop to open each link by selenium and extract the job description by beautifulsoup.
It had worked until 121 rows extracted and then it displayed an error and stopped. when I wanted to start again from row 122, it displays this error from link
the link (name of the dataset column), has problem
and does not open linkedin. I tested and see selenium can open google for example.
my loop is:
# keep the loop from 123
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(r"C:\Users\shahi\chromedriver.exe")
jobdescriptions2=[]
for link in links[121:1000] :
try:
# This instance will be used to log into LinkedIn
driver.get(link)
time.sleep(5)
src = driver.page_source
except:
scr= 'problem in scraping data'
print(f'the link {link}, has problem')
try:
# Now using beautiful soup
soup = BeautifulSoup(src, 'lxml')
# Extracting the HTML of the complete introduction box
job = soup.find("div", {'class': 'decorated-job-posting__details'})
jobdescription = job.find ('div', {'class':'show-more-less-html__markup'})
except:
jobdescription= 'an error in data parcing'
jobdescriptions2.append(jobdescription)
Could you please advise me how I can scrape all 1000 rows?

selenium get abs url from href attribute

when im downloading a page with selenium and process it with java jsoup. I get the hrefs in the source code like this:
Technical Trading
Is there a way to get the absolute url from this or to force selenium to transform it to an absolute url? Updating the links after getting the page doesn't sound like a clean solution.
If you get the href just with selenium, this works as expected:
yourElement.get_attribute('href')
This is a quick sample:
driver = webdriver.Chrome() # note this is my webdriver
driver.implicitly_wait(10)
url = "https://www.duckduckgo.co.uk"
driver.get(url)
aList = driver.find_elements(By.TAG_NAME, 'a')
for a in aList:
print(a.get_attribute('href'))
Output contains:
https://duckduckgo.com/spread
https://duckduckgo.com/spread
https://duckduckgo.com/app
https://duckduckgo.com/app
https://duckduckgo.com/newsletter
https://duckduckgo.com/newsletter
This is how the DOM looks: (it's relative - but gets the full path)

Selenium creating bulk email addresses

I want to use selenium to create several email addresses at once. I suppose they can be random but I already have a list of the email account names I want to create.
I know how to create 1 email using webdriver but how would I go about it if I want to sign up several, one after the other automatically, without having to always change the code?
Simple code for creating 1 email:
from selenium import webdriver
import time
url = 'https://hotmail.com/'
driver = webdriver.Chrome('/C:Users/Desktop/chromedriver')
driver.get(url)
driver.find_element_by_xpath("//a[contains(#class, 'linkButtonSigninHeader')]/#href").click()
time.sleep(2)
driver.find_element_by_id('MemberName').send_keys('usernameexample')
time.sleep(1)
driver.find_element_by_id('iSignupAction).click()
time.sleet(2)
driver.find_element_by_id('PasswordInput').send_keys('Passwordexample1')
time.sleep(1)
driver.find_element_by_id('iSignupAction').click()
time.sleep(2)
driver.find_element_by_id('FirstName').send_keys('john')
time.sleep(1)
driver.find_element_by_id('LastName').send_keys('wayne')
time.sleep(1)
driver.find_element_by_id('iSignupAction').click()
As others have pointed out, you could iterate over a data collection, such as an array:
array_of_usernames = ['username_one', 'username_two']
array_of_usernames.each do |username|
url = 'https://hotmail.com/'
driver = webdriver.Chrome('/C:Users/Desktop/chromedriver')
driver.get(url)
driver.find_element_by_xpath("//a[contains(#class, 'linkButtonSigninHeader')]/#href").click()
driver.find_element_by_id('MemberName').send_keys("#{username}") #INTERPOLATED BLOCK-LOCAL VARIABLE HERE
driver.find_element_by_id('iSignupAction).click()
driver.find_element_by_id('PasswordInput').send_keys('Passwordexample1')
driver.find_element_by_id('iSignupAction').click()
driver.find_element_by_id('FirstName').send_keys('john')
driver.find_element_by_id('LastName').send_keys('wayne')
driver.find_element_by_id('iSignupAction').click()
# some step to log out so that next username can register
end
If you aren't familiar with arrays or iteration, then I'd suggest looking at the docs to get your head around it: https://ruby-doc.org/core-2.6.1/Array.html#method-i-each

Threading and Selenium

I'm trying to make multiple tabs in Selenium and open a page on each tab simultaneously. Here is the code.
CHROME_DRIVER_PATH = "C:/chromedriver.exe"
from selenium import webdriver
import threading
driver = webdriver.Chrome(CHROME_DRIVER_PATH)
links = ["https://www.google.com/",
"https://stackoverflow.com/",
"https://www.reddit.com/",
"https://edition.cnn.com/"]
def open_page(url, tab_index):
driver.switch_to_window(handles[tab_index])
driver.get(url)
return
# open a blank tab for every link in the list
for link in range(len(links)-1 ): # 1 less because first tab is already opened
driver.execute_script("window.open();")
handles = driver.window_handles # get handles
all_threads = []
for i in range(0, len(links)):
current_thread = threading.Thread(target=open_page, args=(links[i], i,))
all_threads.append(current_thread)
current_thread.start()
for thr in all_threads:
thr.join()
Execution goes without errors, and from what I understand this should logically work correctly. But, the effect of the program is not as I imagined. It only opens one page at a time, sometimes it doesn't even switch the tab... Is there a problem that I'm not aware of in my code or threading doesn't work with Selenium?
There is no need in switching to new window to get URL, you can try below to open each URL in new tab one by one:
links = ["https://www.google.com/",
"https://stackoverflow.com/",
"https://www.reddit.com/",
"https://edition.cnn.com/"]
# Open all URLs in new tabs
for link in links:
driver.execute_script("window.open('{}');".format(link))
# Closing main (empty) tab
driver.close()
Now you can handle (if you want) all the windows from driver.window_handles as usual

How to poll Selenium Hub for the number of it's registered Nodes?

I searched for any answer in the Selenium Grid Documentation but couldnt find anything.
Could it somehow be possible to poll the Selenium Hub and take back the number of nodes that are registered to it?
If you check the Grid Console (http://selenium.hub.ip.address:4444/grid/console) you will find valuable information about Grid's nodes, browsers, IPs, etc.
This is my grid. I have two nodes, one Linux and one Windows:
If you go through the links (Configuration, View Config,...) you will find information about each node and browser.
I finally put together this
def grid_nodes_num(grid_console_url="http://my_super_company.com:8080/grid/console#"):
import requests
from bs4 import BeautifulSoup
r = requests.get(grid_console_url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
# print soup.prettify() # for debuggimg
grid_nodes = soup.find_all("p", class_="proxyid")
if grid_nodes == []:
print "-No Nodes detected. Grid is down!-"
else:
nodes_num = len(grid_nodes)
print "-Detected ",nodes_num," node(s)-"
return nodes_num