How can I keep scraping data from my data set contains links of linkedin's job description by selenium and for loop as it is stopped and doesn't work? - selenium

I made a dataset of linkedin links for data science jobs (with 1000 rows) and made a for-loop to open each link by selenium and extract the job description by beautifulsoup.
It had worked until 121 rows extracted and then it displayed an error and stopped. when I wanted to start again from row 122, it displays this error from link
the link (name of the dataset column), has problem
and does not open linkedin. I tested and see selenium can open google for example.
my loop is:
# keep the loop from 123
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(r"C:\Users\shahi\chromedriver.exe")
jobdescriptions2=[]
for link in links[121:1000] :
try:
# This instance will be used to log into LinkedIn
driver.get(link)
time.sleep(5)
src = driver.page_source
except:
scr= 'problem in scraping data'
print(f'the link {link}, has problem')
try:
# Now using beautiful soup
soup = BeautifulSoup(src, 'lxml')
# Extracting the HTML of the complete introduction box
job = soup.find("div", {'class': 'decorated-job-posting__details'})
jobdescription = job.find ('div', {'class':'show-more-less-html__markup'})
except:
jobdescription= 'an error in data parcing'
jobdescriptions2.append(jobdescription)
Could you please advise me how I can scrape all 1000 rows?

Related

Coding return NULL

import requests # request img from web
from bs4 import BeautifulSoup
imglinks = []
for i in range(1, 26):
resp = requests.get('https://******/************.html')
soup = BeautifulSoup(resp.content, "html.parser")
links = soup.select('.container > .****** > .img')
for link in links:
imglinks.append(link['src'])
print(links)
My code return NULL (actually it return []) maybe there are some problem with my soup.select ?
Sometimes, BeautifulSoup is unable to find select elements within a webpage. You might either need to render the page using javascript by using the html_session library or you have been blocked by the website for trying to access it using a bot.
If it isn't either of those reasons, i suggest manually scraping it off by finding indexes in request.text

Web Scraping/ Web Crawling

Can somebody help me figure out how to scrape / crawl this website? https://www.arkansasonline.com/i/lrcrime/
I've downloaded the page source, with requests and parced with BeautifulSoup, but I can't figure out what's going on.
Here is what I have so far:
#####################################################
import requests
from bs4 import BeautifulSoup
url = 'https://www.arkansasonline.com/i/lrcrime/'
r = requests.get(url, headers = headers).text
soup = BeautifulSoup(r,'html.parser' )
data = [x.get_text() for x in soup.find_all('td')]
#####################################################
Typically this would do the trick and I'd get a list of all table data inputs..
But I'm getting
['SEARCH | DISPATCH LOG | STORIES | HOMICIDES | OLD MAP',
'\n\n\n\nClick here to load this Caspio Cloud Database\nCloud Database by Caspio\n']
Which is far from what I need....
Also, how do I craw the 3000 pages?
I also tried to do it with a macro and just record my keystrokes and save to google drive, but the page moves around as you go through the pages, so it makes that basically impossible. They are trying to hide the crime data In my opinion. I want to scrape it all into 1 database and release it to the public.
What happens?
Most of content is provided dynamically, so you won't get it with requests, only the first table with some navigation is provided static, thats why you get this as result.
There is also a "big" hint you should respect.
This Content is Only Available to Subscribers.
How to fix?
NOTE: Be kind and subscribe to consume the content - would be the best solution.
Assuming this is the case you should go with selenium that will render the page as a browser will do and also provide the table you are looking for.
You have to wait until the table is loaded:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'table[data-cb-name="cbTable"]')))
Grab the page_source and load it with pandas as dataframe.
Example (selenium 4)
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
service = Service(executable_path='ENTER YOUR PATH TO CHROMEDRIVER')
driver = webdriver.Chrome(service=service)
driver.get('https://www.arkansasonline.com/i/lrcrime/')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'table[data-cb-name="cbTable"]')))
pd.read_html(driver.page_source)[1]
Output
INCIDENT NUMBER
OFFENSE DESCRIPTION
Location/address
ZIP
INCIDENT DATE
2021-134824
ALL OTHER LARCENY
8109 W 34TH ST
72204
11/1/2021
2021-134815
BURGLARY/B&E
42 NANDINA LN
72210
11/1/2021
2021-134790
ROBBERY
1800 BROADWAY ST
72206
11/1/2021
2021-134788
AGGRAVATED ASSAULT
11810 PLEASANT RIDGE RD
72223
11/1/2021
2021-134778
THEFT FROM MOTOR VEHICLE
4 HANOVER DR
72209
11/1/2021
...
...
...
...
...

Scraping: scrape multiple pages in looping (Beautifulsoup)

I am trying to scrape real estate data using Beautifulsoup, but when I save the result of the scrape to a .csv file, it only contains the information from the first page. I would like to scrape the number of pages I have set in the "pages_number" variable.
# How many pages
pages_number =int(input('How many pages? '))
# inicializa o tempo de execução
tic = time.time()
# Chromedriver
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
#initial link
link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1'
driver.get(link)
# creating looping pages
for page in range(1,pages_number+1):
time.sleep(15)
data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
I already tried this solution but got an error:
link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={}.format(page)'
Does anyone have any idea what can be done?
COMPLETE CODE
https://github.com/arturlunardi/webscraping_vivareal/blob/main/scrap_vivareal.ipynb
I see that the url you are using belongs to page 1 only.
https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1
Are you changing it anywhere in your code? If not, then no matter what you fetch, it would fetch from page 1 only.
You should do something like this:
for page in range(1,pages_number+1):
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
#initial link
link = f"https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={page}"
driver.get(link)
time.sleep(15)
data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
driver.close()
Test Output (not the soup part) - for pages_number = 3 (stored urls in a list, for easy view):
['https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=2', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=3']
Process finished with exit code 0

Web Scraping app using Scrapy / Selenium generates Error: "ModuleNotFoundError 'selenium'"

Good morning!
I have recently started learning Python and moved onto applying the little I know to create the monstrosity seen below.
In brief:
I am attempting to scrape SEC Edgar (https://www.sec.gov/edgar/searchedgar/cik.htm) for CIK codes of companies which I want to study in more detail (for now just 1 company to see if it's the right approach).
To scrape the CIK-code, I created a scrapy spider, imported selenium and created 3 functions - 1st to insert "company name" in the input name, 2nd to activate the "submit" button and finally, a function to scrape the CIK code once the submit is activated and return item.
Apart from adding the item to items.py, I haven't changed the middlewares or settings.
For some reason, I am getting ModuleNotFoundError for 'selenium', although I have installed the packages and imported selenium & webdriver along with everything else.
I have tried to mess around with indentation and rephrased the code but achieved no improvement.
import selenium
from selenium import webdriver
import scrapy
from ..items import Sec1Item
from scrapy import Selector
class SecSpSpider(scrapy.Spider):
name = 'SEC_sp'
start_urls =
['http://https://www.sec.gov/edgar/searchedgar/cik.htm/']
def parse(self,response):
company_name = 'INOGEN INC'
return scrapy.FormRequest.from_response(response, formdata ={
'company': company_name
}, callback=self.start_requests())
def start_requests(self):
driver = webdriver.Chrome()
driver.get(self.start_urls)
while True:
next_url = driver.find_element_by_css_selector(
'.search-button'
)
try:
self.parse(driver.page_source)
next_url.click()
except:
break
driver.close()
def parse_page(self, response):
items = Sec1Item()
CIK_code = response.css('a::text').extract()
items["CIK Code: "] = Sec1Item
yield items
I seem not to be able to get over the import selenium error, hence I am not sure about the extent of needed adjustments to the remainder of my spider.
Error message:
"File/Users/user1/PycharmProjects/Scraper/SEC_1/SEC_1/spiders/SEC_sp.py", line 1, in <module>
import selenium
ModuleNotFoundError: No module named 'selenium'
Thank you for any assistance and help!

How to poll Selenium Hub for the number of it's registered Nodes?

I searched for any answer in the Selenium Grid Documentation but couldnt find anything.
Could it somehow be possible to poll the Selenium Hub and take back the number of nodes that are registered to it?
If you check the Grid Console (http://selenium.hub.ip.address:4444/grid/console) you will find valuable information about Grid's nodes, browsers, IPs, etc.
This is my grid. I have two nodes, one Linux and one Windows:
If you go through the links (Configuration, View Config,...) you will find information about each node and browser.
I finally put together this
def grid_nodes_num(grid_console_url="http://my_super_company.com:8080/grid/console#"):
import requests
from bs4 import BeautifulSoup
r = requests.get(grid_console_url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
# print soup.prettify() # for debuggimg
grid_nodes = soup.find_all("p", class_="proxyid")
if grid_nodes == []:
print "-No Nodes detected. Grid is down!-"
else:
nodes_num = len(grid_nodes)
print "-Detected ",nodes_num," node(s)-"
return nodes_num