I have this code scraping each job title and company name from :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt
This is for every job title
job_titles = browser.find_elements_by_css_selector("a.job-card-list__title")
c = []
for title in job_titles:
c.append(title.text)
print(c)
print((len(c)))
This is for every company name
Company_Names = browser.find_elements_by_css_selector("a.job-card-container__company-name")
d = []
for name in Company_Names:
d.append(name.text)
print(d)
print((len(d)))
I provided the URL above, there are many many pages!
how can I make Selenium auto-open each page and scrape each of the 4 thousand results available?
I have found a way to paginate to each page, but I am yet to know how to scrape each page.
So the URL is :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start=25
The start parameter in the URL increments by 25 from each page to the other.
so we add this piece of code which navigates us successfully to the other pages:
page = 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))
for i in range(1,40):
page = i * 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page
Related
I am working to scrape the website:- "https://www.moglix.com/automotive/car-accessories/216110000?page=101" NOTE: 101 is the page number and this site has 783 pages.
I wrote this code to get all the URL's of the product mentioned on the page using beautifulsoup:-
prod_url = []
for i in range(1,400):
r = requests.get(f'https://www.moglix.com/automotive/car-accessories/216110000?page={i}')
soup = BeautifulSoup(r.content,'lxml')
for link in soup.find_all('a',{"class":"ng-tns-c100-0"}):
prod_url.append(link.get('href'))
There are 40 products on each page, and this should give me 16000 URLs for the products but I am getting 7600(approx)
After checking I can see that the class for a tag is changing on pages. For Eg:-
How to get this href for all the products on all the pages.
You can use find_all method and specified attrs to get all a tags also further filter it by using split and startswith method to get exact product link URL's
res=requests.get(f"https://www.moglix.com/automotive/car-accessories/216110000?page={i}")
soup=BeautifulSoup(res.text,"html.parser")
x=soup.find_all("a",attrs={"target":"_blank"})
lst=[i['href'] for i in x if (len(i['href'].split("/"))>2 and i['href'].startswith("/"))]
Output:
['/love4ride-steel-tubeless-tyre-puncture-repair-kit-tyre-air-inflator-with-gauge/mp/msnv5oo7vp8d56',
'/allextreme-exh4hl2-2-pcs-36w-9000lm-h4-led-headlight-bulb-conversion-kit/mp/msnekpqpm0zw52',
'/love4ride-2-pcs-35-inch-fog-angel-eye-drl-led-light-set-for-car/mp/msne5n8l6q1ykl',..........]
I am doing Linkedin web scraping as a part of my college project. This is the code to locate the skills & endorsements, recommendations and accomplishments section:
skills = driver.find_elements_by_css_selector('#ember661')
recom = driver.find_elements_by_css_selector('#ember679')
acc = driver.find_elements_by_css_selector('#ember695')
But I am getting an empty list in all the three variables. Please help!
There are couple of reasons.
ID is geneerated and not the same for all the profiles.
You should not expect a list of element. There is a single section of each type on profile page so that a single element would be returned.
Those sections might get loaded asynchronously so that the page is loaded but the section has not yet been. So that the locators return false. In such case you need to use explicit waiting. Like:
waiter = WebDriverWait(driver, 10)
skills = waiter.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.pv-skill-categories-section')))
recom = waiter.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.pv-recommendations-section')))
acc = waiter.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.pv-accomplishments-section')))
Hi I am currently looking to scrape a [age such as this "https://www.tennis24.com/match/ABiALWlt/#match-statistics;0" every time it changes score. currently i have the ability to scrape it using selenium and BS using the below code
from selenium import webdriver
Chrom_path = r"C:\Users\Dan1\Desktop\chromedriver.exe"
driver = webdriver.Chrome(Chrom_path)
driver.get("https://www.tennis24.com/match/zVrM3ySQ/#match-statistics;0")
data = driver.find_elements_by_class_name("statTextGroup")
for d in data:
sub_data = d.find_elements_by_xpath(".//*")
assert len(sub_data)==3
for s_d in sub_data:
print(s_d.get_attribute('class')[19:], s_d.get_attribute('innerText'))
but I have no idea how to automate it so that once the score at the top of the page located here"Medical timeout6 : 6 ( 0 : 0 )" changes, the scraper scrapes the new data. the change to monitor though is only visible when the match is in play and not always there.
if you need anymore info please let me know and ill be happy to add it
You can scrape in a while loop the "scoreboard"-class and when this is not the same as the old value of this then this value changed and you can scrape the other things you wanted.
Hope it helped
Here is my scrapy spider
class Spider(scrapy.Spider):
name = "name"
start_urls = ["https://www.aurl"]
def parse(self, response):
links_page_urls = response.css("a.dynamic-linkset::attr(href)").extract()
for url in contract_page_urls:
yield response.follow(url, callback=self.parse_list_page)
next_data_cursor = response.css("li.next").css("a::attr(href)").extract_first()
if next_data_cursor:
self.log("going to next page - {}".format(next_data_cursor))
yield response.follow(next_data_cursor, callback=self.parse)
def parse_list_page(self, response):
list = response.css("div.row div.list-group a.list-group-item").css("a::attr(href)").extract()
for url in list:
self.log("url - {}".format(url))
yield scrapy.Request(url=self.base_url + url, callback=self.parse_page)
def parse_page(self, response):
#Lots of code for parsing elements from a page
# build an item and return it
My observations is that on my own machine at home and with no download delay set, the actual pages are visited in rapid succession and saved to mongo. When I move this code to an EC2 instance and set the download delay to 60, what I now notice is that the web pages aren't visited for scraping and instead the first page is visited, a next data token is scraped and it's visited. Then I see a lot of print outs related to scraping the list pages rather than each individual page.
The desired behavior is to visit the initial URL, get the page list, and then visit each page and scrape it, then move onto the next data cursor and repeat this process.
You could try to set a much lower DOWNLOAD_DELAY (probably= 2) and also set CONCURRENT_REQUESTS = 1 assigning a higher priority to the requests from the page list, something like:
yield response.follow(url, callback=self.parse_list_page, priority=1)
My scrapy script has rules specified as below:
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=<xpath for next page>), callback=parse_website, follow= True, ),)
The website itself has a navigation but each page only shows the link to the next page. i.e as page 1 loads, I can get the link to page 2 and so on and so forth.
How do I get my spider to navigate through all of the n pages?
Thank you!