BeautifulSoup.find_all does not return the class shown by "Inspect" under "div" - beautifulsoup

Goal: Query all the job records as a collection as part of job website scraping.
Steps: There are 100 job records and using "Inspect" in Google Chrome, it shows up as follows, when "Inspect"ing a single job record.
<div class="coveo-list-layout CoveoResult">
<div class="coveo-result-frame item-wrap">
<div class="content-main">
<div class="coveo-result-cell content-wrap">
Problem: The following code does not return the count as 100, it is just 0. All of the above mentioned class were used in find_all, but it does not return 100 records. Attached a snip of the "Inspect" to show the class associated with a single record. Output of Inspect on a single job record:
response = requests.get(url)
print(response)
<Response [200]>
response.reason
'OK'
soup = BeautifulSoup(response.text, 'html.parser')
cards = soup.find_all('div','content-list-layout CoveoResult')
len(cards)
0
cards = soup.find_all('div')
len(cards)
86
Code tried as follows: None of them work
cards = soup.find_all('div','content-list-layout CoveoResult')
cards = soup.find_all('div','content-list-layout')
cards = soup.find_all('div','coveo-result-frame item-wrap')
cards = soup.find_all('div','coveo-result-frame')
cards = soup.find_all('div','content-main')
cards = soup.find_all('div','coveo-result-cell content-wrap')
cards = soup.find_all('div','coveo-result-cell')
Next Steps: Need help with finding the class associated with a single record. As a debug I have generated the output of "cards = soup.

Related

How to scrape a data table that has multiple pages using selenium?

I'm extracting NBA stats from my yahoo fantasy account. Below is the code that I made in jupyter notebook using selenium. Each page shows 25 players and a total of 720 players. I did a for loop that will scrape players in increments of 25 instead of one by one.
for k in range (0,725,25):
Players = driver.find_elements_by_xpath('//tbody/tr/td[2]/div/div/div/div/a')
Team_Position = driver.find_elements_by_xpath('//span[#class= "Fz-xxs"]')
Games_Played = driver.find_elements_by_xpath('//tbody/tr/td[7]/div')
Minutes_Played = driver.find_elements_by_xpath('//tbody/tr/td[11]/div')
FGM_A = driver.find_elements_by_xpath('//tbody/tr/td[12]/div')
FTM_A = driver.find_elements_by_xpath('//tbody/tr/td[14]/div')
Three_Points = driver.find_elements_by_xpath('//tbody/tr/td[16]/div')
PTS = driver.find_elements_by_xpath('//tbody/tr/td[17]/div')
REB = driver.find_elements_by_xpath('//tbody/tr/td[18]/div')
AST = driver.find_elements_by_xpath('//tbody/tr/td[19]/div')
ST = driver.find_elements_by_xpath('//tbody/tr/td[20]/div')
BLK = driver.find_elements_by_xpath('//tbody/tr/td[21]/div')
TO = driver.find_elements_by_xpath('//tbody/tr/td[22]/div')
NBA_Stats = []
for i in range(len(Players)):
players_stats = {'Name': Players[i].text,
'Position': Team_Position[i].text,
'GP': Games_Played[i].text,
'MP': Minutes_Played[i].text,
'FGM/A': FGM_A[i].text,
'FTM/A': FTM_A[i].text,
'3PTS': Three_Points[i].text,
'PTS': PTS[i].text,
'REB': REB[i].text,
'AST': AST[i].text,
'ST': ST[i].text,
'BLK': BLK[i].text,
'TO': TO[i].text}
driver.get('https://basketball.fantasysports.yahoo.com/nba/28951/players?status=ALL&pos=P&cut_type=33&stat1=S_AS_2021&myteam=0&sort=AR&sdir=1&count=' + str(k))
The browser will go page by page after it's done. I print out the results. It only scrape 1 player. What did I do wrong?
A picture of my codes and printing the results
It's hard to see what the issue here is without looking at the original page (can you provide a URL?), however looking at this:
next = driver.find_element_by_xpath('//a[#id = "yui_3_18_1_1_1636840807382_2187"]')
"1636840807382" looks like a Javascript timestamp, so I would guess that the reference you've got hardcoded there is dynamically generated, so the element "yui_3_18_1_1_1636840807382_2187" no longer exists.

function with loop for urls won't return all of them

I am working on a project to pull details from state reports on restaurant inspections. Each inspection has its own url. I am able to gather the values into a dictionary, but only return one at a time. In the call to the function, if I don't specify a specific library entry, I get an error: ''list' object has no attribute 'timeout'' If it ask for a specific entry, I get a good return. How can I get them all?
# loop through the url list to gather inspection details
detailsLib = {}
def get_inspect_detail(urlList):
html = urlopen(urlList)
soup = bs4.BeautifulSoup(html.read(), 'lxml')
details = soup.find_all('font', {'face': 'verdana'})[10:]
result = []
for detail in details:
siteName = details[0].text
licNum = details[2].text
siteRank = details[4].text
detailsLib = {
'Restaurant': siteName,
'License': licNum,
'Rank': siteRank,
}
result.append(detailsLib)
return result
get_inspect_detail(urlList[21])
So I can get the 21st restaurant on the list, repeating 36 times, but not all of them.
Another question for another day is where to do the clean-up. The details will need some regex work but I'm unsure whether to do that inside the function (one at a time), or outside the function by calling all values from a specific key in the library.
Call get_inspect_detail() once per item in urlList, and save all the results.
all_results = []
for url in urlList:
details = get_inspect_detail(url)
all_results.extend(details)

Scrapy only show the first result of each page

I need to scrape the items of the first page and then go to the next button to go to the second page and scrape and so on.
This is my code, but only scrape the first item of each page, if there are 20 pages enter to every page and scrape only the first item.
Could anyone please help me .
Thank you
Apologies for my english.
class CcceSpider(CrawlSpider):
name = 'ccce'
item_count = 0
allowed_domain = ['www.example.com']
start_urls = ['https://www.example.com./afiliados value=&categoria=444&letter=']
rules = {
# Reglas Para cada item
Rule(LinkExtractor(allow = (), restrict_xpaths = ('//li[#class="pager-next"]/a')), callback = 'parse_item', follow = True),
}
def parse_item(self, response):
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = response.xpath('normalize-space(//div[#class="news-col2"]/h2/text())').extract()
ml_item['url'] = response.xpath('normalize-space(//div[#class="website"]/a/text())').extract()
ml_item['correo'] = response.xpath('normalize-space(//div[#class="email"]/a/text())').extract()
ml_item['descripcion'] = response.xpath('normalize-space(//div[#class="news-col4"]/text())').extract()
self.item_count += 1
if self.item_count > 5:
#insert_table(ml_item)
raise CloseSpider('item_exceeded')
yield ml_item
As you haven't given an working target url, I'm a bit guessing here, but most probably this is the problem:
parse_item should be a parse_page (and act accordingly)
Scrapy is downloading a full page which has - according to your description - multiple items and then passes this as a response object to your parse method.
It's your parse method's responsibility to process the whole page by iterating over the items displayed on the page and creating multiple scraped items accordingly.
The scrapy documentation has several good examples for this, one is here: https://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths
Basically your code structure in def parse_XYZ should look like this:
def parse_page(self, response):
items_on_page = response.xpath('//...')
for sel_item in items_on_page:
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = # ...
# ...
yield ml_item
Insert the right xpaths for getting all items on the page and adjust your item xpaths and you're ready to go.

Web page is showing weird unicode(?) letters: \u200e

How can I remove that? I Tried so many things and I am exhausted of trying to defeat this error by myself. I spent the last 3 hours looking at this and trying to get through it and I surrender to this code. Please help.
The first "for" statement grabs article titles from news.google.com
The second "for" statement grabs the time of submisssion from that article on news.google.com.
This is on django btw and this page shows the list of article titles and their time of submission in a list, going down. The weird unicode letters are popping up from the second "for" statement which is the time submissions. Here is my views.py:
def articles(request):
""" Grabs the most recent articles from the main news page """
import bs4, requests
list = []
list2 = []
url = 'https://news.google.com/'
r = requests.get(url)
try:
r.raise_for_status() == True
except ValueError:
print('Something went wrong.')
soup = bs4.BeautifulSoup(r.text, 'html.parser')
for (listarticles) in soup.find_all('h2', 'esc-lead-article-title'):
if listarticles is not None:
a = listarticles.text
list.append(a)
for articles_times in soup.find_all('span','al-attribution-timestamp'):
if articles_times is not None:
b = articles_times.text
list2.append(b)
list = zip(list,list2)
context = {'list':list}
return render(request, 'newz/articles.html', context)

Scraping tabulated paginated data

A page with data I need has changed its structure to a new paginated format. I'm working on updating my scraper for the page.
I can't understand how to collect the data from all of the different pages.
The page to be scraped is: http://eserver.goutsi.com:8080/DPW230.cgi
I know how to collect the data in the tables but I can't figure out how to handle the pagination.
This is my original script:
scrape_actor = Mechanize.new
page = scrape_actor.get("http://loads.goutsi.com:8080/wntv5/BKLoad")
rows = page.body.to_s.split("</tr>")
rows.each do |row|
if row.include? "bgcolor='#f5f5f5'"
columns = row.split("</td>")
i = 0
while i < columns.count
columns[i] = columns[i].gsub(%r{</?[^>]+?>},'').gsub(/[\n\t\r ]+/,'').gsub(" ",'')
i+=1
end
username = "UTSI"
origin = columns[0].gsub(" ","")
pickup = Chronic.parse(columns[1]+"/"+Time.now.strftime("%Y"))
dest = columns[3]
comments = "miles: #{columns[4]}, phone: #{columns[9]}, other: #{columns[11]}"
equipment = columns[6]
ltl = false
ltl = true if columns[7] == "LTL"
Scrape.post_load(username,origin,dest,pickup,'',ltl,equipment,comments,'','','')
end
end
When you click one of the page links javascript on the page triggers a post request to the same path with the new page number.
This can be found in their js file at http://eserver.goutsi.com:8080/js/LoadBoard.js
function gotoPage(pageNumber)
{
document.getElementById("PageNbr").value= pageNumber; // Set new page number
document.getElementById("PageDir").value="R"; // Refresh
document.getElementById("theForm").submit();
}
That code submits this form:
<form action="/DPW230.cgi" method="post" id="theForm">...
Which has the field:
<input type="hidden" id="PageDir" name="PageDir" value=" "><input type="hidden" id="PageNbr" name="PageNbr" value="1">
This value for which is updated in that same javascript.
This means you need to make post requests to this URL with consecutive page numbers are params - then parse each page in turn aggregating the results.