BS4 Beautiful Soup extract text from find_all

BS4 Beautiful Soup extract text from find_all - beautifulsoup

I am scraping a website and would like to create a list of prices.
prices = soup.find_all("li", class_="price")
However, this returns:
<li class="price">€13.99</li>,
<li class="price">€12.99</li>,
.....
How do I extract just the price? I tried
prices = soup.find_all("li", class_="price", text=True)
but it did not work.
I know I can go through the list manually and extract the text but this isn't ideal.

Assuming content is not dynamically added, which it appears it is not, I would use .text to extract from elements returned by using select
prices = [item.text for item in soup.select('li.price')]

find_all() returns list of element.You need to iterate this to get each element and then get the text of the element.
prices = soup.find_all("li", class_="price", text=True)
for price in prices:
print(price.text)

Related

HREF Class changing on every page

I am working to scrape the website:- "https://www.moglix.com/automotive/car-accessories/216110000?page=101" NOTE: 101 is the page number and this site has 783 pages.
I wrote this code to get all the URL's of the product mentioned on the page using beautifulsoup:-
prod_url = []
for i in range(1,400):
r = requests.get(f'https://www.moglix.com/automotive/car-accessories/216110000?page={i}')
soup = BeautifulSoup(r.content,'lxml')
for link in soup.find_all('a',{"class":"ng-tns-c100-0"}):
prod_url.append(link.get('href'))
There are 40 products on each page, and this should give me 16000 URLs for the products but I am getting 7600(approx)
After checking I can see that the class for a tag is changing on pages. For Eg:-
How to get this href for all the products on all the pages.

You can use find_all method and specified attrs to get all a tags also further filter it by using split and startswith method to get exact product link URL's
res=requests.get(f"https://www.moglix.com/automotive/car-accessories/216110000?page={i}")
soup=BeautifulSoup(res.text,"html.parser")
x=soup.find_all("a",attrs={"target":"_blank"})
lst=[i['href'] for i in x if (len(i['href'].split("/"))>2 and i['href'].startswith("/"))]
Output:
['/love4ride-steel-tubeless-tyre-puncture-repair-kit-tyre-air-inflator-with-gauge/mp/msnv5oo7vp8d56',
'/allextreme-exh4hl2-2-pcs-36w-9000lm-h4-led-headlight-bulb-conversion-kit/mp/msnekpqpm0zw52',
'/love4ride-2-pcs-35-inch-fog-angel-eye-drl-led-light-set-for-car/mp/msne5n8l6q1ykl',..........]

XPath selector returns empty list

I'm trying to scrape data from store: https://www.tibia.com/charactertrade/?subtopic=currentcharactertrades&page=details&auctionid=12140&source=overview
There is no problem with getting data from 1st and 2nd table, but when I goes down, xpath returns only empty lists.
even tried to save response in file:
scrapy fetch --nolog "https://www.tibia.com/charactertrade/?subtopic=currentcharactertrades&page=details&auctionid=3475&source=overview" > response.html
for table with skills everything works good
sword = response.xpath('//div [#class="AuctionHeader"]/a/text()').get()
but when it comes to getting for example gold value, I get only empty list:
gold = response.xpath('/html/body/div[3]/div[1]/div[2]/div/div[2]/div/div[1]/div[2]/div[5]/div/div/div[3]/div[2]/div[2]/table/tbody/tr/td/div/table/tbody/tr[2]/td/div[2]/div/table/tbody/tr[3]/td/div/text()').get()
In chrome/firefox both selectors works smooth, but in scrapy only 1st one
I know there might be some problems with data updated by javascript, but it doesn't look like this case

Doesn't look like it's a javascript problem. Think you're not getting your XPATH selectors correct. It's best to be as specific as possible and not to use multiple nodes down. Here we can select the attribute TableContent to get the tables you want. There you can select each individual table that you require if needed.
Code Example
table = response.xpath('//table[#class="TableContent"]')[3]
gold_title = table.xpath('tr/td/span/text()')[2].get()
gold_value = table.xpath('tr/td/div/text()')[2].get()
output
'Gold: '
'31,030'
Explanation
Using the class attribute TableContent, you can select which table you want. Here I've selected the table with the gold values. I've then selected each row and the specific element which has the gold value. The values are hidden behind span and div elements. get() returns a string, getall() returns a list.

How to identify element by text when text is spanned over multiple lines using Selenium

Is it possible to write xpath using contains text such as(Below is what I want but does not work)
//ul[#role='listbox']/..//span[contains(text(),'Fast-Food Restaurent')]
Page Code:
<span class="item-title" md-highlight-text="searchText" md-highlight-flags="i">Fast-<span class="highlight">Food</span> Restaurant</span>
It is an auto complete text box when I enter the word food, there are some options and I want to select Fast-Food Restaurant from it.
Thanks in advance.

To identify the element you can use either of the following xpath based Locator Strategies:
Using the texts Food and Fast:
//ul[#role='listbox']/..//span[.//span[text()='Food']][contains(.,'Fast-')]
Using the texts Food and Restaurant:
//ul[#role='listbox']/..//span[.//span[text()='Food']][contains(.,'Restaurant')]
Using the texts Food, Fast and Restaurant:
//ul[#role='listbox']/..//span[.//span[text()='Food']][contains(.,'Fast-') and contains(.,'Restaurant')]

You can select required span node based on its string representation:
//span[.='Fast-Food Restaurant']

You can amend your path to the following:
//ul[#role='listbox']/..//span[contains(normalize-space(), 'Fast-Food Restaurant')]

Scraping categories and subcategories using beautifulsoup

I am trying to retrieve all categories and subcategories within a website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggling with the loop for categories. I'm using this as a test website: http://www.shophive.com.
How do I loop through each category as well as the subcategories on the left side of the website? I would like to extract all products within the category/subcategory and display on my page.

from bs4 import BeautifulSoup
import user_agent
import requests
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
print(main_category_links)
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
print(subcategory_links)
I'll break this down for you piece by piece.
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
Here we just make the request and store the HTML. I use a module called "user_agent" to generate a User Agent to use in the headers, just my preference.
<div class="parentMenu arrow">
<a href="http://www.shophive.com/year-end-clearance-sale">
<span>New Year's Clearance Sale</span>
</a>
</div>
The links for the main categories are stored like so, so in order to extract just the links we do this:
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
We iterate over the results of soup.find_all('div', class_='parentMenu arrow') since elements the links we want are children of these elements. Then we append soup.find('a').get('href') to our list of main category links. We use soup.find this time because we only want one result, then we get the contents of the href.
<a class="itemMenuName level1" href="http://www.shophive.com/apple/mac">
<span>Mac / Macbooks</span>
</a>
The subcategories are stored like this, notice the "a" tag has a class this time, this makes it a little easier for us to find it.
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
Here we iterate over soup.find_all('a', class_='itemMenuName'). When you search for classes in BeautifulSoup, you can just search for part of the class name. This is helpful to us in this case since the class name varies from itemMenuName level1 to itemMenuName level2. These elements have the link inside of them already, so we just extract the contents of the href that holds the URL with link.get('href') and append it to our list of subcategory links.

How to identify individual controls within a search result using Selenium Webdriver?

I have a list of search results in the following link and would like to know on how can I identify the individual controls using dynamic xpath
http://www.bigbasket.com/cl/fruits-vegetables/?nc=nb
I'm able to get the list of product names displayed using the below line
List<WebElement> productResults = browser.findElements(By.xpath("//*[contains(#id,'product')]/div[2]/span[2]/a"));
I'm able to print the product names displayed in Page 1 using the below code, but however the list size is not matching with the list of results displayed in Page 1 so which I see blank lines in between when printing
System.out.println(productResults.size());
for(int i=0;i<productResults.size();i++){
System.out.println(productResults.get(i).getText());
}
Also I tried to locate the individual controls such as Qty text box, Add button in a similar like how I located the product names but the list count is not matching so which I cannot specify the quantity, add the required product to the cart.
Could you please help me with this one?

The first step is get only the visible itens (that is displayed), sou you can use this xpath:
"//*[contains(#id,'product')][not(contains(#style,'display:none'))]/div[2]/span[2]/a"
Now, you need to return the main iten div, that allows you to acess other functions. You can get the tag parents in this way:
"//*[contains(#id,'product')][not(contains(#style,'display:none'))]/div[2]/span[2]/a/../../.."
The elements that you recieve in this last XPath have all html itens that you want, as set quantity, select the dropdown etc. You can acess each using a findElement() in each IWebElement of the list. Example:
List<WebElement> productResults = browser.findElements(By.xpath("//*[contains(#id,'product')][not(contains(#style,'display:none'))]/div[2]/span[2]/a/../../.."));
for(WebElement element : productResults ){
IWebElement quantityInput = element.findElement(By.XPath("//input[contains(#id, '_qty')]"));
string quantityValue = quantityInput.getAttribute("value"); // if you want to know the current value. YOu can also parse it in an int
IWebElement addButton = element.findElement(By.XPath("//a[contains(#class, 'add-button')]"));
// etc to all elements inside element.
// Remember: Element is yout complete card of the item, that contains Value, name, image, buttons and all it.
}
Sorry for some Java syntax error. I am not a Java developer / tester. My piece of cake is C#.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BS4 Beautiful Soup extract text from find_all - beautifulsoup

Assuming content is not dynamically added, which it appears it is not, I would use .text to extract from elements returned by using select prices = [item.text for item in soup.select('li.price')]

find_all() returns list of element.You need to iterate this to get each element and then get the text of the element. prices = soup.find_all("li", class_="price", text=True) for price in prices: print(price.text)

Related

HREF Class changing on every page

XPath selector returns empty list

How to identify element by text when text is spanned over multiple lines using Selenium

Scraping categories and subcategories using beautifulsoup

How to identify individual controls within a search result using Selenium Webdriver?

Categories

Resources