In scrapy css selectors how do i get a strings ' ' instead of a sub-string [ ] - scrapy

I can't figure out how to get a string out of a selector
I've tried
response.css('.size_list a::text').extract()
I get
['L', '1X', '2X', '3X', '4X', '5X']
Here is the code
<span class="size_list">
<a href="javascript:void(0)" class="itemAttr current" title="L" data-
value="L">L</a>
<a href="javascript:void(0)" class="itemAttr" title="1X" data-
value="1X">1X</a>
<a href="javascript:void(0)" class="itemAttr" title="2X" data-
value="2X">2X</a>
<a href="javascript:void(0)" class="itemAttr" title="3X" data-
value="3X">3X</a>
<a href="javascript:void(0)" class="itemAttr" title="4X" data-
value="4X">4X</a>
<a href="javascript:void(0)" class="itemAttr" title="5X" data-
value="5X">5X</a>
</span>
What I want is "'L', '1X', '2X', '3X', '4X', '5X'"

This is not something for the extraction code to do, this is something you should do with regular Python code once you have the extracted data:
>>> extracted_data = ['L', '1X', '2X', '3X', '4X', '5X']
>>> ', '.join("'%s'" % value for value in extracted_data)
"'L', '1X', '2X', '3X', '4X', '5X'"

Not sure if it's possible to do it directly in the selector. An alternative could be to get it first as a list and to transform it into a string with something like this:
size_list = response.css('.size_list a::text').extract()
string_size_list = ', '.join(size_list)

To obtain the first occurrence of the elements
response.css('.size_list a::text').extract_first()
# or
response.css('.size_list a::text').get()
This should work
item_list = response.css('.size_list a::text').extract()
one_string = (', ').join(item_list) # this work

Related

Data scraping by selenium p tag

I searched a lot on the internet. I couldn't find an example similar to the one below. I'm trying to pull text from a web page. There is no location line in the first p tag. The second location section has a location line. When pulling data, I can only pull the contents of the p tag, which is the location row. I cannot pull the contents of the other p tag. I wonder how can I pull the data inside the first and second p tag?
HTML codes of Page Source:
<div class=" col-md-8">
<p>
<i class='fa fa-home main-color'></i> ORHAN MAH.İBRAHİM CAD. NO:35
<br>
<i class='fa fa-phone main-color'></i>
<a class="gri" href="tel:0508-2920344">0508-2920344 </a>
<br />
<i class='fa fa-clock-o main-color'></i>
<span class="red">19.01.2022</span>
</p>
<p>
<i class='fa fa-home main-color'></i> HAZAN MAH.ÖKTEM CAD. NO:13/B
<br>
<i class='fa fa-phone main-color'></i>
<a class="gri" href="tel:0584 837 23 70">0584 837 23 70 </a>
<br>
<i class="fa fa-map-marker main-color"></i>
<a class="gri" href="https://www.google.com/maps?q=35.554433,25.887766" target="_blank">Haritada</a>
<br />
<i class='fa fa-clock-o main-color'></i>
<span class="red">20.01.2022</span>
</p>
</div>
Here is the selenium code I used to pull the data from the HTML source above:
item = browser.find_elements_by_class_name("col-md-10")
urls = browser.find_elements_by_xpath("//div[#class=' col-md-10']/p/a[2]")
for i in zip(item,urls):
try:
address = i[0].find_element_by_css_selector("p").text.split("\n")[:2]
except:
address = None
try:
phone = i[0].find_element_by_xpath("//a[#class='gri'][1]").text
except:
phone = None
print(address)
print(phone)
try:
url = i[1].get_attribute('href').replace("https://www.google.com/maps?q=","")
except:
url = None
try:
date = i[0].find_element_by_xpath("//span[#class='red'][1]").text
except:
date = None
print(url)
print(date)
Use xpath //div[#class=' col-md-8']/p. This will return data of both p tags.
Then you can perform string operations as per your requirement and use data of each p tag using for loop
The 1.p tag blog has no location section. The 2.p tag blog has a location section. In the 1.p tag I want, I want to print none instead of the location in the p blog. When I try to pull with zip_longest regularly the location fails to pull.
#1.p tag block
ORHAN MAH.İBRAHİM CAD. NO:35
0508-2920344
19.01.2022
#2.p tag block
HAZAN MAH.ÖKTEM CAD. NO:13/B
0584 837 23 70
Haritada
20.01.2022

How can I get a value from an attribute inside a tag

I have a soup object like:
<a class="love-action js-add-to-favorites" data-id="415953" data-price="715.00" href="#">
</a>
I did
soup = BeautifulSoup(src, 'lxml') #передаем переменную в суп
price = soup.find(class_='col-5 col-sm-4 col-lg-7 mob-position detail-top-actions').find('a', class_='love-action js-add-to-favorites')
print(price)
I'd like to get only: 715.00
How to fix?
You can access attributes of a tag by treating it like a dictionary - So simply get the value from the attribute data-price by:
price['data-price']
Example based on your question
soup = BeautifulSoup(src, 'lxml') #передаем переменную в суп
price = soup.find(class_='col-5 col-sm-4 col-lg-7 mob-position detail-top-actions').find('a', class_='love-action js-add-to-favorites')
print(price['data-price'])
Output
715.00

Beautiful Soup - How to find tags after a specific item in HTML?

I need to find tags after a specific item on a website. So, is there a way to skip the tag objects until this specific one, then find the matching ones to given criteria? I need all p with class XYZ after the div with class ABC.
response = requests.get(url).text
soup = BeautifulSoup(response)
items = soup.find_all('p', {'class': 'MessageTextSize js-message-text message-text'}) # only return the ones after the div with class of "Text 2"
Edit: You can see a sample code block below which is part response. The aim is finding the last two paragraphs (Text 3 & Text 4) despite the first one (Text 1) also has the same p class with them. So, I need to look for the parameter of find_all function after the Text 2 (class MessageTextSize js-message-text message-text).
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 1</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize MessageTextSize--jumbo js-message-text message-text" data-aria-label-part="0">Text 2</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 3</p>
</div>
<div class="js-message-text-container">
<p class="MessageTextSize js-message-text message-text" data-aria-label-part="0">Text 4</p>
</div>
p.s. bs4 version is 4.8.1, which is the latest release.
You can always use a custom function (or a lambda expression) inside find_all. The following is self-explanatory (IMO).
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
Example
from bs4 import BeautifulSoup
html = """
<p class="XYZ">Text 1</p>
<p class="XYZ">Text 2</p>
<div class="ABC"></div>
<p class="XYZ">Text 3</p>
<p class="XYZ">Text 4</p>
"""
soup = BeautifulSoup(html, 'html.parser')
result = soup.find_all(
lambda x: x.name == 'p' and
'XYZ' in x.get('class', '') and
x.find_previous('div', class_='ABC')
)
print(result)
Output
[<p class="XYZ">Text 3</p>, <p class="XYZ">Text 4</p>]
EDIT
MessageTextSize js-message-text message-text represents three classes, not one.
x.get('class', '') returns a list of classes -
['MessageTextSize', 'js-message-text', 'message-text']
In your particular case, you have to target a p tag not a div, if I understood correctly.
So, you have to use
result = soup.find_all(
lambda x: x.name == 'p' and
'MessageTextSize js-message-text message-text' in ' '.join(x.get('class', ''))
and x.find_previous('p', class_='MessageTextSize MessageTextSize--jumbo js-message-text message-text')
)
Ref:
find_previous()
Function as filter
If I understand you correctly, this should work:
item = soup.select_one('p[class*="MessageTextSize--jumbo"]')
sibs = item.parent.find_next_siblings()
for sib in sibs:
print(sib.text.strip())
Output:
Text 3
Text 4

How to select specific text to scrape

I'm trying to scrape the following HTML, I want just to get the Some Header part and not the additional info.
<li class="media">
<div class="media-body">
<h4> Some Header <span class="label label-info"> additional Info </span> </h4> Address info
<br>
</div> </li>`
I'm trying the following:
val li: Elements = ul.select("li")
val list: Elements = li.select("a")
val headers: Elements = list.select("h4")
`
and then when I try to get the inner text via, headers.text() I'm getting both Some Header and additional Info
How can I only scrape the Some Header part?
You are almost near to the solution .You are probably looking for calling ownText:
String s = "<li class=\"media\"> \n" +
" <div class=\"media-body\"> \n" +
" <h4> Some Header <span class=\"label label-info\"> additional Info </span> </h4> Address info\n" +
" <br> \n" +
" </div> </li>";
Document document = Jsoup.parse(s);
Elements element = document.select("li");
Elements elements = element.select("a");
System.out.println(elements.select("h4").first().ownText()); ;
Output:
Some Header

Scrapy CSV export shows the same data in all rows

I'm trying to scrape the following html code:
<ul class="results-list" id="search-results">
<li>
<h3 class="name">First John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
<li>
<h3 class="name">Second John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
</ul>
When I run my spider, I get 2 rows, containing the same information. I have name,email,phone columns and for example in the name column for both I would get:
First John,Second John.
My Scrapy code is the following:
people= response.xpath('//ul[#class="results-list"]/li')
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'//h3/text()').extract()
item['Email'] = person.xpath(
'//div[#class="details"]/a/#href').extract()
item['Phone'] = person.xpath(
'//div[#class="details"]/span[#class="phone"]/text()').extract()
yield item
However when I run scrapy crawl MySpider -o output.csv I get the same information in all rows.
you are using absolute path on your xpath expressions, change them to:
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'.//h3/text()').extract_first()
item['Email'] = person.xpath(
'.//div[#class="details"]/a/#href').extract_first()
item['Phone'] = person.xpath(
'.//div[#class="details"]/span[#class="phone"]/text()').extract_first()
yield item