Capture links in scrapy using regex as selector - scrapy

<svg version="1.1" id="Calque_1" xmlns="&ns_svg;" xmlns:xlink="&ns_xlink;" width="700" height="700" viewBox="0 0 300 300" overflow="visible" enable-background="new 0 0 300 300" xml:space="preserve">
<a xlink:href="https://www.pros-locations-de-voitures.fr/location-de-voiture-ain-01/" onmouseover="TipFunction('Ain')" onmouseout="TipFunction('')"><path id="Z1" title="Ain" d="M237.125,152.725l-1.7-1l-2.4,3.3l-2.7,1.6l-2,0.1l-0.2-1.4l-1.6-0.8l-2,2.2l-1.5,0.1v-1.5h-1.5l-2.1-3.9 l-2.5-1.6l-2.7,0.6l-2.9-0.8l-2.9,10.5l-0.8,4l1.5,4.6l1.5-0.3l1.8,2.9l3.2-0.3l3,1l1.5-2.5l1.4-0.4l5.6,7.6l2.9-3.3l1.1-6.8 l-0.4-4.7h1.5l1.3-1.4h-0.1l0.3-2.6l2.8-1.7L237.125,152.725z" fill="red" stroke="#EEEEEE" stroke-width="0.9"></path> </a>
<a xlink:href="https://www.pros-locations-de-voitures.fr/location-de-voiture-aisne-02/" onmouseover="TipFunction('Aisne')" onmouseout="TipFunction('')"><path id="Z2" title="Aisne" d="M179.025,42.325l-6.3,0.4l-0.2,1.8l-1.9,4.1l1.1,3.5l0.2,5.1l-0.3,2.2l1.1,0.9l-1.3,0.6l-1.2,2.8l-1.3,0.8 l1.4,2.3l-1.5-0.1l0.4,1.5l1.2-0.8l1.4,0.6l0.3,1.4l-1.1,0.8l1.3,0.4l0.9,1.2l-0.3,1.4l1.9,2.1l4.7,3l3.8-5.1l-1.3-0.6l0.5-1.4 l-0.8-1.2l2.7-1.1l-1.6-4l0.6-1.4l4-2l2.7,1l0.4-1.5l-0.1-7.1l1.4-0.1l2.5-3.6l-0.7-1.6l0.7-1.7l-0.4-2.9h-0.2l-1.8-0.6v-0.1 l-7.8-2.1l-2.6,0.9l-1.2-0.9L179.025,42.325z " fill="#094353" stroke="#EEEEEE" stroke-width="0.9"></path> </a>
While testing the regex pattern its working fine and matches the links but while applying in code it returning empty list.
import scrapy
class scraper(scrapy.Spider):
name = "scraper"
start_urls = ["https://www.pros-locations-de-voitures.fr/"]
def parse(self, response):
yield {
'Links' : response.selector.re('(?<=xlink:href=").*?(?=")')
}

The data you are looking for is loaded via javascript so to gain access to the data you will have to pre-render the page using either scrapy-splash, selenium or scrapy-playwright. You can then use below xpath selector to obtain the urls. No need to use regex in this case
response.xpath("//*/#*[name()='xlink:href']").getall()

Related

Does BeautifulSoup can locate the element basing on contained text? [duplicate]

Observe the following problem:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
# This returns the <a> element
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
# This returns None
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*")
)
For some reason, BeautifulSoup will not match the text, when the <i> tag is there as well. Finding the tag and showing its text produces
>>> a2 = soup.find(
'a',
href="/customer-menu/1/accounts/1/update"
)
>>> print(repr(a2.text))
'\n Edit\n'
Right. According to the Docs, soup uses the match function of the regular expression, not the search function. So I need to provide the DOTALL flag:
pattern = re.compile('.*Edit.*')
pattern.match('\n Edit\n') # Returns None
pattern = re.compile('.*Edit.*', flags=re.DOTALL)
pattern.match('\n Edit\n') # Returns MatchObject
Alright. Looks good. Let's try it with soup
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
soup.find(
'a',
href="/customer-menu/1/accounts/1/update",
text=re.compile(".*Edit.*", flags=re.DOTALL)
) # Still return None... Why?!
Edit
My solution based on geckons answer: I implemented these helpers:
import re
MATCH_ALL = r'.*'
def like(string):
"""
Return a compiled regular expression that matches the given
string with any prefix and postfix, e.g. if string = "hello",
the returned regex matches r".*hello.*"
"""
string_ = string
if not isinstance(string_, str):
string_ = str(string_)
regex = MATCH_ALL + re.escape(string_) + MATCH_ALL
return re.compile(regex, flags=re.DOTALL)
def find_by_text(soup, text, tag, **kwargs):
"""
Find the tag in soup that matches all provided kwargs, and contains the
text.
If no match is found, return None.
If more than one match is found, raise ValueError.
"""
elements = soup.find_all(tag, **kwargs)
matches = []
for element in elements:
if element.find(text=like(text)):
matches.append(element)
if len(matches) > 1:
raise ValueError("Too many matches:\n" + "\n".join(matches))
elif len(matches) == 0:
return None
else:
return matches[0]
Now, when I want to find the element above, I just run find_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')
The problem is that your <a> tag with the <i> tag inside, doesn't have the string attribute you expect it to have. First let's take a look at what text="" argument for find() does.
NOTE: The text argument is an old name, since BeautifulSoup 4.4.0 it's called string.
From the docs:
Although string is for finding strings, you can combine it with
arguments that find tags: Beautiful Soup will find all tags whose
.string matches your value for string. This code finds the tags
whose .string is “Elsie”:
soup.find_all("a", string="Elsie")
# [Elsie]
Now let's take a look what Tag's string attribute is (from the docs again):
If a tag has only one child, and that child is a NavigableString, the
child is made available as .string:
title_tag.string
# u'The Dormouse's story'
(...)
If a tag contains more than one thing, then it’s not clear what
.string should refer to, so .string is defined to be None:
print(soup.html.string)
# None
This is exactly your case. Your <a> tag contains a text and <i> tag. Therefore, the find gets None when trying to search for a string and thus it can't match.
How to solve this?
Maybe there is a better solution but I would probably go with something like this:
import re
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
""")
links = soup.find_all('a', href="/customer-menu/1/accounts/1/update")
for link in links:
if link.find(text=re.compile("Edit")):
thelink = link
break
print(thelink)
I think there are not too many links pointing to /customer-menu/1/accounts/1/update so it should be fast enough.
in one line using lambda
soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
You can pass a function that return True if a text contains "Edit" to .find
In [51]: def Edit_in_text(tag):
....: return tag.name == 'a' and 'Edit' in tag.text
....:
In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update")
Out[52]:
<a href="/customer-menu/1/accounts/1/update">
<i class="fa fa-edit"></i> Edit
</a>
EDIT:
You can use the .get_text() method instead of the text in your function which gives the same result:
def Edit_in_text(tag):
return tag.name == 'a' and 'Edit' in tag.get_text()
With soupsieve 2.1.0 you can use :-soup-contains css pseudo class selector to target a node's text. This replaces the deprecated form of :contains().
from bs4 import BeautifulSoup as BS
soup = BS("""
<a href="/customer-menu/1/accounts/1/update">
Edit
</a>
""")
single = soup.select_one('a:-soup-contains("Edit")').text.strip()
multiple = [i.text.strip() for i in soup.select('a:-soup-contains("Edit")')]
print(single, '\n', multiple)
Method - 1: Checking text property
pattern = 'Edit'
a2 = soup.find_all('a', string = pattern)[0]
Method - 2: Using lambda iterate through all elements
a2 = soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)
Good Luck

compare the 'class' of container tag

Let's say I extract some classes from some HTML:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
print(p_standard)
And the output looks like this:
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
And let's say I only wanted to print the text inside the P3 classes so that the output looks like:
a
c
I thought this code below would work, but it didn't. How can I compare the class name of the container tag to some value?
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class") == "P3":
print(p_standard.get_text())
I'm aware that in my first line, I could have simply done r"P3" instead of r"Standard|P3", but this is only a small fraction of the actual code (not the full story), and I need to leave that first line as it is.
Note: doing something like .find("p", class_ = "P3") only works for descendants, not for the container tag.
OK, so after playing around with the code, it turns out that
p_standard.get("class")[0] == "P3"
works. (I was missing the [0])
So this code works:
p_standards = soup.find_all("p",attrs={'class':re.compile(r"Standard|P3")})
for p_standard in p_standards:
if p_standard.get("class")[0] == "P3":
print(p_standard.get_text())
I think the following is more efficient. Use select and CSS Or syntax to gather list based on either class.
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head></head>
<body>
<p class="P3">a</p>
<p class="Standard">b</p>
<p class="P3">c</p>
<p class="Standard">d</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
p_standards = soup.select('.Standard,.P3')
for p_standard in p_standards:
if 'P3' in p_standard['class']:
print(item.text)

Getting width/attributes out of a tag in Beautifulsoup instead of text

So the beautifulsoup documentation I can find talks about finding a specific tag using id, class etc... But it doesn't talk about how to extract data from within the tag rather than what it surrounds.
My issue:
<img src=yellowbar.png width=63.94 height=10><img src=redbar.png width=36.0632181423 height=10><br />
Power:</b> 1480 / 1480<br />
<img src=yellowbar.png width=100 height=10><img src=redbar.png width=0 height=10><br />
I have this HTML. There are around a total of 20 tags on the page, of which 3 have src=yellowbar.png
my goal is, to select the second one, and get the width back. So I am guessing it would go:
Find tags -> find src=yellowbar.png -> select second one -> print width back.
How would I go about this?
So far I've managed to print a list of all tags.
soup = BeautifulSoup(element, "lxml")
tag = soup.find_all('img')
print(tag)
which returns
[<img height="10" src="yellowbar.png" width="77"/>, <img height="10" src="redbar.png" width="0"/>]
If I could understand your question then this should solve your issue.
from bs4 import BeautifulSoup
content = """
<img src=yellowbar.png width=63.94 height=10><img src=redbar.png width=36.0632181423 height=10><br />
Power:</b> 1480 / 1480<br />
<img src=yellowbar.png width=100 height=10><img src=redbar.png width=0 height=10><br />
"""
soup = BeautifulSoup(content,"lxml")
for tags in soup.find_all("img",{"src":"yellowbar.png"}): #use the attributes as well to specify the item you look for
print(tags['width']) #access the value using attribute
Output:
63.94
100

How to use scrapy to crawl multiple pages? (two level)

On my site I created two simple pages:
Here are their first html script:
test1.html :
<head>
<title>test1</title>
</head>
<body>
<a href="test2.html" onclick="javascript:return xt_click(this, "C", "1", "Product", "N");" indepth="true">
<span>cool</span></a>
</body></html>
test2.html :
<head>
<title>test2</title>
</head>
<body></body></html>
I want scraping text in the title tag of the two pages.here is "test1" and "test2".
but I am a novice with scrapy I only happens scraping only the first page.
my scrapy script:
from scrapy.spider import Spider
from scrapy.selector import Selector
from testscrapy1.items import Website
class DmozSpider(Spider):
name = "bill"
allowed_domains = ["http://exemple.com"]
start_urls = [
"http://www.exemple.com/test1.html"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//head')
items = []
for site in sites:
item = Website()
item['title'] = site.xpath('//title/text()').extract()
items.append(item)
return items
How to pass the onclik?
and how to successfully scraping the text of the title tag of the second page?
Thank you in advance
STEF
To use multiple functions in your code, send multiple requests and parse them, you're going to need: 1) yield instead of return, 2) callback.
Example:
def parse(self,response):
for site in response.xpath('//head'):
item = Website()
item['title'] = site.xpath('//title/text()').extract()
yield item
yield scrapy.Request(url="http://www.domain.com", callback=self.other_function)
def other_function(self,response):
for other_thing in response.xpath('//this_xpath')
item = Website()
item['title'] = other_thing.xpath('//this/and/that').extract()
yield item
You cannot parse javascript with scrapy, but you can understand what the javascript does and do the same: http://doc.scrapy.org/en/latest/topics/firebug.html

Extracting href from attribute with BeatifulSoup

I use this method
allcity = dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")})
to return a list like this:
[<a onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" href="http://www.ylyd.com/showurl.asp?id=6182" target="_blank"><font size="3">掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫</font></a>,
掳脵露脠驴矛脮脮]
How do I extract this href?
http://www.ylyd.com/showurl.asp?id=6182
Thanks. :)
you can use
for a in dom.body.findAll(attrs={'id' : re.compile("\d{1,2}")}, href=True):
a['href']
In this example, there's no real need to use regex, it can be simply as calling <a> tag and then ['href'] attribute like so:
get_me_url = soup.a['href'] # http://www.ylyd.com/showurl.asp?id=6182
# cached URL
get_me_cached_url = soup.find('a', class_='m')['href']
You can always use prettify() method to better see the HTML code.
from bs4 import BeautifulSoup
string = '''
[
<a href="http://www.ylyd.com/showurl.asp?id=6182" onmousedown="return c({'fm':'as','F':'77B717EA','F1':'9D73F1E4','F2':'4CA6DE6B','F3':'54E5243F','T':'1279189248','title':this.innerHTML,'url':this.href,'p1':1,'y':'B2D76EFF'})" target="_blank">
<font size="3">
掳虏驴碌路驴碌脴虏煤脨脜脧垄脥酶 隆煤 脢脦脝路脦露脕卢陆脫
</font>
</a>
,
<a class="m" href="http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu" target="_blank">
掳脵露脠驴矛脮脮
</a>
]
'''
soup = BeautifulSoup(string, 'html.parser')
href = soup.a['href']
cache_href = soup.find('a', class_='m')['href']
print(f'{href}\n{cache_href}')
# output:
'''
http://www.ylyd.com/showurl.asp?id=6182
http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece763105392230e54f728629c86027fa3c215cc791a1b1a23a4fb7935107380843e7000db120afdf14076340920a3de95c81cd2ace52f38fb5023716c914b19c46ea8dc4755d650e34d99aa0ee6cae74596b9a1d6c85523dd58716df7f49c5b7003c065e76445&p=8b2a9403c0934eaf5abfc8385864&user=baidu
'''
Alternatively, you can do the same thing using Baidu Organic Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Essentially, the main difference in this example is that you don't have to figure out how to grab certain elements since it's already done for the end-user with a JSON output.
Code to grab href/cached href from first page results:
from serpapi import BaiduSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "baidu",
"q": "ylyd"
}
search = BaiduSearch(params)
results = search.get_dict()
for result in results['organic_results']:
# try/expect used since sometimes there's no link/cached link
try:
link = result['link']
except:
link = None
try:
cached_link = result['cached_page_link']
except:
cached_link = None
print(f'{link}\n{cached_link}\n')
# Part of the output:
'''
http://www.baidu.com/link?url=7VlSB5iaA1_llQKA3-0eiE8O9sXe4IoZzn0RogiBMCnJHcgoDDYxz2KimQcSDoxK
http://cache.baiducontent.com/c?m=LU3QMzVa1VhvBXthaoh17aUpq4KUpU8MCL3t1k8LqlKPUU9qqZgQInMNxAPNWQDY6pkr-tWwNiQ2O8xfItH5gtqxpmjXRj0m2vEHkxLmsCu&p=882a9646d5891ffc57efc63e57519d&newp=926a8416d9c10ef208e2977d0e4dcd231610db2151d6d5106b82c825d7331b001c3bbfb423291505d3c77e6305a54d5ceaf13673330923a3dda5c91d9fb4c57479c77a&s=c81e728d9d4c2f63&user=baidu&fm=sc&query=ylyd&qid=e42a54720006d857&p1=1
'''
Disclaimer, I work for SerpApi.