Using regular expressions with BeautifulSoup's select_one - beautifulsoup

I have read answers on this website that describe using reg ex for 'find' queries in BeautifulSoup. However answers are less clear regarding the use of reg ex and querying on multiple tags while using 'select_one'.
Specifically I have two tags, shown below.
'#CommitYear14'
'#CommitYear12'
So I just need a query that looks for matches with '#CommitYear'.
My query right now is
college_info = beautiful_soup_parsing.select_one(tag)
where tag is either '#CommitYear14' or '#CommitYear12'. I don't know how to get both 14 and 12.

Function select_one() is for applying CSS selectors, you cannot use re with it. But however, you can use CSS selecor ^= which selects element(s) whose attribute value begins with selected string (for reference on CSS selectors look at this):
data = '''
<div id="CommitYear12">CommitYear12</div>
<div id="CommitYear14">CommitYear14</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('[id^="CommitYear"]'))
Prints:
<div id="CommitYear12">CommitYear12</div>

Related

Selenium finding elements returns incorrect elements

I'm using Selenium to try and get some elements on a web page but I'm having trouble getting the ones I want. I'm getting some, but they're not the ones I want.
So what I have on my page are five divs that look like this:
<div class="membershipDetails">
Inside each one is something like this:
<div class="membershipDetail">
<h3>
VIP Membership
</h3>
</div>
They DO all have this same link, but they don't have the same text ('VIP Membership' would be replaced by something else)
So the first thing was to get all the divs above in a list. This is the line I use:
listElementsMembership = driver.find_elements_by_css_selector(div[class^='membershipDetail'])
This gives me five elements, just as I would expect. I checked the 'class' attribute name and they are what I would expect. At this point I should say that they aren't all EXACTLY the same name 'membershipDetail'. Some have variations. But I can see that I have all five.
The next thing is to go through these elements and try and get that element which contains the href ('VIP Membership').
So I did that like this:
for elem in listElementsMembership:
elemDetailsLink = elem.find_element_by_xpath('//a[contains(#href,"EditMembership")]')
Now this does return something, but it always got me the element from the FIRST of the five elements. It's as if the 'elem.find_element_by_xpath' line is going up a level first before finding these hrefs. I kind of confirmed this by switching this to a 'find_elements_by_xpath' (plural) and getting, you guessed it, five elements.
So is this line:
elemDetailsLink = elem.find_element_by_xpath('//a[contains(#href,"EditMembership")]')
going up a level before getting its results? If it is, now can I make it not do that and just restrict itself to the children?
If you are trying to find element with in an element use a . in the xpath like below:
listElementsMembership = driver.find_elements_by_css_selector(div[class^='membershipDetail'])
for elem in listElementsMembership:
elemDetailsLink = elem.find_element_by_xpath('.//a') # Finds the "a" tag with respect to "elem"
Suppose if you are looking for VIP Membership:
listElementsMembership = driver.find_elements_by_css_selector(div[class^='membershipDetail'])
for elem in listElementsMembership:
value = elem.find_element_by_xpath('.//a').get_attribute("innerText")
if "VIP Membership" in value:
print(elem.find_element_by_xpath('.//a').get_attribute("innerText"))
And if you dont want iterate over all the five elements try to use xpath like below: (As per the HTML you have shared)
//div[#class='membershipDetail']//a[text()='VIP Membership']
Or
//div[#class='membershipDetail']//a[contains(text(),'VIP Membership')]
You've few mistake in that css selector.
Quotes are missing.
^ is for starts-with, not sure if you really need that. In case it's partial matching please use * instead of ^
Also, I do not see any logic for the below statement in your code attempt.
The next thing is to go through these elements and try and get that
element which contains the href ('VIP Membership').
Code :
listElementsMembership = driver.find_elements_by_css_selector("div[class*='membershipDetail']")
for ele in listElementsMembership:
e = ele.find_element(By.XPATH, ".//descendant::a")
if "VIP Membership" in e.get_attribute('href'):
print(e.text, e.get_attribute('href'))
You can give an index using a square bracket like this.
elemDetailsLink = elem.find_element_by_xpath('(//a[contains(#href,"EditMembership")])[1]')
If you are trying to get an element using XPath, the index should start with 1, not 0.

Python BeautifulSoup get text from class

How can I get the text "Lionel Messi" from this HTML code?
Lionel Messi
This is my code so far:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
page = requests.get('https://www.futbin.com/players')
soup = BeautifulSoup(page.content, 'lxml')
pool = soup.find(id='repTb')
player_names = pool.find_all(class_='player_name_players_table')
print(player_names[0])
When I print player_names I get this result:
/Users/ejps/PycharmProjects/scraper_players/venv/bin/python /Users/ejps/PycharmProjects/scraper_players/scraper.py
<a class="player_name_players_table" href="/20/player/44079/lionel-messi">Lionel Messi</a>
Process finished with exit code 0
But what code would I have to put in to get only the text of it?
I want to scrape all player names form that page in my code. But first I need to find a way to get that text extracted I think.
Cant find a way to make it work unfortunately.
I am new to python and try to do some projects to learn it.
EDIT:
With the help from comments I was able to get the text I need.
I only have one more question here.
Is it possible to find class_ by partial text only?
Like this:
prating = pool.find_all(class_='form rating ut20')
The full class would be
class="form rating ut20 toty gold rare"
but it is changing. The part that is always the same is "form rating ut20" so I thought maybe there is some kind of a placeholder that let me search for all "class" names inlcuding "form rating ut20"
Could you maybe help me with this as well?
To select specific class you can use either regular expression or if you have installed version bs4 4.7.1 or above you can use css selector.
Using regular expression will get list of element.
import re
prating = pool.find_all(class_=re.compile("form rating ut20"))
Or Using css selector will get list of element.1st css selector means contains and other one means starts-with.
prating = pool.select('[class*="form rating ut20"]')
OR
prating = pool.select('[class^="form rating ut20"]')
Get text using the getText() method.
player_names[0].getText()

why the code of python2.7 have no any output?

This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.
If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.

How to get All Text in robot framework ?

Consider the following source code,
<div id="groupContainer" class="XXXXXX">
<ul id="GroupContactListWrapper" class="list-wrapper">
<li class="contactNameItemContainer">
<div class="contactNameItem">
<span class="name">Name1</span>
</div>
</li>
<li class="contactNameItemContainer">
<div class="contactNameItem">
<span class="name">Name2</span>
</div>
</li>
</ul>
</div>
How do i retreive the two names (Name1,Name2) in a list variable ?
I tried the following xpath for a "Get Text" keyword, but only returns the first one.
//div[#id='groupContainer']//li[#class='contactNameItemContainer']//span
Please suggest
You could iterate over the elements as below:
${xpath}= Set Variable //div[#id='groupContainer']//li[#class='contactNameItemContainer']//span
${count}= Get Matching Xpath Count ${xpath}
${names}= Create List
:FOR ${i} IN RANGE 1 ${count} + 1
\ ${name}= Get Text xpath=(${xpath})[${i}]
\ Append To List ${names} ${name}
This works but is rather slow when there are many matches.
You could extend Selenium2Library and write your own keyword for this purpose. Save the following as Selenium2LibraryExt.py
from Selenium2Library import Selenium2Library
class Selenium2LibraryExt(Selenium2Library):
def get_all_texts(self, locator):
"""Returns the text value of elements identified by `locator`.
See `introduction` for details about locating elements.
"""
return self._get_all_texts(locator)
def _get_all_texts(self, locator):
elements = self._element_find(locator, False, True)
texts = []
for element in elements:
if element is not None:
texts.append(element.text)
return texts if texts else None
Then you can use your new Get All Texts keyword in your tests like this:
*** Settings ***
library Selenium2LibraryExt
*** Test Cases ***
Get All Texts Test
Open Browser http://www.example.com chrome
#{texts} Get All Texts css=.name
Log Many ${texts}
Though the top-rated answer is fully working - and the most xpath-ish :), let me add an option I don't see proposed yet - using the Get Webelements keyword. E.g.:
#{locators}= Get Webelements xpath=//div[#id='groupContainer']//li[#class='contactNameItemContainer']//span
${result}= Create List
:FOR ${locator} in #{locators}
\ ${name}= Get Text ${locator}
\ Append To List ${result} ${name}
It'll generate and return a list of all matching elements, on which you just iterate on. It might be a tad faster than xpath's [index] reference because the dom is evaluated once - but don't hold me accountable if that's not fully true :)
Get Text will return content of the first element that matches the locator. When using XPATH you can specify the index of the element you want to get, like this:
${name1} Get Text xpath=//div[#id='groupContainer']//li[#class='contactNameItemContainer'][0]//span
${name2} Get Text xpath=//div[#id='groupContainer']//li[#class='contactNameItemContainer'][1]//span
#{names} Create List ${name1} ${name2}
#Velapanur
I had another similar requirement where in i have to enter the texts into textareas in a page.
The below i wrote with the idea in reference to what Todor had suggested, and it worked. Many thanks to Todor, Velapanur
#{Texts}= Get WebElements ${AllTextboxes}
:FOR ${EachTextarea} in #{Texts}
\ Input Text ${EachTextarea} ${RandomTextdata}
Here is the logic to get all elements in Java. You can adopt it to your need.
You have to use findElements() and not findElement() to get all elements.
List<WebElement> items = driver.findElements(By.cssSelector("ul#GroupContactListWrapper div.contactNameItem"));
foreach(item in items){
System.out.println(item.getText();
}
If you want a particular element from the list you can use items.get(1)

Is there a shorter syntax than soup.select("#visitor_stats")[0]?

I'm using BeautifulSoup (import bs4) to read some information from a web page. Several lines in my script look like
stats = soup.select("#visitor_stats")[0]
Is there a shorter syntax for this?
select() lets you select a bunch of HTML tag elements based on their CSS properties (like id and class). In this case you are looking for all HTML tag elements with CSS id property set to visitor_stats. And then selecting the first element from the returned list.
The BeautifulSoup method find() returns the first occurrence of the search criteria. So the list index [0] can be gotten rid of by using find()
stats = soup.find(attrs={'id':'visitor_stats'})
But I am not sure if this is any shorter :)