cannot extract text from tag Beautifulsoup - beautifulsoup

the following command correctly extract a table from a HTML page:
[tr.findAll('td') for tr in table.findAll('tr',{'class': "js-file-line"})]
[[<td class="blob-num js-line-number" data-line-number="1" id="L1"></td>],
[<td class="blob-num js-line-number" data-line-number="2" id="L2"></td>,
<td>Arsenal</td>,
<td>38</td>,
<td>26</td>,
<td>9</td>,
<td>3</td>,
<td>79</td>,
<td>36</td>,
<td>87</td>],
[<td class="blob-num js-line-number" data-line-number="3" id="L3"></td>,
<td>Liverpool</td>,
etc.
I would like to modify the command to extract the content of each td.
but I cannot extract text from each line since the .text returns an error:
I use the following command:
[tr.findAll('td').text[1:] for tr in table.findAll('tr',{'class': "js-file-line"})][1:]
Where [1:] are used to skip headers (and they works fine. Tested). The problem is the .text which results the following error:
ResultSet object has no attribute 'text'.
You're probably treating a list of items like a single item.
Did you call find_all() when you meant to call find()?
I am actually using findAll which from my understanding is equivalent to find_All.
Sorry if this is too basic question...

The find_All method returns a ResultSet object which is basically a list of Tag objects.
text is a Tag attribute, so you should use one more list comprehension.
txt = [
[td.text for td in tr.find_all('td')][1:]
for tr in table.find_all('tr', {'class': "js-file-line"})
][1:]
Or, if the rows contain only 'td' tags you can use the strings generator.
txt = [list(tr.strings)[1:] for tr in table.find_all('tr', {'class': "js-file-line"})][1:]

Related

Selenium web scraping elements from tag

I'm looping from a diferents urls trying to get some information from some movies
I'm trying to get the writers. I am not extracting each csselector because perhaps in some other movie there is not the same number of scriptwriters and it would give an error. For this reason I want to extract the elements that are bound to the tag. For example I want to get all the elements of the tag "a" (image attached)
I have the following code but it's not working:
driver.find_element(By.TAG_NAME,"a")
I don't know if there is any other way without using tag
url movie = "https://www.imdb.com/title/tt7740496/?ref_=watch_fanfav_tt_t_4"
I think you are using python. Try to use one of this methods:
driver.find_elements_by_xpath('(//span[contains(text(),"Guión")])[1]/../div//a')
driver.find_elements(By.XPATH,'(//span[contains(text(),"Guión")])[1]/../div//a')
Check selenium documentation: Locating Elements
My result with java code it returns 3 elements as you want.

Selenium finding elements returns incorrect elements

I'm using Selenium to try and get some elements on a web page but I'm having trouble getting the ones I want. I'm getting some, but they're not the ones I want.
So what I have on my page are five divs that look like this:
<div class="membershipDetails">
Inside each one is something like this:
<div class="membershipDetail">
<h3>
VIP Membership
</h3>
</div>
They DO all have this same link, but they don't have the same text ('VIP Membership' would be replaced by something else)
So the first thing was to get all the divs above in a list. This is the line I use:
listElementsMembership = driver.find_elements_by_css_selector(div[class^='membershipDetail'])
This gives me five elements, just as I would expect. I checked the 'class' attribute name and they are what I would expect. At this point I should say that they aren't all EXACTLY the same name 'membershipDetail'. Some have variations. But I can see that I have all five.
The next thing is to go through these elements and try and get that element which contains the href ('VIP Membership').
So I did that like this:
for elem in listElementsMembership:
elemDetailsLink = elem.find_element_by_xpath('//a[contains(#href,"EditMembership")]')
Now this does return something, but it always got me the element from the FIRST of the five elements. It's as if the 'elem.find_element_by_xpath' line is going up a level first before finding these hrefs. I kind of confirmed this by switching this to a 'find_elements_by_xpath' (plural) and getting, you guessed it, five elements.
So is this line:
elemDetailsLink = elem.find_element_by_xpath('//a[contains(#href,"EditMembership")]')
going up a level before getting its results? If it is, now can I make it not do that and just restrict itself to the children?
If you are trying to find element with in an element use a . in the xpath like below:
listElementsMembership = driver.find_elements_by_css_selector(div[class^='membershipDetail'])
for elem in listElementsMembership:
elemDetailsLink = elem.find_element_by_xpath('.//a') # Finds the "a" tag with respect to "elem"
Suppose if you are looking for VIP Membership:
listElementsMembership = driver.find_elements_by_css_selector(div[class^='membershipDetail'])
for elem in listElementsMembership:
value = elem.find_element_by_xpath('.//a').get_attribute("innerText")
if "VIP Membership" in value:
print(elem.find_element_by_xpath('.//a').get_attribute("innerText"))
And if you dont want iterate over all the five elements try to use xpath like below: (As per the HTML you have shared)
//div[#class='membershipDetail']//a[text()='VIP Membership']
Or
//div[#class='membershipDetail']//a[contains(text(),'VIP Membership')]
You've few mistake in that css selector.
Quotes are missing.
^ is for starts-with, not sure if you really need that. In case it's partial matching please use * instead of ^
Also, I do not see any logic for the below statement in your code attempt.
The next thing is to go through these elements and try and get that
element which contains the href ('VIP Membership').
Code :
listElementsMembership = driver.find_elements_by_css_selector("div[class*='membershipDetail']")
for ele in listElementsMembership:
e = ele.find_element(By.XPATH, ".//descendant::a")
if "VIP Membership" in e.get_attribute('href'):
print(e.text, e.get_attribute('href'))
You can give an index using a square bracket like this.
elemDetailsLink = elem.find_element_by_xpath('(//a[contains(#href,"EditMembership")])[1]')
If you are trying to get an element using XPath, the index should start with 1, not 0.

Using regular expressions with BeautifulSoup's select_one

I have read answers on this website that describe using reg ex for 'find' queries in BeautifulSoup. However answers are less clear regarding the use of reg ex and querying on multiple tags while using 'select_one'.
Specifically I have two tags, shown below.
'#CommitYear14'
'#CommitYear12'
So I just need a query that looks for matches with '#CommitYear'.
My query right now is
college_info = beautiful_soup_parsing.select_one(tag)
where tag is either '#CommitYear14' or '#CommitYear12'. I don't know how to get both 14 and 12.
Function select_one() is for applying CSS selectors, you cannot use re with it. But however, you can use CSS selecor ^= which selects element(s) whose attribute value begins with selected string (for reference on CSS selectors look at this):
data = '''
<div id="CommitYear12">CommitYear12</div>
<div id="CommitYear14">CommitYear14</div>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('[id^="CommitYear"]'))
Prints:
<div id="CommitYear12">CommitYear12</div>

Need to pass data to hidden text field in Selenium

I got the below solution here,
jse.executeScript("document.getElementsByName('body')[0].setAttribute('type', 'text');");
and then passing data using SendKeys.
But it is creating duplicate text field with text attribute and hidden text field still exist..
You have two input tags. I am assuming you want to execute the script against the second one not the first.
Also, I am using querySelector and that allows you to pass cssSelector to identify the element you want.
Note: Make sure the format of dateToPass is correct
String dateToPass = "01/01/2015";
String scriptText = "document.querySelector('.propertyYear.require').setAttribute('value','" + dateToPass + "')";
((JavascriptExecutor)driver).executeScript(scriptText);

Is there a shorter syntax than soup.select("#visitor_stats")[0]?

I'm using BeautifulSoup (import bs4) to read some information from a web page. Several lines in my script look like
stats = soup.select("#visitor_stats")[0]
Is there a shorter syntax for this?
select() lets you select a bunch of HTML tag elements based on their CSS properties (like id and class). In this case you are looking for all HTML tag elements with CSS id property set to visitor_stats. And then selecting the first element from the returned list.
The BeautifulSoup method find() returns the first occurrence of the search criteria. So the list index [0] can be gotten rid of by using find()
stats = soup.find(attrs={'id':'visitor_stats'})
But I am not sure if this is any shorter :)