Is there a shorter syntax than soup.select("#visitor_stats")[0]? - beautifulsoup

I'm using BeautifulSoup (import bs4) to read some information from a web page. Several lines in my script look like
stats = soup.select("#visitor_stats")[0]
Is there a shorter syntax for this?

select() lets you select a bunch of HTML tag elements based on their CSS properties (like id and class). In this case you are looking for all HTML tag elements with CSS id property set to visitor_stats. And then selecting the first element from the returned list.
The BeautifulSoup method find() returns the first occurrence of the search criteria. So the list index [0] can be gotten rid of by using find()
stats = soup.find(attrs={'id':'visitor_stats'})
But I am not sure if this is any shorter :)

Related

Selenium web scraping elements from tag

I'm looping from a diferents urls trying to get some information from some movies
I'm trying to get the writers. I am not extracting each csselector because perhaps in some other movie there is not the same number of scriptwriters and it would give an error. For this reason I want to extract the elements that are bound to the tag. For example I want to get all the elements of the tag "a" (image attached)
I have the following code but it's not working:
driver.find_element(By.TAG_NAME,"a")
I don't know if there is any other way without using tag
url movie = "https://www.imdb.com/title/tt7740496/?ref_=watch_fanfav_tt_t_4"
I think you are using python. Try to use one of this methods:
driver.find_elements_by_xpath('(//span[contains(text(),"Guión")])[1]/../div//a')
driver.find_elements(By.XPATH,'(//span[contains(text(),"Guión")])[1]/../div//a')
Check selenium documentation: Locating Elements
My result with java code it returns 3 elements as you want.

Selenium finding elements returns incorrect elements

I'm using Selenium to try and get some elements on a web page but I'm having trouble getting the ones I want. I'm getting some, but they're not the ones I want.
So what I have on my page are five divs that look like this:
<div class="membershipDetails">
Inside each one is something like this:
<div class="membershipDetail">
<h3>
VIP Membership
</h3>
</div>
They DO all have this same link, but they don't have the same text ('VIP Membership' would be replaced by something else)
So the first thing was to get all the divs above in a list. This is the line I use:
listElementsMembership = driver.find_elements_by_css_selector(div[class^='membershipDetail'])
This gives me five elements, just as I would expect. I checked the 'class' attribute name and they are what I would expect. At this point I should say that they aren't all EXACTLY the same name 'membershipDetail'. Some have variations. But I can see that I have all five.
The next thing is to go through these elements and try and get that element which contains the href ('VIP Membership').
So I did that like this:
for elem in listElementsMembership:
elemDetailsLink = elem.find_element_by_xpath('//a[contains(#href,"EditMembership")]')
Now this does return something, but it always got me the element from the FIRST of the five elements. It's as if the 'elem.find_element_by_xpath' line is going up a level first before finding these hrefs. I kind of confirmed this by switching this to a 'find_elements_by_xpath' (plural) and getting, you guessed it, five elements.
So is this line:
elemDetailsLink = elem.find_element_by_xpath('//a[contains(#href,"EditMembership")]')
going up a level before getting its results? If it is, now can I make it not do that and just restrict itself to the children?
If you are trying to find element with in an element use a . in the xpath like below:
listElementsMembership = driver.find_elements_by_css_selector(div[class^='membershipDetail'])
for elem in listElementsMembership:
elemDetailsLink = elem.find_element_by_xpath('.//a') # Finds the "a" tag with respect to "elem"
Suppose if you are looking for VIP Membership:
listElementsMembership = driver.find_elements_by_css_selector(div[class^='membershipDetail'])
for elem in listElementsMembership:
value = elem.find_element_by_xpath('.//a').get_attribute("innerText")
if "VIP Membership" in value:
print(elem.find_element_by_xpath('.//a').get_attribute("innerText"))
And if you dont want iterate over all the five elements try to use xpath like below: (As per the HTML you have shared)
//div[#class='membershipDetail']//a[text()='VIP Membership']
Or
//div[#class='membershipDetail']//a[contains(text(),'VIP Membership')]
You've few mistake in that css selector.
Quotes are missing.
^ is for starts-with, not sure if you really need that. In case it's partial matching please use * instead of ^
Also, I do not see any logic for the below statement in your code attempt.
The next thing is to go through these elements and try and get that
element which contains the href ('VIP Membership').
Code :
listElementsMembership = driver.find_elements_by_css_selector("div[class*='membershipDetail']")
for ele in listElementsMembership:
e = ele.find_element(By.XPATH, ".//descendant::a")
if "VIP Membership" in e.get_attribute('href'):
print(e.text, e.get_attribute('href'))
You can give an index using a square bracket like this.
elemDetailsLink = elem.find_element_by_xpath('(//a[contains(#href,"EditMembership")])[1]')
If you are trying to get an element using XPath, the index should start with 1, not 0.

XPath selector returns empty list

I'm trying to scrape data from store: https://www.tibia.com/charactertrade/?subtopic=currentcharactertrades&page=details&auctionid=12140&source=overview
There is no problem with getting data from 1st and 2nd table, but when I goes down, xpath returns only empty lists.
even tried to save response in file:
scrapy fetch --nolog "https://www.tibia.com/charactertrade/?subtopic=currentcharactertrades&page=details&auctionid=3475&source=overview" > response.html
for table with skills everything works good
sword = response.xpath('//div [#class="AuctionHeader"]/a/text()').get()
but when it comes to getting for example gold value, I get only empty list:
gold = response.xpath('/html/body/div[3]/div[1]/div[2]/div/div[2]/div/div[1]/div[2]/div[5]/div/div/div[3]/div[2]/div[2]/table/tbody/tr/td/div/table/tbody/tr[2]/td/div[2]/div/table/tbody/tr[3]/td/div/text()').get()
In chrome/firefox both selectors works smooth, but in scrapy only 1st one
I know there might be some problems with data updated by javascript, but it doesn't look like this case
Doesn't look like it's a javascript problem. Think you're not getting your XPATH selectors correct. It's best to be as specific as possible and not to use multiple nodes down. Here we can select the attribute TableContent to get the tables you want. There you can select each individual table that you require if needed.
Code Example
table = response.xpath('//table[#class="TableContent"]')[3]
gold_title = table.xpath('tr/td/span/text()')[2].get()
gold_value = table.xpath('tr/td/div/text()')[2].get()
output
'Gold: '
'31,030'
Explanation
Using the class attribute TableContent, you can select which table you want. Here I've selected the table with the gold values. I've then selected each row and the specific element which has the gold value. The values are hidden behind span and div elements. get() returns a string, getall() returns a list.

How to select one from duplicate tag in page in java in selenium webdriver

I am using Selenium WebDriver and I have number of items on a page and each item on page is a separate form type.
I have saved all of these form elements in a list and I am iterating over every item in an attempt to get the name of the element by using the "alt" attribute.
However when I try to get the "name" attribute from the input element it is always returning the first input tag found on that page, not the name attribute of the element I have currently selected.
The syntax I am using is:
((Webdriver imgtags.get(i)).findelement(By.xpath("//input[#name='qty']")).sendKeys ("100");
I have also tried to get the id from the tag by using:
((Webdriver imgtags.get(i)).getAttribute("id");
It's returning a blank value, but it should return the value of the id attribute in that input tag.
I also tried to get the id by using .bytagname but as id is an attribute it is not accessible
Try:
(driver) findElement(By.xpath("//*[contains(local-name(), 'input') and contains(#name, 'qty')]")).sendKeys("100");
To answer the comment by #rrd: to be honest, I have no idea why OP uses ((Webdriver imgtags.get(i)). I don't know what that is. Normally, I just use driver.findElement[...]
Hoping that he knows what works in his framework :D
Selenium Xpath handling is not fully compliant and it does not always treat // as a synonym of descendant-or-self.
Instead try tweaking your code to use the following Xpath:
((Webdriver imgtags.get(i)).findElement(By.xpath("./descendant-or-self::input[#name='qty']")).sendKeys("100");
This will base your search off the currently selected WebElement and then look for any descendants that have a name attribute with a value of "qty".
I would also suggest storing your imgtags array as an array of WebElement e.g.
List<WebElement> imgtags = new ArrayList<>();
This is a much better idea than casting to WebDriver to be able to use .findElement(). This will cause you problems at some point in the future.

cannot extract text from tag Beautifulsoup

the following command correctly extract a table from a HTML page:
[tr.findAll('td') for tr in table.findAll('tr',{'class': "js-file-line"})]
[[<td class="blob-num js-line-number" data-line-number="1" id="L1"></td>],
[<td class="blob-num js-line-number" data-line-number="2" id="L2"></td>,
<td>Arsenal</td>,
<td>38</td>,
<td>26</td>,
<td>9</td>,
<td>3</td>,
<td>79</td>,
<td>36</td>,
<td>87</td>],
[<td class="blob-num js-line-number" data-line-number="3" id="L3"></td>,
<td>Liverpool</td>,
etc.
I would like to modify the command to extract the content of each td.
but I cannot extract text from each line since the .text returns an error:
I use the following command:
[tr.findAll('td').text[1:] for tr in table.findAll('tr',{'class': "js-file-line"})][1:]
Where [1:] are used to skip headers (and they works fine. Tested). The problem is the .text which results the following error:
ResultSet object has no attribute 'text'.
You're probably treating a list of items like a single item.
Did you call find_all() when you meant to call find()?
I am actually using findAll which from my understanding is equivalent to find_All.
Sorry if this is too basic question...
The find_All method returns a ResultSet object which is basically a list of Tag objects.
text is a Tag attribute, so you should use one more list comprehension.
txt = [
[td.text for td in tr.find_all('td')][1:]
for tr in table.find_all('tr', {'class': "js-file-line"})
][1:]
Or, if the rows contain only 'td' tags you can use the strings generator.
txt = [list(tr.strings)[1:] for tr in table.find_all('tr', {'class': "js-file-line"})][1:]