BeautifulSoup filtering tags based on class value - beautifulsoup

The HTML code I get from BeautifulSoup looks a little like this:
<tr class="A">
[various content]
</tr>
<tr class="B">
[various content]
</tr>
<tr class="C">
[various content]
</tr>
...
<tr class="NOT WANTED">
[various content]
</tr>
...
<tr class="A">
[various content]
</tr>
<tr class="B">
[various content]
</tr>
<tr class="C">
[various content]
</tr>
Say I want to save the code in a variable x, but without the tr tag with the unwanted class in the middle. How would I get about doing it?
I know I can do
x = whatever.findAll('tr', {'class' : 'A'})
if I just want class A, but how do I include every tr tag except the one with the "NOT WANTED" class value?

Try it with css selectors and use the decompose() method:
soup = bs(whatever,'html.parser')
target = soup.select_one('tr[class="NOT WANTED"]')
target.decompose()
soup
Output should be your expected output.

Related

XPATH to use preceding and following sibling in a single statement

I would like to scrape name, address informations between tag contains defendent text and another tag,
My HTML structure is:
<hr>
<H5>Defendant/Respondent Information</H5>
<span class="InfoChargeStatement">(Each Defendant/Respondent is displayed below)</span>
<table>
<tr>
<td><span class="FirstColumnPrompt">Party Type:</span></td><td><span class="Value">Defendant</span><span class="Prompt">Party No.:</span><span class="Value">1</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Name:</span></td><td><span class="Value">Name 1</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Address:</span></td><td><span class="Value">Addr 1</span></td>
</tr>
<tr>
<td><span class="FirstColumnPrompt">City:</span></td><td><span class="Value">city1</span><span class="Prompt">State:</span><span class="Value">aa</span><span class="Prompt">Zip Code:</span><span class="Value">Zip1</span></td>
</tr>
</table>
<hr>
<table>
<tr>
<td><span class="FirstColumnPrompt">Party Type:</span></td><td><span class="Value">Defendant</span><span class="Prompt">Party No.:</span><span class="Value">2</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Name:</span></td><td><span class="Value">Name 2</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Address:</span></td><td><span class="Value">Addr2</span></td>
</tr>
<tr>
<td><span class="FirstColumnPrompt">City:</span></td><td><span class="Value">City2</span><span class="Prompt">State:</span><span class="Value">st2</span><span class="Prompt">Zip Code:</span><span class="Value">zip2</span></td>
</tr>
</table>
<hr>
<H5>Related Persons Information</H5>
<span class="InfoChargeStatement">(Each Related person is displayed below)</span>
<table>
<tr>
<td><span class="FirstColumnPrompt">Name:</span></td><td><span class="Value">Unwanted Name</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Address:</span></td><td><span class="Value">un addr</span></td>
</tr>
<tr>
<td><span class="FirstColumnPrompt">City:</span></td><td><span class="Value">Unwanted City</span><span class="Prompt">State:</span><span class="Value">Unwanted city</span><span class="Prompt">Zip Code:</span><span class="Value">12345</span></td>
</tr>
</table>
<table></table>
<hr>
My current XPATH capturing the first occurence of Name and address properly, but if need to extract the multiple occurences, it also scrape the information from the unwanted h5 tags.
My current XPATH is,
"//*[contains(text(),'Defendant')]//following-sibling::table//span[text()='Name:' or text()='Business or Organization Name:']/ancestor-or-self::td/following-sibling::td//text()")
I tried including preceding sibling and following sibling but nothing gives my expected output,
My current output is..
names - [
Name1,
Name2
Unwanted Name,
]
Expected output is,
[
Name1
Name2
]
Kindly help.
try this:
"//H5[contains(text(),'Defendant')]/following-sibling::table[not(preceding-sibling::H5[not(contains(text(),'Defendant'))])]/tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"
It first selects the table that has not a preceding-sibling::h5 with text() that not contains 'Defendant' and than
selects from the correct table the tr where the first td meets your requirements and selects the second td
No need for double slashes which is bad for performance
EDIT 1
Since there are more preceding-sibling::h5 than the example shows, this XPath will deal with that:
"//H5[contains(text(),'Defendant')]/following-sibling::table[preceding-sibling::H5[1][contains(text(),'Defendant')]]//tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"
This will only select those tables that have as there first preceding-sibling::h5 the same h5 as we were interested in
EDIT 2
Actually now the first h5 select is redundant. This XPath will do:
"//table[preceding-sibling::H5[1][contains(text(),'Defendant')]]//tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"

Extracting data from table with Scrapy

I have this table
<table class="specs-table">
<tbody>
<tr>
<td colspan="2" class="group">Sumary</td>
</tr>
<tr>
<td class="specs-left">Name</td>
<td class="specs-right">ROG GL552JX </td>
</tr>
<tr class="noborder-bottom">
<td class="specs-left">Category</td>
<td class="specs-right">Gaming </td>
</tr>
<tr>
<td colspan="2" class="group">Technical Details</td>
</tr>
<tr>
<td class="specs-left">Name</td>
<td class="specs-right">Asus 555 </td>
</tr>
<tr>
<td class="specs-left">Resolution </td>
<td class="specs-right">1920 x 1080 pixels </td>
</tr>
<tr class="noborder-bottom">
<td class="specs-left"> Processor </td>
<td class="specs-right"> 2.1 GHz </td>
</tr>
</tbody>
</table>
From this table I want my Scrapy to find the first occurrence of the text "Name" and to copy the value from the next cell (In this case "ROG GL552JX") and find the next occurrence of the text "Name" and copy the value "Asus 555".
The result I need:
'Name': [u'Asus 555'],
'Name': [u'Asus 555'],
The problem is that in this table I have two occurrences of the text "Name" and Scrapy copies the value of both occurrences.
My result is:
'Name': [u'ROG GL552JX', u'Asus 555'],
My bot:
def parse(self, response):
next_selector = response.xpath('//*[#aria-label="Pagina urmatoare"]//#href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
item_selector = response.xpath('//*[contains(#class, "pb-name")]//#href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url), callback=self.parse_item)
def parse_item(self, response):
l = ItemLoader(item = PcgItem(), response=response, )
l.add_xpath('Name', 'Name', '//tr/td[contains(text(), "Name")]/following-sibling::td/text()',', MapCompose(unicode.strip, unicode.title))
return l.load_item()
How can I solve this problem?
Thank you
if you need an item per Name, then you should do something like:
for sel in response.xpath('//tr/td[contains(text(), "Name")]/following-sibling::td/text()'):
l = ItemLoader(...)
l.add_value('Name', sel.extract_first())
...
yield l.load_item()
Now if you want it all inside an item, I would recommend to leave it as it is (a list) because an scrapy.Item is a dictionary, so you won't be able to have 2 Name as keys.

can we use selenium when such a table is not having proper html like shown below?

Here is the table that I am using to get the table row element that has specific element such as the href that has 'Harvest' in text and also checking if text 'running' exists in the same table row.
<table id="execTable" class="tableHistory jobtable translucent">
<colgroup>
<col class="execid">
<col class="titlecol">
</colgroup>
<tbody>
<tr>
<th>Id</th>
<th>Name</th>
</tr>
</tbody>
<tr id="8571">
<td>8571</td>
<td class="titlecol">
<div id="hitdiv-8571" class="arrow"></div>
Harvest
</td>
<td>09-03-2015 09:45:04</td>
<td>-</td>
<td>2m 6s</td>
<td>running</td>
<td>view/restart</td>
</tr>
<tr id="8571-child" class="childRow" style="display: none;"></tr>
<tr id="8566">
<td>8566</td>
<td class="titlecol">
<div id="hitdiv-8566" class="arrow"></div>
mk
</td>
<td>09-03-2015 03:30:00</td>
<td>09-03-2015 04:16:50</td>
<td>46m 50s</td>
<td>succeeded</td>
<td>view/restart</td>
</tr>
<tr id="8555-child" class="childRow" style="display: none;"></tr>
</table>
I am not able to get the TRs.
WebElement table = driver.findElement(By.id("execTable"));
List<WebElement> trows = table.findElements(By.tagName("tr"));
List<WebElement> all = driver.findElements(By.xpath(".//*[#id='execTable']/*"));
for (WebElement a : all) {
if(a.getTagName().equalsIgnoreCase("tr")) { ....}
}
I was able to get the above code working. Thank you!

How to "firmly" locate an element in a table? Selenium

How can I locate an element "1988" (the fourth line) in the following table:
<table border="0" width="820" cellpadding="2" cellspacing="0">
<tbody>
<tr valign="top">
<td class="default" width="100%">Results <b>1</b> to <b>10</b> of <b>1988</b></td>
</tr>
<tr valign="top">
<td class="default" bgcolor="#C0C0C0"> <font class="resultsheader"> ...etc
</tr>
</tbody>
</table>
IMPORTANT: I know one way that works (By.xpath):
driver.findElement(By.xpath("//td[#width='100%']")).getText();
However, this way does not ALWAYS work. The page is dynamic, so I need a way to locate that element no matter what changes happen to the page.
I tried the following but I am not sure:
By.xpath("//html//body//table//tbody//tr[3]//td//table//tbody//tr//td[2]//table[4]//tbody//tr[1]//td//b[3]"
If you can't change the HTML and want to use attributes for selection, you can write something like this:
//table[#border=0][#width=820]//tbody//tr[1]//td//b[3]

Selenium, Unable to find element by containing text following child node

A little bit of background:
The HTML looks like this:
<table>
<thead>
<tr>
<th>Head1</th>
<th>Head2</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<span>
<span class="icon">S</span>
"Auto"
</span>
</td>
<td>
<span>
Cost, Time
</span>
</td>
</tr>
</tbody>
</table>
A simple version of the code looks like this: (run in pry)
[69] pry> e = driver.find_element(:xpath, "//table/tbody/tr/td/span")
=> #<Selenium::WebDriver::Element:0x7ba9a4d694458ec id=":wdc:1361791490676">
[70] pry> e.text
=> "SAutomotive"
[71] pry> e = driver.find_element(:xpath, "//table/tbody/tr/td/span[contains(text(),'Auto')]")
Selenium::WebDriver::Error::NoSuchElementError: The element could not be found
from /Users/ben/.rvm/gems/ruby-1.9.3-p194/gems/selenium-webdriver-2.29.0/lib/selenium/webdriver/remote/response.rb:52:in `assert_ok'
I have no access to changing the HTML code
Although there is only one row in this table there is the possibility of more being added and I cannot predict the location of the row, this is why i am trying to find it by name
my normal code is:
e = driver.find_element(:xpath, "//table/tbody/tr[td/span[contains(text(),'Auto')]]")
The problem I am having is that I cannot find any way of getting the row in the table by the name given in the text of the first table cell.
Use below xpath
"//table/tbody/tr/td/span[contains(.,'Auto')]"