Extracting data from table with Scrapy - scrapy

I have this table
<table class="specs-table">
<tbody>
<tr>
<td colspan="2" class="group">Sumary</td>
</tr>
<tr>
<td class="specs-left">Name</td>
<td class="specs-right">ROG GL552JX </td>
</tr>
<tr class="noborder-bottom">
<td class="specs-left">Category</td>
<td class="specs-right">Gaming </td>
</tr>
<tr>
<td colspan="2" class="group">Technical Details</td>
</tr>
<tr>
<td class="specs-left">Name</td>
<td class="specs-right">Asus 555 </td>
</tr>
<tr>
<td class="specs-left">Resolution </td>
<td class="specs-right">1920 x 1080 pixels </td>
</tr>
<tr class="noborder-bottom">
<td class="specs-left"> Processor </td>
<td class="specs-right"> 2.1 GHz </td>
</tr>
</tbody>
</table>
From this table I want my Scrapy to find the first occurrence of the text "Name" and to copy the value from the next cell (In this case "ROG GL552JX") and find the next occurrence of the text "Name" and copy the value "Asus 555".
The result I need:
'Name': [u'Asus 555'],
'Name': [u'Asus 555'],
The problem is that in this table I have two occurrences of the text "Name" and Scrapy copies the value of both occurrences.
My result is:
'Name': [u'ROG GL552JX', u'Asus 555'],
My bot:
def parse(self, response):
next_selector = response.xpath('//*[#aria-label="Pagina urmatoare"]//#href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
item_selector = response.xpath('//*[contains(#class, "pb-name")]//#href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url), callback=self.parse_item)
def parse_item(self, response):
l = ItemLoader(item = PcgItem(), response=response, )
l.add_xpath('Name', 'Name', '//tr/td[contains(text(), "Name")]/following-sibling::td/text()',', MapCompose(unicode.strip, unicode.title))
return l.load_item()
How can I solve this problem?
Thank you

if you need an item per Name, then you should do something like:
for sel in response.xpath('//tr/td[contains(text(), "Name")]/following-sibling::td/text()'):
l = ItemLoader(...)
l.add_value('Name', sel.extract_first())
...
yield l.load_item()
Now if you want it all inside an item, I would recommend to leave it as it is (a list) because an scrapy.Item is a dictionary, so you won't be able to have 2 Name as keys.

Related

Using Playwright how to select next td of a given inner text

This one has me baffled. Basically, using playwright, I'm trying to verify values on a table. Given, "Cat", I should see if "Dog" exists, or if given "Space", I should see if "Rocket" exists.
I tried
const planet = (await page.locator('tr:has(td.col_d:has-text("Saturn")) >> a')).innerText();
but that didn't work. I thought of grabbing all of the innerText on all the , sticking it into an array, then looking for where the initial text is in the Array (Cat) and seeing if the text in the next index is correct (i.e. Dog). Isn't there an easier way I don't know of yet?
<tbody>
<tr>
<td class="labelCol"> Title A < /td>
<td class="dataCol col02"><span>
<a href="/00578000000VqXe" title="POS ""</a>
Data A
</td>
<td class="labelCol">Title X</td>
<td class="dataCol">Data X</td>
</tr>
<tr>
<td class="labelCol" > Cat < /td>
<td class="dataCol col02">Dog/td >
<td class="labelCol" > Saturn < /td>
<td class="dataCol">Jupiter/td >
</tr>
<tr >
<td class="labelCol" > Blue < /td>
<td class="dataCol col02">Red</td >
<td class="labelCol" > Reason < /td>
<td class="dataCol">Space</td > </tr>
Rocket
</td>
</tr >
</tbody>
I don't particularly like this, but you could just assert that 'Dog' is to the right of 'Cat' and 'Rocket' is to the right of 'Space like this if you don't care if they are in the next cell or not.
await expect(page.locator(`td:right-of(:text-is("Cat"))`).first()).toHaveText('Dog');
await expect(page.locator(`td:right-of(:text-is("Space"))`).first()).toHaveText('Rocket');
Or if Dog needs to immediately follow cat, you could do something like this:
const space = page.locator(`td:text-is("Cat")`);
await expect(space.locator(`//following-sibling::td`).first()).toHaveText('Dog');

XPATH to use preceding and following sibling in a single statement

I would like to scrape name, address informations between tag contains defendent text and another tag,
My HTML structure is:
<hr>
<H5>Defendant/Respondent Information</H5>
<span class="InfoChargeStatement">(Each Defendant/Respondent is displayed below)</span>
<table>
<tr>
<td><span class="FirstColumnPrompt">Party Type:</span></td><td><span class="Value">Defendant</span><span class="Prompt">Party No.:</span><span class="Value">1</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Name:</span></td><td><span class="Value">Name 1</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Address:</span></td><td><span class="Value">Addr 1</span></td>
</tr>
<tr>
<td><span class="FirstColumnPrompt">City:</span></td><td><span class="Value">city1</span><span class="Prompt">State:</span><span class="Value">aa</span><span class="Prompt">Zip Code:</span><span class="Value">Zip1</span></td>
</tr>
</table>
<hr>
<table>
<tr>
<td><span class="FirstColumnPrompt">Party Type:</span></td><td><span class="Value">Defendant</span><span class="Prompt">Party No.:</span><span class="Value">2</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Name:</span></td><td><span class="Value">Name 2</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Address:</span></td><td><span class="Value">Addr2</span></td>
</tr>
<tr>
<td><span class="FirstColumnPrompt">City:</span></td><td><span class="Value">City2</span><span class="Prompt">State:</span><span class="Value">st2</span><span class="Prompt">Zip Code:</span><span class="Value">zip2</span></td>
</tr>
</table>
<hr>
<H5>Related Persons Information</H5>
<span class="InfoChargeStatement">(Each Related person is displayed below)</span>
<table>
<tr>
<td><span class="FirstColumnPrompt">Name:</span></td><td><span class="Value">Unwanted Name</span></td>
</tr>
</table>
<table>
<tr>
<td><span class="FirstColumnPrompt">Address:</span></td><td><span class="Value">un addr</span></td>
</tr>
<tr>
<td><span class="FirstColumnPrompt">City:</span></td><td><span class="Value">Unwanted City</span><span class="Prompt">State:</span><span class="Value">Unwanted city</span><span class="Prompt">Zip Code:</span><span class="Value">12345</span></td>
</tr>
</table>
<table></table>
<hr>
My current XPATH capturing the first occurence of Name and address properly, but if need to extract the multiple occurences, it also scrape the information from the unwanted h5 tags.
My current XPATH is,
"//*[contains(text(),'Defendant')]//following-sibling::table//span[text()='Name:' or text()='Business or Organization Name:']/ancestor-or-self::td/following-sibling::td//text()")
I tried including preceding sibling and following sibling but nothing gives my expected output,
My current output is..
names - [
Name1,
Name2
Unwanted Name,
]
Expected output is,
[
Name1
Name2
]
Kindly help.
try this:
"//H5[contains(text(),'Defendant')]/following-sibling::table[not(preceding-sibling::H5[not(contains(text(),'Defendant'))])]/tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"
It first selects the table that has not a preceding-sibling::h5 with text() that not contains 'Defendant' and than
selects from the correct table the tr where the first td meets your requirements and selects the second td
No need for double slashes which is bad for performance
EDIT 1
Since there are more preceding-sibling::h5 than the example shows, this XPath will deal with that:
"//H5[contains(text(),'Defendant')]/following-sibling::table[preceding-sibling::H5[1][contains(text(),'Defendant')]]//tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"
This will only select those tables that have as there first preceding-sibling::h5 the same h5 as we were interested in
EDIT 2
Actually now the first h5 select is redundant. This XPath will do:
"//table[preceding-sibling::H5[1][contains(text(),'Defendant')]]//tr[td[1][span[text()[.='Name:' ]]]]/td[2]/span/text()"

iterate with v-for and data-attribute

I have a vuejs-datatable, and now I want to have an option-column with edit- / delete-links.
This is the table-body which gets iterated from the function getRows():
<tbody>
<tr v-for="(row, idr) in get_rows()" v-bind:key="idr">
<td>{{row.id}}</td>
<td>{{row.email}}</td>
<td>
<b-icon-pencil-square></b-icon-pencil-square>
<b-icon-trash></b-icon-trash>
</td>
</tr>
</tbody>
Now the td with the {{row.id}} and {{row.email}} are fine. However the :data-id="row.id" displays only the id of the first entry. Links in every row in my table have the same data-id. I do not understand why this is happening and what am I doing wrong.
Use code below (notice, it's not using data-id):
<tbody>
<tr v-for="(row, idr) in get_rows()" v-bind:key="idr">
<td>{{row.id}}</td>
<td>{{row.email}}</td>
<td>
<b-icon-pencil-square></b-icon-pencil-square>
<b-icon-trash></b-icon-trash>
</td>
</tr>
</tbody>

can we use selenium when such a table is not having proper html like shown below?

Here is the table that I am using to get the table row element that has specific element such as the href that has 'Harvest' in text and also checking if text 'running' exists in the same table row.
<table id="execTable" class="tableHistory jobtable translucent">
<colgroup>
<col class="execid">
<col class="titlecol">
</colgroup>
<tbody>
<tr>
<th>Id</th>
<th>Name</th>
</tr>
</tbody>
<tr id="8571">
<td>8571</td>
<td class="titlecol">
<div id="hitdiv-8571" class="arrow"></div>
Harvest
</td>
<td>09-03-2015 09:45:04</td>
<td>-</td>
<td>2m 6s</td>
<td>running</td>
<td>view/restart</td>
</tr>
<tr id="8571-child" class="childRow" style="display: none;"></tr>
<tr id="8566">
<td>8566</td>
<td class="titlecol">
<div id="hitdiv-8566" class="arrow"></div>
mk
</td>
<td>09-03-2015 03:30:00</td>
<td>09-03-2015 04:16:50</td>
<td>46m 50s</td>
<td>succeeded</td>
<td>view/restart</td>
</tr>
<tr id="8555-child" class="childRow" style="display: none;"></tr>
</table>
I am not able to get the TRs.
WebElement table = driver.findElement(By.id("execTable"));
List<WebElement> trows = table.findElements(By.tagName("tr"));
List<WebElement> all = driver.findElements(By.xpath(".//*[#id='execTable']/*"));
for (WebElement a : all) {
if(a.getTagName().equalsIgnoreCase("tr")) { ....}
}
I was able to get the above code working. Thank you!

How to find a certain link under a tr that is underneath several identical trs

Another problem getting to a particular portion of the HTML to click a link. I need to click the link that is above the Classify Item #, IFW #QA GM 04012014 1424-1, Supplier One
The only part of the TD that I know is "QA GM 04012014 1424". I can get to the table by doing a:
//*[#id='openTasksTable']/tbody
What I'm left with then is an unknown number of TD's all with the same ID. I'm not sure how to find the proper one from that point.
Code:
<table id="openTasksTable" cellspacing="1" cellpadding="1">
<thead>
<tbody>
<tr class="evenRow tableControlDataRow twTableTR">
<td>
Run Task
</td>
<td>Classify Item #, IFW #QA GM 04012014 0911-1, Supplier One</td>
<td>Apr 24, 2014</td>
</tr>
<tr class="oddRow tableControlDataRow twTableTR">
<td>
Run Task
</td>
<td>Classify Item #, IFW #QA GM 04012014 1012-1, Supplier One</td>
<td>Apr 24, 2014</td>
</tr>
<tr class="evenRow tableControlDataRow twTableTR">
<td>
Run Task
</td>
<td>Classify Item #, IFW #QA GM 04012014 1414-1, Supplier One</td>
<td>Apr 24, 2014</td>
</tr>
<tr class="oddRow tableControlDataRow twTableTR">
<td>
Run Task
</td>
<td>Classify Item #, IFW #QA GM 04012014 1420-1, Supplier One</td>
<td>Apr 24, 2014</td>
</tr>
<tr class="evenRow tableControlDataRow twTableTR">
<td>
Run Task
</td>
<td>Classify Item #, IFW #QA GM 04012014 1422-1, Supplier One</td>
<td>Apr 24, 2014</td>
</tr>
<tr class="oddRow tableControlDataRow twTableTR">
<td>
Run Task
</td>
<td>Classify Item #, IFW #QA GM 04012014 1424-1, Supplier One</td>
<td>Apr 24, 2014</td>
</tr>
Thanks!
Greg
The following xpath should work for you:
//td[contains(text(),'QA GM 04012014 1424')]/..//a
It will find only the <a> which is in the same <tr> with <td> you mentioned.