Why am I not able to scrape just this particular P tag? - scrapy

I am using scrapy shell just to make sure my selectors for my spider are correct. I am able to get all other sections I need except this one p tag that contains the cross ref part numbers. I am scraping from this particular page here
When I try response.css('div.col-1-2-2' > div.rpr-help m-chm > div > p::text').extract() it returns blank
When I try response.css('div > p::text').extract() the results have the section I am looking for plus a bunch of data I do not want.
I have a feeling this is going to be a super easy answer, but I have no idea what I am missing here
This is a snippet of the html section of the page I am trying to scrape, the last 'p' tag starting with Part Number
<div class="col-1-2-2">
<div id="img-detail" style="text-align:center;">
<div id="img-detail-main">
<a id="ctl00_cphMain_imgenlarge" rel="nofollow" href="/detail-img.aspx?id=3094537&i=" class="cboxElement"><img id="ctl00_cphMain_iMain" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_01_l.jpg" style="border-width:0px;outline:none;">
<div class="img-overlay" style="display:none;"><img src="/images/play.png" style="height:107px;"></div>
<div id="main-text-overlay" style="display:none;"></div>
</a>
</div>
<div class="img-help">Click image to open expanded view</div>
<div id="img-detail-thumb">
<div class="a-button a-active">
<img id="ctl00_cphMain_rImgTh_ctl01_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_01_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl02_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_02_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl03_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_03_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl04_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_04_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl05_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_05_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl06_imgTh" class="diagram" data-dcmt="Clutch assembly AP3094537 is number 5 on this diagram. This is to give you an idea of the appearance and the location of the part. Your appliance model may be slightly different." src="https://483cda5f439700fab03b-6195bc77e724f6265ff507b1dc015ddb.ssl.cf1.rackcdn.com/0029384112_4.gif" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl07_imgTh" class="video" src="https://img.youtube.com/vi/7RS1l6t8efc/hqdefault.jpg" style="border-width:0px;">
<div class="img-overlay"><img src="/images/play.png"></div>
</div>
</div>
</div>
<div class="rpr-help m-chm">
<div class="header">
<h2 class="h6">Repair Help</h2>
</div><!-- /end .header -->
<div class="inner m-bsc">
<ul>
<li>Repair Video</li>
<li>Repair Q&A</li>
</ul>
</div>
<div>
<br>
<span class="h4">Cross Reference Information</span><br>
<p>Part Number 285785 (AP3094537) replaces 2670, 285331, 285380, 285422, 285540, 285761, 285785VP, 3350015, 3350114, 3350115, 3351342, 3351343, 387888, 388948, 388949, 3946794, 3946847, 3951311, 3951312, 62699, 63174, 63765, 64176, AH334641, EA334641, J27-662, LP326, PS334641.
<br>
</p>
</div>
</div>
</div>

Hope this works
response.xpath('//div[#class="col-1-2-2"]//p/text()').extract_first()

You can try this also, response.xpath('(//div[#class="rpr-help m-chm"]//p//text())[1]').get()

Related

How to clean up pulled data from BeautifulSoup, Pandas, Python

Hello everyone I have the information I want pulled using BeautiuflSoup but I can't seem to get it printed out correctly to send to pandas and excel.
html_f ='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret"> </span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">Comment</span><br>
[1] Comments
</p>
</div>
</div>
</li>'''
My code used to pull the data I want:
soup = BeautifulSoup(html_f,'html.parser')
for child in soup.findAll('li',class_='list-group-item')[0]:
print (child.text)
Here is the info it pulls But it prints it out weird with tons of spacing
07/01/2022 Date
Comment
[1] Comments
Ideally, I only need the top portion of (date and File Date) printed out but at the very least I need help getting it into a list format like:
07/01/2022 Date
Comment
[1] Comments
To get your information printed as expected in your question, you could use stripped_strings and iterate over its elements:
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
Note: In new code use find_all() instead of old syntax findAll().
Example
html='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
Output
07/01/2022 Date
Comment
[1] Comments
Not sure cause you are talking about pandas, you also could pick each information, clean it up and append to a list of dicts:
data = []
for e in soup.find_all('li',class_='list-group-item'):
data.append({
'date': e.p.text.strip().replace(' Date',''),
'comment': e.select_one('.tyler-toggle-container br').next_sibling.strip()
})
pd.DataFrame(data)
or
data = [{
'date':soup.select_one('li.list-group-item .text-primary').text.strip().replace(' Date',''),
'comment':soup.select_one('li.list-group-item .tyler-toggle-container br').next_sibling.strip()
}]
Output
date
comment
07/01/2022
[1] Comments
So far so good, it's my trying
doc='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')
text=[' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
print(text)
Output:
['07/01/2022, Comments']
Try this ways,must work
text=' '.join([' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]).strip()
#Or
text= [' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
final_text= text[1]+ ',' +text[2]
final_text= text[1]+text[2].split()#if you want to make list

Defined Array Variable working fine but child element not working in vuejs

Facing an Weird Issue in Vuejs.. I am using firestore to get data using props as id for the singlepost...
Now array formed is fine.. as you can see in the screenshot... there is no error in the console, i am seeing data... but its not working for its child key element.. attaching another screenshot of code.i guess if {{temple}} is working then {{temple.templename}} should work fine also
<div class="appCapsule">
{{temple}}
<div class="section mt-2">
<div class="card text-center">
<div class="card-header">
{{temple.templename}}
</div>
<div class="card-body">
<img v-bind:src="'https://awesong.in/jain/storage/temples/' + temple.fileToUpload1" style="width:100%">
<p class="card-text">Temple Type : {{temple.templetype}}</p>
<p class="card-text">Near By City : {{temple.nearbycity}}</p>
<p class="card-text">Built in : {{temple.built}}</p>
<p class="card-text">Address : {{temple.address}}</p>
<p class="card-text">Location : {{temple.location}}</p>
<p class="card-text">District : {{temple.district}}</p>
<p class="card-text">State : {{temple.state}}</p>
<p class="card-text">Phone : {{temple.phone}}</p>
<p class="card-text">Email : {{temple.email}}</p>
<p class="card-text">Website : {{temple.website}}</p>
<p class="card-text">Views : {{temple.clicks}}</p>
</div>
</div>
</div>
</div>
Image attached for reference
Temple is an array of JSON objects. You could use temple[0].property, or you could loop over it:
<div v-for="t in temple" :key="JSON.stringify(t)">
{{t.templename}} // etc
</div>

Angular 5 - Bindings cannot contain assignments

<li class="tabRow tabRowLeft" *ngFor="let gene of filteredgene = (seq.genes) | limitTo:filteredgene.length/2+filteredgene.length%2">
<div class="displayFlex" (click)="showGeneRecord(gene.geneName,'ATTRIBUTE','.addPopup.attributeRisk')">
<div class="tabCell">
<div class="cellItem displayFlex">
<h4 class="flex1">{{gene.geneName}}</h4>
</div>
</div>
<div class="tabCell">
<div class="cellItem displayFlex">
<h4 class="flex1">{{gene.geneScore}}</h4>
</div>
</div>
</div>
</li>
I am trying to get the value in filteredgene and then for loop on filteredgene . I am getting error Bindings cannot contain assignments. Any one knows, what should I do resolve and get this thing done.
For limit to I have created a pipe too.

know number of elements with vba scraping web

hello everyone I would like to know numer of elements like the following
<div id="datatable">
<form id="theForm" name="theForm" >......</form>
<div class="no_data_dd" id="no_data" >....
<div class= ....>.....
<div class= ....>
</div>
</div>
</div>
<div class="score_row score_header">.../div>
<div class="score_row match_line e_true" >..</div>
<div class="score_row padded_date ">..</div>
<div class= ....>.....</div>
</div>
I tried with
Set itemEle = objIE.document.getElementById("scoretable")
Length = itemEle.getElementsByTagName("class").Length
length = 0 and nont =5
why?
What if you try get "elements" by instead of get "element" that might be a reason for the answer

Is there a Microformat for the Hours a Business is open?

I was wondering if there was yet a Microformat for a business's hours of operation.
If not, who do I submit a standard to?
After submitting the same question to the Microformats mailing list, I received a reply from someone named Martin Hepp who apparently has come up with a specification for this.
He provided me with the following links:
The GoodRelations vocabulary provides
a standard way for business hours of
operation, see:
http://www.ebusiness-unibw.org/wiki/Rdfa4google
http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey
#
the full spec and other materials are at
http://www.ebusiness-unibw.org/wiki/GoodRelations
This is used e.g. by Bestbuy to expose the opening hours of their 1000k
stores in the US.
Best
Martin
The most widely used markup for opening hours on the Web is GoodRelations.
Here is an example:
<div xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:gr="http://purl.org/goodrelations/v1#"
xmlns:vcard="http://www.w3.org/2006/vcard/ns#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#">
<div about="#store" typeof="gr:LocationOfSalesOrServiceProvisioning">
<div property="rdfs:label" content="Pizzeria La Mamma"></div>
<div rel="vcard:adr">
<div typeof="vcard:Address">
<div property="vcard:country-name" content="Germany"></div>
<div property="vcard:locality" content="Munich"></div>
<div property="vcard:postal-code" content="85577"></div>
<div property="vcard:street-address" content="1234 Main Street"></div>
</div>
</div>
<div property="vcard:tel" content="+33 408 970-6104"></div>
<div rel="foaf:depiction" resource="http://www.pizza-la-mamma.com/image_or_logo.png"></div>
<div rel="vcard:geo">
<div>
<div property="vcard:latitude" content="48.08" datatype="xsd:float"></div>
<div property="vcard:longitude" content="11.64" datatype="xsd:float"></div>
</div>
</div>
<div rel="gr:hasOpeningHoursSpecification">
<div about="#mon_fri" typeof="gr:OpeningHoursSpecification">
<div property="gr:opens" content="08:00:00" datatype="xsd:time"></div>
<div property="gr:closes" content="18:00:00" datatype="xsd:time"></div>
<div rel="gr:hasOpeningHoursDayOfWeek" resource="http://purl.org/goodrelations/v1#Friday"></div>
<div rel="gr:hasOpeningHoursDayOfWeek" resource="http://purl.org/goodrelations/v1#Thursday"></div>
<div rel="gr:hasOpeningHoursDayOfWeek" resource="http://purl.org/goodrelations/v1#Wednesday"></div>
<div rel="gr:hasOpeningHoursDayOfWeek" resource="http://purl.org/goodrelations/v1#Tuesday"></div>
<div rel="gr:hasOpeningHoursDayOfWeek" resource="http://purl.org/goodrelations/v1#Monday"></div>
</div>
</div>
<div rel="gr:hasOpeningHoursSpecification">
<div about="#sat" typeof="gr:OpeningHoursSpecification">
<div property="gr:opens" content="08:30:00" datatype="xsd:time"></div>
<div property="gr:closes" content="14:00:00" datatype="xsd:time"></div>
<div rel="gr:hasOpeningHoursDayOfWeek" resource="http://purl.org/goodrelations/v1#Saturday"></div>
</div>
</div>
<div rel="foaf:page" resource=""></div>
</div>
</div>
Note that the Microformats suggestion from Ton does not really model that this is an opening hour, so a client cannot do a lot with it. GoodRelations markup is supported by many major companies. For example, BestBuy is using GoodRelations on all of their 1000+ store pages for indicating opening hours.
A HTML micro-format can look like:
<ol class="business_hours">
<li class="monday">Maandag <span class="dtstart" title="08:00:00+01">9.00</span> - <span class="dtend" title="17:00:00+01">18.00</span> uur</li>
<li class="tuesday">Dinsdag <span class="dtstart" title="08:00:00+01">9.00</span> - <span class="dtend" title="17:00:00+01">18.00</span> uur</li>
<li class="wednesday">Woensdag <span class="dtstart" title="08:00:00+01">9.00</span> - <span class="dtend" title="17:00:00+01">18.00</span> uur</li>
<li class="thursday">Donderdag <span class="dtstart" title="08:00:00+01">9.00</span> - <span class="dtend" title="17:00:00+01">18.00</span> uur</li>
<li class="friday">Vrijdag <span class="dtstart" title="08:00:00+01">9.00</span> - <span class="dtend" title="17:00:00+01">18.00</span> uur</li>
<li class="saturday">Zaterdag <span class="dtstart" title="08:00:00+01">9.00</span> - <span class="dtend" title="15:00:00+1">16.00</span> uur</li>
<li>Zondag Gesloten</li>
</ol>
Excuse my Dutch :)
My 2 cents.
Microformat has updated their wiki with a suggested way of implementing Operating Hours based on hCalendar.
http://microformats.org/wiki/operating-hours
See https://schema.org/openingHours
Schema.org is an initiative launched on 2 June 2011 by Bing, Google and Yahoo.
An example:
<strong>Openning Hours:</strong>
<time itemprop="openingHours" datetime="Tu,Th 16:00-20:00">
Tuesdays and Thursdays 4-8pm
</time>
Perhaps http://microformats.org/ may be of use...
If is still useful, you should submit to the microformats community using their wiki: microformats.org.
In this link you have all the existing process to propose a new microformat specification.
Hope that helps.