How to clean up pulled data from BeautifulSoup, Pandas, Python

How to clean up pulled data from BeautifulSoup, Pandas, Python - pandas

Hello everyone I have the information I want pulled using BeautiuflSoup but I can't seem to get it printed out correctly to send to pandas and excel.
html_f ='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret"> </span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">Comment</span><br>
[1] Comments
</p>
</div>
</div>
</li>'''
My code used to pull the data I want:
soup = BeautifulSoup(html_f,'html.parser')
for child in soup.findAll('li',class_='list-group-item')[0]:
print (child.text)
Here is the info it pulls But it prints it out weird with tons of spacing
07/01/2022 Date
Comment
[1] Comments
Ideally, I only need the top portion of (date and File Date) printed out but at the very least I need help getting it into a list format like:
07/01/2022 Date
Comment
[1] Comments

To get your information printed as expected in your question, you could use stripped_strings and iterate over its elements:
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
Note: In new code use find_all() instead of old syntax findAll().
Example
html='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for e in soup.find_all('li',class_='list-group-item'):
for t in list(e.stripped_strings):
print(t)
Output
07/01/2022 Date
Comment
[1] Comments
Not sure cause you are talking about pandas, you also could pick each information, clean it up and append to a list of dicts:
data = []
for e in soup.find_all('li',class_='list-group-item'):
data.append({
'date': e.p.text.strip().replace(' Date',''),
'comment': e.select_one('.tyler-toggle-container br').next_sibling.strip()
})
pd.DataFrame(data)
or
data = [{
'date':soup.select_one('li.list-group-item .text-primary').text.strip().replace(' Date',''),
'comment':soup.select_one('li.list-group-item .tyler-toggle-container br').next_sibling.strip()
}]
Output
date
comment
07/01/2022
[1] Comments

So far so good, it's my trying
doc='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret">
</span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">
Comment
</span>
<br/>
[1] Comments
</p>
</div>
</div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')
text=[' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
print(text)
Output:
['07/01/2022, Comments']
Try this ways,must work
text=' '.join([' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]).strip()
#Or
text= [' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
final_text= text[1]+ ',' +text[2]
final_text= text[1]+text[2].split()#if you want to make list

Related

Why am I not able to scrape just this particular P tag?

I am using scrapy shell just to make sure my selectors for my spider are correct. I am able to get all other sections I need except this one p tag that contains the cross ref part numbers. I am scraping from this particular page here
When I try response.css('div.col-1-2-2' > div.rpr-help m-chm > div > p::text').extract() it returns blank
When I try response.css('div > p::text').extract() the results have the section I am looking for plus a bunch of data I do not want.
I have a feeling this is going to be a super easy answer, but I have no idea what I am missing here
This is a snippet of the html section of the page I am trying to scrape, the last 'p' tag starting with Part Number
<div class="col-1-2-2">
<div id="img-detail" style="text-align:center;">
<div id="img-detail-main">
<a id="ctl00_cphMain_imgenlarge" rel="nofollow" href="/detail-img.aspx?id=3094537&i=" class="cboxElement"><img id="ctl00_cphMain_iMain" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_01_l.jpg" style="border-width:0px;outline:none;">
<div class="img-overlay" style="display:none;"><img src="/images/play.png" style="height:107px;"></div>
<div id="main-text-overlay" style="display:none;"></div>
</a>
</div>
<div class="img-help">Click image to open expanded view</div>
<div id="img-detail-thumb">
<div class="a-button a-active">
<img id="ctl00_cphMain_rImgTh_ctl01_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_01_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl02_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_02_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl03_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_03_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl04_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_04_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl05_imgTh" src="https://cdn.appliancepartspros.com/images/product/cache/whirlpool-clutch-assembly-285785-ap3094537_05_tt.jpg" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl06_imgTh" class="diagram" data-dcmt="Clutch assembly AP3094537 is number 5 on this diagram. This is to give you an idea of the appearance and the location of the part. Your appliance model may be slightly different." src="https://483cda5f439700fab03b-6195bc77e724f6265ff507b1dc015ddb.ssl.cf1.rackcdn.com/0029384112_4.gif" style="border-width:0px;">
</div>
<div class="a-button">
<img id="ctl00_cphMain_rImgTh_ctl07_imgTh" class="video" src="https://img.youtube.com/vi/7RS1l6t8efc/hqdefault.jpg" style="border-width:0px;">
<div class="img-overlay"><img src="/images/play.png"></div>
</div>
</div>
</div>
<div class="rpr-help m-chm">
<div class="header">
<h2 class="h6">Repair Help</h2>
</div><!-- /end .header -->
<div class="inner m-bsc">
<ul>
<li>Repair Video</li>
<li>Repair Q&A</li>
</ul>
</div>
<div>
<br>
<span class="h4">Cross Reference Information</span><br>
<p>Part Number 285785 (AP3094537) replaces 2670, 285331, 285380, 285422, 285540, 285761, 285785VP, 3350015, 3350114, 3350115, 3351342, 3351343, 387888, 388948, 388949, 3946794, 3946847, 3951311, 3951312, 62699, 63174, 63765, 64176, AH334641, EA334641, J27-662, LP326, PS334641.
<br>
</p>
</div>
</div>
</div>

Hope this works
response.xpath('//div[#class="col-1-2-2"]//p/text()').extract_first()

You can try this also, response.xpath('(//div[#class="rpr-help m-chm"]//p//text())[1]').get()

Angular 5 - Bindings cannot contain assignments

<li class="tabRow tabRowLeft" *ngFor="let gene of filteredgene = (seq.genes) | limitTo:filteredgene.length/2+filteredgene.length%2">
<div class="displayFlex" (click)="showGeneRecord(gene.geneName,'ATTRIBUTE','.addPopup.attributeRisk')">
<div class="tabCell">
<div class="cellItem displayFlex">
<h4 class="flex1">{{gene.geneName}}</h4>
</div>
</div>
<div class="tabCell">
<div class="cellItem displayFlex">
<h4 class="flex1">{{gene.geneScore}}</h4>
</div>
</div>
</div>
</li>
I am trying to get the value in filteredgene and then for loop on filteredgene . I am getting error Bindings cannot contain assignments. Any one knows, what should I do resolve and get this thing done.
For limit to I have created a pipe too.

Clarification on relative xpath

I want an xpath which represents the element 2 w.r.t element 1, as shown in the picture below. Can I build it using following-sibling?
Any help appreciated.
<div id="leads_data" class="uk-width-large-8-12 uk-container-center">
<div class="md-card-list leads_cards_list">
<ul class="">
<li class="item-shown" style="margin: 40px -20px; min-height: 90px;">
<div class="md-card-list-item-menu" data-uk-dropdown="{mode:'click',pos:'bottom-right'}">
<span>12:35:22 PM</span>
<a class="md-btn md-btn-flat md-btn-wave waves-effect waves-button" onclick="showModal(446)" data-uk-modal="{ center:true }" href="#lead_info">Edit</a>
</div>
<div class="md-card-list-item-sender">
<span>
<i class="material-icons">face</i>
CDYCUQYGNKYNFNFCWOUO
</span>
</div>
<div class="md-card-list-item-sender">
<span>
<i class="material-icons">call</i>
+91-1119771441
</span>
</div>
<div class="md-card-list-item-subject">
<span>
<i class="material-icons">email</i>
rnkntsoorh#jd.com
</span>
</div>
<div class="md-card md-card-list-item-content-wrapper uk-margin-top" onclick="showModal(446)" style="display: block; opacity: 1;">
<div class="md-card-content leads_card_content">
<div class="uk-text-large uk-margin-remove">
<b>Remarks:</b>
Some remarks of TSYNTORSAC
</div>
</div>
</div>
<br/>
</li>
I want to start with div having the text "CDYCUQYGNKYNFNFCWOUO" and locate the div having the text "Some remarks of TSYNTORSAC".

You can use below xPath :-
//span[contains(.,'CDYCUQYGNKYNFNFCWOUO')]/parent::div/following::div[contains(.,'Some remarks of TSYNTORSAC')][last()]
or
//span[contains(.,'CDYCUQYGNKYNFNFCWOUO')]/parent::div/following::div[contains(.,'Some remarks of TSYNTORSAC') and contains(#class, 'uk-text-large uk-margin-remove')]
or
//span[contains(.,'CDYCUQYGNKYNFNFCWOUO')]/parent::div/following-sibling::div[contains(.,'Some remarks of TSYNTORSAC')]/descendant::div[contains(#class, 'uk-text-large uk-margin-remove')]
or
//span[contains(.,'CDYCUQYGNKYNFNFCWOUO')]/parent::div/following-sibling::div[contains(.,'Some remarks of TSYNTORSAC')]/div/div
Hope it will help you..:)

In addition to the ways described by Saurabh gaur you can also use following-sibling to locate the desired element.
.//span[text()='CDYCUQYGNKYNFNFCWOUO']/../following-sibling::div[position()=3]//div[contains(.,'Some remarks of TSYNTORSAC')]
or
.//span[text()='CDYCUQYGNKYNFNFCWOUO']/../following-sibling::div[position()=3]/div/div[contains(text(),'Some remarks of TSYNTORSAC']
Hope this helps

You could simply get the element containing the desired text instead of using following:
//li[div[normalize-space(.)='face CDYCUQYGNKYNFNFCWOUO']]/div[last()]

try this :-
//div[contains(#class,'md-card-list-item-sende') and contains(.,'CDYCUQYGNKYNFNFCWOUO')]/following::div[contains(#class,'md-card-list-item-content') and descendant::b[contains(.,'Remarks')]]

Autocomplete with Selenium

I'm having some (actually, a lot) of trouble automating selection from autocomplete options with Selenium. Currently, I am able to automate the inputting of the text, though, I am not able to select anything from the appearing drop-down suggestions list that pops up. I tried searching here for some answers to my problem, but nothing has worked. Below is the element that appears with the suggestions that I am trying to select:
<div class="cs-autocomplete-popup">
<div class="inner">
<div class="cs-autocomplete-Matches csc-autocomplete-Matches">
<ul>
<li class="cs-autocomplete-matchItem csc-autocomplete-matchItem">
<span class="csc-autocomplete-matchItem-content cs-autocomplete-matchItem-content" id="matchItem::matchItemContent">john doe</span>
</li>
</ul>
</div>
<div class="csc-autocomplete-addToPanel cs-autocomplete-addToPanel">
<hr>
<div class="content csc-autocomplete-addTermTo cs-autocomplete-addTermTo">Add "John Doe" to:</div>
<ul>
<li class="cs-autocomplete-authorityItem csc-autocomplete-authorityItem" id="authorityItem:">Local Persons</li>
</ul>
</div>
</div>
<div class="cs-autocomplete-popup-miniView csc-autocomplete-popup-miniView" style="top: 2px; left: 149px; display: none;"><div class="cs-miniView">
john doe
<div>
<span class="csc-autocomplete-popup-miniView-field1Label cs-autocomplete-popup-miniView-field1Label">b.</span>
<span class="csc-autocomplete-popup-miniView-field1 cs-autocomplete-popup-miniView-field1" id="field1"></span>
</div>
<div>
<span class="csc-autocomplete-popup-miniView-field2Label cs-autocomplete-popup-miniView-field2Label">d.</span>
<span class="csc-autocomplete-popup-miniView-field2 cs-autocomplete-popup-miniView-field2" id="field2"></span>
</div>
<div>
<span class="csc-autocomplete-popup-miniView-field3 cs-autocomplete-popup-miniView-field3" id="field3"></span>
</div>
<div>
</div>
</div></div>
</div>
From this, I am trying to select "john doe". Does anyone know an efficient/complete way of doing this? I would very much appreciate the help.

driver.findElement(By.xpath("get the exact field address of autocomplete textbox");
Thread.sleep(5000);
//for xpath: need to take the common xpath from the list of the elements.
List<WebElement> link=driver.findElements(By.xpath("//span[contains(#class, 'cs-autocomplete-matchItem-content') and .='john doe']");
for(int i=0; i<=link.size(); i++)
{
if(link.get(i).getText().equalsIgnoreCase("john doe");
{
link.get(i).click();
}
}

extracting currency values using scrapy/xpath

Trying to get currency values from some html using scrapy.Code is
links = hxs.select('//a[#class="product-image"]/div[#class="price-box"]//span[#class="price"]/text()').extract()')
And the HTML
<div>
<span>
<sub>
<li class="item first">
<a href="http://www.xtra-vision.ie/dvd-blu-ray/to-rent/new-release/dvd/pitch-perfect-dvd.html" title="Image for Pitch Perfect" class="product-image">
<span class="exclusive-star">
</span>
<img src="http://www.xtra-vision.ie/media/catalog/product/cache/3/small_image/124x173/5b02ab93946615b958c913185aae2414/i/w/iws_5167c10c906b57.33524324.JPG.jpg" alt="Image for Pitch Perfect" />
<h2 class="product-name">Pitch Perfect</h2>
<div class="price-box">
<span class="regular-price" id="product-price-5174">
<span class="price">
€15
<sub class="price-bit">.99</sub>
</span>
</span>
</div>
</a>
</li>
</sub>
</span>
</div>
The resulting price i get is \u20ac15\t\t\t\t\t\t
Is there some way I can extract 15.99 from this html using xpath

I used a combination of xpath and Python so might not be quite what you were after, although this was mainly employed to get rid of the extraneous tabs added to the end of the "price".
price = hxs.select('//span[#class="price"]/text()').extract()
pricebit = hxs.select('//span[#class="price"]/sub[#class="price-bit"]/text()').extract()
totalprice = price + price-bit
totalstr = ''.join(totalprice).replace('\t','')

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to clean up pulled data from BeautifulSoup, Pandas, Python - pandas

Related

Why am I not able to scrape just this particular P tag?

Angular 5 - Bindings cannot contain assignments

Clarification on relative xpath

Autocomplete with Selenium

extracting currency values using scrapy/xpath

Categories

Resources