Data scraping by selenium p tag

Data scraping by selenium p tag - selenium

I searched a lot on the internet. I couldn't find an example similar to the one below. I'm trying to pull text from a web page. There is no location line in the first p tag. The second location section has a location line. When pulling data, I can only pull the contents of the p tag, which is the location row. I cannot pull the contents of the other p tag. I wonder how can I pull the data inside the first and second p tag?
HTML codes of Page Source:
<div class=" col-md-8">
<p>
<i class='fa fa-home main-color'></i> ORHAN MAH.İBRAHİM CAD. NO:35
<br>
<i class='fa fa-phone main-color'></i>
<a class="gri" href="tel:0508-2920344">0508-2920344 </a>
<br />
<i class='fa fa-clock-o main-color'></i>
<span class="red">19.01.2022</span>
</p>
<p>
<i class='fa fa-home main-color'></i> HAZAN MAH.ÖKTEM CAD. NO:13/B
<br>
<i class='fa fa-phone main-color'></i>
<a class="gri" href="tel:0584 837 23 70">0584 837 23 70 </a>
<br>
<i class="fa fa-map-marker main-color"></i>
<a class="gri" href="https://www.google.com/maps?q=35.554433,25.887766" target="_blank">Haritada</a>
<br />
<i class='fa fa-clock-o main-color'></i>
<span class="red">20.01.2022</span>
</p>
</div>
Here is the selenium code I used to pull the data from the HTML source above:
item = browser.find_elements_by_class_name("col-md-10")
urls = browser.find_elements_by_xpath("//div[#class=' col-md-10']/p/a[2]")
for i in zip(item,urls):
try:
address = i[0].find_element_by_css_selector("p").text.split("\n")[:2]
except:
address = None
try:
phone = i[0].find_element_by_xpath("//a[#class='gri'][1]").text
except:
phone = None
print(address)
print(phone)
try:
url = i[1].get_attribute('href').replace("https://www.google.com/maps?q=","")
except:
url = None
try:
date = i[0].find_element_by_xpath("//span[#class='red'][1]").text
except:
date = None
print(url)
print(date)

Use xpath //div[#class=' col-md-8']/p. This will return data of both p tags.
Then you can perform string operations as per your requirement and use data of each p tag using for loop

The 1.p tag blog has no location section. The 2.p tag blog has a location section. In the 1.p tag I want, I want to print none instead of the location in the p blog. When I try to pull with zip_longest regularly the location fails to pull.
#1.p tag block
ORHAN MAH.İBRAHİM CAD. NO:35
0508-2920344
19.01.2022
#2.p tag block
HAZAN MAH.ÖKTEM CAD. NO:13/B
0584 837 23 70
Haritada
20.01.2022

Related

Selenium Automation - Need to combine 1 or more xpath locators to form a single locator

The scenario here is I need to Assert whether the status of a jobname is changed to Completed, But the issue is that on the UI page the Job status HTML element is similar for all the different job names.
Below is the sample HTML code:
<div class="flex-primary">
<i class="gray hwx-test test-status fa provider-logo hwx-test na na-testing " title="Completed"></i>
<span class="hwx-title" title="job1">job1</span>
</div>
<div class="flex-primary">
<i class="gray hwx-test test-status fa provider-logo hwx-test na na-testing " title="Completed"></i>
<span class="hwx-title" title="job2">job2</span>
</div>
<div class="flex-primary">
<i class="gray hwx-test test-status fa provider-logo hwx-test na na-testing " title="Completed"></i>
<span class="hwx-title" title="job3">job3</span>
</div>
I want a locator which will uniquely be able to point to a job title in completed state
i.e: I want a xpath or any other locator which can combine & give me 1 single output for below 2 xpath's result:
//span[#title='job1'] and //i[#title='Completed']

The below xpath will give you the div which contains jobname as job1 and title as completed
//span[#title='job1']/preceding-sibling::i[#title='Completed']/parent::div
Below xpath will point to the i tag of jobname of job1 which is completed
//span[#title='job1']/parent::div/i[#title='Completed']

In scrapy css selectors how do i get a strings ' ' instead of a sub-string [ ]

I can't figure out how to get a string out of a selector
I've tried
response.css('.size_list a::text').extract()
I get
['L', '1X', '2X', '3X', '4X', '5X']
Here is the code
<span class="size_list">
<a href="javascript:void(0)" class="itemAttr current" title="L" data-
value="L">L</a>
<a href="javascript:void(0)" class="itemAttr" title="1X" data-
value="1X">1X</a>
<a href="javascript:void(0)" class="itemAttr" title="2X" data-
value="2X">2X</a>
<a href="javascript:void(0)" class="itemAttr" title="3X" data-
value="3X">3X</a>
<a href="javascript:void(0)" class="itemAttr" title="4X" data-
value="4X">4X</a>
<a href="javascript:void(0)" class="itemAttr" title="5X" data-
value="5X">5X</a>
</span>
What I want is "'L', '1X', '2X', '3X', '4X', '5X'"

This is not something for the extraction code to do, this is something you should do with regular Python code once you have the extracted data:
>>> extracted_data = ['L', '1X', '2X', '3X', '4X', '5X']
>>> ', '.join("'%s'" % value for value in extracted_data)
"'L', '1X', '2X', '3X', '4X', '5X'"

Not sure if it's possible to do it directly in the selector. An alternative could be to get it first as a list and to transform it into a string with something like this:
size_list = response.css('.size_list a::text').extract()
string_size_list = ', '.join(size_list)

To obtain the first occurrence of the elements
response.css('.size_list a::text').extract_first()
# or
response.css('.size_list a::text').get()
This should work
item_list = response.css('.size_list a::text').extract()
one_string = (', ').join(item_list) # this work

How to select specific text to scrape

I'm trying to scrape the following HTML, I want just to get the Some Header part and not the additional info.
<li class="media">
<div class="media-body">
<h4> Some Header <span class="label label-info"> additional Info </span> </h4> Address info
<br>
</div> </li>`
I'm trying the following:
val li: Elements = ul.select("li")
val list: Elements = li.select("a")
val headers: Elements = list.select("h4")
`
and then when I try to get the inner text via, headers.text() I'm getting both Some Header and additional Info
How can I only scrape the Some Header part?

You are almost near to the solution .You are probably looking for calling ownText:
String s = "<li class=\"media\"> \n" +
" <div class=\"media-body\"> \n" +
" <h4> Some Header <span class=\"label label-info\"> additional Info </span> </h4> Address info\n" +
" <br> \n" +
" </div> </li>";
Document document = Jsoup.parse(s);
Elements element = document.select("li");
Elements elements = element.select("a");
System.out.println(elements.select("h4").first().ownText()); ;
Output:
Some Header

Scrapy CSV export shows the same data in all rows

I'm trying to scrape the following html code:
<ul class="results-list" id="search-results">
<li>
<h3 class="name">First John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
<li>
<h3 class="name">Second John</h3>
<div class="details">
email
<span class="phone">999999999</span>
</div>
</li>
</ul>
When I run my spider, I get 2 rows, containing the same information. I have name,email,phone columns and for example in the name column for both I would get:
First John,Second John.
My Scrapy code is the following:
people= response.xpath('//ul[#class="results-list"]/li')
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'//h3/text()').extract()
item['Email'] = person.xpath(
'//div[#class="details"]/a/#href').extract()
item['Phone'] = person.xpath(
'//div[#class="details"]/span[#class="phone"]/text()').extract()
yield item
However when I run scrapy crawl MySpider -o output.csv I get the same information in all rows.

you are using absolute path on your xpath expressions, change them to:
for person in people:
item = SpiderItem()
item['Name'] = person.xpath(
'.//h3/text()').extract_first()
item['Email'] = person.xpath(
'.//div[#class="details"]/a/#href').extract_first()
item['Phone'] = person.xpath(
'.//div[#class="details"]/span[#class="phone"]/text()').extract_first()
yield item

How to enable the Google Trusted Stores without having Platinum Level in Bigcommerce

I would like to enable the Google Trusted Stores code without having to subscribe to the Platinum Level (I'm on a Gold Level plan). I have successfully set up automated daily Shipping and Cancellation Feeds through ShipWorks. I believe I set up the "Badge" code correctly on the footer.html:
<!-- BEGIN: Google Trusted Stores -->
<script type="text/javascript">
var gts = gts || [];
gts.push(["id", "######"]);
gts.push(["badge_position", "BOTTOM_RIGHT"]);
gts.push(["locale", "en_AU"]);
gts.push(["google_base_offer_id", "%%GLOBAL_ProductId%%"]);
gts.push(["google_base_subaccount_id", "8669332"]);
gts.push(["google_base_country", "AU"]);
gts.push(["google_base_language", "en_AU"]);
(function() {
var gts = document.createElement("script");
gts.type = "text/javascript";
gts.async = true;
gts.src = "https://www.googlecommerce.com/trustedstores/api/js";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(gts, s);
})();
</script>
<!-- END: Google Trusted Stores -->
I have to put the Order Confirmation Module Code on the website. The issue is figuring out the Est. Ship Date and Est. Delivery Date and putting in a "loop" to get the requested data for each item in the order. I have placed the following code on the order.html page:
<!-- start order and merchant information -->
<span id="gts-o-id">%%GLOBAL_OrderId%%</span>
<span id="gts-o-domain">www.****.com.au</span>
<span id="gts-o-email">%%GLOBAL_CurrentCustomerEmail%%</span>
<span id="gts-o-country">%%GLOBAL_ShipCountry%%</span>
<span id="gts-o-currency">%%GLOBAL_CurrencyName%%</span>
<span id="gts-o-total">%%GLOBAL_OrderTotal%%</span>
<span id="gts-o-discounts">%%GLOBAL_CouponDiscount%%</span>
<span id="gts-o-shipping-total">%%GLOBAL_ShippingPrice%%</span>
<span id="gts-o-tax-total">%%GLOBAL_TaxCost%%</span>
<span id="gts-o-est-ship-date">ORDER_EST_SHIP_DATE</span>
<span id="gts-o-est-delivery-date">ORDER_EST_DELIVERY_DATE</span>
<span id="gts-o-has-preorder">N</span>
<span id="gts-o-has-digital">N</span>
<!-- end order and merchant information -->
<!-- start repeated item specific information -->
<!-- item example: this area repeated for each item in the order -->
<span class="gts-item">
<span class="gts-i-name">%%GLOBAL_ProductName%%</span>
<span class="gts-i-price">%%GLOBAL_ProductPrice%%</span>
<span class="gts-i-quantity">%%GLOBAL_ProductQuantity%%</span>
<span class="gts-i-prodsearch-id">%%GLOBAL_ProductId%%</span>
<span class="gts-i-prodsearch-store-id">######</span>
<span class="gts-i-prodsearch-country">AU</span>
<span class="gts-i-prodsearch-language">en_AU</span>
</span>
<!-- end item 1 example -->
<!-- end repeated item specific information -->
</div>
<!-- END Google Trusted Stores Order -->

I have tried with the code for the badge and successfully got approved. as far as the conversion module. I had to do a "Hack" with javascript for the EST ship date and EST delivery date:
<!-- Include the conversion tracking code for all analytics packages -->
<!-- START Google Trusted Stores Order -->
<div id="gts-order" style="display:none;" translate="no">
<!-- start order and merchant information -->
<span id="gts-o-id">%%ORDER_ID%%</span>
<span id="gts-o-domain">www.doubletakeshapewear.com</span>
<span id="gts-o-email">%%ORDER_EMAIL%%</span>
<span id="gts-o-country">%%GLOBAL_ShipCountry%%</span>
<span id="gts-o-currency">%%GLOBAL_CurrencyName%%</span>
<span id="gts-o-total">%%ORDER_AMOUNT%%</span>
<span id="gts-o-discounts">%%GLOBAL_CouponDiscount%%</span>
<span id="gts-o-shipping-total">%%GLOBAL_ShippingPrice%%</span>
<span id="gts-o-tax-total">%%GLOBAL_TaxCost%%</span>
<span id="gts-o-est-ship-date"></span>
<script>
var today = new Date();
var tomorrow = new Date();
tomorrow.setDate(today.getDate()+3);
if(tomorrow.getMonth() <= 8){
var fecha = tomorrow.getFullYear()+'-'+'0'+(tomorrow.getMonth()+1)+'-'+tomorrow.getDate();
} else{
var fecha = tomorrow.getFullYear()+'-'+(tomorrow.getMonth()+1)+'-'+tomorrow.getDate();
}
document.getElementById("gts-o-est-ship-date").innerHTML = fecha;
</script>
<span id="gts-o-est-delivery-date"></span>
<script>
var today2 = new Date();
var tomorrow2 = new Date();
var j =document.getElementById("gts-o-country").innerHTML;
if( j != 'US'){
if(
tomorrow2.setDate(today.getDate()+4);
if(tomorrow2.getMonth() <= 8){
var fecha2 = tomorrow2.getFullYear()+'-'+'0'+(tomorrow2.getMonth()+1)+'-'+tomorrow2.getDate();
} else{
var fecha2 = tomorrow2.getFullYear()+'-'+(tomorrow2.getMonth()+1)+'-'+tomorrow2.getDate();
}
document.getElementById("gts-o-est-delivery-date").innerHTML = fecha2;
}else{
tomorrow2.setDate(today.getDate()+20);
if(tomorrow2.getMonth() <= 8){
var fecha2 = tomorrow2.getFullYear()+'-'+'0'+(tomorrow2.getMonth()+1)+'-'+tomorrow2.getDate();
} else{
var fecha2 = tomorrow2.getFullYear()+'-'+(tomorrow2.getMonth()+1)+'-'+tomorrow2.getDate();
}
document.getElementById("gts-o-est-delivery-date").innerHTML = fecha2;
}
</script>
<span id="gts-o-has-preorder">N</span>
<span id="gts-o-has-digital">N</span>
<!-- end order and merchant information -->
<!-- start repeated item specific information -->
<!-- item example: this area repeated for each item in the order -->
<span class="gts-item">
<span class="gts-i-name">%%GLOBAL_ProductName%%</span>
<span class="gts-i-price">%%GLOBAL_ProductPrice%%</span>
<span class="gts-i-quantity">%%GLOBAL_ProductQuantity%%</span>
<span class="gts-i-prodsearch-id">%%GLOBAL_ProductId%%</span>
<span class="gts-i-prodsearch-store-id">483911</span>
<span class="gts-i-prodsearch-country">US</span>
<span class="gts-i-prodsearch-language">en_US</span>
</span>
<!-- end item 1 example -->
<!-- end repeated item specific information -->
</div>
<!-- END Google Trusted Stores Order -->
I put this under the order.html I'm checking some other options because your code is uncomplete on the delivery date and ship date. when I spoke with some bigcommerce people they said that this variables are populated from customer information that is given when you are platinum ( which means that they don't even exist on bigcommerce )
Please let me know if you find something else or if it works for you. also please don't forget to purchase and install your own ssl

In my experience with Bigcommerce this is not possible. They have restricted the data which is required for GTS to only be available in the code that is output by their system, and not in other files. Basically, even if we knew what the variables were I don't believe they would work because they aren't a global scope.
I would guess that if you can use GTS to sell your products, the small price increase from their Gold to platinum level will quickly be made up in your sales.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Data scraping by selenium p tag - selenium

Use xpath //div[#class=' col-md-8']/p. This will return data of both p tags. Then you can perform string operations as per your requirement and use data of each p tag using for loop

Related

Selenium Automation - Need to combine 1 or more xpath locators to form a single locator

In scrapy css selectors how do i get a strings ' ' instead of a sub-string [ ]

How to select specific text to scrape

Scrapy CSV export shows the same data in all rows

How to enable the Google Trusted Stores without having Platinum Level in Bigcommerce

Categories

Resources