Scrapy: How to get pagination links? - scrapy

I try to get pagination links on this site, but to no avail.
spider:
for next_page in response.css('._3ZWfj::attr(href)').getall():
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.profile_link)
html:
<nav class="_2uKgC" aria-label="Page navigation" data-qa-target="pagination">
<p>Page 1 of 660</p>
<ul>
<li class="ktQcN"><span class="CxgVm _3ZWfj"><svg width="24px" height="24px" viewBox="0 0 24 24" version="1.1" class="USD7b MBqXj _3NE5i _2bkBT" role="img" aria-label="Previous page" focusable="false"><polyline fill="none" stroke-linecap="round" points="16 20 8 12 16 4"></polyline></svg></span></li>
<li><span class="HWGOs _3ZWfj">1</span></li>
<li class="_35eJJ _2Sm6k"><span class="CxgVm _3ZWfj">...</span></li>
<li class="">2</li>
<li class="">3</li>
</ul>
</nav>

You have not included a tag here response.css('._3ZWfj::attr(href)').getall()
It is supposed to be response.css('._3ZWfj a ::attr(href)').getall()

Related

Insert twice on database with unobtrusive on ASP .NET core

I have an Ajax form on my view. It relates to Newsletter
<div class="d-flex flex-column">
<div class="d-flex align-items-center">
<div class="px-lg-8">
<div class="d-flex align-items-center">
<div class=" ml-3">
<svg xmlns="http://www.w3.org/2000/svg" width="21.238" height="15.291" viewBox="0 0 21.238 15.291">
<path d="M0 0v15.292h21.238V0zm.849.849h19.54v1.062l-8.31 7.244-.04.04a2.194 2.194 0 0 1-1.42.571 2.2 2.2 0 0 1-1.42-.571c-.158-.138-1.293-1.118-2.111-1.832C4.661 5.255.974 2.021.849 1.912zm0 2.19c.737.642 3.35 2.917 5.575 4.858L.849 12.305zm19.539 0v9.266l-5.575-4.407c2.226-1.943 4.839-4.216 5.576-4.858zM7.075 8.469l1.566 1.367.013.013a3.057 3.057 0 0 0 1.965.77 3.051 3.051 0 0 0 1.978-.783c.12-.1 1.059-.913 1.58-1.367l6.212 4.911v1.062H.849V13.38z" data-name="Path 319"></path>
</svg>
</div>
<h5 class="mb-0"> #_localizer["NewsLetterMemberShip"]</h5>
</div>
<p class="my-3 ">#_localizer["ReceiveNewsLetter"]</p>
<div class="textbox-footer ">
<form asp-controller="Home" asp-action="NewsLetter" data-ajax="true" data-ajax-mode="replace" data-ajax-success="NewLetterPostSuccess">
<input asp-for="Email" id="NewsLetterEmail" class="form-control input-textbox-footer "
placeholder="#_localizer["EnterEmail"]">
<span asp-validation-for="Email"></span>
<button type="submit" class="btn btn--orange footer__send-btn">#_localizer["SendButton"]</button>
</form>
</div>
</div>
</div>
</div>
It works well. but now I see after click on newsletter ,It inserts email twice,
It has a reference:
<script src="/lib/jquery-unobtrusive-ajax/jquery.unobtrusive-ajax.min.js"></script>
If I delete this refrence, it will be work correct and insert 1 time.
This is my script :
function NewLetterPostSuccess(result) {
swal('info', result.text, 'info');
$('#NewsLetterEmail').val('');
}
I see the way for resolve is delete this reference.
But is not for ajax form?
and If I delete reference I have another misktake too.
How can resolve this problem?
I find my mistake .
It referenced to page twice with "unobtrusive" at 2 different Route.
<script src="/lib/jquery-unobtrusive-ajax/jquery.unobtrusive-ajax.min.js"></script>
I deleted one of them and It works well now.
Check all your project for the reference always! :)

Getting nested <li> tags also.....just need the direct <li> item

I have a page structure where it includes hyperlinks with different levels of nested list items. I am writing the xpath for just the second nested list items(level 2) but it is also giving me all the list items below level 2(level 3 or below), which i don't need. It's getting really frustrating now because I have tried different xpaths and cssSelectors, everytime it is taking the level 3 list items along with level 2.
The page structure is exactly like following:
<div>
<ul class="menuBar_menu_lvl_0">
<li class="item_lvl_1">
<li class="item_lvl_1">
<ul class="menu_lvl_1">
<li class="item_lvl_2"></li>
<li class="item_lvl_2"></li>
<li class="item_lvl_2">
<ul class="menu_lvl_2">
<li class="item_lvl_3"></li>
<li class="item_lvl_3"></li>
<li class="item_lvl_3"></li>
</ul>
</li>
</ul>
<li class="item_lvl_1">
<li class="item_lvl_1">
xpath = //ul[#class="menuBar_menu_lvl_0"]//li[#class="item_lvl_1"]//ul[#class="menu_lvl_1"]//li[#class="item_lvl_2"]
This is also giving me item number 3 elements along with item level 2. I want to get each level item correctly and separately. Anybody's help would be highly appreciated
Welcome to SO.
Here is the sample that used.
<html>
<body>
<div>
<ul class="menuBar_menu_lvl_0">
<li class="item_lvl_1">
<li class="item_lvl_1">
<ul class="menu_lvl_1">
<li class="item_lvl_2">Level1-li1</li>
<li class="item_lvl_2">Level1-li2</li>
<li class="item_lvl_2">
<ul class="menu_lvl_2">
<li class="item_lvl_3">Level2-li1</li>
<li class="item_lvl_3">Level2-li2</li>
<li class="item_lvl_3">Level2-li1</li>
</ul>
Level1-li3
</li>
</ul>
<li class="item_lvl_1">
<li class="item_lvl_1">
</ul>
</div>
</body>
</html>
And below is the xpath
//ul[#class='menu_lvl_1']//li[not(parent::ul[#class='menu_lvl_2'])]
Here is the output:
Here is the method to get the text from parent element only.
def get_text_exclude_children(element):
return driver.execute_script(
"""
var parent = arguments[0];
var child = parent.firstChild;
var textValue = "";
while(child) {
if (child.nodeType === Node.TEXT_NODE)
textValue += child.textContent;
child = child.nextSibling;
}
return textValue;""",
element).strip()

Xpath is invalid: TypeError: The expression cannot be converted to return the specified type

I am using selenium using Robot Framework I'm getting
xpath is invalid: TypeError: The expression cannot be converted to return the specified type.
The code I used is
Set Test Variable ${xpathIP} xpath=//ul/li/div[#class="segmentName"]
${IPSegmentsCnt}= Get Matching Xpath Count ${xpathIP}
Log ${IPSegmentsCnt}
:For ${i} IN RANGE 1 ${IPSegmentsCnt} + 1
\ ${name}= Get Text xpath=(${xpathIP})[${i}]
\ Log ${name}
\ Should Not Match Regexp ${name} \\(DS:.+\\)
I'm not getting what exactly the error is...
HTML:
<li _ngcontent-ats-90="">
<span _ngcontent-ats-90="" ng-reflect-class-name="arrow collapse-false" class="arrow collapse-false"></span>
<md-checkbox _ngcontent-ats-90="" class="mat-accent mat-checkbox ng-untouched ng-pristine ng-valid" ng-reflect-name="4INFO IP Segments"><label class="mat-checkbox-layout"><div class="mat-checkbox-inner-container"><input class="mat-checkbox-input cdk-visually-hidden" type="checkbox" ng-reflect-id="input-md-checkbox-1" id="input-md-checkbox-1" ng-reflect-name="4INFO IP Segments" name="4INFO IP Segments" tabindex="0" aria-label="">
<div class="mat-checkbox-ripple mat-ripple" md-ripple="" ng-reflect-trigger="[object HTMLElement]" ng-reflect-centered="true" ng-reflect-speed-factor="0.3"></div><div class="mat-checkbox-frame"></div><div class="mat-checkbox-background"><svg xml:space="preserve" class="mat-checkbox-checkmark" version="1.1" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path class="mat-checkbox-checkmark-path" d="M4.1,12.7 9,17.6 20.3,6.3" fill="none" stroke="white"></path></svg><div class="mat-checkbox-mixedmark"></div></div></div><span class="mat-checkbox-label">
</span></label></md-checkbox>
<div _ngcontent-ats-90="" class="datasourceName">4INFO IP Segments false</div>
<ul _ngcontent-ats-90="">
<li _ngcontent-ats-90="">
<md-checkbox _ngcontent-ats-90="" class="mat-accent mat-checkbox ng-untouched ng-pristine ng-valid" ng-reflect-name="4Info_Age_18-30"><label class="mat-checkbox-layout"><div class="mat-checkbox-inner-container"><input class="mat-checkbox-input cdk-visually-hidden" type="checkbox" ng-reflect-id="input-md-checkbox-2" id="input-md-checkbox-2" ng-reflect-name="4Info_Age_18-30" name="4Info_Age_18-30" tabindex="0" aria-label=""><div class="mat-checkbox-ripple mat-ripple" md-ripple="" ng-reflect-trigger="[object HTMLElement]" ng-reflect-centered="true" ng-reflect-speed-factor="0.3"></div><div class="mat-checkbox-frame"></div><div class="mat-checkbox-background"><svg xml:space="preserve" class="mat-checkbox-checkmark" version="1.1" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path class="mat-checkbox-checkmark-path" d="M4.1,12.7 9,17.6 20.3,6.3" fill="none" stroke="white"></path></svg><div class="mat-checkbox-mixedmark"></div></div></div><span class="mat-checkbox-label">
</span></label></md-checkbox>
<div _ngcontent-ats-90="" class="segmentName">4Info_Age_18-30 (DS: 0) (CPM: $1.00)</div>
</li><li _ngcontent-ats-90="">
<md-checkbox _ngcontent-ats-90="" class="mat-accent mat-checkbox ng-untouched ng-pristine ng-valid" ng-reflect-name="4Info_Age_35_and_over"><label class="mat-checkbox-layout"><div class="mat-checkbox-inner-container"><input class="mat-checkbox-input cdk-visually-hidden" type="checkbox" ng-reflect-id="input-md-checkbox-3" id="input-md-checkbox-3" ng-reflect-name="4Info_Age_35_and_over" name="4Info_Age_35_and_over" tabindex="0" aria-label=""><div class="mat-checkbox-ripple mat-ripple" md-ripple="" ng-reflect-trigger="[object HTMLElement]" ng-reflect-centered="true" ng-reflect-speed-factor="0.3"></div><div class="mat-checkbox-frame"></div><div class="mat-checkbox-background"><svg xml:space="preserve" class="mat-checkbox-checkmark" version="1.1" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path class="mat-checkbox-checkmark-path" d="M4.1,12.7 9,17.6 20.3,6.3" fill="none" stroke="white"></path></svg><div class="mat-checkbox-mixedmark"></div></div></div><span class="mat-checkbox-label">
</span></label></md-checkbox>
<div _ngcontent-ats-90="" class="segmentName">4Info_Age_35_and_over (DS: 0) (CPM: $1.00)</div>
</li><li _ngcontent-ats-90="">
<md-checkbox _ngcontent-ats-90="" class="mat-accent mat-checkbox ng-untouched ng-pristine ng-valid" ng-reflect-name="4Info_Age_50_plus"><label class="mat-checkbox-layout"><div class="mat-checkbox-inner-container"><input class="mat-checkbox-input cdk-visually-hidden" type="checkbox" ng-reflect-id="input-md-checkbox-4" id="input-md-checkbox-4" ng-reflect-name="4Info_Age_50_plus" name="4Info_Age_50_plus" tabindex="0" aria-label=""><div class="mat-checkbox-ripple mat-ripple" md-ripple="" ng-reflect-trigger="[object HTMLElement]" ng-reflect-centered="true" ng-reflect-speed-factor="0.3"></div><div class="mat-checkbox-frame"></div><div class="mat-checkbox-background"><svg xml:space="preserve" class="mat-checkbox-checkmark" version="1.1" viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path class="mat-checkbox-checkmark-path" d="M4.1,12.7 9,17.6 20.3,6.3" fill="none" stroke="white"></path></svg><div class="mat-checkbox-mixedmark"></div></div></div><span class="mat-checkbox-label">
</span></label></md-checkbox>
<div _ngcontent-ats-90="" class="segmentName">4Info_Age_50_plus (DS: 0) (CPM: $1.00)</div>
</li><li _ngcontent-ats-90="">
</ul>
</li>
I tried multiple times that wasn't worked
The problem could be that you are assigning the xpath twice so ${name} looks for something like this xpath=xpath='yourxpath'
Change
${IPSegmentsCnt}= Get Matching Xpath Count ${xpathIP}
Set Test Variable ${xpathIP} xpath=//ul/li/div[#class="segmentName"]
to
Set Test Variable ${xpathIP} //ul/li/div[#class="segmentName"]
${IPSegmentsCnt}= Get Matching Xpath Count xpath=${xpathIP}

Web scraping Linkedin Job posts using Python, Selenium & Phantomjs

Some LinkedIn job posts contain a see more button that expands the whole job description:
https://www.linkedin.com/jobs/view/401243784/?refId=3024203031501300167509&trk=d_flagship3_search_srp_jobs
I tried to expand it using the element.click() but the source I get after expansion contains some placeholder divs instead of the original div. How, can I scrap those hidden texts.
This is what I get from driver.page_source
<div class="jobs-ghost-placeholder jobs-ghost-placeholder--medium jobs-ghost-placeholder--thin mb2"></div>
<div class="jobs-ghost-placeholder jobs-ghost-placeholder--x-small jobs-ghost-placeholder--thin mb2"></div>
<div class="jobs-ghost-placeholder jobs-ghost-placeholder--small jobs-ghost-placeholder--thin"></div>
Instead of the source I get from chrome inspect:
<div id="ember7189" class="jobs-description-details pt5 ember-view"> <h3 class="jobs-box__sub-title js-formatted-exp-title">Seniority Level</h3>
<p class="jobs-box__body js-formatted-exp-body">Associate</p>
<!---->
<h3 class="jobs-box__sub-title js-formatted-industries-title">Industry</h3>
<ul class="jobs-box__list jobs-description-details__list js-formatted-industries-list">
<li class="jobs-box__list-item jobs-description-details__list-item">Real Estate</li>
<li class="jobs-box__list-item jobs-description-details__list-item">Information Technology and Services</li>
</ul>
<h3 class="jobs-box__sub-title js-formatted-employment-status-title">Employment Type</h3>
<p class="jobs-box__body js-formatted-employment-status-body">Full-time</p>
<h3 class="jobs-box__sub-title js-formatted-job-functions-title">Job Functions</h3>
<ul class="jobs-box__list jobs-description-details__list js-formatted-job-functions-list">
<li class="jobs-box__list-item jobs-description-details__list-item">Information Technology</li>
<li class="jobs-box__list-item jobs-description-details__list-item">Project Management</li>
<li class="jobs-box__list-item jobs-description-details__list-item">Product Management</li>
</ul>
</div>
I also tried different values for the wait WebDriverWait(driver, 3) but in vain.
code:
employment_type = wait.until(EC.presence_of_element_located(
(By.CSS_SELECTOR, 'div.jobs-description__details>div.jobs-description-details>p.js-formatted-employment-status-body'))).text
raises timeout exception as it only finds those jobs-ghost-placeholder instead of the described css_selector

rails 3.2 will paginate change style current page

I am trying to add some style to the current page to highlight it from the other pages style. But my code does not seem to work... all the numbers look the same!
<section id="pagination">
<nav>
<ul>
<li><a class='<%= "active" if (params[:page]).to_i == #myths.current_page %>'>
<!-- checking current page and params page -->
<%= #myths.current_page.to_i %> <br> <%= (params[:page]).to_i %> <br>
<%= will_paginate #myths, :inner_window => 1, :outer_window => 1, :previous_label => '← previous', :next_label => 'next →' %>
</a></li>
</ul>
</nav>
</section>
Any help is appreciated.
update
CSS
a.active{text-decoration:underline;}
Picture of the pagination the current page is 3 (the style is applied when i check for the current page).
I am not using any helper.
Here is the generated html
<section id="pagination">
<nav>
<ul>
<li><a class='active'>
3 <br> 3 <br>
<div class="pagination">
<ul>
<li class="prev previous_page ">
<a rel="prev" href="/tags/Justice?page=2">← previous</a>
</li>
<li><a rel="start" href="/tags/Justice?page=1">1</a></li>
<li><a rel="prev" href="/tags/Justice?page=2">2</a></li>
<li class="active">3</li>
<li><a rel="next" href="/tags/Justice?page=4">4</a></li>
<li>5</li>
<li class="next next_page ">
<a rel="next" href="/tags/Justice?page=4">next →</a>
</li>
</ul>
</div>
</a></li>
</nav>
</section>
this is the problem
<li class="active">3</li>
it should be
<li><a class="active" href="/tags/Justice?page=3">3</a></li>
for it to work. how to fix that?
since i am on page 3, 3 should be underlined the pagination should be like : <- previous 1 2 3 4 5 next ->
Consider reading the following wiki page. Just simplify your view:
<section id="pagination">
<%= will_paginate #myths, :inner_window => 1, :outer_window => 1, :previous_label =>
</section>
To add some special look and feel for the current page link, just set css style for the .current class. will_paginate will do the remaining staff for you and automatically add .current class to the current page link.
Update:
To make your particular case working, just change the css style to reflect the markup
li.active a {
text-decoration:underline;
}