Scrapy Rules - Navigating one page at a time - scrapy

My scrapy script has rules specified as below:
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=<xpath for next page>), callback=parse_website, follow= True, ),)
The website itself has a navigation but each page only shows the link to the next page. i.e as page 1 loads, I can get the link to page 2 and so on and so forth.
How do I get my spider to navigate through all of the n pages?
Thank you!

Related

HREF Class changing on every page

I am working to scrape the website:- "https://www.moglix.com/automotive/car-accessories/216110000?page=101" NOTE: 101 is the page number and this site has 783 pages.
I wrote this code to get all the URL's of the product mentioned on the page using beautifulsoup:-
prod_url = []
for i in range(1,400):
r = requests.get(f'https://www.moglix.com/automotive/car-accessories/216110000?page={i}')
soup = BeautifulSoup(r.content,'lxml')
for link in soup.find_all('a',{"class":"ng-tns-c100-0"}):
prod_url.append(link.get('href'))
There are 40 products on each page, and this should give me 16000 URLs for the products but I am getting 7600(approx)
After checking I can see that the class for a tag is changing on pages. For Eg:-
How to get this href for all the products on all the pages.
You can use find_all method and specified attrs to get all a tags also further filter it by using split and startswith method to get exact product link URL's
res=requests.get(f"https://www.moglix.com/automotive/car-accessories/216110000?page={i}")
soup=BeautifulSoup(res.text,"html.parser")
x=soup.find_all("a",attrs={"target":"_blank"})
lst=[i['href'] for i in x if (len(i['href'].split("/"))>2 and i['href'].startswith("/"))]
Output:
['/love4ride-steel-tubeless-tyre-puncture-repair-kit-tyre-air-inflator-with-gauge/mp/msnv5oo7vp8d56',
'/allextreme-exh4hl2-2-pcs-36w-9000lm-h4-led-headlight-bulb-conversion-kit/mp/msnekpqpm0zw52',
'/love4ride-2-pcs-35-inch-fog-angel-eye-drl-led-light-set-for-car/mp/msne5n8l6q1ykl',..........]

Robotframework: Keep on refreshing browser page until page contains the element

With Robot Framework and Selenium, I want to create a keyword to keep on refreshing the page until page contains the element.
I do not think I can use Wait Until Page Contains Element <xpath> <time> because the page needs to be refreshed to show the element.
How can I write a FOR Loop to do this?
*** Keywords ***
Refresh Page until page contains the element
Reload Page
Page Should Contain Element <xpath>
Or may be I can somehow loop this?
${Reload}= Run Keyword And Return Status Page Should Contain Element <xpath>
Run Keyword If ${Reload} <don't know how to write here> ELSE Reload Page
Hi this could be achieved by using something similar to the below snippet. Using a while loop which has additional examples here: RoboCorp WHILE Loops
Refresh Page Until Page Contains Element
${Reload}= Run Keyword And Return Status Page Should Contain Element <xpath>
WHILE ${Reload} != ${TRUE}
${Reload}= Run Keyword And Return Status Page Should Contain Element <xpath>
END
Hope this helps

Make Selenium scroll LinkedIn to scrape jobs

I have this code scraping each job title and company name from :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt
This is for every job title
job_titles = browser.find_elements_by_css_selector("a.job-card-list__title")
c = []
for title in job_titles:
c.append(title.text)
print(c)
print((len(c)))
This is for every company name
Company_Names = browser.find_elements_by_css_selector("a.job-card-container__company-name")
d = []
for name in Company_Names:
d.append(name.text)
print(d)
print((len(d)))
I provided the URL above, there are many many pages!
how can I make Selenium auto-open each page and scrape each of the 4 thousand results available?
I have found a way to paginate to each page, but I am yet to know how to scrape each page.
So the URL is :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start=25
The start parameter in the URL increments by 25 from each page to the other.
so we add this piece of code which navigates us successfully to the other pages:
page = 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))
for i in range(1,40):
page = i * 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page

Enable to select element using Scrapy shell

I'm trying to print out all the titles of the products of this website using scrapy shell: 'https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas'
Once it is open I start fetching:
fetch('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas')
And I try to print out the title of each product as a result nothing is selected:
>>> response.css('.shelfProductTile-descriptionLink::text')
output: []
Also tried:
>>> response.css('a')
output: []
How can I do ? Thanks
Your code is correct. What happens is that there is no a element in the HTML retrieved by scrapy. When you visit the page with your browser, the product list is populated with javascript, on the browser side. They are not in the HTML code.
In the doc you'll find techniques to pre-render javascript. Maybe you should try that.

how to identify page boundries in a .prn file of multiple page job

Can someone help me to identify the page boundaries in a multiple page job .prn file ?
My goal is to use a PJL command #PJL SET MEDIASOURCE=TRAYX at the start of every page of a multiple job file.
Where X = page number, for example:
for Page 1 : #PJL SET MEDIASOURCE=TRAY1
for Page 2 : #PJL SET MEDIASOURCE=TRAY2