I want to return the links to all posts from a specific subreddit on my Reddit homepage. My intuition is to do this by looking for the next link after it finds an href = r/whatever.
I was using https://www.reddit.com/r/programming/
I would recommend using infinite scroll load.
Then after use this to grab all the links.
links = [x.get_attribute("href") for x in driver.find_elements(By.XPATH, "//a[#href and #data-click-id='body']")]
you can find all a tags with href attribute and after that, you can iterate through this list. python implementation.
driver = webdriver.WhateverDriver
links = driver.find_elements(By.XPATH, "//a[#href]") # This will return all links
Related
I'm trying to get the links of all the posts in an instagram profile.
How can I get to the href="/p/CX067tNhZ8i/" in the photo.
What I'm trying to do is find the href= blabla of all posts.
All your posts are in class="v1Nh3 kIKUG _bz0w".
I tried to get the hraf= blabla value from this class with the get_attribute command, but it didn't work.
Thank you for your help.
browser.get("https://www.instagram.com/lightning.mcqueen34/")
links = []
elements = browser.find_element_by_xpath('//*[#id="react-root"]/div/div/section/main/div/div[4]/article/div[1]/div/div[1]/div[3]')
for i in elements:
links.append(i.get_attribute('href'))
I thought this would work but the elements value is not a list . It gave an error.
This should work:elements = browser.find_elements_by_tag_name('a')
Below answer will not work in all cases, dependant on how the DOM of the page is loaded.
Replace this line:
elements = browser.find_element_by_xpath('//*[#id="react-root"]/div/div/section/main/div/div[4]/article/div[1]/div/div[1]/div[3]')
With:
elements = browser.find_element_by_xpath("//a[#href]")
This will let you retreive all links with a href from the page.
Try to change XPath first to get the DIV class or ID after trying this //a[#href] Xpath to get all HREF.
There are some posts about this topic but I cannot find any solution for my case, this is the situation:
I click on a link (a next page):
ActionChains(driver).move_to_element(next_el).click().perform()
Then I get the content of the new page(I'm interested on some script sections inside the body)
html = driver.find_element_by_xpath("//*").get_attribute("outerHTML")
But that content is always the same, no matter how long I wait for.
The only way to get the driver with new DOM information is to do a refresh(),
but for this case that is not a valid option.
Thanks and regards.
I am not sure what exactly you are looking for here, but if I am right you want to capture the content of script tag from the page.
If that is the case capture the page source in a string variable
sorce_code = driver.page_source , after you get the sting you can extract the value by any of the available string methods. I hope it helps.
I am trying to scrape nordstrom product descriptions. I got all the item links (stored in local mongodb db) and now am itertating through them and here is an example link https://www.nordstrom.ca/s/leith-ruched-body-con-tank-dress/5420732?origin=category-personalizedsort&breadcrumb=Home%2FWomen%2FClothing%2FDresses&color=001
My code for the spider is:
def parse(self, response):
items = NordstromItem()
description = response.css("div._26GPU").css("div::text").extract()
items['description'] = description
yield items
I also tried scrapy shell and the returned page is blank.
I am also using scrapy random agents.
I suggest you to use css or xpath selector to get the info you want. Here's more about it: https://docs.scrapy.org/en/latest/topics/selectors.html
And you can also use css/xpath checker to help identify if the selector gets the info you want. Like this Chrome extesion: https://autonomiq.io/chropath/
My code:
import stuff...
driver = webdriver.Firefox(executable_path='C:\\...\\geckodriver.exe')
driver.get('https://webpage.com/')
elems = driver.find_elements_by_CLASS_ID_TEXT_XPATH_WHATEVER_ELSE_BADLY_DOCUMENTED_STUFF
How do I get a list of ALL CSS Selectors, such as Class, ID, p, span, input, button and all other elements from the webpage.com?
If you know a link to a brief and clear info resource with plentiful of examples explaining find_elements_by or Locating Elements in advanced detail, please share here.
edit:
ok, a bit more specific question then, how could I get the list of ALL Class Selectors from a webpage into elems?
elems = driver.find_elements_by_class_name('????')
You could use XPath to search the DOM, there's a good tutorial on using XPath here. You may also find this resource useful which covers using XPath queries with Selenium.
You can use XPath queries with selenium like so:
elements = driver.find_elements(By.XPATH, '//button')
This code would return all buttons on the page.
elements = driver.find_elements(By.XPATH, '[class]')
This would return all the elements that contain a class attribute
I hope this helps.
I have a test where I click through list of links on a page. If I open the links in new pages, I can iterate through the list using browser.switchTo().window and the original window's handle, clicking on each one.
But if I open a link in the same page (_self) and I navigate back to the original list of links using browser.navigate().back(), I get the following error when it iterates to click the next link:
StaleElementReferenceError: stale element reference: element is not attached to the page document
What's the proper way to access elements on a prior page once you've navigated away?
When you request a object from Selenium, Selenium IDs that object with a internal Unique ID. See the below command
>>> driver.find_element_by_tag_name("a")
<selenium.webdriver.firefox.webelement.FirefoxWebElement
(session="93fc2bec-c9f8-0c46-aec3-1939af00c917",
element="5173f7fb-63ca-e447-b176-4a226d956834")>
As you can see the element has a unique uuid. Selenium maintains these list internally, so when you take an action like click, it fetches the element from its cached and takes action on it.
Once the page is refreshed or a new page is loaded, this cache is no longer valid. But the object you created in your language binding still is. If I try and execute some action on its
>>> elem.is_displayed()
selenium.common.exceptions.StaleElementReferenceException:
Message: The element reference of [object Null] null
stale: either the element is no longer attached to the DOM or the page has been refreshed
So in short there is no way to use the same object. Which means you need to alter your approach. Consider the below code
for elem in driver.find_elements_by_tag_name("a"):
elem.click()
driver.back()
Above code will fail on the second attempt of elem.click(). So the fix is to make sure not to re-use a collection object. Instead use a number based loop. I can write the above code in many different ways. Consider few approaches below
Approach 1
elems = driver.find_elements_by_tag_name("a")
count = len(elems)
for i in range(0, count):
elems[i].click()
driver.back()
elems = driver.find_elements_by_tag_name("a")
This is not a very great approach as I am getting a collection of objects and only using one of them. A page which would have 500+ odd links will make this code quite slow
Approach 2
elems = driver.find_elements_by_tag_name("a")
count = len(elems)
for i in range(1, count + 1):
elem = driver.find_element_by_xpath("(//a)[{}]".format(i))
driver.back()
This is better than approach 1 as I am getting all objects just one. Latter I am getting one and using one
Approach 3
elems = driver.find_elements_by_tag_name("a")
links = []
for elem in elems:
links.append(elem.get_attribute("href"))
for link in links:
driver.get(link)
# do some action
This approach will only work when links are href based. So it is that based on the situation I would choose or alter my approach