I tried the following XPATH in XPATHHelper in Chrome and XPather in Firefox and it always displays all the snippets(ie the description of the search results) in google search result page, but it does not work in the Scrapy shell:
//span[#class='st']
In case it matters, I invoke scrapy shell like this:
scrapy shell "http://www.google.com/search?q=myQuery"
and I say hxs.select("//span[#class='st']"). This always returns an empty list.
Any clues as to why this could be happening?
Scrapy is not able to "parse" sites that need Javascript execution. What different developer consoles show you is the already interpreted and executed site with all Javascripts applied.
Since Google displays its resulst with the help of Javascript, the Scrapy on its own can't handle this.
sometimes sites will not work with Javascript Disabled (Applebees.com for example) so you have to use an actual browser like Selenium.
In Firefox url bar type :
about:config
find the line javascript.enable and change its value to false
Install FireFinder extension
Open Firebug (F12)
and then enjoy scraping google like xpath expression :
//*[#id="search"]//li[#class="g"]/div[#class="s"]//cite
Related
I have been following the scrapy tutorial trying to create a very simple web scraper for warframe.market. I have about a year of coding experience from school, but no python experience. I simply want to get the price of an item from the website. I used the following to scrape the page:
scrapy shell "https://warframe.market/items/hydroid_prime_set"
then I inspected the web page to find the individual elements that I am trying to scrape. I used this command to try to view the results I wanted:
response.css("div.order-row.d-flex.col-12").extract()
This did not work, so I used view(response) to see what I had scraped, and my cmd just waits endlessly at this point.
Is HTTPS stopping me from scraping? Am I selecting the wrong css in my response? Is the webpage too big? Could someone please show me where I went wrong?
Thanks
The response isn't empty, but it's rendered using javascript (you can validate it inspecting the response.body), for example try this in the shell:
import json
data = json.loads(response.css('#application-state::text').extract_first())
for order in data.get('payload',{}).get('orders', []):
print '"{}" price: {}'.format(order.get('platinum'),
order.get('user',{}).get('ingame_name'))
Upon typing xpath in fire-path text field, if x-path is correct then it'll display the corresponding HTML code. It was working fine previously.
But now it's not displaying the corresponding HTML code even though the xpath is correct.
Can anyone help me to find the solution for this problem? I even uninstalled fire-path and installed again but still, it's not working.
If you visit the GitHub Page of FirePath, it clearly mentions:
FirePath is a Firebug extension that adds a development tool to edit, inspect and generate XPath expressions and CSS3 Selectors
Now if you visit the home page of FireBug, it clearly mentions that :
The Firebug extension isn't being developed or maintained any longer. We invite you to use the Firefox DevTools instead, which ship with Firebug.next
So the direction is clear that we have to use DevTools [F12] which comes integrated with the Mozilla Firefox 56.x + releases onwards.
Example Usage :
Now, let us assume we have to identify the xpath of the Search Box on Google Home Page.
Open Mozilla Firefox 56.x browser and browse to the url https://www.google.co.in
Press F12 to open the DevTools
Within the DevTools section, on the Inspector tab, use the Inspector to identify the Search Box WebElement.
Copy the xpath (absolute) and paste it in a text pad.
Construct a logical unique xpath.
Within the DevTools section, on the Console tab, within JS sub menu, paste the logical unique xpath you have constructed in the following format and hit Enter or Return as follows:
$x("logical_unique_xpath_of_search_box")
The WebElement identified by the xpath will be reflected.
The new version of Firefox is not supporting firebug.
You can use chrome dev tools if you like so.
I personally writing XPath using chrome dev tools
For more info refer my answer here
Is there a way to get the xpath in google chrome?
I am trying to retrieve the ad URLs for this website:
http://www.appledaily.com
The ad URLs are loaded using javascript so a standard crawlspider does not work. The ads also changes as you refresh the page.
I found this question here and what I gathered is that we need to first use selenium to load a page in the browser then use Scrapy to retrieve the url. I have some experiences with scrapy but none at all in using Selenium. Can anyone show/point me to resource on how I can write a script to do that?
Thank you very much!
EDIT:
I tried the following but neither works in opening the ad banner. Can anyone help?
from selenium import webdriver driver=webdriver.Firefox()
driver=webdriver.Firefox()
driver.get('http://appledaily.com')
adBannerElement = driver.find_element_by_id('adHeaderTop')
adBannerElement.click()
2nd try:
adBannerElement =driver.find_element_by_css_selector("div[#id='adHeaderTop']")
adBannerElement.click()
CSS Selector should not contain # symbol it should be 'div[id='adHeaderTop']' or a shorter way of representing the same as div#adHeaderTop
Actually on observing and analyzing the site and the event that you are trying to carry out, I find that the noscript tag is what should interest you. Just get the HTML source of this node, parse the href attribute and fire this URL.
It will be equivalent to clicking the banner.
<noscript>
"<a href="http://adclick.g.doubleclick.net/aclk%253Fsa%...</a>"
</noscript>
(This is not the complete node information, just inspect the banner in Chrome and you will find this tag).
EDIT: Here is a working snippet that gives you the URL without clicking on the Ad banner, as mentioned from the tag.
driver = new FirefoxDriver();
driver.navigate().to("http://www.appledaily.com");
WebElement objHidden = driver.findElement(By.cssSelector("div#adHeaderTop_ad_container noscript"));
if(objHidden != null){
String innerHTML = objHidden.getAttribute("innerHTML");
String adURL = innerHTML.split("\"")[1];
System.out.println("** " + adURL); ///URL when you click on the Ad
}
else{
System.out.println("<noscript> element not found...");
}
Though this is written in Java, the page source wont change.
I am using selenium to test a webapp, for which most of the selenium test cases are already written. I have no idea how it works, I just build the project and go to link provided in the browser and run the test start running and yes all the test are manually written not generated.
I am using ruby, and doing something like this for clicking a link/button in a javascript popup :
def methodName()
clickAndWait("<Id of the link in js popup that I want to click>")
assertText("<text I need to check>")
end
this method is then called in '.test' file, but never works for a javascript popup, for the rest its all good !
help !
Popups a lot of the time are in a different context, either a frame or a window. When you call assertText, Selenium ignores these. Use the switchTo function (not sure of exact syntax in ruby) to switch to the popup before calling assertText
In Ruby maybe we should use like this:
#driver.switch_to.alert.accept
Has anyone a good solution on how to test auto-complete combo box in Selenium?
Thanks for the help,
Manjide
Take a look at my other answer Selenium - verify the list of suggestions is displayed
Using TestPlan with the Selenium back-end this code grabs the suggestions from Google -- which is an example of a auto-complete combo box.
GotoURL http://www.google.com/webhp?hl=en
ClickReplaceType //input[#name='q'] what is my
# This is where the suggestions appear
set %ResultsXPath% //table[#class='gac_m']//td[#class='gac_c']
# Check that they are there (that is, wiat for them, since they are dynamic)
Check %ResultsXPath%
# Then iterate over the suggestions
foreach %Cell% in (response %ResultsXPath%)
Notice %Cell%
end
This produces the results:
00000000-00 GOTOURL http://www.google.com/webhp?hl=en
00000001-00 NOTICE Starting a new browser (0:0:0:0:1) com.thoughtworks.selenium.DefaultSelenium#332611a7
00000002-00 NOTICE what is my ip
00000003-00 NOTICE what is my ip address
00000004-00 NOTICE what is my bmi
00000005-00 NOTICE what is my house worth
00000006-00 NOTICE what is my
Usually such tests work in both the Selenium and HTMLUnit backend in TestPlan, but Google currently works with Selenium.