So far my experience with Scrapy spiders are for a focused scrapying. In another word, I first do a manual keyword search on a target website, which return a http address containing the keywords, an example is http://www.simplyhired.com/search?q=Anesthesiologist. This weblink will let my spider "see" what I got in a browser.
Now I noticed this method doesn't work on some websites, such as this one: http://www.physicianjobboard.com/. Keyword searching works on a browser, but only generates a generic weblink of http://www.mdjobsite.com/Index2.cfm?Page=JobsSearchResults. This generic web link contains .cfm file and would not directly let my spiders know which keywords I am interested in.
One inefficient method would be scrapying all the posts from this website and filter out those I need. Are there another method to let my spiders see what I got in my browser and perform a focused scrapying? My guess is to let spider send a request mimicking keyword search and then analyze the response page. I have zero experience on this. Could anyone give some hints if my guess is correct?
Related
Why i am able to google messages in (for example) gitter.im? How did google indexed all this: https://gitter.im/neoclide/coc.nvim?at=5ea00cdda3612210839689f1 ?
Does gitter.im return its content to google in another format or via some specific interface/protocol declared in special section for web crawlers somewhere? Did google spent some resources on development to build a gitter.im-specific crawler that is able to do specific XHR-requests?
Simple:
Google ask https://gitter.im/gitter/developers
There is N recent messages embedded in HTML already, say 50. Then google just extract all the links from the HTML (from that time-tag "18:15", for example). Each time-tag gives you url of form https://gitter.im/gitter/developers?at=610011abc9f8852a970e808e and google doesnt care why. Just remember urls.
Google asks that grabbed 50 urls of form https://gitter.im/gitter/developers?at=610011abc9f8852a970e808e
Each such URL gives you ~50 messages around that exact message. So search engine think: "ok, this URL gives you THIS text".
So when you search THIS test it just gives you the url closer-to that text or maybe just any url with that text...
I'm studying scrapy and am trying to crawl through this website - http://bananarepublic.gap.com/browse/category.do?cid=1055063&sop=true
However my scrapy code cannot find the product links listed on this website. Could anyone tell me why? The xpath Im using is //a[#class="product-card--link"]/#href
Is this because of js? If so, I tried using scrapy splash but still cannot find the product links listed. Could someone please help!
Thank you!
The items are generated via AJAX request. When you connect to a page a javascript script is executed that makes some extra http requests to retreive some json data. However scrapy does not execute any javascript so you need to manually find and call those AJAX requests.
See related issue: Can scrapy be used to scrape dynamic content from websites that are using AJAX?, to see how inspect network traffic and solve such cases.
In this particular case you can see first xhr requests that is being made returns a huge json file with all of the item data:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063&isFacetsEnabled=true&globalShippingCountryCode=&globalShippingCurrencyCode=&locale=en_US&
As you can see url takes some arguments, most importantly it takes cid which stands for category id and other arguments are mostly for calculating shipping prices, so if you don't care about those this works just as well:
http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063
An alternative that avoids digging deep into the AJAX requests would be using Splash (https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/) to scrape the page after the AJAX has been processed.
Can be a bit easier to implement, your xpath expression should work fine with Splash. But the scraper will be slower as it has to render each page.
I need to fetch company addresses(cim) from site http://www.ceginfo.hu/
Example Company Name: AB-KONTÍR Szolgáltató Bt.
I know how to do it using WinHttp.WinHttpRequest object and FireBug.
But I am not able to decide to which URL I should send this request.
When I analyse the request/responses using FireBug, I get the following URL:
http://www.ceginfo.hu/company/search/4221638
4221638 is CompanyID here I think. But in my case I will have company name only and that's what my problem is.
So can anybody please tell me where can I get URL using firebug or any other tool using which I can track the URL with Company Name as parameter which I can use in my VBA code.
Thanks in advance!
So can anybody please tell me where can I get URL using firebug or any
other tool using which I can track the URL with Company Name as
parameter which I can use in my VBA code.
No. Unless there is a publicly available database (I would suggest calling them, if you can) or an API that allows for programmatic access, the only way to arrive at this link slug is by executing the search.
Further, the post slog is not as relevant as you think. If you search for simply "Kontir", this is the resulting page -- with many results:
http://www.ceginfo.hu/company/search/4222407
You're going to have to automate the "search" -- passing the criteria to the Web Page and executing the button-click and/or HTTPPost, and then parse the result(s). In the example company name, there is only one result. But it is possible as in my example above, that there may be multiple matches for some queries, and then you will need to have a method of dealing with these, or ignoring them.
I am trying to scrape a this page with scrapy:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=7391
and the response which I get is different than what I see in the browser. Browser response has the correct page, while scrapy response is:
http://www.barnesandnoble.com/s?dref=4815&sort=SA&startat=1
page. I have tried with urllib2 but still have the same issue. Any help is much appreciated.
I don't really understand the issue, but usually a different response for a browser and scrapy is caused by one these:
the server analyzes your User-Agent header, and returns a specially crafted page for mobile clients or bots;
the server analyzes the cookies, and does something special when it looks like you are visiting for the first time;
you are trying to make a POST request via scrapy like the browser does, but you forgot some form fields, or put wrong values
etc.
There is no universal way to determine what's wrong, because it depends on the server logic, which you don't know. If you are lucky, you will analyze and fix all the mentioned issues and will make it work.
I am looking for a way to utilize the FireShot API with JS to given a URL (or perhaps a list) use the FireShot API to take screenshot, upload to Imgur, then return the user the URLs or perhaps something like markdown to use quickly in forums.
Method 1: Open new window
I tried opening the URL in a new window, but found that I cant control that page with JS dues to cross domain problems. The same with iFrames.
Method 2: simple $.get()
A simple $.get() wont work because of the same cross domain issues I guess?
http://jsfiddle.net/t6aeq/
$.get($url.val(), function(data) {
console.log(data);
});
Via PHP "Proxy"
So I tried creating a simple PHP script that gets the HTML of the URL and returns it to my JS (using file_get_contents($url)). But some sites like Microsoft will detect that I am using some automated methods and give an error page of sorts. I also cant seem to find a way to use jQuery to query that returned HTML for link[rel=stylesheet], script, style and body to append to the head and a div respectively. I posted abt that on another question
A new Idea: Embed scripts on browser level
So I thought away of getting around these is using iMacros or GreeseMonkey or something to insert scripts into pages on the browser level instead? But any guidance or tips on how can I do that? Also, I'd prefer a pure JS/PHP method if available so users are not limited to using Browser plugin/scripts (tho I will be the only user for now)
It suddenly came to my mind that this may not work because the FireShot API key and Imgur is limited to the domain? Any solutions?
You might be able to inject the FireShot script using Greasemonkey. But, first use GM_xmlhttpRequest() to fetch an API key, for that page's domain, from the "Create FireShot API Key" page.
Note that GM_xmlhttpRequest() does not have the same cross-domain issues that $.get() has.
However, at this point you might be better off just writing your own Firefox add-on. Maybe start with FireShot's code for ideas. Also see the Screengrab add-on.