I have a elasticsearch pipeline which will index all the scraped content into elasticsearch. My problem is the contents scraped from the start_urls page are indexed. Those data are not even passed through my elasticsearch pipeline. What am I missing? Is there any settings in scrapy to achieve it? Does scrapy considers scraped contents only from page crawled from the start_url page?
Your question is not clear to me, although make sure that you have added your pipeline in settings.py
ITEM_PIPELINES = { "your_pipeline": 100 }
Related
I'm working on a web scraping project where I'm scraping a product related to a specific niche and this is the URL where I scrap data from :
https://www.aliexpress.com/af/category/200010058.html?categoryBrowse=y&origin=n&CatId=200010058&spm=a2g0o.home.108.18.650c2145CobYJm&catName=backpacks
I used selenium as a middleware to render javascript and this is the code for that :
def process_request(self, request, spider):
self.driver.get(request.url)
WebDriverWait(self.driver, 10).until(
EC.element_to_be_clickable((By.CLASS_NAME, '_3t7zg'))
)
# self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(10)
links = self.driver.find_elements("xpath", '//div[1]/div/div[2]/div[2]/div/div[2]/a[#href]')
The problem is that on the webpage there are 60 products, when I look at the number of products I have scraped I find 11. I don't know what the problem is.
I'm pretty new to BeautifulSoup, and I'm trying to figure out why it doesn't work as expected.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.globes.co.il/news/article.aspx?did=1001285710")
bsObj = BeautifulSoup(html.read(), features="html.parser")
print(bsObj.find_all('iframe'))
I get a list of only 2 iframes.
However, when I open this page with a browser and type:
document.getElementsByTagName("iframe")
in dev-tools I get a list of 14 elements.
Can you please help me?
This is because that site dynamically adds more iframes once the page is loaded. Additionally, iframe content is dynamically loaded by a browser, and won't be downloaded via urlopen either. You may need to use Selenium to allow JavaScript to load the additional iframes, and then may need to search for the iframe and download the content via the src url.
So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()
I have a spider setup using link extractor rules. The spider crawls and scrapes the items that I expect, although it will only follow the 'Next' pagination button to the 3rd page, where the spider then finishes without any errors, there are a total of 50 pages to crawl via the 'Next' pagination. Here is my code:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class MySpider(CrawlSpider):
name = 'my_spider'
start_urls = [some_base_url]
rules = (
Rule(LinkExtractor(allow='//div[#data-test="productGridContainer"]//a[contains(#data-test, "product-title")]'), callback='parse_item'),
Rule(LinkExtractor(restrict_xpaths='//div[#data-test="productGridContainer"]//a[contains(#data-test, "next")]'), follow=True)
)
def parse_item(self, response):
# inspect_response(response, self)
...items are being scraped
return scraped_info
It feels like I may be missing a setting or something as the code functions as expected for the first 3 iterations. My settings file does not override the DEPTH_LIMIT default of 0. Any help is greatly appreciated, thank you.
EDIT 1
It appears it may not have anything to do with my code as if I start with a different product page I can get up to 8 pages scraped before the spider exits. Not sure if its the site I am crawling or how to troubleshoot?
EDIT 2
Troubleshooting it some more it appears that my 'Next' link disappears from the web page. When I start on the first page the pagination element is present to go to the next page. When I view the response for the next page there are no products and no next link element so the spider thinks it it done. I have tried enabling cookies to see if the site is requiring a cookie in order to paginate. That doesnt have any affect. Could it be a timing thing?
EDIT 3
I have adjusted the download delay and concurrent request values to see if that makes a difference. I get the same results whether I pull the page of data in 1 second or 30 minutes. I am assuming 30 minutes is slow enough as I can manually do it faster than that.
EDIT 4
Tried to enable cookies along with the cookie middleware in debug mode to see if that would make a difference. The cookies are fetched and sent with the requests but I get the same behavior when trying to navigate pages.
To check if the site denies too many requests in short time you can add the following code to your spider (e.g. before your rulesstatement) and play around with the value. See also the excellent documentation.
custom_settings = {
'DOWNLOAD_DELAY': 0.4
}
I am trying to build a scraper using scrapy and I plan to use deltafetch to enable incremental refresh but I need to parse javascript based pages which is why I need to use splash as well.
In the settings.py file, we need to add
SPIDER_MIDDLEWARES = {'scrapylib.deltafetch.DeltaFetch': 100,}
for enabling deltafetch whereas, we need to add
SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,} for splash
I wanted to know how would both of them work together if both of them use some kind of spider middleware.
Is there some way in which I could use both of them?
For other answers see here and here. Essentially you can use the request meta parameter to manually set the deltafetch_key for the requests you are making. In this way you can request the same page with Splash even after you've successfully scraped items from that page with Scrapy and vice versa. Hope that helps!
from scrapy_splash import SplashRequest
from scrapy.utils.request import request_fingerprint
(your spider code here)
yield scrapy.Request(url, meta={'deltafetch_key': request_fingerprint(response.request)})