Schedule job for SQL - sql

I have a list of urls which I need to crawl, so I can index the pages and add them to my searchindex.db
urls = ['http://consequenceofsound.net/', 'http://www.tinymixtapes.com/', 'https://www.residentadvisor.net/']
this is how I've initialized my crawler class:
class crawler:
# Initialize the crawler with the name of database
def __init__(self,dbname):
self.con = sqlite3.connect(dbname)
def __del__(self):
self.con.close()
def dbcommit(self):
self.con.commit()
this is the crawling method:
def crawl(self,pages,depth=2):
(...)
#code here that opens and adds links to database
(...)
self.dbcommit()
pages=newpages
here I instantiate my crawler class:
crawler=crawler('searchindex.db')
pagelist = url[0]
crawler.crawl(pagelist)
how do I schedule url crawling and indexing of pages, so that each crawling indexing process resumes after the last one is finished or interrupted by any reason?

Related

Scrapy concurrent spiders instance variables

I have a number of Scrapy spiders running and recently had a strange bug. I have a base class and a number of sub classes:
class MyBaseSpider(scrapy.Spider):
new_items = []
def spider_closed(self):
#Email any new items that weren't in the last run
class MySpiderImpl1(MyBaseSpider):
def parse(self):
#Implement site specific checks
self.new_items.append(new_found_item)
class MySpiderImpl2(MyBaseSpider):
def parse(self):
#Implement site specific checks
self.new_items.append(new_found_item)
This seems to have been running well, new items get emailed to me on a per-site basis. However I've recently had some emails from MySpiderImpl1 which contain items from Site 2.
I'm following the documentation to run from a script:
scraper_settings = get_project_settings()
runner = CrawlerRunner(scraper_settings)
configure_logging()
sites = get_spider_names()
for site in sites:
runner.crawl(site.spider_name)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
I suspect the solution here is to switch to a pipeline which collates the items for a site and emails them out when pipeline.close_spider is called but I was surprised to see the new_items variable leaking between spiders.
Is there any documentation on concurrent runs? Is it bad practice to keep variables on a base class? I do also track other pieces of information on the spiders in variables such as the run number - should this be tracked elsewhere?
In python all class variables are shared between all instances and subclasses. So your MyBaseSpider.new_items is the exact same list that is used by MySpiderImpl1.new_items and MySpiderImpl2.new_items.
As you suggested you could implement a pipeline, although this might require significantly refactoring your current code. It could look something like this.
pipelines.py
class MyPipeline:
def process_item(self, item, spider):
if spider.name == 'site1':
... email item
elif spider.name == 'site2':
... do something different
I am assuming all of your spiders have names... I think it's a requirement.
Another option that probably requires less effort might be to override the start_requests method in your base class to assign a unique list at start of the crawling process.

Scrapy Broad Crawl: Quickstart Example Project

Would there be any code example showing a minimal structure of a Broad Crawls with Scrapy?
Some desirable requirements:
crawl in BFO order; (DEPTH_PRIORITY?)
crawl only from URLs that follow certain patterns; and (LinkExtractor?)
URLs must have a maximum depth. (DEPTH_LIMIT)
I am starting with:
import scrapy
from scrapy import Request
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class WebSpider(scrapy.Spider):
name = "webspider"
def __init__(self):
super().__init__()
self.link_extractor = LxmlLinkExtractor(allow="\.br\/")
self.collecton_file = open("collection.jsonl", 'w')
start_urls = [
"https://www.uol.com.br/"
]
def parse(self, response):
data = {
"url": response.request.url,
"html_content": response.body
}
self.collecton_file.write(f"{data}\n")
for link in self.link_extractor.extract_links(response):
yield Request(link.url, callback=self.parse)
Is that the correct approach to crawl and store the raw HTML on disk?
How to stop the spider when it has already collected n pages?
How to show some stats (pages/min, errors, number of pages collected so far) instead of the standard log?
It should work, but you are writing to a file manually instead of using Scrapy for that.
CLOSESPIDER_PAGECOUNT. But mind that it will start stopping once you reach the specified number of pages, but it will still finish processing some additional pages.
I supect you just want to set LOG_LEVEL to something else.

Use scrapyd job id in scrapy pipelines

I've implemented a web application that is triggering scrapy spiders using scrapyd API (web app and scrapyd are running on the same server).
My web application is storing job ids returned from scrapyd in DB.
My spiders are storing items in DB.
Question is: how could I link in DB the job id issued by scrapyd and items issued by the crawl?
I could trigger my spider using an extra parameter - let say an ID generated by my web application - but I'm not sure it is the best solution. At the end, there is no need to create that ID if scrapyd issues it already...
Thanks for your help
The question should be phrased as "How can I get a job id of a scrapyd task in runtime?"
When scrapyd runs a spider it actually gives the spider the job id as an argument.
Should be always as last argument of sys.args.
Also,
os.environ['SCRAPY_JOB'] should do the trick.
In the spider constructor(inside init),
add the line -->
self.jobId = kwargs.get('_job')
then in the parse function pass this in item,
def parse(self, response):
data = {}
......
yield data['_job']
in the pipeline add this -->
def process_item(self, item, spider):
self.jobId = item['jobId']
.......

scrapyd multiple spiders writing items to same file

I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline
class JsonWriterPipeline(object):
def __init__(self, json_filename):
# self.json_filepath = json_filepath
self.json_filename = json_filename
self.file = open(self.json_filename, 'wb')
#classmethod
def from_crawler(cls, crawler):
save_path='/tmp/'
json_filename=crawler.settings.get('json_filename', 'FM_raw_export.json')
completeName = os.path.join(save_path, json_filename)
return cls(
completeName
)
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
After the spiders are running I can see how they are collecting data correctly, items are stored in files XXXX.jl and the spiders works correctly, however the contents crawled are not reflected on common file. Spiders seems to work well however the pipeline is not doing well their job and is not collecting data into common file.
I also noticed that only one spider is writing at same time on file.
I don't see any good reason to do what you do :) You can change the json_filename setting by setting arguments on your scrapyd schedule.json Request. Then you can make each spider to generate slightly different files that you merge with post-processing or at query time. You can also write JSON files similar to what you have by just setting the FEED_URI value (example). If you write to single file simultaneously from multiple processes (especially when you open with 'wb' mode) you're looking for corrupt data.
Edit:
After understanding a bit better what you need - in this case - it's scrapyd starting multiple crawls running different spiders where each one crawls a different website. The consumer process is monitoring a single file continuously.
There are several solutions including:
named pipes
Relatively easy to implement and ok for very small Items only (see here)
RabbitMQ or some other queueing mechanism
Great solution but might be a bit of an overkill
A database e.g. SQLite based solution
Nice and simple but likely requires some coding (custom consumer)
A nice inotifywait-based or other filesystem monitoring solution
Nice and likely easy to implement
The last one seems like the most attractive option to me. When scrapy crawl finishes (spider_closed signal), move, copy or create a soft link for the FEED_URL file to a directory that you monitor with a script like this. mv or ln is an atomic unix operation so you should be fine. Hack the script to append the new file on your tmp file that you feed once to your consumer program.
By using this way, you use the default feed exporters to write your files. The end-solution is so simple that you don't need a pipeline. A simple Extension should fit the bill.
On an extensions.py in the same directory as settings.py:
from scrapy import signals
from scrapy.exceptions import NotConfigured
class MoveFileOnCloseExtension(object):
def __init__(self, feed_uri):
self.feed_uri = feed_uri
#classmethod
def from_crawler(cls, crawler):
# instantiate the extension object
feed_uri = crawler.settings.get('FEED_URI')
ext = cls(feed_uri)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
# return the extension object
return ext
def spider_closed(self, spider):
# Move the file to the proper location
# os.rename(self.feed_uri, ... destination path...)
On your settings.py:
EXTENSIONS = {
'myproject.extensions.MoveFileOnCloseExtension': 500,
}

Is Scrapy able to scrape the data once I initialise it's object?

Is it possible for Scrapy to do like when I call the function scrape.crawl("website") in a class, it would redirect to the class where the scraping codes are and execute the function.
I tried to find in various sources and mostly asked me to write it as a script form. But couldn't find any working example that shows me how to initialise the object so as to call the script.
Came close to this code but it's not working.
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/#href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
Calling the object?
spider = DmozSpider()
Any kind souls with working example with what I want?
For this you need a quite complex structure -- if I understand your question right.
If you have your instance of the spider you need to set up a Crawler and start it afterwards. For example:
crawler = Crawler(get_project_settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
This is the base but you should be able to get started with this. However as I've said previously it is quite complex and you need some configuration beside this to get it running.
Update
If you have a URL and want Scrapy to crawl that site you could do it like this:
def __init__(self, url, *args, **kwargs):
super(DmozSpider, self).__init__(*args, **kwargs)
self.start_urls = [url]
And then start crawling like described above. Because Scrapy spiders start as soon as you call them you need the right sequence of setting the start URL and then starting.