When I use scrapy to write a distributed crawler with scrapy-redis, I only store the request request queue in redis, and there is no storage deduplication fingerprint queue
# Spider code
import scrapy
from datetime import datetime
from machinedigikey.items import Detail_Item
from scrapy_redis.spiders import RedisSpider
class DgkUpdateDetailSpider(RedisSpider):
# setting
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.PriorityQueue"
SCHEDULER_PERSIST = True
REDIS_URL = "redis://127.0.0.1:6379"
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
}
MONGO_URI = 'mongodb://localhost:27017'
# Startup method
scrapy crawl dgk_update_detail
No errors, but I checked that there is no heavy fingerprint in redis only dgk_update_detail:requests
Also ask if the code breaks during the running of the code, but does not clear the data in redis delete all the crawler code, modify and redeploy, will the entire crawler continue to crawl?
Stop the crawling method as:
Kill -s 9 id
Related
We are trying to take a backup of Splunk dashboards and reports source code for versioning. we are on an enterprise implementation where our rest calls are restricted. we can create and access dashboards and reports via Slunk UI, but would like know if we can automatically take backup of them and store in our versioning system.
Automated versioning will be quite a challenge without REST access. I assume you don't have CLI access or you wouldn't be asking.
There are apps available to do this for you. See https://splunkbase.splunk.com/app/4355/ and https://splunkbase.splunk.com/app/4182/ .
There's also a .conf presentation on the topic. See https://conf.splunk.com/files/2019/slides/FN1315.pdf
For the time being, I have written a Python script to read/intercept the UI inspet -> performance -> network responses of Splunk's reports URL which lists the complete set of reports under my app and as well as its full details.
from time import sleep
import json
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
# make chrome log requests
capabilities = DesiredCapabilities.CHROME
capabilities["goog:loggingPrefs"] = {"performance": "ALL"} # newer: goog:loggingPrefs
driver = webdriver.Chrome(
desired_capabilities=capabilities, executable_path="/Users/username/Downloads/chromedriver_92"
)
spl_reports_url="https://prod.aws-cloud-splunk.com/en-US/app/sre_app/reports"
driver.get(spl_reports_url)
sleep(5) # wait for the requests to take place
# extract requests from logs
logs_raw = driver.get_log("performance")
logs = [json.loads(lr["message"])["message"] for lr in logs_raw]
# create directory to save all reports as .json files
from pathlib import Path
main_bkp_folder='splunk_prod_reports'
# Create a main directory in which all dashboards will be downloaded to
Path(f"./{main_bkp_folder}").mkdir(parents=True, exist_ok=True)
# Function to write json content to file
def write_json_to_file(filenamewithpath,json_source):
with open(filenamewithpath, 'w') as jsonfileobj:
json_string = json.dumps(json_source, indent=4)
jsonfileobj.write(json_string)
def log_filter(log_):
return (
# is an actual response
log_["method"] == "Network.responseReceived"
# and json
and "json" in log_["params"]["response"]["mimeType"]
)
counter = 0
# extract Network entries from each log event
for log in filter(log_filter, logs):
#print(log)
request_id = log["params"]["requestId"]
resp_url = log["params"]["response"]["url"]
# print only results_preview
if "searches" in resp_url:
print(f"Caught {resp_url}")
counter = counter + 1
nw_resp_body = json.loads(driver.execute_cdp_cmd("Network.getResponseBody", {"requestId": request_id})['body'])
for each_report in nw_resp_body["entry"]:
report_name = each_report['name']
print(f"Extracting report source for {report_name}")
report_filename = f"./{main_bkp_folder}/{report_name.replace(' ','_')}.json"
write_json_to_file(report_filename,each_report)
print(f"Completed.")
print("All reports source code exported successfully.")
The above code is far from production version as yet to add error handling, logging, and modularization.
Also, to note that the above script uses browser UI, in production the scrip will be run in a docker image with ChromeOptions to use headless mode.
Instead of
driver = webdriver.Chrome(
desired_capabilities=capabilities, executable_path="/Users/username/Downloads/chromedriver_92"
)
Use:
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--window-size=1420,2080')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(
desired_capabilities=capabilities,options=chrome_options, executable_path="/Users/username/Downloads/chromedriver_92"
)
Here you can customize as per your need.
I am running a scrapy spider to export some football data and using the scrapy splash plugin.
For development I am running the spider against cached results, so as not to hit the website too much. The strange thing is, that I am consistently missing some items in the export when running the spider with multiple start_urls. The number of missing Items differ slightly each time.
However, when I comment out all but one start_urls and run the spider for each one separately, I get all the results. I am not sure if this is a bug with scrapy or if I am missing something about scrapy here, as this is my first project with the framework.
Here is my caching configuration:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [403]
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
These are my start_urls:
start_urls = [
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2017',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2018',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2019',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2020',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2017',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2018',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2019',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2020'
]
I have a standard setup with an export pipeline and my spider yielding splash requests multiple times for each relevant url on the page from a parse method. Each parse method fills the same item passed via cb_kwargs or creates a new one with data from the passed item.
Please let me know if further code from my project, like the spider, pipelines or item loaders might be relevant to the issue and I will edit my question here.
Would there be any code example showing a minimal structure of a Broad Crawls with Scrapy?
Some desirable requirements:
crawl in BFO order; (DEPTH_PRIORITY?)
crawl only from URLs that follow certain patterns; and (LinkExtractor?)
URLs must have a maximum depth. (DEPTH_LIMIT)
I am starting with:
import scrapy
from scrapy import Request
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class WebSpider(scrapy.Spider):
name = "webspider"
def __init__(self):
super().__init__()
self.link_extractor = LxmlLinkExtractor(allow="\.br\/")
self.collecton_file = open("collection.jsonl", 'w')
start_urls = [
"https://www.uol.com.br/"
]
def parse(self, response):
data = {
"url": response.request.url,
"html_content": response.body
}
self.collecton_file.write(f"{data}\n")
for link in self.link_extractor.extract_links(response):
yield Request(link.url, callback=self.parse)
Is that the correct approach to crawl and store the raw HTML on disk?
How to stop the spider when it has already collected n pages?
How to show some stats (pages/min, errors, number of pages collected so far) instead of the standard log?
It should work, but you are writing to a file manually instead of using Scrapy for that.
CLOSESPIDER_PAGECOUNT. But mind that it will start stopping once you reach the specified number of pages, but it will still finish processing some additional pages.
I supect you just want to set LOG_LEVEL to something else.
I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline
class JsonWriterPipeline(object):
def __init__(self, json_filename):
# self.json_filepath = json_filepath
self.json_filename = json_filename
self.file = open(self.json_filename, 'wb')
#classmethod
def from_crawler(cls, crawler):
save_path='/tmp/'
json_filename=crawler.settings.get('json_filename', 'FM_raw_export.json')
completeName = os.path.join(save_path, json_filename)
return cls(
completeName
)
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
After the spiders are running I can see how they are collecting data correctly, items are stored in files XXXX.jl and the spiders works correctly, however the contents crawled are not reflected on common file. Spiders seems to work well however the pipeline is not doing well their job and is not collecting data into common file.
I also noticed that only one spider is writing at same time on file.
I don't see any good reason to do what you do :) You can change the json_filename setting by setting arguments on your scrapyd schedule.json Request. Then you can make each spider to generate slightly different files that you merge with post-processing or at query time. You can also write JSON files similar to what you have by just setting the FEED_URI value (example). If you write to single file simultaneously from multiple processes (especially when you open with 'wb' mode) you're looking for corrupt data.
Edit:
After understanding a bit better what you need - in this case - it's scrapyd starting multiple crawls running different spiders where each one crawls a different website. The consumer process is monitoring a single file continuously.
There are several solutions including:
named pipes
Relatively easy to implement and ok for very small Items only (see here)
RabbitMQ or some other queueing mechanism
Great solution but might be a bit of an overkill
A database e.g. SQLite based solution
Nice and simple but likely requires some coding (custom consumer)
A nice inotifywait-based or other filesystem monitoring solution
Nice and likely easy to implement
The last one seems like the most attractive option to me. When scrapy crawl finishes (spider_closed signal), move, copy or create a soft link for the FEED_URL file to a directory that you monitor with a script like this. mv or ln is an atomic unix operation so you should be fine. Hack the script to append the new file on your tmp file that you feed once to your consumer program.
By using this way, you use the default feed exporters to write your files. The end-solution is so simple that you don't need a pipeline. A simple Extension should fit the bill.
On an extensions.py in the same directory as settings.py:
from scrapy import signals
from scrapy.exceptions import NotConfigured
class MoveFileOnCloseExtension(object):
def __init__(self, feed_uri):
self.feed_uri = feed_uri
#classmethod
def from_crawler(cls, crawler):
# instantiate the extension object
feed_uri = crawler.settings.get('FEED_URI')
ext = cls(feed_uri)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
# return the extension object
return ext
def spider_closed(self, spider):
# Move the file to the proper location
# os.rename(self.feed_uri, ... destination path...)
On your settings.py:
EXTENSIONS = {
'myproject.extensions.MoveFileOnCloseExtension': 500,
}
I am new at python, I just learnt how to create an api using flask restless and flask sql-alchemy. I however would like to seed the database with random values. How do I achieve this? Please help.
Here is the api code...
import flask
import flask.ext.sqlalchemy
import flask.ext.restless
import datetime
DATABASE = 'sqlite:///tmp/test.db'
#Create the Flask application and the FLask-SQLALchemy object
app = flask.Flask(__name__)
app.config ['DEBUG'] = True
app.config['SQLALCHEMY_DATABASE_URI'] = DATABASE
db = flask.ext.sqlalchemy.SQLAlchemy(app)
#create Flask-SQLAlchemy models
class TodoItem(db.Model):
id = db.Column(db.Integer, primary_key = True)
todo = db.Column(db.Unicode)
priority = db.Column(db.SmallInteger)
due_date = db.Column(db.Date)
#Create database tables
db.create_all()
#Create Flask restless api manager
manager = flask.ext.restless.APIManager(app, flask_sqlalchemy_db = db)
#Create api end points
manager.create_api(TodoItem, methods = ['GET','POST','DELETE','PUT'], results_per_page = 20)
#Start flask loop
app.run()
I had a similar question and did some research, found something that worked.
The pattern I am seeing is based on registering a Flask CLI custom command, something like: flask seed.
This would look like this given your example. First, import the following into your api code file (let's say you have it named server.py):
from flask.cli import with_appcontext
(I see you do import flask but I would just add you should change these to from flask import what_you_need)
Next, create a function that does the seeding for your project:
#with_appcontext
def seed():
"""Seed the database."""
todo1 = TodoItem(...).save()
todo2 = TodoItem(...).save()
todo3 = TodoItem(...).save()
Finally, register these command with your flask application:
def register_commands(app):
"""Register CLI commands."""
app.cli.add_command(seed)
After you've configured you're application, make sure you call register_commands to register the commands:
register_commands(app)
At this point, you should be able to run: flask seed. You can add more functions (maybe a flask reset) using the same pattern.
From another newbie, the forgerypy and forgerypy3 libraries are available for this purpose (though they look like they haven't been touched in a bit).
A simple example of how to use them by adding them to your model:
class TodoItem(db.Model):
....
#staticmethod
def generate_fake_data(records=10):
import forgery_py
from random import randint
for record in records:
todo = TodoItem(todo=forgery_py.lorem_ipsum.word(),
due_date=forgery_py.date.date(),
priority=randint(1,4))
db.session.add(todo)
try:
db.session.commit()
except:
db.session.rollback()
You would then call the generate_fake_data method in a shell session.
And Miguel Grinberg's Flask Web Development (the O'Reilly book, not blog) chapter 11 is a good resource for this.