Running Scrapy daily and tracking changes in the data - scrapy

So I've written my first scraper with Scrapy and I'm having some trouble with the next steps. I want to run the scraper daily, probably with cron, and track the changes in the values I've scraped. When I export to a json or csv file, then run the scraper again, the new data gets dumped into the same file. Is there a way to make each scrape export into a separate file? Any insight would be great, thanks!

tell scrapy the name of the file to write to using -o
$ scrapy crawl -h | grep output=
--output=FILE, -o FILE dump scraped items into FILE (use - for stdout)
you can use current date as file name like:
$ scrapy crawl <spider-name> -t json/csv -o $(date '+%Y-%m-%d')

Related

Extract huge tar.gz archives from S3 without copying archives to a local system

I'm looking for a way to extract huge dataset (18 TB+ found here https://github.com/cvdfoundation/open-images-dataset#download-images-with-bounding-boxes-annotations) with this in mind I need the process to be fast (i.e. I don't want to spend twice the time for first copying and then extracting files) Also I don't want archives to take extra space not even one 20 gb+ archive.
Any thoughts on how one can achieve that?
If you can arrange to pipe the data straight into tar, it can uncompress and extract it without needing a temporary file.
Here is a example. First create a tar file to play with
$ echo abc >one
$ echo def >two
$ tar cvf test.tar
$ tar cvf test.tar one two
one
two
$ gzip test.tar
Remove the test files
$ rm one two
$ ls one two
ls: cannot access one: No such file or directory
ls: cannot access two: No such file or directory
Now extract the contents by piping the compressed tar file into the tar command.
$ cat test.tar.gz | tar xzvf -
one
two
$ ls one two
one two
The only part missing now is how to download the data and pipe it into tar. Assuming you can access the URL with wget, you can get it to send the data to stdout. So you end up with this
wget -qO- https://youtdata | tar xzvf -

running python program on virtual machine

I created GCP VM (ubunto). I installed python and scrapy.
I would like to run my spider from there, scrapy crawl test -o test1.csv
I opened the terminal from gcp and run the spider (worked), it will take at least 3 hours.
How can I make sure when i exit the terminal (browser) the script will continue.
You can use nohup to make sure the crawling continues:
nohup scrapy crawl test -o test1.csv &
When you log off, the crawler will continues until it finishes. The & at the end will make the process execute in the background.
To redirect the output to a log file, you can execute it as follows:
nohup scrapy crawl test -o test1.csv &> test.log &
For a better way to run & deploy spiders on a server, you can checkout scrapyd
You can create a run.py file in the spiders directory.
document content
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'test', '-o', 'test1.csv'])
After that
nohup python -u run.py > spider_house.log 2>&1 &
If the log has been configured inside the crawler and the log will be recorded according to the log output inside the crawler, the log output configured by nohup will not be used.
If a paused sustainable crawl is configured, that is, the JOBDIR= parameter wants to gracefully pause the crawler, so that the next time the crawler starts, the last pause is the crawl. The close crawler method is
kill 2 pid

How to save output of python-swiftclient to file when dowloading a directory?

Sometimes I get errors when I download files from a cloud with python-swiftclient, like this one:
Error downloading object 'uploads/1/image.png': Object GET failed: https://orbit.brightbox.com/v1/acc-12345/uploads/1/image.png 500 Internal Error b'An error occurred'
To search for the all errors and re-download failed files I would want to save output of the swift command to a file
I tried to do the following ways:
swift-cli -A https://orbit.brightbox.com/v1/acc-12345 \
-U user -K secret download uploads 2>&1 | tee uploads.log
# and
swift-cli -A https://orbit.brightbox.com/v1/acc-12345 \
-U user -K secret download uploads > uploads.log
But this didn't work. man swift describes -o option
For a single object download, you may use the -o [--output]
option to redirect the output to a specific file or if "-" then just redirect to stdout or with --no-download actually not to write anything to disk.
but when I try to download a directory with -o option if fails with
-o option only allowed for single file downloads
How can I save log to a file when I download a directory with swift CLI?
Actually redirecting output to a file works with swift-client:
swift-cli -A https://orbit.brightbox.com/v1/acc-12345 \
-U user -K secret download uploads > uploads.log
I was confused because after I started the command above, in another terminal window I did
tail -f uploads.log
But it didn't give me any output (like I was seeing when I was running the download command without redirection).
Seems like that swift-client writes to a file in batches and I needed to wait about a minute until tail -f dumps into the console a hundred of lines like this
uploads/documents/1/image.png [auth 0.000s, headers 0.390s, total 14.361s, 0.034 MB/s]

Disable Scrapyd item storing in .jl feed

Question
I want to know how to disable Item storing in scrapyd.
What I tried
I deploy a spider to the Scrapy daemon Scrapyd. The deployed spider stores the spidered data in a database. And it works fine.
However Scrapyd logs each scraped Scrapy item. You can see this when examining the scrapyd web interface.
This item data is stored in ..../items/<project name>/<spider name>/<job name>.jl
I have no clue how to disable this. I run scrapyd in a Docker container and it uses way too much storage.
I have tried suppress Scrapy Item printed in logs after pipeline, but this does nothing for scrapyd logging it seems. All spider logging settings seem to be ignored by scrapyd.
Edit
I found this entry in the documentation about Item storing. It seems if you omit the items_dir setting, item logging will not happen. It is said that this is disabled by default. I do not have a scrapyd.conf file, so item logging should be disabled. It is not.
After writing my answer I re-read your question and I see that what you want has nothing to do with logging but it's about not writing to the (default-ish) .jl feed (Maybe update the title to: "Disable scrapyd Item storing"). To override scrapyd's default, just set FEED_URI to an empty string like this:
$ curl http://localhost:6800/schedule.json -d project=tutorial -d spider=example -d setting=FEED_URI=
For other people who are looking into logging... Let's see an example. We do the usual:
$ scrapy startproject tutorial
$ cd tutorial
$ scrapy genspider example example.com
then edit tutorial/spiders/example.py to contain the following:
import scrapy
class TutorialItem(scrapy.Item):
name = scrapy.Field()
surname = scrapy.Field()
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
for i in xrange(100):
t = TutorialItem()
t['name'] = "foo"
t['surname'] = "bar %d" % i
yield t
Notice the difference between running:
$ scrapy crawl example
# or
$ scrapy crawl example -L DEBUG
# or
$ scrapy crawl example -s LOG_LEVEL=DEBUG
and
$ scrapy crawl example -s LOG_LEVEL=INFO
# or
$ scrapy crawl example -L INFO
By trying such combinations on your spider confirm that it doesn't print Item info for log-level beyond debug.
It's now time, after you deploy to scrapyd to do exactly the same:
$ curl http://localhost:6800/schedule.json -d setting=LOG_LEVEL=INFO -d project=tutorial -d spider=example
Confirm that the logs don't contain items when you run:
Note that if your items are still printed in INFO level, it likely means that your code or some pipeline is printing it. You could rise log-level further and/or investigate and find the code that prints it and remove it.

Mahout seqdirectory not making a new file

I am trying to convert a text file into a sequence file that I can run mahout kmeans on. When I run the seqdirectory utility, I do not get any errors and it says that the program is completed. However, when I look in the output directory, it is empty. I've looked around and can't find any solutions to this. Thoughts?
Here is what I run in the terminal:
hduser#ubuntu:~$ $MAHOUT_HOME/bin/mahout seqdirectory --input Downloads/google/ --output Downloads/sparsefiles/ -c UTF-8
This is the output I get:
12/07/06 06:24:19 INFO driver.MahoutDriver: Program took 1091 ms (Minutes: 0.018183333333333333)
I think it may be producing the output on hdfs. Try checking:
hadoop dfs -ls Downloads/sparsefiles/
Also, to ensure it produces in your local filesystem you can modify the command like:
$MAHOUT_HOME/bin/mahout seqdirectory --input file://<home path>/Downloads/google/ --output file://<home path>/Downloads/sparsefiles/ -c UTF-8