Scrapy not exporting to csv - scrapy

I have just created a new scrapy project after ages, and seem to be forgetting something. In any case, my spider runs great, but does not store the output to a csv. Is there something that needs to go into the pipeline or settings files? I am using this command:
scrapy crawl ninfo -- set FEED_URI=myinfo.csv --set FEED_FORMAT=csv
Any help is appreciated, Thanks.
TM

Try with this command:
$ scrapy crawl ninfo -o myinfo.csv -t csv
See http://doc.scrapy.org/en/latest/intro/tutorial.html#storing-the-scraped-data (the only difference being they use it to generate JSON data, but Scrapy embarks a CSV exporter: http://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-format-csv)

Related

How can I use scrapy middleware in the scrapy Shell?

In a scrapy project one uses middleware quite often. Is there a generic way of enableing usage of middleware in the scrapy shell during interactive sessions as well?
Although, Middlewares set up in setting.py are enabled by default in scrapy shell. You can see it on the logs when running scrapy shell.
So to answer your question, yes you can do so using this command.
scrapy shell -s DOWNLOADER_MIDDLEWARES='<<your custom middleware>>'
You can override settings using the -s parameters.
Remember, just running scrapy shell inside a folder that contains a scrapy project.
It will load the default settings from settings.py.
Happy Scraping :)

Recursively go through folders and load the csv files in each folder into BigQuery

So I have Google Cloud Storage Bucket which follows this style of directory:
gs://mybucket/{year}/{month}/{day}/{a csv file here}
The csv files all follow the same schema so it shouldn't be an issue. I was wondering if there was a easier method of loading all the files in 1 command or even a cloud function into 1 table in BigQuery. I've been using bl load to accomplish this for now, but seeing that I have to do this about every week, I'd like to get some automation for it.
Inspired from this answer
You can recursively load your files with the following command:
gsutil ls gs://mybucket/**.csv | \
xargs -I{} echo {} | \
awk -F'[/.]' '{print "yourdataset."$7"_"$4"_"$5"_"$6" "$0}' | \
xargs -I{} sh -c 'bq --location=YOUR_LOCATION load --replace=false --autodetect --source_format=CSV {}'
This loads your CSV files into independant tables in your target dataset, with the naming convention "filename_year_month_day".
The "recursively" part is ensured by the double wildcard (**).
This is for the manual part..
For the automation part you have the choice between different options:
the easiest one is probably to associate a Cloud Function that you trigger with Cloud Scheduler. There is no bash runtime available so you would for instance have to Python your way through. Here is what a quick Google search gave me.
it is possible to do that with an orchestrator (Cloud Composer) if you have the infrastructure (if you don't, don't consider)
another solution is to use Cloud Run, triggered either by Cloud Scheduler (on regular occasions then), or through Eventarc triggers when your csv files are uploaded to GCS.

Why running specific spider doesn't work (but run all the spider)

I created a scrapy project, and want to have two separate spiders (with two different names):
1
I'm trying to only run "listing_enseigne.py" spider with the command scrapy crawl nameofthespider but it seems that this command also run the other spider (from the file "detail_enseigne.py")...
However when looking in the scrapy documentation,it seems that this command should run only the spider named.
If anyone can help me.. thanks!
Edit 1:
Indeed, scrapy won't run them both but it will execute all codes in all spiders which run before the actual spider (thanks wishmaster for the answer).
I don't really understand how to organize my spiders then..
I want to have a first spider to collect urls from a website (in fact the first spider should export csv files containing multiple information included urls).
Then I want a second spider to find the latest file from the export folder, collecting all urls from this latest file, and then parsing this urls to collect other informations...

Python Scrapy Error. Running 'scrapy crawl' with more than one spider is no longer supported

I made a script in Scrapy Python which has been working fine during months (without changes). Recently when I execute the script in Windows Powershell it raises the next error:
scrapy crawl spider –o 'filename.csv' –t 'csv'
...
Running 'scrapy crawl' with more than one spider is no longer supported
I wonder what the problem is.
Thanks in advance.
When you experience this, most probably you have an extra space somewhere in the params and crawler sees more params than expected.
Remove it and it should work.
Make sure that you write the command option with the short dash: -o and not –o.
I tried copying and pasting your command and did not work, but it works with short dashes.
I had this issue and I fixed it by changing:
scrapy crawl s-mart -o test 2.csv -t csv
To:
scrapy crawl s-mart -o test2.csv -t csv
So I am guessing that the whitespace was causing this?
Run command without -t, it's shortcut for generating new spiders based on pre-defined templates:
scrapy genspider [-t template] <name> <domain>
Try something like:
scrapy crawl <yourspidername> –o filename.csv
Documentation https://doc.scrapy.org/en/0.10.3/topics/commands.html#available-tool-commands
Try
scrapy crawl spider –o filename.csv
Possible solution:
try changing the name of your spider in its module. Maybe you have created a same named spider somewhere else or copied it, and scrapy keeps track of what you gave run in the past, thus it encounters 2+ spiders with same name and since name must be unique, it can't crawl it
Changing the name solved my problem.
I had this error after I renamed my spider.
The solution was to delete all the *.pyc files in the spiders folder.
These get regenerated by python compiler when you next run scrapy.
I had a "\n" in the parameter.
...replace("\n","")

How do i get my CSV file?

I have done the following changes in the jmeter.properties file :
jmeter.save.saveservice.output_format=csv
jmeter.save.saveservice.assertion_results_failure_message=true
jmeter.save.saveservice.default_delimiter=|
But still I could not find where my .csv file.
Can anyone please help me.
Please see first answers to these posts:
How to save JMeter Aggregate Report results to a CSV file using command prompt?
How do I save my Apache jMeter results to a CSV file?.
In addition to your configuration done in jmeter.properties:
1) GUI:
2) CLI:
jmeter –n –t test.jmx -l test.csv
In test.csv you'll get results in csv format.