Rotating IPs and user agents using Scrapy CrawlerRunner - scrapy

I haven't set up my spider as a Scrapy project, so don't have the settings.py file for settings. But I still want to implement as many methods as I can to avoid being blocked/blacklisted. Is there any way to add these rotations and settings outside of a Scrapy project and only within the CrawlerRunner function???
Thanks!

If anyone else is wondering, there isn't a way to do this when using the CrawlerRunner function. I just had to set up a Scrapy project which didn't end up being as hard as I thought it would be. Theres plenty of info and "hello world" tutorials to follow online.
Good luck!

Related

Scraping Blogs - avoid already scraped items by checking urls from json/csv in advance

I'd like to scrape newspages / blogs (anything, which contains new informations on a daily basis).
My Crawler works fine and does everything, I kindly asked him to do.
But I cannot find a proper solution to the circumstance, that I'd like him to ignore already scraped urls (or items to keep it more general) and just add new urls/items to an already existing json/csv file.
I've seen many solutions here to check, whether an item exists in a csv file.. but none of this "solutions" did really work.
Scrapy DeltaFetch seems to cannot be installed on my system... I've get errors af. and all the hints, like e.g. $ sudo pip install bsddb3, upgrade this and update that.. etc.. does not do the trick. (tried it for 3 hours now and fed up with solutionfinding for a package, which wasn't updated since 2017).
I hope, that you have a handy and practical solution.
Thank you very much in advance!
Best regards!
An option could be a custom downloader middleware with the following:
A process_response that puts the url you crawled in a database
A process_request method that checks if the url is present in the database. If it's in there, you raise an IgnoreRequest so the request is not going through anymore.

CasperJS: Disable remote page's javascript but still use casper.evaluate?

Thanks for reading my topic, I'd be really grateful if anyone could suggest any other avenues I should explore to achieve the below.
Using CasperJS or PhantomJS I need to disable all JavaScript that belongs to the pages I navigate from being executed, while still being able to run my own using casper.execute.
Does anyone know a way I can do this?
Is it possible to modify the HTTP headers or bodies using onResourceRequested or onResourceReceived? or cancel a request conditionally? or are they read only?
Can you modify the raw HTML source before it's offered for parsing?
I've tried hacking a window.stop() in a casper.execute early, but this works inconsistently between pages.
Is the Phantom WebServer module used for this kind of thing? Could/Should I route reqs/responses through that and modify them as they pass through?
Thanks for any help - I appreciate this is a weird use case.
As stated here it is possible but not with the current phantomjs master branch but in a specific [dev branch[(https://github.com/Vitallium/phantomjs/tree/allow-to-disable-js), you should build from, look for the latest commit for disable-javascript option.

Meteor File Uploads

I see that this has been asked here before, but nothing since Meteor.http has been available. I'm still grasping the concepts of Meteor and file uploads are totally eluding me.
Here's my question:
So, in what I believe to be the right method,
Meteor.http.call("POST", url, [options], [asyncCallback]) what do you put for the url? With the client/server javascript relationship in meteor, it doesn't seem like it really uses urls that much.
If anyone has a basic example of a file upload in meteor, that would just be extra awesome.
well been playing a bit with meteor. Made a collectionFS a mix of meteor and gridFS (could be compatible).
Test it here: http://collectionfs.meteor.com/
It support quit large files, multiple files, users etc. I've tested a 50Mb seems ok, if connection is lost or browser dies the user can resume upload.
It should even be possible to have multiple users upload to exact same file - haven't quit found a usecase for it, but it's possible.
Accounts, publishing etc. is as with collections - the test is in autopublish mode, though only meta data is avaliable - chunks of data is served in background via blobs.
I'll try getting it on github,
Take a look at filepicker.io. They handle the upload, store it into your S3, and return to you the url that you can dump into your db.
Wget the filepicker script into your client folder.
wget https://api.filepicker.io/v0/filepicker.js
Insert a filepicker input tag
<input type="filepicker" id="attachment">
In the startup, initialize it:
Meteor.startup( function() {
filepicker.setKey("YOUR FILEPICKER API KEY");
filepicker.constructWidget(document.getElementById('attachment'));
});
Attach a event handler
Template.templateNameHere.events({
'change #attachment': function(evt){
console.log(evt.files);
}
});
(I had posted on How would one handle a file upload with Meteor? Sorry. I'm new here. Is it kosher to copy the same answer twice? Anyone who knows better can feel free to edit this.)
Checkout how to accomplish this using Meteor.Method on the server and the FileReader's api on the client
https://gist.github.com/dariocravero/3922137
After several searches, this looks to me the easiest (and for the moment the meteor's style way) to handle a file upload with no extra dependencies.
Since meteor includes JQuery by default, you can utilize a Jquery plugin for that, i presume, something like: https://github.com/blueimp/jQuery-File-Upload/wiki/Options can do the trick for you, and supports both GET and PUT.
Otherwise it would be a pain in the ass to get it to work, but not impossible, since you can access PUT in meteor.
If you would prefer a more pure JS sollution maybe you can look at: http://igstan.ro/posts/2009-01-11-ajax-file-upload-with-pure-javascript.html
And adapt it.
There is no ready made support for file uploads so share what you come up with, i would be very interested!
Alternatively (if you wouldn't like to use a 3rd party solution like filepicker) you could use the meteor router package.
This handles the HTTP requests on server-side.

Can I use/adapt the Kohana userguide module to create help pages for my application?

I'd like to create a userguide for the application I'm building using the Kohana framework, and I'm wondering if there's a way I can use the Kohana userguide module for this purpose.
I understand how to add userguide info for new modules that I create, and how to include my classes in the API, but I want to build a second, separate userguide for the actual application user, as opposed to the app developers.
At first, I thought I'd just try adding app help pages to the main userguide at APPPATH/guide. I tried adding a "application/guide" directory, and put a file in there called menu.md, but that just ended up replacing the Kohana menu in the userguide. After renaming the file to menu.myapp.md, it doesn't show up at all.
So then it occurs to me that I could simple edit modules/userguide/guide/menu.md to add sections for my app, and likewise add markdown files for each app component. But really it would be much better to have a completely separate userguide for app users since the Kohana documentation isn't relevant for them.
What's the best way to go about this? Should I create a duplicate of the entire userguide module and modify the routing, &c.? Or is there some way to set up both userguides using the one version of the module? Or am I barking up the wrong tree altogether? Is there some other module/approach that would be better for building "Help" pages for the app?
Thanks in advance for your help!
Yes, you can make docs for your application with the userguide. If you want examples, check out these links:
https://github.com/zombor/Auto-Modeler/blob/master/config/userguide.php
https://github.com/zombor/Auto-Modeler/tree/master/guide/auto-modeler
Note that you'll still get "api docs" and everything else, unless you change the config to hide them.

Automate adding entries to a wiki

Once I have my renamed files I need to add them to my project's wiki page. This is a fairly repetitive manual task, so I guess I could script it but I don't know where to start.
The process is:
Got to appropriate page on the wiki
for each team member (DeveloperA, DeveloperB, DeveloperC)
{
for each of two files ('*_current.jpg', '*_lastweek.jpg')
{
Select 'Attach' link on page
Select the 'manage' link next to the file to be updated
Click 'Browse' button
Browse to the relevant file (which has the same name as the previous version)
Click 'Upload file' button
}
}
Not necessarily looking for the full solution as I'd like to give it a go myself.
Where to begin? What language could I use to do this and how difficult would it be?
Check if the wiki you mean to talk to supports XMLRPC, because if it does it should be a snap. I wrote a tool called WikiUp to solve a similar problem (updating a delineated section on a wiki page).
If you're writing in C#, the WebClient classes might be a good place to start. I bet people could give more specific advice if you mentioned which wiki platform you are using, and whether it requires authentication, though.
I'd probably start by downloading fiddler and watching the http requests from doing it manually. Then you could use some simple scripts and regexes to build your http requests for automating the process.
Of course, if your wildly lucky, your wiki would have a backend simple enough that you could just plug them into its db directly. :)
You might find CoScripter useful -- it's a Firefox extension that allows you to automate tasks you perform on websites. I'm not certain how you'd integrate this with the list of files you're changing on your local system, but it can certainly handle the file uploading through a web form.
Better bet is probably using cURL or a similar HTTP library with your programming language of choice. If you're on *nix, you can use the cURL commandline program inside your shell script to get this done fairly easily. (Like #jsight said you will need to analyze the actual forms you're using on the webpage, using Fiddler or just looking at the form elements and re-creating the POST through cURL.)