How to pass some information between parse_item calls? - scrapy

Ok, imagine a website with some list. The items of this list have one piece of information needed. The second piece is located at some other url, which is unique from item to item.
Currently our crawler opens a list page, scrapes each item, and for each item it opens that 2nd URL and gets the 2nd piece of the info from there. We use requests lib which is excellent in almost all cases but now it seems to be slow and ineffective. It looks that the whole Twisted is being blocked until one 'requests' request ends.
pseudo-code:
def parse_item():
for item in item_list:
content2 = requests.get(item['url'])
We can't just let Scrapy parse these 2nd urls because we need to 'connect' the first and the second url somehow. Something like Redis would work, but hey, is there any better (simpler, faster) way to do that in Scrapy? I can't believe the things must be so complicated.

You can do this my passing variable in meta
For example:
req = Request(url=http://somedonain.com/path, callback=myfunc)
req.meta['var1'] = 'some value'
yeld(req)
And in ur myfunc, you read passed variable as:
myval = response.request.meta['var1']

Related

Fetch All Pull-Request Comments Via Bitbucket REST API

This is how retrieve a particular pull-request's comments according to bitbucket's documentation:
While I do have the pull-request ID and format a correct URL I still get a 400 response error. I am able to make a POST request to comment but I cannot make a GET. After further reading I noticed the six parameters listed for this endpoint do not say 'optional'. It looks like these need to be supplied in order to retrieve all the comments.
But what exactly are these parameters? I don't find their descriptions to be helpful in the slightest. Any and all help would be greatly appreciated!
fromHash and toHash are only required if diffType is'nt set to EFFECTIVE. state also seems optional to me (didn't give me an error when not including it), and anchorState specifies which kind of comments to fetch - you'd probably want ALL there. As far as I understand it, path contains the path of the file to read comments from. (ex: src/a.py and src/b.py were changed -> specify which of them to fetch comments for)
However, that's probably not what you want. I'm assuming you want to fetch all comments.
You can do that via /rest/api/1.0/projects/{projectKey}/repos/{repositorySlug}/pull-requests/{pullRequestId}/activities which also includes other activities like reviews, so you'll have to do some filtering.
I won't paste example data from the documentation or the bitbucket instance I tested this once since the json response is quite long. As I've said, there is an example response on the linked page. I also think you'll figure out how to get to the data you want once downloaded since this is a Q&A forum and not a "program this for me" page :b
As a small quickstart: you can use curl like this
curl -u <your_username>:<your_password> https://<bitbucket-url>/rest/api/1.0/projects/<project-key>/repos/<repo-name>/pull-requests/<pr-id>/activities
which will print the response json.
Python version of that curl snippet using the requests module:
import requests
url = "<your-url>" # see above on how to assemble your url
r = requests.get(
url,
params={}, # you'll need this later
auth=requests.auth.HTTPBasicAuth("your-username", "your-password")
)
Note that the result is paginated according to the api documentation, so you'll have to do some extra work to build a full list: Either set an obnoxiously high limit (dirty) or keep making requests until you've fetched everything. I stronly recommend the latter.
You can control which data you get using the start and limit parameters which you can either append to the url directly (e.g. https://bla/asdasdasd/activity?start=25) or - more cleanly - add to the params dict like so:
requests.get(
url,
params={
"start": 25,
"limit": 123
}
)
Putting it all together:
def get_all_pr_activity(url):
start = 0
values = []
while True:
r = requests.get(url, params={
"limit": 10, # adjust this limit to you liking - 10 is probably too low
"start": start
}, auth=requests.auth.HTTPBasicAuth("your-username", "your-password"))
values.extend(r.json()["values"])
if r.json()["isLastPage"]:
return values
start = r.json()["nextPageStart"]
print([x["id"] for x in get_all_pr_activity("my-bitbucket-url")])
will print a list of activity ids, e.g. [77190, 77188, 77123, 77136] and so on. Of course, you should probably not hardcode your username and password there - it's just meant as an example, not production-ready code.
Finally, to filter by action inside the function, you can replace the return values with something like
return [activity for activity in values if activity["action"] == "COMMENTED"]

Extract portion of HTML from website?

I'm trying to use VBA in Excel, to navigate a site with Internet explorer, to download an Excel file for each day.
After looking through the HTML code of the site, it looks like each day's page has a similar structure, but there's a portion of the website link that seems completely random. But this completely random part stays constant and does not change each time you want to load the page.
The following portion of the HTML code contains the unique string:
<a href="#" onClick="showZoomIn('222698519','b1a9134c02c5db3c79e649b7adf8982d', event);return false;
The part starting with "b1a" is what is used in the website link. Is there any way to extract this part of the page and assign it as a variable that I then can use to build my website link?
Since you don't show your code, I will talk too in general terms:
1) You get all the elements of type link (<a>) with a Set allLinks = ie.document.getElementsByTagName("a"). It will be a vector of length n containing all the links you scraped from the document.
2) You detect the precise link containing the information you want. Let's imagine it's the 4th one (you can parse the properties to check which one it is, in case it's dynamic):
Set myLink = allLinks(3) '<- 4th : index = 3 (starts from zero)
3) You get your token with a simple split function:
myToken = Split(myLink.onClick, "'")(3)
Of course you can be more synthetic if the position of your link containing the token is always the same, like always the 4th link:
myToken = Split(ie.document.getElementsByTagName("a")(3).onClick,"'")(3)

Setting a custom long list of starting URLS in Scrapy

The crawling starts from the list included in start_urls = []
I need a long list of these starting urls and 2 methods of solving this problem:
Method 1: Using pandas to define the starting_urls array
#Array of Keywords
keywords = pandas.Keyword
urls = {}
count = 0
while(count < 100):
urls[count]='google.com?q=' + keywords[count]
count = count + 1
#Now I have the starting urls in urls array.
However, it doesn't seem to define starting_urls = urls because when I run:
scrapy crawl SPIDER
I get the error:
Error: Request url must be str or unicode, got int:
Method 2:
Each starting URL contains paginated content and in the def parse method I have the following code to crawl all linked pages.
next_page = response.xpath('//li[#class="next"]/a/#href').extract_first()
yield response.follow(next_page, callback=self.parse)
I want to add additional pages to crawl from the urls array defined above.
count=0
while(count < 100):
yield response.follow(urls[count], callback=self.parse)
count=count + 1
But it seems that none of these 2 methods work. Maybe I can't add this code the spider.py file?
To make first note, though obviously I can't say I've ran your entire script like that it's incomplete but first thing I noticed is that your face URL does need to have or be the proper format... "http://ect.ect" for scrapy tp make a proper request
Also, not to question your skills but if you weren't aware that by using strip, split and join functions you can turn from list, strings, dictionaries add integers back and forth from each other to achieve the needed desired effect...
WHATS HAPPENING TO YOU:
While be using range instead of count... but mimic your issue
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + i)
----------
TypeError: Can't convert 'int' object to str implicity
#TURNING MY INT INTO STR:
lis = range(11)
site = "site.com/page="
for i in lis:
print(site + str(i))
--------------------
site.com/page=0
site.com/page=1
site.com/page=2
site.com/page=3
site.com/page=4
site.com/page=5
site.com/page=6
site.com/page=7
site.com/page=8
site.com/page=9
site.com/page=10
As to the error, when you you have the count to "+ 1", and then configure the entire URL then to add that 1 ... You are then trying to makes a string variable with an integer... I'd imagine simply turning the integer into a string before then constructing your url, then back to and interger before you add one to the count so it could be changed appropriately to then...
My go-to way to keep my coat as clean as possible is much cleaner. By adding an extra file at the root or current working folder of which you start to crawl, with all the urls you wish to scrape, you can use then pythons read and write functions and open the file with you or else decide your spider script.. like this
class xSpider(BaseSpider):
name = "w.e"
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
What really bothers me is that your error, is saying that you're compiling drink with an integer which I will ask you again if you need further for a complete snippet of your spider and in the spirit of coders kinship, also your settings.py because I'll tell you right now that who end up finding out, despite of any adjustments to the settings.Py file you won't be able to scrape Google search pages... Rather, not entire number of result page... Which I will then recommend to Scrappy conjunction with beautiful suit
The immediate problem I see is that you are making a DICT when it expects a list. :). Change it to a list.
There are also all kinds of interactions depending on which underlying spider you inherited from (if you did at all). Try switching to list then hit the question up again with more data if you still are having problems

Constrain Wikipedia Search API to generate only NS:0 pages

I am calling the Wikipedia API from Java using the following search query to generate redirected pages:
https://en.wikipedia.org//w/api.php?format=json&action=query&generator=allpages&gapfilterredir=redirects&prop=links&continue=&gapfrom=D
where the final 'D' is just an example for the continue-from.
I am interested in only iterating over items in namespace:0. In fact, if I don't, the continue return value includes category pages, which break the next query iteration.
Thank you in advance.
The parameter you need from the Allpages api is
…&gapnamespace=0&…
but notice that when you omit it, then 0 is the default anyway.

REST API: How to search for other attribute

I use node.js as REST API.
There are following actions available:
/contacts, GET, finds all contacts
/contacts, POST, creats new contact
/contacts/:id, GET, shows or gets specifiy contact by it's id
/contacts/:id, PUT, updates a specific contact
/contacts/:id, DELETE, removes a specific contact
What would now be a logic Route for searching, quering after a user?
Should I put this to the 3. route or should I create an extra route?
I'm sure you will get a lot of different opinions on this question. Personally I would see "searching" as filtering on "all contacts" giving:
GET /contacts?filter=your_filter_statement
You probably already have filtering-parameters on GET /contacts to allow pagination that works well with the filter-statement.
EDIT:
Use this for parsing your querystring:
var url = require('url');
and in your handler ('request' being your nodejs http-request object):
var parsedUrl = url.parse(request.url, true);
var filterStatement = parsedUrl.query.filter;
Interesting question. This is a discussion that I have had several times.
I don't think there is a clear answer, or maybe there is and I just don't know it or don't agree with it. I would say that you should add a new route: /contacts/_search performing an action on the contacts list, in this case a search. Clear and defined what you do then.
GET /contacts finds all contacts. You want a subset of all contacts. What delimiter in a URI represents subsets? It's not "?"; that's non-hierarchical. The "/" character is used to delimit hierarchical path segments. So for a subset of contacts, try a URI like /contacts/like/dave/ or /contacts/by_name/susan/.
This concept of subsetting data by path segments is for more than collections--it applies more broadly. Your whole site is a set, and each top-level path segment defines a subset of it: http://yoursite.example/contacts is a subset of http://yoursite.example/. It also applies more narrowly: /contacts/:id is a subset of /contacts, and /contacts/:id/firstname is a subset of /contacts/:id.