I just starting looking at the Python version of Mechanize today. I took most of this code from the first example on http://wwwsearch.sourceforge.net/mechanize/. The documentation of this module is very sparse and I have no idea how to debug this.
I am trying to find and follow the first link with the text "Careers". When I run this I get this error "mechanize._mechanize.LinkNotFoundError". Can anyone tell me what I am doing wrong?
import re
import mechanize
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.open("http://www.amazon.com/")
response1 = br.follow_link(text_regex=r"Careers", nr=1)
assert br.viewing_html()
print br.title()
I just tried the sample code myself, and it looks like the problem is with the nr argument. It's not documented anywhere but in the source code (which is far more informative than the documentation!), and it states that:
nr: matches the nth link that matches all other criteria (default 0)
Because the nr argument is 0-based, when you gave the argument of 1, it was looking for the second mention of Careers, which was obviously nothing.
Because it defaults to 0, or the first link found, you can set the nr argument to 0, or leave it off entirely.
Related
I am trying to run this script: enter link description here
(The only difference is that instead of this TEST_SENTENCES I need to read my dataset (column text). The only thing is that I need to apply stop word removal to that column before passing that to the rest of the code.
df = pd.DataFrame({'text': ['the "superstar breakfast" is shrink wrapped muffins that can be bought at a convenience store.',
'The wireless internet was unreliable. ', 'i am still her . :). ',
'I appreciate your help ', 'I appreciate your help '], 'sentiment':[
'positive', 'negative', 'neutral', 'positive', 'neutral']})
But the error does not raises when I use the data frame in this way but when I use the csv file that includes the exact same data the error raises.
But when I add this line of the code to remove stop_words
df['text_without_stopwords'] = df['text'].apply(lambda x: ' '.join([word.encode('latin1', 'ignore').decode('latin1') for word in x.split() if word not in (stop)]))
TEST_SENTENCES = df['text_without_stopwords']
It keeps raises this error:
ValueError: All sentences should be Unicode-encoded!
Also, the error raises in the tokenization step:
tokenized, _, _ = st.tokenize_sentences(TEST_SENTENCES)
I want to know what is happening here that it causes this error, and the correct solution to fix the code.
(I have tried different encodings like uff-8 , etc but non worked)
I don't know the reason yet but when I did
df['text_without_stopwords'] = df['text_without_stopwords'].astype('unicode')
it worked.
Still very curious to know why this is happening only when I do stop words removal
This is how retrieve a particular pull-request's comments according to bitbucket's documentation:
While I do have the pull-request ID and format a correct URL I still get a 400 response error. I am able to make a POST request to comment but I cannot make a GET. After further reading I noticed the six parameters listed for this endpoint do not say 'optional'. It looks like these need to be supplied in order to retrieve all the comments.
But what exactly are these parameters? I don't find their descriptions to be helpful in the slightest. Any and all help would be greatly appreciated!
fromHash and toHash are only required if diffType is'nt set to EFFECTIVE. state also seems optional to me (didn't give me an error when not including it), and anchorState specifies which kind of comments to fetch - you'd probably want ALL there. As far as I understand it, path contains the path of the file to read comments from. (ex: src/a.py and src/b.py were changed -> specify which of them to fetch comments for)
However, that's probably not what you want. I'm assuming you want to fetch all comments.
You can do that via /rest/api/1.0/projects/{projectKey}/repos/{repositorySlug}/pull-requests/{pullRequestId}/activities which also includes other activities like reviews, so you'll have to do some filtering.
I won't paste example data from the documentation or the bitbucket instance I tested this once since the json response is quite long. As I've said, there is an example response on the linked page. I also think you'll figure out how to get to the data you want once downloaded since this is a Q&A forum and not a "program this for me" page :b
As a small quickstart: you can use curl like this
curl -u <your_username>:<your_password> https://<bitbucket-url>/rest/api/1.0/projects/<project-key>/repos/<repo-name>/pull-requests/<pr-id>/activities
which will print the response json.
Python version of that curl snippet using the requests module:
import requests
url = "<your-url>" # see above on how to assemble your url
r = requests.get(
url,
params={}, # you'll need this later
auth=requests.auth.HTTPBasicAuth("your-username", "your-password")
)
Note that the result is paginated according to the api documentation, so you'll have to do some extra work to build a full list: Either set an obnoxiously high limit (dirty) or keep making requests until you've fetched everything. I stronly recommend the latter.
You can control which data you get using the start and limit parameters which you can either append to the url directly (e.g. https://bla/asdasdasd/activity?start=25) or - more cleanly - add to the params dict like so:
requests.get(
url,
params={
"start": 25,
"limit": 123
}
)
Putting it all together:
def get_all_pr_activity(url):
start = 0
values = []
while True:
r = requests.get(url, params={
"limit": 10, # adjust this limit to you liking - 10 is probably too low
"start": start
}, auth=requests.auth.HTTPBasicAuth("your-username", "your-password"))
values.extend(r.json()["values"])
if r.json()["isLastPage"]:
return values
start = r.json()["nextPageStart"]
print([x["id"] for x in get_all_pr_activity("my-bitbucket-url")])
will print a list of activity ids, e.g. [77190, 77188, 77123, 77136] and so on. Of course, you should probably not hardcode your username and password there - it's just meant as an example, not production-ready code.
Finally, to filter by action inside the function, you can replace the return values with something like
return [activity for activity in values if activity["action"] == "COMMENTED"]
I'm struggling with Scrapy to output only "hits" to a json file. I'm new at this, so if there is just a link I should review, that might help (I've spent a fair amount of time googling around, still struggling) though code correction tips more welcome:).
I'm working off of the scrapy tutorial (https://doc.scrapy.org/en/latest/intro/overview.html) , with the original code outputing a long list including field names and output like "field: output" where both blanks and found items appear. I'd like only to include links that are found, and output them w/o the field name to a file.
For the following code I am trying, if I issue "scrapy crawl quotes2 -o quotes.json > output.json, it works but the quotes.json is always blank (i.e., including if I do "scrapy crawl quotes2 -o quotes.json").
In this case, as an experiment, I only want to return the URL if the string "Jane" is in the URL (e.g., /author/Jane-Austen):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes2"
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('a'):
for i in quote.css('a[href*=Jane]::attr(href)').extract():
if i is not None:
print(i)
I've tried "yield" and items options, but am not up to speed enough to make them work. My longer term ambition to go to sites without having to understand the html tree (which may in and of itself be the wrong approach) and look for URLs with specific text in the URL string.
Thoughts? Am guessing this is not too hard, but is beyond me.
Well this is happening because you are printing the items, you have to tell Scrapy explicitly to 'yield' them.
But before that i don't see why you are looping through the anchor nodes instead of that you should loop over the quotes using css or XPath selectors, extract all the author links inside that quote and lastly check if that URL contains a specific String (Jane for you case).
for quote in response.css('.quote'):
jane_url = quote.xpath('.//a[contains(#href, "Jane")]').extract_first()
if jane_url is not None:
yield {
'url': jane_url
}
I am very new to python. But I want to extract some data of job postings from an online job portal.
With the following code I wanted to extract the title of the job posting of a particular website:
def jobtitle(soup):
jobs=[]
for div in soup.find_all(name="div", attrs={"class"}:"row"}):
for a in div.find_all(name="a",attrs={"data-tn-element":"jobTitle"}):
jobs.append(a["title"])
return(jobs)
jobtitle(soup)
I receive this error message:
for div in soup.find_all(name="div", attrs={"class"}:"row"}):
^
IndentationError: unexpected indent
I tried many different things that were recommend on other sites, but nothing worked. I just don't know what the problem is. I tried different whitespace, but I just don't understand what I am doing wrong.
Any ideas? I would be really grateful!
Thanks a lot :-)
Remove the indent on the first for line.
The first for statement should be directly under the jobs=[] declaration.
def jobtitle(soup):
jobs=[]
for div in soup.find_all(name="div", attrs={"class"}:"row"}):
for a in div.find_all(name="a",attrs={"data-tn-element":"jobTitle"}):
jobs.append(a["title"])
return(jobs)
jobtitle(soup)
I am trying to extract data from craigslist using BeautifulSoup. As a preliminary test, I wrote the following:
import urllib2
from bs4 import BeautifulSoup, NavigableString
link = 'http://boston.craigslist.org/search/jjj/index100.html'
print link
soup = BeautifulSoup(urllib2.urlopen(link).read())
print soup
x=soup.body.find("div",class_="content")
print x
Upon printing soup, I can see the entire webpage. However, upon trying to find something more specific such as the class called "content", it prints None. I know that the class exists in the page source as I looked on my own browser, but for some reason, it is not finding it in the BeautifulSoup parsing. Any ideas?
Edit:
I also added in the following to see what would happen:
print soup.body.article
When I do so, it prints out some information between the article tags, but not all. Is it possible that when I am using the find function, it is somehow skipping some information? I'm really not sure why this is happening when it prints the whole thing for the general soup, but not when I try to find particulars within it.
The find method on the BeautifulSoup instance (your soup variable) is not the same as the find method on a Tag (your soup.body).
This:
soup.body.find("div",class_="content")
is only searching through the direct children of the body tag.
If you call find on the BeautifulSoup instance, it does what you want and searches the whole document:
soup.find("div",class_="content")