newsletter3k, find author name in visible text after first "by" word - beautifulsoup

Newsletter3K is a good python Library for News content extraction. It works mostly well
.I want to extract names after first "by" word in visible text. This is my code, it did not work well, somebody out there please help:
import re
from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
USER_AGENT = 'Mozilla/5.0 (Macintosh;Intel Mac OS X 10.15; rv:78.0)Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
html1='https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2/'
article = Article(html1.strip(), config=config)
article.download()
article.parse()
soup = BeautifulSoup(article)
## I want to take only visible text
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
for line in visible_text:
# Capture one-or-more words after first (By or by) the initial match
match = re.search(r'By (\S+)', line)
# Did we find a match?
if match:
# Yes, process it to print
By = match.group(1)
print('By {}'.format(By))`

This is not a comprehensive answer, but it is one that you can build from. You will need to expand this code as you add additional sources. Like I stated before my Newspaper3k overview document has lots of extraction examples, so please review it thoroughly.
Regular expressions should be a last ditch effort after trying these extraction methods with newspaper3k:
article.authors
meta tags
json
soup
from newspaper import Config
from newspaper import Article
from newspaper.utils import BeautifulSoup
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
urls = ['https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2',
'https://www.macleans.ca/education/what-college-students-in-canada-can-expect-during-covid',
'https://www.cnn.com/2021/02/12/asia/india-glacier-raini-village-chipko-intl-hnk/index.html',
'https://www.latimes.com/california/story/2021-02-13/wildfire-santa-cruz-boulder-creek-residents-fear-water'
'-quality',
'https://foxbaltimore.com/news/local/maryland-lawmakers-move-ahead-with-first-tax-on-internet-ads-02-13-2021']
for url in urls:
try:
article = Article(url, config=config)
article.download()
article.parse()
author = article.authors
if author:
print(author)
elif not author:
soup = BeautifulSoup(article.html, 'html.parser')
author_tag = soup.find(True, {'class': ['td-post-author-name', 'byline']}).find(['a', 'span'])
if author_tag:
print(author_tag.get_text().replace('By', '').strip())
else:
print('no author found')
except AttributeError as e:
pass

Related

How to take user argument and pass it to Rule extractor in Scrapy

I have a config file in which many website details are present. I am taking user input argument in scrapy using -a parameter and taking out matching allowed_domains and start_urls from config file. Since this is a generic spider, I am using rule extractor.
Below is my code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup
import yaml
import re
import scrapy
with open("/scrapyConfig.yaml", "r") as f:
config = yaml.load(f, Loader=yaml.FullLoader)
def cleanHtml(raw_html):
CLEANR = re.compile('<.*?>')
cleanText = str(re.sub(CLEANR,'', raw_html))
return cleanText
def remove_tags(html):
soup = BeautifulSoup(html, "html.parser")
for data in soup(['style', 'script']):
data.decompose()
noTagsData = str(' '.join(soup.stripped_strings))
return noTagsData
class SpiderSpider(CrawlSpider):
name = 'spider1'
def __init__(self, **kwargs):
super().__init__(**kwargs)
userInp = self.site
self.allowed_domains=config[userInp]['allowed_domain']
self.start_urls=config[userInp]['start_url']
rules = [(Rule(LinkExtractor(unique=False,allow=(config[self.site]['regex1'],config[self.site]['regex2'])),callback='parse_item',follow=True))]
def parse_item(self,response):
uncleanText = response.xpath(config[self.site]['xpath1']).extract()
cleanText = [x.replace("\n","") for x in uncleanText]
cleanText = [x.replace("\t"," ") for x in cleanText]
cleanText = [x.replace("\r","") for x in cleanText]
cleanText = [x.replace("\xa0","") for x in cleanText]
cleanText = [x.replace(":"," ") for x in cleanText]
cleanText = remove_tags(str(cleanText))
finalCleanJD = cleanHtml(str(cleanText))
yield {"URL":response.url,"Job Description":finalCleanJD}
I am able to take the user input and fetch corresponding allowed_domains and start_urls from config file using init function but when I am passing the same argument in rule extractor, it is not recognising self.site and if I put this rule extractor inside init function then spider is not scraping the page. It's just written as crawled in terminal and then it exits. Even the rule variable is not highlighted when it is inside init function which tells that rule variable is not used anywhere but when it is put outside init function it is getting highlighted but it is not recognizing self.site variable. How can I make this generic spider take user input argument and take out the matching details from config file and start scraping?

List values inside for loop in python beautifulsoup

I am doing some scraping in beautifulsoup. While scraping values from next pages I am using for loop. Everything is fine but when I make a list of the scraped values, I got only values of last page. Below is my code.
from bs4 import BeautifulSoup as bs
import requests
params = []
for page_number in range(0, 4):
p = page_number*10
params.append(p)
print(params)
gymname_list = []
gymratings_list =[]
gymnumreviews_list = []
gymcat_list = []
for i in params:
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url= f'https://www.yelp.com.au/search?find_desc=gyms&find_loc=Berlin%2C%20Germany&start={i}'
response=requests.get(url,headers=headers)
page_soup=bs(response.content,'lxml')
mains = page_soup.find_all("div", {"class": "mainAttributes__09f24__26-vh arrange-unit__09f24__3IxLD arrange-unit-fill__09f24__1v_h4 border-color--default__09f24__1eOdn"})
for main in mains:
try:
gymname = main.find("a", {"class": "css-166la90"}).text
print(gymname)
except:
print(None)
gymname_list.append(gymname)
In the code above, as you can see, I am trying to scrape the first four pages but when I print "gymname" all I got is the gym name on the last i.e. the fourth page results. I want all results into my list. gymname_list. Please help
In your last for loop, your indentation should be inside this for loop
for main in mains:
try:
gymname = main.find("a", {"class": "css-166la90"}).text
print(gymname)
except:
print(None)
gymname_list.append(gymname)

Beautiful Soup List of Words

I am trying to find certain words within a website. Right now my code can only check for one word but I want it to be able to check for multiple words, (say instead of just checking for 'dog', i want it to check for ["dog","cat","adult"]
#Import Packages
import requests
from bs4 import BeautifulSoup
def count_words(url, the_word):
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)
def main():
url = 'https://patch.com/illinois/alsip-crestwood/pet-adoption-alsip-crestwood-area-see-latest-
dogs-cats-more'
word= 'dog'
count = count_words(url, word)
print(url, count, word)
if __name__ == '__main__':
main()
Basically I do not know how to pass in a list of words instead of one singular string!
I believe you're making it a bit too complicated than what is actually necessary. Try something like this:
url = "https://patch.com/illinois/alsip-crestwood/pet-adoption-alsip-crestwood-area-see-latest-dogs-cats-more"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
pets = ["dog","cat"]
for pet in pets:
print(pet, len(soup.find_all(text=lambda text: text and pet in text)))
Output:
dog 13
cat 76

Python facebook chatbook download from google

I'm trying to program a python file that downloads google images, but gives the following error
"C:\Users\marco\Desktop\Scripts Python\venv\Scripts\python.exe" "C:/Users/marco/Desktop/Scripts Python/ChatBot.py"
Traceback (most recent call last):
File "C:/Users/marco/Desktop/Scripts Python/ChatBot.py", line 4, in
from urllib import FancyURLopener
ImportError: cannot import name 'FancyURLopener' from 'urllib' (C:\Users\marco\AppData\Local\Programs\Python\Python37-32\lib\urllib__init__.py)
my code:
import os
import sys
import time
from urllib import FancyURLopener
import urllib2
import simplejson
# Define search term
searchTerm = "william shatner"
# Replace spaces ' ' in search term for '%20' in order to comply with request
searchTerm = searchTerm.replace(' ','%20')
# Start FancyURLopener with defined version
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
# Set count to 0
count= 0
for i in range(0,10):
# Notice that the start changes for each iteration in order to request a new set of images for each loop
url = ('https://ajax.googleapis.com/ajax/services/search/images?' + 'v=1.0&q='+searchTerm+'&start='+str(i*4)+'&userip=MyIP')
print (url)
request = urllib2.Request(url, None, {'Referer': 'testing'})
response = urllib2.urlopen(request)
# Get results using JSON
results = simplejson.load(response)
data = results['responseData']
dataInfo = data['results']
# Iterate for each result and get unescaped url
for myUrl in dataInfo:
count = count + 1
print (myUrl['unescapedUrl'])
myopener.retrieve(myUrl['unescapedUrl'],str(count)+'.jpg')
# Sleep for one second to prevent IP blocking from Google
time.sleep(1)
As the error message says, FancyURLopener is not where you are looking for it. This is the correct import statement:
from urllib.request import FancyURLopener

Why does my CrawlerProcess not have the function "crawl"?

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from items import BackpageItem, CityvibeItem
from scrapy.shell import inspect_response
import re
import time
import sys
class MySpider(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com']
# Set last_age to decide how many pages are crawled
last_page = 10
start_urls = ['http://www.example.com/washington/?page=%s' % page for page in xrange(1,last_page)]
rules = (
#Follow all links inside <div class="cat"> and calls parse_item on each link
Rule(LinkExtractor(
restrict_xpaths=('//a[#name="listing_link"]')),
callback='parse_item'),
)
# Extract relevent text from the website into a ExampleItem
def parse_item(self, response):
item = ExampleItem()
item['title'] = response.xpath('string(//h2[#class="post-title"]/text())').extract()
item['desc'] = response.xpath('string(//div[#class="section post-body"]/text())').extract()
item['url'] = response.url
item['location'] = response.xpath('string(//div[#class="posting"]/div[2]/text())').extract()
item['posted_date'] = response.xpath('string(//div[#class="post-date"]/span/text())').extract()#.re("(?<=Posted\s*).*")
item['crawled_date'] = time.strftime("%c")
# not sure how to get the other image urls right now
item['image_urls'] = response.xpath('string(//div[#class="section post-contact-container"]/div/div/img/#src)').extract()
# I can't find this section on any pages right now
item['other_ad_urls'] = response.xpath('//a[#name="listing_link"]/#href').extract()
item['phone_number'] = "".join(response.xpath('//div[#class="post-info"]/span[contains(text(), "Phone")]/following-sibling::a/text()').extract())
item['email'] = "".join(response.xpath('//div[#class="post-info"]/span[contains(text(), "Email")]/following-sibling::a/text()').extract())
item['website'] = "".join(response.xpath('//div[#class="post-info limit"]/span[contains(text(), "Website")]/following-sibling::a/text()').extract())
item['name'] = response.xpath('//div[#class="post-name"]/text()').extract()
#uncomment for debugging
#inspect_response(response, self)
return item
# process1 = CrawlerProcess({
# 'ITEM_PIPELINES': {
# #'scrapy.contrib.pipeline.images.ImagesPipeline': 1
# 'backpage.pipelines.GeolocationPipeline': 4,
# 'backpage.pipelines.LocationExtractionPipeline': 3,
# 'backpage.pipelines.BackpagePipeline': 5
# }
# });
process1 = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process1.crawl(MySpider)
process1.start()
My spider works perfectly when I run it from the command line with
scrapy crawl example
but I will need to run multiple spiders, so I want to put them all in a script and use CrawlerProcess. When I try to run this I get the error,
AttributeError: 'CrawlerProcess' object has no attribute 'crawl'
This is scrapy version 0.24.6.
All items and pipelines are correct, because the spider works from the command line.
There is (was?) a compatibility problem between Scrapy and Scrapyd. I needed to run Scrapy 0.24 and Scrapyd 1.0.1.
Here is the issue on Github
https://github.com/scrapy/scrapyd/issues/100#issuecomment-115268880