Add Filter in Pyhton - selenium

I want to send an e-mail by pulling the advertisement links that I have collected from a site one by one. I wrote a system but it doesn't work, it doesn't give an error either. Namely, if there are certain words and phone numbers in the advertisement, it will not send an e-mail, otherwise it will send an e-mail. I did filters. I am sharing my codes. If there are errors you see when you look at them (such as logic, code errors…) I would be very happy if you could share them.
Word Filter
try:
re = driver.find_element(By.XPATH, "/html/body")
search_words = re.findAll(
"Import",
"Tree",
"Seven",
)
print("\nCannot send mail because it contains the word.Index :", i)
print(re.findAll)
except:
search_words = " "
time.sleep(1)

re will return a WebElement. I think you are looking for the text from element? If so, you need to use the (text) property:
re = driver.find_element(By.XPATH, "/html/body").text
But to work properly you can do it like this:
try:
web_text = driver.find_element(By.XPATH, "/html/body").text
words = ["Import", "Tree", "Seven"]
search_words = [word for word in words if re.findall(word, web_text)]
text_words = ''
if search_words:
for i, word in enumerate(search_words):
if i < len(search_words) - 1:
text_words += f"{word}, "
else:
text_words += f"{word}."
print(f"\nCannot send mail because it contains the word.Index : {text_words}")
print(re.findall)
except Exception:
text_words = ""
time.sleep(1)

Related

Webscraping customer review - Invalid selector error using XPath

I am trying to extract userid, rating and review from the following site using selenium and it is showing "Invalid selector error". I think, the Xpath I have tried to define to get the review text is the reason for error. But I am unable to resolve the issue. The site link is as below:
teslamotor review
The code that I have used is following:
#Class for Review webscraping from consumeraffairs.com site
class CarForumCrawler():
def __init__(self, start_link):
self.link_to_explore = start_link
self.comments = pd.DataFrame(columns = ['rating','user_id','comments'])
self.driver = webdriver.Chrome(executable_path=r'C:/Users/mumid/Downloads/chromedriver/chromedriver.exe')
self.driver.get(self.link_to_explore)
self.driver.implicitly_wait(5)
self.extract_data()
self.save_data_to_file()
def extract_data(self):
ids = self.driver.find_elements_by_xpath("//*[contains(#id,'review-')]")
comment_ids = []
for i in ids:
comment_ids.append(i.get_attribute('id'))
for x in comment_ids:
#Extract dates from for each user on a page
user_rating = self.driver.find_elements_by_xpath('//*[#id="' + x +'"]/div[1]/div/img')[0]
rating = user_rating.get_attribute('data-rating')
#Extract user ids from each user on a page
userid_element = self.driver.find_elements_by_xpath('//*[#id="' + x +'"]/div[2]/div[2]/strong')[0]
userid = userid_element.get_attribute('itemprop')
#Extract Message for each user on a page
user_message = self.driver.find_elements_by_xpath('//*[#id="' + x +'"]]/div[3]/p[2]/text()')[0]
comment = user_message.text
#Adding date, userid and comment for each user in a dataframe
self.comments.loc[len(self.comments)] = [rating,userid,comment]
def save_data_to_file(self):
#we save the dataframe content to a CSV file
self.comments.to_csv ('Tesla_rating-6.csv', index = None, header=True)
def close_spider(self):
#end the session
self.driver.quit()
try:
url = 'https://www.consumeraffairs.com/automotive/tesla_motors.html'
mycrawler = CarForumCrawler(url)
mycrawler.close_spider()
except:
raise
The error that I am getting is as following:
Also, The xpath that I tried to trace is from following HTML
You are seeing the classic error of...
as find_elements_by_xpath('//*[#id="' + x +'"]]/div[3]/p[2]/text()')[0] would select the attributes, instead you need to pass an xpath expression that selects elements.
You need to change as:
user_message = self.driver.find_elements_by_xpath('//*[#id="' + x +'"]]/div[3]/p[2]')[0]
References
You can find a couple of relevant detailed discussions in:
invalid selector: The result of the xpath expression "//a[contains(#href, 'mailto')]/#href" is: [object Attr] getting the href attribute with Selenium

Better way to clean product description using BeautifulSoup?

I have written following code to fetch product description from a site using BeautifulSoup-
def get_soup(url):
try:
response = requests.get(url)
if response.status_code == 200:
html = response.content
return BeautifulSoup(html, "html.parser")
except Exception as ex:
print("error from " + url + ": " + str(ex))
def get_product_details(url):
try:
soup = get_soup(url)
prod_details = dict()
desc_list = soup.select('p ~ ul')
prod_details['description'] = ''.join(desc_list)
return prod_details
except Exception as ex:
logger.warning('%s - %s', ex, url)
if __name__ == '__main__':
get_product_details("http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html")
In above code I am trying to convert description(a list) to string but getting below issue-
[WARNING] aprisin.py:82 get_product_details() : sequence item 0: expected str instance, Tag found - http://www.aprisin.com.sg/p-748-littletikespoptunesguitar.html
Output of description without converting description to string-
[<ul>
<li>Freestyle</li>
<li>Play along with 5 pre-set tunes: </li>
</ul>, <ul>
<li>Each string will play a note</li>
<li>Guitar has a whammy bar</li>
<li>2-in-1 volume control and power button </li>
<li>Simple and easy to use </li>
<li>Helps develop music appreciation </li>
<li>Requires 3 "AA" alkaline batteries (included)</li>
</ul>]
You are passing a list of tags (Object) instead of string to join(). join() works with list of strings. Use the following code changes for join function:-
prod_details['description'] = ''.join([tag.get_text() for tag in desc_list])
or
prod_details['description'] = ''.join([tag.string for tag in desc_list])
In case you want the description along with html content, you can use the following:-
# this will preserve the html tags and indentation.
prod_details['description'] = ''.join([tag.prettify() for tag in desc_list])
or
# this will return the html content as string.
prod_details['description'] = ''.join([str(tag) for tag in desc_list])
desc_list is list of bs4.element.Tag. you should convert tag to string:
desc_list = soup.select('p ~ ul')
prod_details['description'] = str(desc_list[0])
You're trying to join a list of Tags, but the join method needs str arguments. Try:
''.join([str(i) for i in desc_list])

All the links in the same row

I'm doing a database where it has 2 columns: event_name and event_URL. It doesn't get the name and puts all the urls on the event_URL column. Print: https://prnt.sc/fru1tr
Code:
import urllib2
from bs4 import BeautifulSoup
import psycopg2
page = urllib2.urlopen('https://www.meetup.com/find/outdoors-adventure/?allMeetups=false&radius=50&userFreeform=London%2C+&mcId=c1012717&change=yes&sort=default')
soup = BeautifulSoup(page, 'lxml')
events = soup.find('ul', class_='j-groupCard-list searchResults tileGrid tileGrid--3col tileGrid_atMedium--2col tileGrid_atSmall--1col')
A = []
B = []
try:
conn = psycopg2.connect("dbname='meetup' user='postgres' host='localhost' password='root'")
except:
print 'Unable to connect to the database.'
cur = conn.cursor()
for event in events.findAll('li'):
text = event.findAll('h3')
if len(text) != 0:
A.append(text[0].find(text = True))
url = event.find('a', href=True)
if len(url) != 0:
B.append(url['href'])
cur.execute("""INSERT INTO outdoors_adventure(event_name,event_url) VALUES(%s,%s)""", (tuple(A),tuple(B)))
conn.commit()
del A[:]
del B[:]
If the indentation is right in the posted code, the problem might be in the nested for-loop: for every event, you append the "B" list with all the links on the page. You could try:
for event in events.findAll('li'):
text = event.findAll('h3')
if len(text) != 0:
A.append(text[0].find(text = True))
for link in events.findAll('li'):
url = link.find('a', href=True)
if len(url) != 0:
B.append(url['href'])
Or better, keep the event-name and event-URL search in a single for-loop, fetching first the text and then the url of event
EDIT: You can simplify the name extraction, by using:
for event in events.findAll('li'):
text = event.h3.string.strip()
if len(text) != 0:
A.append(text)
url = event.find('a', href=True)
...
Let me know if that does the trick for you (it does on my side).
EDIT2: The problem might be just the fact that the extracted string starts with tabs (maybe that's why your DB seems "not to show the name" - its there, but you only see the tabs in the preview?). Just use strip() to remove them.

Scrapy only show the first result of each page

I need to scrape the items of the first page and then go to the next button to go to the second page and scrape and so on.
This is my code, but only scrape the first item of each page, if there are 20 pages enter to every page and scrape only the first item.
Could anyone please help me .
Thank you
Apologies for my english.
class CcceSpider(CrawlSpider):
name = 'ccce'
item_count = 0
allowed_domain = ['www.example.com']
start_urls = ['https://www.example.com./afiliados value=&categoria=444&letter=']
rules = {
# Reglas Para cada item
Rule(LinkExtractor(allow = (), restrict_xpaths = ('//li[#class="pager-next"]/a')), callback = 'parse_item', follow = True),
}
def parse_item(self, response):
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = response.xpath('normalize-space(//div[#class="news-col2"]/h2/text())').extract()
ml_item['url'] = response.xpath('normalize-space(//div[#class="website"]/a/text())').extract()
ml_item['correo'] = response.xpath('normalize-space(//div[#class="email"]/a/text())').extract()
ml_item['descripcion'] = response.xpath('normalize-space(//div[#class="news-col4"]/text())').extract()
self.item_count += 1
if self.item_count > 5:
#insert_table(ml_item)
raise CloseSpider('item_exceeded')
yield ml_item
As you haven't given an working target url, I'm a bit guessing here, but most probably this is the problem:
parse_item should be a parse_page (and act accordingly)
Scrapy is downloading a full page which has - according to your description - multiple items and then passes this as a response object to your parse method.
It's your parse method's responsibility to process the whole page by iterating over the items displayed on the page and creating multiple scraped items accordingly.
The scrapy documentation has several good examples for this, one is here: https://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths
Basically your code structure in def parse_XYZ should look like this:
def parse_page(self, response):
items_on_page = response.xpath('//...')
for sel_item in items_on_page:
ml_item = CcceItem()
#info de producto
ml_item['nombre'] = # ...
# ...
yield ml_item
Insert the right xpaths for getting all items on the page and adjust your item xpaths and you're ready to go.

Web page is showing weird unicode(?) letters: \u200e

How can I remove that? I Tried so many things and I am exhausted of trying to defeat this error by myself. I spent the last 3 hours looking at this and trying to get through it and I surrender to this code. Please help.
The first "for" statement grabs article titles from news.google.com
The second "for" statement grabs the time of submisssion from that article on news.google.com.
This is on django btw and this page shows the list of article titles and their time of submission in a list, going down. The weird unicode letters are popping up from the second "for" statement which is the time submissions. Here is my views.py:
def articles(request):
""" Grabs the most recent articles from the main news page """
import bs4, requests
list = []
list2 = []
url = 'https://news.google.com/'
r = requests.get(url)
try:
r.raise_for_status() == True
except ValueError:
print('Something went wrong.')
soup = bs4.BeautifulSoup(r.text, 'html.parser')
for (listarticles) in soup.find_all('h2', 'esc-lead-article-title'):
if listarticles is not None:
a = listarticles.text
list.append(a)
for articles_times in soup.find_all('span','al-attribution-timestamp'):
if articles_times is not None:
b = articles_times.text
list2.append(b)
list = zip(list,list2)
context = {'list':list}
return render(request, 'newz/articles.html', context)