Using Keyword to Print Sentence - beautifulsoup

I am trying to write a program to take a list of keywords and print out the sentence that contains such word from a website a user enters. Right now my output is printing out a whole lot of extra stuff such as symbols, I want it to just print the sentence for each occurence. How do I go about that?
code so far:
#Import Packages
import requests
from bs4 import BeautifulSoup
import urllib.request as ul
url = input('Enter URL:')
reg= requests.get(url,allow_redirects=False)
soup = BeautifulSoup(req.content, "lxml")
words = ["technology","wireless"]
for word in words:
print(word, soup.find(text=lambda text: text and word in text))

You can use NLTK (Natural Language Toolkit) http://www.nltk.org/ to turn the text to sentences. install with:
pip install nltk
then run the following lines once in Python:
import nltk.data
nltk.download('punkt')
Then the code looks like this:
import requests
from bs4 import BeautifulSoup
import nltk.data
words = ["technology", "wireless", "people"]
url = 'https://marketbusinessnews.com/financial-glossary/wireless-technology/'
reg = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(reg.content, "lxml")
# get rid of unwanted tags
for unwanted_tag in soup(['script', 'style', 'head', 'title', 'meta']):
unwanted_tag.decompose()
# get the text from the soup
texts = " ".join(soup.stripped_strings)
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
for word in words:
print("######", word, "######")
for text in tokenizer.tokenize(texts):
if word in text.lower(): # cast to lower to make search case insensitive
print(text)
This solution is not perfect as you may end up with spaces where you don't expect them but the alternative is to have no spaces where you expect them.

it better o select The <p> tags, it defines a paragraph.
for word in words:
for p in soup.select('p'):
if word in p.text:
print(word, p.text)

Related

Space between paragraphs in Selenium

I have a text file with text message(in 5 paragraphs) and the bot that sends messages to Instagram users. But it doesn't put spaces between paragraphs like it is in txt file, just straight text. Any ideas how to fix it?
If you are having issues with selenium not typing spaces when typing a string, You can use split method and each time you type a string of the new list, you can send a SPACE using Keys on selenium, like this:
parg = "Hello World!"
list_words = parg.split()
for word in list_words:
input_selected.send_keys(word)
input_selected.send_keys(Keys.SPACE)
I believe something like this can work too and it will be better since it won't require using any lists, just one line of code:
input_selected.send_keys(parg.replace(" ",Keys.SPACE))
NOTE:
in order to import Keys in selenium:
from selenium.webdriver.common.keys import Keys

Recognize newline (\n) in text as end of sentence in Spacy

I'd like to recognize a newline in text as the end of a sentence. I've tried inputting it into the nlp object like this:
text = 'Guest Blogging\nGuest Blogging allows the user to collect backlinks'
nlp = spacy.load("en_core_web_lg")
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)
for sent in nlp(text).sents:
print('next sentence:')
print(sent)
The output of this is:
next sentence:
Guest Blogging
Guest Blogging allows the user to collect backlinks
I don't understand why Spacy isn't recognizing the newline as a sentence end. My desired output is:
next sentence:
Guest Blogging:
next sentence:
Guest Blogging allows the user to collect backlinks
Does anyone know how to achieve this?
The reason the sentencizer isn't doing anything here is that the parser has run first and already set all the sentence boundaries, and then the sentencizer doesn't modify any existing sentence boundaries.
The sentencizer with \n is only the right option if you know you have exactly one sentence per line in your input text. Otherwise a custom component that adds sentence starts after newlines (but doesn't set all sentence boundaries) is probably what you want.
If you want to set some custom sentence boundaries before running the parser, you need be sure you add your custom component before the parser in the pipeline:
nlp.add_pipe("my_component", before="parser")
Your custom component would set token.is_start_start = True for the tokens right after newlines and leave all other tokens unmodified.
Check out the second example here: https://spacy.io/usage/processing-pipelines#custom-components-simple
you can do this by using
nlp = spacy.load('en_core_web_sm', exclude=["parser"])
text = 'Guest Blogging\nGuest Blogging allows the user to collect backlinks'
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)
for sent in nlp(text).sents:
print("next sentence")
print(sent)
Output:
next sentence
Guest Blogging
next sentence
Guest Blogging allows the user to collect backlinks
You could also break sentences by \n before feeding spacy.
from spacy.lang.en import English
def get_sentences(_str):
chunks = _str.split('\n')
sentences = []
nlp = English()
nlp.add_pipe("sentencizer")
for chunk in chunks:
doc = nlp(chunk)
sentences += [sent.text.strip() for sent in doc.sents]
return sentences

Python function print output redirect

Its my first attempt at coding and I've a bit of a snag.
I was wondering if one of you would be able to point me in the right direction for this case..
What I'm trying to do is convert a string from hexadecimal format into readable text.
Following this, I want to run the converted string/text against some regex commands in order to pick out things like email addresses and domains.
I know the codes work individually when testing - but the problem is that when trying to run them together, I am unable to properly assign the print output of the converted string into the variable for me to run the regex commands against.
Any tips or suggestions on how I can get around to properly assigning the converted string into a variable and work on it?
This is part of the code as the rest are options available to convert other format types into readable text:
#!/usr/bin/env python3
import base64
import codecs
import re
import sys
print(" If the data is Hexadecimal and looks similar to this: 48656C6C6F20686F77206172, enter: 2 ")
print("")
decision = int(input("enter the number here: "))
print("")
message = input("Enter the data you wish to have decoded: ")
def decode_hex1(encoded_text):
information = ''
for i in range(len(encoded_text)//2):
information = information +
print(codecs.decode(encoded_text[i*2:i*2+2]0, "hex").decode('utf-8'), end="")
return information
if decision == 2:
output = decode_hex1(message)
match_emails = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', output)
print("The following emails may be of interest to you: ", match_emails)
print("")
domain_regex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
match_domains = re.findall(domain_regex, output)
print("The following domains may be of interest to you: ", (match_domains))
else:
print("Invalid Choice - please check the number and try again")
print("")
print("Done")

How to scrape NON-ASCII characters alone

I am using BeautifulSoup to scrape data. The text I want to scrape is "€ 48,50", which contains an ascii character. However, I would like to replace the euro sign with nothing so that the final output is "48,50". I have been getting errors because the console cannot print it. I am using python 2.7 on Windows for this. I will appreciate a solution.
I was basically getting errors and do not know how to go about this. Or is there a way I can just extract non-ascii characters alone?
w= item.find_all("div",{"class":"product-price"}).find("strong",
{"class":"product-price__money"}).text.replace("\\u20ac"," ")
print w
You need to decode the string and pass the replace function a unicode string.
text = "€ 48,50"
w = text.decode("utf-8").replace(u"\u20ac"," ")
print w
See How to replace unicode characters in string with something else python? for more details.

Extract text from PDF in respect to formatting (font size, type etc)

Is possible to extract text from a PDF file concerning specific font/font size/font colour etc.? I prefer Perl, python or *nix command-line utilities. My goal is to extract all headlines from PDF file so I will have a nice index of articles contained in a single PDF.
Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g. Perl.
I have working code which extracts text from pdf with the size of the font.
with help of PDfminer, I have achieved this job. with many pdf's
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'path\whereyour pdffile'
os.chdir(path)
Extract_Data=[]
for PDF_file in os.listdir():
if PDF_file.endswith('.pdf'):
for page_layout in extract_pages(PDF_file):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
I have used fitz to accomplish the required task, as it is much faster compared to pdfminer. You can find my duplicate answer to a similar question here.
An example code snippet is shown below.
import fitz
def scrape(keyword, filePath):
results = [] # list of tuples that store the information as (text, font size, font name)
pdf = fitz.open(filePath) # filePath is a string that contains the path to the pdf
for page in pdf:
dict = page.get_text("dict")
blocks = dict["blocks"]
for block in blocks:
if "lines" in block.keys():
spans = block['lines']
for span in spans:
data = span['spans']
for lines in data:
if keyword in lines['text'].lower(): # only store font information of a specific keyword
results.append((lines['text'], lines['size'], lines['font']))
# lines['text'] -> string, lines['size'] -> font size, lines['font'] -> font name
pdf.close()
return results
If you wish to find the font information of every line, you may omit the if condition that checks for a specific keyword.
You can extract the text information in any desired format by understanding the structure of dictionary outputs that we obtain by using get_text("dict"), as mentioned in the documentation.