Recognize newline (\n) in text as end of sentence in Spacy - spacy

I'd like to recognize a newline in text as the end of a sentence. I've tried inputting it into the nlp object like this:
text = 'Guest Blogging\nGuest Blogging allows the user to collect backlinks'
nlp = spacy.load("en_core_web_lg")
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)
for sent in nlp(text).sents:
print('next sentence:')
print(sent)
The output of this is:
next sentence:
Guest Blogging
Guest Blogging allows the user to collect backlinks
I don't understand why Spacy isn't recognizing the newline as a sentence end. My desired output is:
next sentence:
Guest Blogging:
next sentence:
Guest Blogging allows the user to collect backlinks
Does anyone know how to achieve this?

The reason the sentencizer isn't doing anything here is that the parser has run first and already set all the sentence boundaries, and then the sentencizer doesn't modify any existing sentence boundaries.
The sentencizer with \n is only the right option if you know you have exactly one sentence per line in your input text. Otherwise a custom component that adds sentence starts after newlines (but doesn't set all sentence boundaries) is probably what you want.
If you want to set some custom sentence boundaries before running the parser, you need be sure you add your custom component before the parser in the pipeline:
nlp.add_pipe("my_component", before="parser")
Your custom component would set token.is_start_start = True for the tokens right after newlines and leave all other tokens unmodified.
Check out the second example here: https://spacy.io/usage/processing-pipelines#custom-components-simple

you can do this by using
nlp = spacy.load('en_core_web_sm', exclude=["parser"])
text = 'Guest Blogging\nGuest Blogging allows the user to collect backlinks'
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)
for sent in nlp(text).sents:
print("next sentence")
print(sent)
Output:
next sentence
Guest Blogging
next sentence
Guest Blogging allows the user to collect backlinks

You could also break sentences by \n before feeding spacy.
from spacy.lang.en import English
def get_sentences(_str):
chunks = _str.split('\n')
sentences = []
nlp = English()
nlp.add_pipe("sentencizer")
for chunk in chunks:
doc = nlp(chunk)
sentences += [sent.text.strip() for sent in doc.sents]
return sentences

Related

Python function print output redirect

Its my first attempt at coding and I've a bit of a snag.
I was wondering if one of you would be able to point me in the right direction for this case..
What I'm trying to do is convert a string from hexadecimal format into readable text.
Following this, I want to run the converted string/text against some regex commands in order to pick out things like email addresses and domains.
I know the codes work individually when testing - but the problem is that when trying to run them together, I am unable to properly assign the print output of the converted string into the variable for me to run the regex commands against.
Any tips or suggestions on how I can get around to properly assigning the converted string into a variable and work on it?
This is part of the code as the rest are options available to convert other format types into readable text:
#!/usr/bin/env python3
import base64
import codecs
import re
import sys
print(" If the data is Hexadecimal and looks similar to this: 48656C6C6F20686F77206172, enter: 2 ")
print("")
decision = int(input("enter the number here: "))
print("")
message = input("Enter the data you wish to have decoded: ")
def decode_hex1(encoded_text):
information = ''
for i in range(len(encoded_text)//2):
information = information +
print(codecs.decode(encoded_text[i*2:i*2+2]0, "hex").decode('utf-8'), end="")
return information
if decision == 2:
output = decode_hex1(message)
match_emails = re.findall(r'[\w.+-]+#[\w-]+\.[\w.-]+', output)
print("The following emails may be of interest to you: ", match_emails)
print("")
domain_regex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
match_domains = re.findall(domain_regex, output)
print("The following domains may be of interest to you: ", (match_domains))
else:
print("Invalid Choice - please check the number and try again")
print("")
print("Done")

Howto break text line into multiple events

I am new to Splunk and I'm trying to play a little bit with source type and the regex setting of it...Let's say I put following events into HEC:
curl -k https://utu:8088/services/collector/event/1.0 -H "Authorization: Splunk 21755979-ed43-4a1a-8962-e6e45ccf3ccf" -d '{"event": "splunk splunk splunk dog", "sourcetype": "hec_st"}'
curl -k https://utu:8088/services/collector/event/1.0 -H "Authorization: Splunk 21755979-ed43-4a1a-8962-e6e45ccf3ccf" -d '{"event": "splunk splunk splunk cat", "sourcetype": "hec_st"}'
hec_st is the source type with regex:
(splunk)\s+
with SHOULD_LINEMERGE=false
Please why mentioned settings doesn't break string "splunk splunk splunk cat" into multiple events
splunk
splunk
splunk
cat
I'm able to find this string as one event always. Thanks a lot in advance
T.
Firstly, the correct regex for your requirement is
([\s]+)
In case if you want to add it to the props.conf, you can use the following stanza,
[hec_st]
BREAK_ONLY_BEFORE_DATE =
DATETIME_CONFIG =
LINE_BREAKER = (.)\s*
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
category = Miscellaneous
description = Split events by space
pulldown_type = 1
Below are the guidelines to be followed while creating a regex for a source type to Define event boundaries for incoming data.
Specifies a regex that determines how the raw text stream is broken into initial events, before line merging takes place.
*This sets SHOULD_LINEMERGE = false and LINE_BREAKER to the user-provided regular expression.
Defaults to ([\r\n]+), meaning data is broken into an event for each line, delimited by any number of carriage return or newline characters.
The regex must contain a capturing group -- a pair of parentheses which defines an identified subcomponent of the match.
Wherever the regex matches, Splunk considers the start of the first capturing group to be the end of the previous event and considers the end of the first capturing group to be the start of the next event.
The contents of the first capturing group are discarded, and will not be present in any event. You are telling Splunk that this text comes between lines.
Input data:
My name is SV.
Output data:
Alright for anyone solving similar problem...I was successful in the end with following sourcetype stanza:
[hec_type]
BREAK_ONLY_BEFORE_DATE =
DATETIME_CONFIG =
LINE_BREAKER = ([\s+])
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
category = Miscellaneous
description = Split events by space
pulldown_type = 1
with this stanza sentence like "My name is Tomas" is being braked that word = event as I wanted. What really helps is to play with regexp in some online editor and then to put it into props.

Extract text from PDF in respect to formatting (font size, type etc)

Is possible to extract text from a PDF file concerning specific font/font size/font colour etc.? I prefer Perl, python or *nix command-line utilities. My goal is to extract all headlines from PDF file so I will have a nice index of articles contained in a single PDF.
Text and /font/font size/position (no color, as I checked) you can get from Ghostscript's txtwrite device (try -dTextFormat=0 | 1 options), as well as from mudraw's (MuPDF) with -tt option. Then parse XML-like output with e.g. Perl.
I have working code which extracts text from pdf with the size of the font.
with help of PDfminer, I have achieved this job. with many pdf's
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'path\whereyour pdffile'
os.chdir(path)
Extract_Data=[]
for PDF_file in os.listdir():
if PDF_file.endswith('.pdf'):
for page_layout in extract_pages(PDF_file):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
I have used fitz to accomplish the required task, as it is much faster compared to pdfminer. You can find my duplicate answer to a similar question here.
An example code snippet is shown below.
import fitz
def scrape(keyword, filePath):
results = [] # list of tuples that store the information as (text, font size, font name)
pdf = fitz.open(filePath) # filePath is a string that contains the path to the pdf
for page in pdf:
dict = page.get_text("dict")
blocks = dict["blocks"]
for block in blocks:
if "lines" in block.keys():
spans = block['lines']
for span in spans:
data = span['spans']
for lines in data:
if keyword in lines['text'].lower(): # only store font information of a specific keyword
results.append((lines['text'], lines['size'], lines['font']))
# lines['text'] -> string, lines['size'] -> font size, lines['font'] -> font name
pdf.close()
return results
If you wish to find the font information of every line, you may omit the if condition that checks for a specific keyword.
You can extract the text information in any desired format by understanding the structure of dictionary outputs that we obtain by using get_text("dict"), as mentioned in the documentation.

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));

Extract text from corrupt (?) pdf document

In a project I'm working on we scrape legal documents from various government sites and then make them searchable online.
Every now and then we encounter a PDF that seems to be corrupt. Here's an example of one.
If you open it in a PDF reader, it looks fine, but:
If you try to copy and paste it, you get corrupted text
If you run it through any tools like pdftotext, you corrupted text
If you do just about anything else to it -- you guessed it -- you get corrupted text
Yet, if you open it in a reader, it looks fine! So I know the text is there, but something is wrong, wrong wrong! The result is that on my site it looks really bad.
Is there anything I can do?
Update: I did more research today. Thanks to #Andrew Cash's observation that this is essentially a Caesar cipher, I realized I could search for the documents. This link will show you about 200 of these in my system. Looking through the larger sample set, it looks like these are all created by the same software, pdffactory v. 3.51! So I blame a bug, not deliberate obfuscation.
Update 2: The link above won't provide any results anymore. These are purged from my system using my solution below.
Tha PDF is using subsetted fonts where the characters are remapped to other characters using the same as a simple World War II substitution cipher.
A = G,
B = 1,
C = #,
D = W,
...
... and so on. Every character is remapped.
The font is mapped this way and in order to get the correct characters displaying in the PDF you need to send "G1#W" in for it to print out ABCD. Normally PDF's will have a ToUnicode table to help you with text extraction but this table has been left out on purpose I suspect.
I have seen a few of these documents myself where they are deliberately obfuscated to prevent text extraction. I have seen a document with about 5 different fonts and they were all mapped using a different sequence.
One sure way to tell if this is the problem is to load the PDF into Acrobat and copy / paste the text into a text editor. If Acrobat cannot decode the text back to English then there is no way to extract the text without remapping it manually if you know the translation mappings.
The only way to extract text easily from these types of documents is to OCR the full document and remove the original text. The OCR would convert the page to a TIFF image and then OCR it so the original garbled text shouldn't affect the OCR.
Weary of this issue and not wanting to deal with OCR, I manually sorted out the cipher. Here she be, as a python dict along with some rudimentary code that I was using to test it. I'm sure this could be improved, but it does work for all letters except uppercase Q and uppercase X, which I haven't yet been able to find.
It's missing a fair bit of punctuation too at least for now (all of these are missing, for example: <>?{}\|!~`##$%^_=+).
# -*- coding: utf-8 -*-
import re
import sys
letter_map = {
u'¿':'a',
u'regex':'b',
u'regex':'c',
u'regex':'d',
u'»':'e',
u'o':'f',
u'1':'g',
u'regex':'h',
u'·':'i',
u'¶':'j',
u'μ':'k',
u'regex':'l',
u'3':'m',
u'2':'n',
u'±':'o',
u'°':'p',
u'regex':'q',
u'®':'r',
u'-':'s',
u'¬':'t',
u'«':'u',
u'a':'v',
u'©':'w',
u'regex':'x',
u'§':'y',
u'¦':'z',
u'ß':'A',
u'Þ':'B',
u'Ý':'C',
u'Ü':'D',
u'Û':'E',
u'Ú':'F',
u'Ù':'G',
u'Ø':'H',
u'×':'I',
u'Ö':'J',
u'Õ':'K',
u'Ô':'L',
u'Ó':'M',
u'Ò':'N',
u'Ñ':'O',
u'Ð':'P',
u'':'Q', # Missing
u'Î':'R',
u'Í':'S',
u'Ì':'T',
u'Ë':'U',
u'Ê':'V',
u'É':'W',
u'':'X', # Missing
u'Ç':'Y',
u'Æ':'Z',
u'ð':'0',
u'ï':'1',
u'î':'2',
u'í':'3',
u'ì':'4',
u'ë':'5',
u'ê':'6',
u'é':'7',
u'è':'8',
u'ç':'9',
u'ò':'.',
u'ô':',',
u'æ':':',
u'å':';',
u'Ž':"'",
u'•':"'",
u'•':"'", # s/b double quote, but identical to single.
u'Œ':"'", # s/b double quote, but identical to single.
u'ó':'-', # dash
u'Š':'-', # n-dash
u'‰':'--', # em-dash
u'ú':'&',
u'ö':'*',
u'ñ':'/',
u'÷':')',
u'ø':'(',
u'Å':'[',
u'Ã':']',
u'‹':'•',
}
ciphertext = u'''YOUR STUFF HERE'''
plaintext = ''
for letter in ciphertext:
try:
plaintext += letter_map[letter]
except KeyError:
plaintext += letter
# These are multi-length replacements
plaintext = re.sub(u'm⁄4', 'b', plaintext)
plaintext = re.sub(u'g⁄n', 'c', plaintext)
plaintext = re.sub(u'g⁄4', 'd', plaintext)
plaintext = re.sub(u' ́', 'l', plaintext)
plaintext = re.sub(u' ̧', 'h', plaintext)
plaintext = re.sub(u' ̈', 'x', plaintext)
plaintext = re.sub(u' ̄u', 'qu', plaintext)
for letter in plaintext:
try:
sys.stdout.write(letter)
except UnicodeEncodeError:
continue