I want to read and print this data in the picture by using the follow codes. But I get some troubles in this program, how could I fix this codes? - python-re

import re
file_path = 'D:/Speech/data/test2.txt'
useful_regex = re.compile(r'\[.+\]\n', re.IGNORECASE)
with open(file_path) as f:
file_content = f.read()
info_lines = re.findall(useful_regex, file_content)
len(info_lines)
for l in info_lines[1:10]:
print(l.strip().split('\t'))
As stated in the title, I want to read and print this data in the picture by using the follow codes. But I get some troubles in this program, how could I fix this codes?

Related

Using string output from pytesseract to do a vlookup in pandas dataframe

I'm very new to Python, and I'm trying to make a simple image to song title to BPM program. My approach is using pytesseract to generate a string output; and then, using that string output, I wish to vlookup in a dataframe created by pandas. However, it always return zero value even though that song does exist in the data.
import PIL.ImageGrab
from PIL import ImageGrab
import numpy as np
import pytesseract
import pandas as pd
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
def getTitleImage(left, top, width, height):
printscreen_pil = ImageGrab.grab((left, top, left + width, top + height))
printscreen_numpy = np.array(printscreen_pil.getdata(), dtype='uint8') \
.reshape((printscreen_pil.size[1], printscreen_pil.size[0], 3))
return printscreen_numpy
# Printscreen:
titleImage = getTitleImage(x, y, w, h)
# pytesseract to string:
songTitle = pytesseract.image_to_string(titleImage)
print('Name of the song: ', songTitle)
# Importing the csv data via pandas.
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
# A simple vlookup formula that return the BPM of the song by taking data from the same row.
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
Output:
Name of the song: Macarena
The BPM of the song is: 0
However, when I tried to forcefully provide the string to the songTitle variable, it works:
songTitle = 'Macarena'
print('Name of the song: ', songTitle)
songTable = pd.read_csv(r'C:\Users\leech\Desktop\songList.csv')
bpmSong = songTable[songTable['Song Title'] == songTitle]['BPM'].sum()
print('The BPM of the song is: ', bpmSong)
Output:
Name of the song: Macarena
The BPM of the song is: 103
I have checked the string generated from pytesseract: It has no extra space in the front or the back, totally identical to the forced string, but they still produce different results. What could be the problem?
I found the answer.
It is because the songTitle coming from:
songTitle = pytesseract.image_to_string(titleImage)
...is actually 'Macarena\n' instead of 'Macarena'.
They might look the same after print out, except the former will create a new line after it.
A great lesson learn for me.

How to open Russian-language PDFs for NLTK processing

I'm trying to extract text from a pdf file in Russian, and use this text as data for tokenisation, lemmatisation etc. with NLTK on Jupyter Notebook. I'm using PyPDF2, but I keep running into problems.
I am creating a function and passing to it the pdf as the input:
from PyPDF2 import PdfFileReader
def getTextPDF(pdfFileName):
pdf_file = open(pdfFileName, "rb")
read_pdf = PdfFileReader(pdf_file)
text = []
for i in range(0, read_pdf.getNumPages()):
text.append(read_pdf.getPage(i).extractText())
return "\n".join(text)
Then I call the function:
pdfFile = "sample_russian.pdf"
print("PDF: \n", myreader_pdf.getTextPDF(pdfFile))
But I get a long pink list of the same error warning:
PdfReadWarning: Superfluous whitespace found in object header b'1' b'0' [pdf.py:.....]
Any ideas would be very helpful! Thanks in advance!

How to calculate tf-idf when working on .txt files in python 3.7?

I have books in pdf and I want to do NLP tasks such as preprocessing, tf-idf calculation, word2vec, etc on those books. So I converted them into .txt files and was trying to get tf-idf scores. Previously I performed tf-idf on a CSV file, so I made some changes in that code and tried to use it for .txt file. But I am unsuccessful in my attempt.
Below is my code:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = open('jungle book.txt', 'r+')
# print(data.read())
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec.fit(data)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data)
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(20)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(20)
print(tf_idf)
This code is working until print ('Non Zero Count :', cvec_count.shape) and printing:
Sparse Matrix Shape : (0, 7132)
Non Zero Count : 0
Then it is giving error:
ZeroDivisionError: division by zero
Even if I run this code with ignoring ZeroDivisionError, still it is wrong as it is not counting any frequencies.
I have no idea how to work around .txt file. What is the proper way to work on .txt file for NLP tasks?
Thanks in advance!
You are getting the error because data variable is empty or wrong type. Just opening the text file is not enough. You have to read the contents into a string variable and then do the preprocessing on that variable. Try replacing
data = open('jungle book.txt', 'r+')
# print(data.read())
with
with open('jungle book.txt', 'r') as file:
data = file.read()

Cannot replace spaCy lemmatized pronouns (-PRON-) through text

I'm trying to lemmatise a text with spaCy. Since spaCy uses -PRON- as lemma for personal pronouns, I want to keep the original text in all those cases.
Here's the relevant section of my code:
...
fout = open('test.txt', 'w+')
doc = nlp(text)
for word in doc:
if word.lemma_ == "-PRON-":
write = word.text
print(write)
else:
write = word.lemma_
fout.write(str(write))
fout.write(" ")
...
The print statement does print the original words for the cases where spaCy attributes the lemma '-PRON-'.
However, my output file (test.txt) always contains '-PRON-' for those cases, even though I would expect it to write the original words for those cases (I, us etc.)
What am I missing?
I tried different versions, including using the pos_ tag to identify the pronouns etc. but always with the same result, i.e., that my output contains '-PRON-'s
Try this somewhat altered code snipped to see what you get...
import spacy
nlp = spacy.load('en_core_web_sm')
text = 'Did he write the code for her?'
doc = nlp(text)
out_sent = [w.lemma_ if w.lemma_ !='-PRON-' else w.text for w in doc]
out_sent = ' '.join(out_sent)
print(out_sent)
with open('out_sent.txt', 'w') as f:
f.write(out_sent + '\n')
This should produce...
do he write the code for her ?

I want to parse multiple HTML documents with beautiful soup but I can't make it work

Is there a way to use beautiful soup to parse multiple HTML documents at the same time? I am modifying the code online that extracts HTML.txt files from edgar with beautiful soup so they can be downloaded as formated files: however, I found that my codes now only prints one edgar document (it's intended to print 5) and I don't know what's wrong with it.
import csv
import requests
import re
from bs4 import BeautifulSoup
with open('General Motors Co 11-15.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
fn1 = line[0]
fn2 = re.sub(r'[/\\]', '', line[1])
fn3 = re.sub(r'[/\\]', '', line[2])
fn4 = line[3]
saveas = '-'.join([fn1, fn2, fn3, fn4])
# Reorganize to rename the output filename.
url = 'https://www.sec.gov/Archives/' + line[4].strip()
bodytext=requests.get(url).text
parsedContent=BeautifulSoup(bodytext, 'html.parser')
for script in parsedContent(["script", "style"]):
script.extract()
text = parsedContent.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
with open(saveas, 'wb') as f:
f.write(requests.get('%s' % text).content)
print(file, 'downloaded and wrote to text file')
Do you know what's wrong with my codes?
I would guess that you're overwriting the existing document every time you write to the file. trying changing with open(saveas, 'wb') as f: to with open(saveas, 'ab') as f:
opening a file as wb creates a new document at with the same name as saveas, essentially clearing the existing document.