Translation with google trad api - api

I'm trying to write a program that takes the text of a file, for example PDF, and translates the text extracted with the Google API, except that the API doesn't work with my code. I don't have a clue why it isn't working.
I've already tried to modify my code but nothing I've done works.
from tika import parser
# from googletrans import Translator
import os
from textblob import TextBlob
#os.remove("arifureta.txt")
#os.remove("arifureta-formater.txt")
#os.remove("arifureta-traduit.txt")
raw = parser.from_file('/home/tom/Téléchargements/Arifureta_ From Commonplace to World_s Strongest Vol. 1.pdf')
text = raw['content']
text = text.replace('https://mp4directs.com','')
text = text.replace('\t','')
text = text.replace('\r','')
fichier = open("arifureta.txt", "a")
fichier.write(text)
fichier.close()
fic = open("arifureta-formater.txt", "a")
cpt=0
with open("arifureta.txt") as f :
for line in f :
if len(line)==1 :
cpt+=1
else :
cpt=0
if cpt<2:
fic.write(line)
fic.close()
nbLigneTraité = 0
fic2 = open("arifureta-traduit.txt", "a")
compteur=0
textPasTraduit=''
with open("arifureta-formater.txt") as f :
for line in f :
fic2.write(str(blob.translate(from_lang='en',to='fr')))
if len(line)>1:
textPasTraduit += line
compteur+=1
if compteur%1000==0:
blob = TextBlob(textPasTraduit)
try:
fic2.write(str(blob.translate(from_lang='en',to='fr')))
print(blob.translate(from_lang='en',to='fr'))
except Exception as e:
pass
nbLigneTraité+=1
print(nbLigneTraité)
if len(line)==1:
fic2.write('\n')
fic2.close()
I expect to have the entire translation of the PDF's text in the result file, but actually the answer is 'broken link'. I think it is due to the quantity of text, but I haven't find a way to try any other method.

Related

How to use mapFieldType with gdal.VectorTranslate

I'm trying to export a postgresql database into a .gpkg file, but some of my fields are lists, and ogr2ogr send me the message :
Warning 1: The output driver does not natively support StringList type for field my_field_name. Misconversion can happen. -mapFieldType can be used to control field type conversion.
But, as in the documentation, -mapFieldType is not a -lco, i don't find how to use it with the python version of gdal.VectorTranslate
here ma config :
gdal_conn = gdal.OpenEx(f"PG:service={my_pgsql_service}", gdal.OF_VECTOR)
gdal.VectorTranslate("path_to_my_file.gpkg"), gdal_conn,
SQLStatement=my_sql_query,
layerName=my_mayer_name,
format="GPKG",
accessMode='append',
)
so i've tried to add it in the -lco :
layerCreationOptions=["-mapFieldType StringList=String"]
but it didn't work
so i diged into the code of gdal, added a field mapFieldType=None into the VectorTranslateOptions function, and added into its code the following lines :
if mapFieldType is not None:
mapField_str = ''
i = 0
for k, v in mapFieldType.items():
i += 1
mapField_str += f"{k}={v}" if i == len(mapFieldType) else f"{k}={v},"
new_options += ['-mapFieldType', mapField_str]
And it worked, but is there an other way ?
And if not, where can i propose this feature ?
Thank you for your help

How to convert PDF with images which I don't care about to text?

I'm trying to convert pdf to text files. The problem is that those pdf contain images, which I don't care about (this is the type of file I want to extract (https://www.sia.aviation-civile.gouv.fr/pub/media/store/documents/file/l/f/lf_sup_2020_213_fr.pdf). Note that if I do copy/paste with my mouse, it work quite well (except the line break), so I'd guess that it's possible. Most of the answer I found online work pretty well on dummy pdf with text only, but give especially bad result on the map.
For instance, something like this
from tika import parser # pip install tika
raw = parser.from_file('test2.pdf')
print(raw['content'])
works well for retrieving the text, but I have a lot of trash like this :
ERY
CTR
3
CH
A
which appear because of the map.
Something like this, which work by converting the pdf to images and then reading the images, face the same problem (I found it on a very similar thread on stackoverflow, but there is no answer) :
import pytesseract as pt
from PIL import Image
import sys
def convert(name):
pages = convert_from_path(name, dpi=200)
for idx,page in enumerate(pages):
page.save('page'+str(idx)+'.jpg', 'JPEG')
quote = Image.open('page'+str(idx)+'.jpg')
text = pt.image_to_string(quote, lang="fra")
file_ex = open('page'+str(idx)+'.text',"w")
file_ex.write(text)
file_ex.close()
if __name__ == '__main__':
convert(sys.argv[1])
Finally, I tried to remove the image first, and then using one of the solutions above, but it didn't work better :
from tika import parser # pip install tika
from PyPDF2 import PdfFileWriter, PdfFileReader
# Remove the images
inputStream = open("lf_sup_2020_213_fr.pdf", "rb")
outputStream = open("test3.pdf", "wb")
src = PdfFileReader(inputStream)
output = PdfFileWriter()
[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()
output.write(outputStream)
outputStream.close()
# Read from pdf without images
raw = parser.from_file('test2.pdf')
print(raw['content'])
Do you know how to solve this ? It can be in any language.
Thanks
One approach you could try is to use a toolkit capable of parsing the text characters in the PDF then use the object properties to try and remove the unwanted map labels while keeping the text characters required.
For example, the ParsePages method from LEADTOOLS PDF toolkit (which is what I am familiar with since I work for the vendor of this toolkit) can be used to obtain the text from the PDF:
using (PDFDocument document = new PDFDocument(pdfFileName))
{
PDFParsePagesOptions options = PDFParsePagesOptions.All;
document.ParsePages(options, 1, -1);
using (StreamWriter writer = File.CreateText(txtFileName))
{
IList<PDFObject> objects = document.Pages[0].Objects;
writer.WriteLine("Objects: {0}", objects.Count);
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
writer.WriteLine("---------------------");
}
}
This will obtain all the text in the PDF for the first page, with the unwanted results as you mentioned. Here is an excerpt below:
Objects: 3918
5
91L
F5
4
1 LF
N
OY
L2
1AM
TService
8
26
1de l’Information
0
B09SUP AIP 213/20
7
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
141
17˚
82
N20
9Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
More code can be used to examine the properties for each parsed character:
writer.WriteLine(" ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
writer.WriteLine(" TextProperties.FontHeight: {0}", obj.TextProperties.FontHeight.ToString());
writer.WriteLine(" TextProperties.FontIndex: {0}", obj.TextProperties.FontIndex.ToString());
writer.WriteLine(" Code: {0}", obj.Code);
writer.WriteLine("------");
This will give the properties for each character:
Objects: 3918
ObjectType: Text
Bounds: -60.952693939209, 1017.25231933594, -51.8431816101074, 1023.71826171875
TextProperties.FontHeight: 7.10454273223877
TextProperties.FontIndex: 48
Code: 5
------
Using these properties, the unwanted text might be filtered using their properties. For example, I noticed that the FontHeight for a good portion of the unwanted text is around 7 PDF units, so the first code might be altered to avoid extracting any text smaller than 7.25 PDF units:
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.FontHeight > 7.25)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
}
The extracted output would give a better result, an excerpt follows:
Objects: 3918
Service
de l’Information
SUP AIP 213/20
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
Lieu : FIR : Marseille LFMM - AD : Chambéry Aix-Les-Bains LFLB, Chambéry Challes les Eaux LFLE
ZRT LE SIRE, MOTTE CASTRALE, ALLEVARD
*
C
D
E
In the end, you will have to try and come up with a good criteria to filter out the unwanted text without removing the text you need to keep, using this approach.

SpaCy: How to manually set POS tag for vertical bar "|"?

When text is tagged by SpaCy, the vertical bar "|" is assigned different POS tags depending on the context, such as "ADV" , "DEL"... While I want "|" to be recognized as "PUNC". Is there a way to force this POS for "|" ?
I tried this command and it didn't work.
nlp.tokenizer.add_special_case('|', [{ORTH: '|', POS: PUNC}])
I would add a simple pipe into the pipeline, right after the tagger :
def pos_postprocessor_pipe(doc) :
for token in doc :
if token.text == '|':
token.pos_ = 'PUNCT'
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(pos_postprocessor_pipe, name="pos_postprocessor", after='tagger')
It looks like you're using the method from this question to overwrite the tags, but that seems to be for an old version of spaCy; it doesn't work in 2.3.1.
You can set values on tokens, so you can do something like this:
import spacy
from spacy.symbols import NOUN
nlp = spacy.load('en')
words = nlp("...")
for word in words:
if word.text == "|":
word.pos = NOUN
However, it's likely that it's easier to just add an exception for the pipe wherever it's causing you issues.
Adding an exception for the pipe would look like this.
for word in nlp(...):
pos = word.pos_
if word.text == "|":
pos = "PUNCT"
do_stuff(word, pos)

Combine two TTS outputs in a single mp3 file not working

I want to combine two requests to the Google cloud text-to-speech API in a single mp3 output. The reason I need to combine two requests is that the output should contain two different languages.
Below code works fine for many language pair combinations, but unfortunately not for all. If I request e.g. a sentence in English and one in German and combine them everything works. If I request one in English and one in Japanes I can't combine the two files in a single output. The output only contains the first sentence and instead of the second sentence, it outputs silence.
I tried now multiple ways to combine the two outputs but the result stays the same. The code below should show the issue.
Please run the code first with:
python synthesize_bug.py --t1 'Hallo' --code1 de-De --t2 'August' --code2 de-De
This works perfectly.
python synthesize_bug.py --t1 'Hallo' --code1 de-De --t2 'こんにちは' --code2 ja-JP
This doesn't work. The single files are ok, but the combined files contain silence instead of the Japanese part.
Also, if used with two Japanes sentences everything works.
I already filed a bug report at Google with no response yet, but maybe it's just me who is doing something wrong here with encoding assumptions. Hope someone has an idea.
#!/usr/bin/env python
import argparse
# [START tts_synthesize_text_file]
def synthesize_text_file(text1, text2, code1, code2):
"""Synthesizes speech from the input file of text."""
from apiclient.discovery import build
import base64
service = build('texttospeech', 'v1beta1')
collection = service.text()
data1 = {}
data1['input'] = {}
data1['input']['ssml'] = '<speak><break time="2s"/></speak>'
data1['voice'] = {}
data1['voice']['ssmlGender'] = 'FEMALE'
data1['voice']['languageCode'] = code1
data1['audioConfig'] = {}
data1['audioConfig']['speakingRate'] = 0.8
data1['audioConfig']['audioEncoding'] = 'MP3'
request = collection.synthesize(body=data1)
response = request.execute()
audio_pause = base64.b64decode(response['audioContent'].decode('UTF-8'))
raw_pause = response['audioContent']
ssmlLine = '<speak>' + text1 + '</speak>'
data1 = {}
data1['input'] = {}
data1['input']['ssml'] = ssmlLine
data1['voice'] = {}
data1['voice']['ssmlGender'] = 'FEMALE'
data1['voice']['languageCode'] = code1
data1['audioConfig'] = {}
data1['audioConfig']['speakingRate'] = 0.8
data1['audioConfig']['audioEncoding'] = 'MP3'
request = collection.synthesize(body=data1)
response = request.execute()
# The response's audio_content is binary.
with open('output1.mp3', 'wb') as out:
out.write(base64.b64decode(response['audioContent'].decode('UTF-8')))
print('Audio content written to file "output1.mp3"')
audio_text1 = base64.b64decode(response['audioContent'].decode('UTF-8'))
raw_text1 = response['audioContent']
ssmlLine = '<speak>' + text2 + '</speak>'
data2 = {}
data2['input'] = {}
data2['input']['ssml'] = ssmlLine
data2['voice'] = {}
data2['voice']['ssmlGender'] = 'MALE'
data2['voice']['languageCode'] = code2 #'ko-KR'
data2['audioConfig'] = {}
data2['audioConfig']['speakingRate'] = 0.8
data2['audioConfig']['audioEncoding'] = 'MP3'
request = collection.synthesize(body=data2)
response = request.execute()
# The response's audio_content is binary.
with open('output2.mp3', 'wb') as out:
out.write(base64.b64decode(response['audioContent'].decode('UTF-8')))
print('Audio content written to file "output2.mp3"')
audio_text2 = base64.b64decode(response['audioContent'].decode('UTF-8'))
raw_text2 = response['audioContent']
result = audio_text1 + audio_pause + audio_text2
with open('result.mp3', 'wb') as out:
out.write(result)
print('Audio content written to file "result.mp3"')
raw_result = raw_text1 + raw_pause + raw_text2
with open('raw_result.mp3', 'wb') as out:
out.write(base64.b64decode(raw_result.decode('UTF-8')))
print('Audio content written to file "raw_result.mp3"')
# [END tts_synthesize_text_file]ls
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('--t1')
parser.add_argument('--code1')
parser.add_argument('--t2')
parser.add_argument('--code2')
args = parser.parse_args()
synthesize_text_file(args.t1, args.t2, args.code1, args.code2)
You can find the answer here:
https://issuetracker.google.com/issues/120687867
Short answer: It's not clear why it is not working, but Google suggests a workaround to first write the files as .wav, combine and then re-encode the result to mp3.
I have managed to do this in NodeJS with just one function (idk how optimal is it, but at least it works). Maybe you could take inspiration from it
I have used memory-streams dependency from npm
var streams = require('memory-streams');
function mergeAudios(audios) {
var reader = new streams.ReadableStream();
var writer = new streams.WritableStream();
audios.forEach(element => {
if (element instanceof streams.ReadableStream) {
element.pipe(writer)
}
else {
writer.write(element)
}
});
reader.append(writer.toBuffer())
return reader
}
Input parameter is a list which contain ReadableStream or responce.audioContent from synthesizeSpeech operation. If it is readablestream, it uses pipe operation, if it is audiocontent, it uses write method. At the end all content is passed into an readabblestream.

How to extract Highlighted Parts from PDF files

Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.
To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file:
Direct download
# Based on https://stackoverflow.com/a/62859169/562769
from typing import List, Tuple
import fitz # install with 'pip install pymupdf'
def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
points = annot.vertices
quad_count = int(len(points) / 4)
sentences = []
for i in range(quad_count):
# where the highlighted part is
r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect
words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
sentences.append(" ".join(w[4] for w in words))
sentence = " ".join(sentences)
return sentence
def handle_page(page):
wordlist = page.get_text("words") # list of words on page
wordlist.sort(key=lambda w: (w[3], w[0])) # ascending y, then x
highlights = []
annot = page.first_annot
while annot:
if annot.type[0] == 8:
highlights.append(_parse_highlight(annot, wordlist))
annot = annot.next
return highlights
def main(filepath: str) -> List:
doc = fitz.open(filepath)
highlights = []
for page in doc:
highlights += handle_page(page)
return highlights
if __name__ == "__main__":
print(main("PDF-export-example-with-notes.pdf"))
Ok, after looking I found a solution for exporting highlighted text from a pdf to a text file. Is not very hard:
First, you highlight your text with the tool you like to use (in my case, I highlight while I'm reading on an iPad using Goodreader app).
Transfer your pdf to a computer and open it using Skim (a pdf reader, free and easy to find on the web)
On FILE, choose CONVERT NOTES and convert all the notes of your document to SKIM NOTES.
That's all: simply go to EXPORT an choose EXPORT SKIM NOTES. It will export you a list of your highlighted text. Once opened this list can be exported again to a txt format file.
Not much work to do, and the result is fantastic.