I have something like this that is converting fonts
i=1
while ( i<$argc )
Open($argv[i])
# edit meta somehow
Generate($argv[i]:r + type)
i = i+1
endloop
that prints this metadata
Created by FontForge 20141024 at Wed Nov 12 16:59:42 2014
By Jimmy Wärting
that i would like to remove or change
You can use built-in function SetFontNames.
It has such signature:
SetFontNames(fontname[,family[,fullname[,weight[,copyright-notice[,fontversion]]]]])
i=1
while ( i<$argc )
Open($argv[i])
#edit meta
SetFontNames('fontName', 'fontFamilyName', 'fullName', 'weight',
'copyright', 'version')
Generate($argv[i]:r + type)
i = i+1
endloop
If some parameters are not necessary, just write empty string:
SetFontNames('', '', '', '', 'copyright', 'version')
For more details, please, see https://fontforge.github.io/scripting-alpha.html
Related
I'm trying to convert pdf to text files. The problem is that those pdf contain images, which I don't care about (this is the type of file I want to extract (https://www.sia.aviation-civile.gouv.fr/pub/media/store/documents/file/l/f/lf_sup_2020_213_fr.pdf). Note that if I do copy/paste with my mouse, it work quite well (except the line break), so I'd guess that it's possible. Most of the answer I found online work pretty well on dummy pdf with text only, but give especially bad result on the map.
For instance, something like this
from tika import parser # pip install tika
raw = parser.from_file('test2.pdf')
print(raw['content'])
works well for retrieving the text, but I have a lot of trash like this :
ERY
CTR
3
CH
A
which appear because of the map.
Something like this, which work by converting the pdf to images and then reading the images, face the same problem (I found it on a very similar thread on stackoverflow, but there is no answer) :
import pytesseract as pt
from PIL import Image
import sys
def convert(name):
pages = convert_from_path(name, dpi=200)
for idx,page in enumerate(pages):
page.save('page'+str(idx)+'.jpg', 'JPEG')
quote = Image.open('page'+str(idx)+'.jpg')
text = pt.image_to_string(quote, lang="fra")
file_ex = open('page'+str(idx)+'.text',"w")
file_ex.write(text)
file_ex.close()
if __name__ == '__main__':
convert(sys.argv[1])
Finally, I tried to remove the image first, and then using one of the solutions above, but it didn't work better :
from tika import parser # pip install tika
from PyPDF2 import PdfFileWriter, PdfFileReader
# Remove the images
inputStream = open("lf_sup_2020_213_fr.pdf", "rb")
outputStream = open("test3.pdf", "wb")
src = PdfFileReader(inputStream)
output = PdfFileWriter()
[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()
output.write(outputStream)
outputStream.close()
# Read from pdf without images
raw = parser.from_file('test2.pdf')
print(raw['content'])
Do you know how to solve this ? It can be in any language.
Thanks
One approach you could try is to use a toolkit capable of parsing the text characters in the PDF then use the object properties to try and remove the unwanted map labels while keeping the text characters required.
For example, the ParsePages method from LEADTOOLS PDF toolkit (which is what I am familiar with since I work for the vendor of this toolkit) can be used to obtain the text from the PDF:
using (PDFDocument document = new PDFDocument(pdfFileName))
{
PDFParsePagesOptions options = PDFParsePagesOptions.All;
document.ParsePages(options, 1, -1);
using (StreamWriter writer = File.CreateText(txtFileName))
{
IList<PDFObject> objects = document.Pages[0].Objects;
writer.WriteLine("Objects: {0}", objects.Count);
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
writer.WriteLine("---------------------");
}
}
This will obtain all the text in the PDF for the first page, with the unwanted results as you mentioned. Here is an excerpt below:
Objects: 3918
5
91L
F5
4
1 LF
N
OY
L2
1AM
TService
8
26
1de l’Information
0
B09SUP AIP 213/20
7
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
141
17˚
82
N20
9Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
More code can be used to examine the properties for each parsed character:
writer.WriteLine(" ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
writer.WriteLine(" TextProperties.FontHeight: {0}", obj.TextProperties.FontHeight.ToString());
writer.WriteLine(" TextProperties.FontIndex: {0}", obj.TextProperties.FontIndex.ToString());
writer.WriteLine(" Code: {0}", obj.Code);
writer.WriteLine("------");
This will give the properties for each character:
Objects: 3918
ObjectType: Text
Bounds: -60.952693939209, 1017.25231933594, -51.8431816101074, 1023.71826171875
TextProperties.FontHeight: 7.10454273223877
TextProperties.FontIndex: 48
Code: 5
------
Using these properties, the unwanted text might be filtered using their properties. For example, I noticed that the FontHeight for a good portion of the unwanted text is around 7 PDF units, so the first code might be altered to avoid extracting any text smaller than 7.25 PDF units:
foreach (PDFObject obj in objects)
{
if (obj.TextProperties.FontHeight > 7.25)
{
if (obj.TextProperties.IsEndOfLine)
writer.WriteLine(obj.Code);
else
writer.Write(obj.Code);
}
}
The extracted output would give a better result, an excerpt follows:
Objects: 3918
Service
de l’Information
SUP AIP 213/20
Aéronautique
Date de publication : 05 NOV
e-mail : sia.qualite#aviation-civile.gouv.fr
Internet : www.sia.aviation-civile.gouv.fr
Objet : Création de 4 zones réglementées temporaires (ZRT) pour l’exercice VOLOPS en région de Chambéry
En vigueur : Du mercredi 25 Novembre 2020 au vendredi 04 décembre 2020
Lieu : FIR : Marseille LFMM - AD : Chambéry Aix-Les-Bains LFLB, Chambéry Challes les Eaux LFLE
ZRT LE SIRE, MOTTE CASTRALE, ALLEVARD
*
C
D
E
In the end, you will have to try and come up with a good criteria to filter out the unwanted text without removing the text you need to keep, using this approach.
In my TACL, I'm trying to create a variable used as input to an SQLCI command. I want to use a LIKE clause with a % as a wildcard. Every time it replaces the % with a ? causing the SQL statement to not return the desired results.
Code snippitz:
?TACL macro
#Frame
#Set #informat plain
#Set #outformat pretty
#Push stoprun DC fidata var1 mailmast sqlvol sqlsvol IsOpen EMLFile ans emlline
#Push mailfile mfile likeit charaddr sqlin sqlout test
[#Def True text |body|-1]
[#Def False text |body|0]
Intervening code cut out to reduce length - the cutout code works
== select <program name> from <full mailmast filename> where mm_file_prefix
== like "<likename>%" for browse access;
[#If Not [StopRun]
|then|
#Setv test "~%"
#Set #trace -1
#Set SqlIn select mm_program_name from [mailmast] where mm_file_prefix
#appendv sqlin "like ""[LikeIt]~%"" for browse access;"
SQLCI/Inv Sqlin,outv sqlout/
] == end of if
When I run the code, I display the variables, and it has replaced the % with ?
-TRACE-
-19-st 1 v
Invoking variable :MAILMAST.1
#Set SqlIn select mm_program_name from $DATA5.SQL2510.MAILMAST where mm_file_p
refix
-TRACE-
-20-
#appendv sqlin "like ""[LikeIt]
^
-TRACE-
-20-
Invoking variable :LIKEIT.1
#appendv sqlin "like ""ED?"" for browse access;"
-TRACE-
-20-d test
?
-22-st 1 v
SQLCI/Inv Sqlin,outv sqlout/
-TRACE-
-23-d sqlin
select mm_program_name from $DATA5.SQL2510.MAILMAST where mm_file_prefix
like "ED?" for browse access;
-24-
Since the % is not there as a wildcard, this SQL statement fails to bring up the proper record.
The question is, how do I put a % into a TACL variable and not get it changed to a ?
Edited to replace duplicate code with the start of the TACL Macro - MEH
I really don't like answering my own question because that looks like I was just setting things up to make me look smart, but in continuing my research while waiting, I found the answer. See code snippitz below:
?TACL macro
#Frame
#Set #informat plain
#Set #outformat pretty
#Push stoprun DC fidata var1 mailmast sqlvol sqlsvol IsOpen EMLFile ans emlline
#Push mailfile mfile likeit charaddr sqlin sqlout test
[#Def True text |body|-1]
[#Def False text |body|0]
[#Def ascii struct
Begin
BYTE byt0 value 37;
CHAR pcent REDEFINES byt0;
End;
] == end of struct
Note the addition of the struct. This is to be able to assign a byte a decimal value of 37 (%) and redefine it as a character to be used in the like statement.
== select <program name> from <full mailmast filename> where mm_file_prefix
== like "<likename>%" for browse access;
[#If Not [StopRun]
|then|
#Set test [ascii:pcent]
#Set #trace -1
#Set SqlIn select mm_program_name from [mailmast] where mm_file_prefix
#append sqlin like "[LikeIt][ascii:pcent]" for browse access;
SQLCI/Inv Sqlin,outv sqlout/
] == end of if
Note the use of the struct [ascii:pcent] and this does work.
Thanks to all who read my question.
I'm writing an LPeg-based parser. How can I make it so a parsing error returns nil, errmsg?
I know I can use error(), but as far as I know that creates a normal error, not nil, errmsg.
The code is pretty long, but the relevant part is this:
local eof = lpeg.P(-1)
local nl = (lpeg.P "\r")^-1 * lpeg.P "\n" + lpeg.P "\\n" + eof -- \r for winblows compat
local nlnoeof = (lpeg.P "\r")^-1 * lpeg.P "\n" + lpeg.P "\\n"
local ws = lpeg.S(" \t")
local inlineComment = lpeg.P("`") * (1 - (lpeg.S("`") + nl * nl)) ^ 0 * lpeg.P("`")
local wsc = ws + inlineComment -- comments count as whitespace
local backslashEscaped
= lpeg.P("\\ ") / " " -- escaped spaces
+ lpeg.P("\\\\") / "\\" -- escaped escape character
+ lpeg.P("\\#") / "#"
+ lpeg.P("\\>") / ">"
+ lpeg.P("\\`") / "`"
+ lpeg.P("\\n") -- \\n newlines count as backslash escaped
+ lpeg.P("\\") * lpeg.P(function(_, i)
error("Unknown backslash escape at position " .. i) -- this error() is what I wanna get rid of.
end)
local Line = lpeg.C((wsc + (backslashEscaped + 1 - nl))^0) / function(x) return x end * nl * lpeg.Cp()
I want Line:match(...) to return nil, errmsg when there's an invalid escape.
LPeg itself doesn't provide specific functions to help you with error reporting. A quick fix to your problem would be to make a protected call (pcall) to match like this:
local function parse(text)
local ok, result = pcall(function () return Line:match(text) end)
if ok then
return result
else
-- `result` will contain the error thrown. If it is a string
-- Lua will add additional information to it (filename and line number).
-- If you do not want this, throw a table instead like `{ msg = "error" }`
-- and access the message using `result.msg`
return nil, result
end
end
However, this will also catch any other error, which you probably don't want. A better solution would be to use LPegLabel instead. LPegLabel is an extension of LPeg that adds support for labeled failures. Just replace require"lpeg" with require"lpeglabel" and then use lpeg.T(L) to throw labels where L is an integer from 1-255 (0 is used for regular PEG failures).
local unknown_escape = 1
local backslashEscaped = ... + lpeg.P("\\") * lpeg.T(unknown_escape)
Now Line:match(...) will return nil, label, suffix if there is a label thrown (suffix is the remaining unprocessed input, which you can use to compute for the error position via its length). With this, you can print out the appropriate error message based on the label. For more complex grammars, you would probably want a more systematic way of mapping the error labels and messages. Please check the documentation found in the readme of the LPegLabel repository to see examples of how one may do so.
LPegLabel also allows you to catch the labels in the grammar by the way (via labeled choice); this is useful for implementing things like error recovery. For more information on labeled failures and examples, please check the documentation.
I'm trying to extract some text using BeautifulSoup. I'm using get_text() function for this purpose.
My problem is that the text contains </br> tags and I need to convert them to end lines. how can I do this?
You can do this using the BeautifulSoup object itself, or any element of it:
for br in soup.find_all("br"):
br.replace_with("\n")
As official doc says:
You can specify a string to be used to join the bits of text together: soup.get_text("\n")
A regex should do the trick.
import re
s = re.sub('<br\s*?>', '\n', yourTextHere)
Hope this helps!
Also you can use get_text(separator = '\n', strip = True) :
from bs4 import BeautifulSoup
bs=BeautifulSoup('<td>some text<br>some more text</td>','html.parser')
text=bs.get_text(separator = '\n', strip = True)
print(text)
>>
some text
some more text
it works for me.
Adding to Ian's and dividebyzero's post/comments you can do this to efficiently filter/replace many tags in one go:
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
elem.replace_with(elem.text + "\n\n")
Instead of replacing the tags with \n, it may be better to just add a \n to the end of all of the tags that matter.
To steal the list from #petezurich:
for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
elem.append('\n')
If you call element.text you'll get the text without br tags.
Maybe you need define your own custom method for this purpose:
def clean_text(elem):
text = ''
for e in elem.descendants:
if isinstance(e, str):
text += e.strip()
elif e.name == 'br' or e.name == 'p':
text += '\n'
return text
# get page content
soup = BeautifulSoup(request_response.text, 'html.parser')
# get your target element
description_div = soup.select_one('.description-class')
# clean the data
print(clean_text(description_div))
I would like to retrieve logs from a mercurial repository using mercurial commands api. Unfortunately, mercurial.commands.log prints the messages to the stdout, instead of returning some nice list of revisions, like e.g. pysvn does. Can the be achieved easily? I would like to add mercurial support to my program and would like to do this as easily, as it's possible.
You should do something along the lines of this:
from mercurial import ui, hg
u = ui.ui()
repo = hg.repo()
for rev in repo:
print repo[rev]
the subscripted object is a context object. It has useful methods like description(), branch(), and user(). For a complete list of what it can do, see the source (or do a dir() on it).
The simple answer is to use ui.pushbuffer() right before you call the log command and log_output = ui.popbuffer() right after you call it. By doing that log_output will contain the output of the log command.
Are you actually looking for the straight log output though, or do you really want the diff or some other kind of data? If we know what exactly you're trying to get (for example: "the commit messages of every changeset between X and Y") we might be able to show you a better way.
EDIT: Take a look at the Mercurial API wiki page to see how to get most of the common information from repo and ctx objects.
yeah I had the same problem.. seems as its as designed to disallow retrieving logs remotely. The web interface gives a little rss feed but it wasn't enough of a history for me. So we created our own customised rss feed...
its not the most elaborate of things and is customised to our liking, you can mix the fields around in print_item() to change the look of the feed. You could also mod it to return log info on specific changesets if needed.
You will have to add a script alias to apache, something like (See http://httpd.apache.org/docs/2.0/howto/cgi.html for more info):
ScriptAlias /feed.cgi /usr/local/systems/hg/script/feed.cgi
feed.cgi file contents:
#!/usr/bin/env python2.5
# -*- python -*-
"""
Creates a rss feed from commit log messages in a repository/branch.
Can be filtered on commit logs from a set date eg date=2009-12-12
or by a number of days previous eg. days=7
Usage:
* retrieve all logs: http://hg.server/feed.cgi?repository=MyRepo
* retrieve logs from set date: http://hg.server/feed.cgi?repository=DMyRepo&date=2009-11-11
* retrieve logs from last 77 days: http://hg.server/feed.cgi?repository=DMyRepo&days=77
* retrieve all logs from a branch: http://hg.server/feed.cgi?repository=MyRepo&branch=myBranch
Script Location on server: /usr/local/systems/hg/script/feed.cgi
"""
defaultdateformats = (
'%Y-%m-%d %H:%M:%S',
'%Y-%m-%d %I:%M:%S%p',
'%Y-%m-%d %H:%M',
'%Y-%m-%d %I:%M%p',
'%Y-%m-%d',
'%m-%d',
'%m/%d',
'%m/%d/%y',
'%m/%d/%Y',
'%a %b %d %H:%M:%S %Y',
'%a %b %d %I:%M:%S%p %Y',
'%a, %d %b %Y %H:%M:%S', # GNU coreutils "/bin/date --rfc-2822"
'%b %d %H:%M:%S %Y',
'%b %d %I:%M:%S%p %Y',
'%b %d %H:%M:%S',
'%b %d %I:%M:%S%p',
'%b %d %H:%M',
'%b %d %I:%M%p',
'%b %d %Y',
'%b %d',
'%H:%M:%S',
'%I:%M:%S%p',
'%H:%M',
'%I:%M%p',
)
import os, sys, cgi, cgitb, datetime, time
cgitb.enable()
from mercurial import ui, hg, util
from mercurial.node import short
def find_repository(name):
base = '/usr/local/systems/hg/repos/'
path = os.path.join(base, name)
repos = hg.repository(None, path)
return repos
def find_changes(repos, branch, date):
# returns true if d2 is newer than d1
def newerDate(d1, d2):
d1 = datetime.datetime.fromtimestamp(d1)
d2 = datetime.datetime.fromtimestamp(d2)
return d1 < d2
#for ctx in repos.changelog:
# print ctx
changes = repos.changelog
out = []
# filter on branch
if branch != '':
changes = [change for change in changes if repos.changectx(change).branch() == branch ]
# filter on date
if date != '':
changes = [change for change in changes if newerDate(date, repos.changectx(change).date()[0]) ]
return changes
def print_item(change, link_template):
def _element(name, content):
content = cgi.escape(content)
print " <%(name)s>%(content)s</%(name)s>" % {
'name': name,
'content': content
}
link = link_template % {'node': short(change.node())}
print " <item>"
_element('title', str(change.rev()))
_element('description', change.description())
_element('guid', str(change.rev()))
_element('author', change.user())
_element('link', link)
_element('pubdate', str(datetime.datetime.fromtimestamp(change.date()[0])))
print " </item>"
def print_rss(changes, repos, template):
print """<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<link>N/A</link>
<language>en-us</language>
<title>Changelog</title>
<description>Changelog</description>
"""
for change in changes:
ctx = repos.changectx(change)
print_item(ctx, template)
print """
</channel>
</rss>
"""
if __name__=="__main__":
# -*- python -*-
print "Content-Type: application/rss+xml; charset=UTF-8"
print
f = cgi.FieldStorage()
if not f.has_key("repository"):
print "Need to specify repository."
sys.exit()
repository = f['repository'].value
branch = ''
if f.has_key('branch'):
branch = f['branch'].value
date = ''
if f.has_key('date') and not f.has_key('days'):
try:
#date = datetime.datetime.strptime(f['date'].value, '%Y-%m-%d')
date = util.parsedate(f['date'].value)[0]
except:
print 'Error in date format, use one of the following formats:', defaultdateformats
sys.exit()
elif f.has_key('days') and not f.has_key('date'):
days = int(f['days'].value)
try:
date = datetime.datetime.now() - datetime.timedelta(days=days)
date = time.mktime(date.timetuple())
except:
print 'Error in days, please use a standard number eg. days=7'
sys.exit()
elif f.has_key('days') and f.has_key('date'):
print 'Error, please only supply a dayrange OR a date, not both'
sys.exit()
repos = find_repository(repository)
changes = find_changes(repos, branch, date)
rev_link_template = 'http://hg.server/hg/%(repos)s/rev/%%(node)s' % {
'repos': repository
}
print_rss(changes, repos, rev_link_template)