I am having a strange problem with iText and acrofields. I created a PDF and added the acrofields. Now when I do form.setField ('a field name', "a value") and I display or print the PDF, the value gets duplicated (once in smaller font and once in the intended font for that document). I checked the structure of the document and it doesn't look that my Acrofield are duplicated. What could be the cause of this
Thanks in advance
Pascal
Please find link here: https://drive.google.com/file/d/0B8O5n5QFSSNrSGVlNllOcEJHRzQ/edit?usp=sharing
I am on Ubuntu. Maybe that's why! I am using evince to look at the file, however I get the same result when I print it. I included a screenshot of what I see. https://drive.google.com/file/d/0B8O5n5QFSSNrWXJyY2VpSkt5NE0/edit?usp=sharing
When I say duplicated, I should say shadowed. The value of the field is first displayed without font styling then overwritten with the required font.
The code I showed is pretty straightforward. The 2 arrrays are the name of the fields and their associated values. If the value is xxxx I set the field value to its index in that array. As you can see on the screenshot it gets shadowed too. My printout looks exactly like the screenshot. I haven't tried it yet on another platform.
Here is the code written in groovy
File mergeForm (String path, Map fields, Map values, String newFile) {
println "Merge Form: $path"
def file = grailsApplication.mainContext.getResource(path)?.inputStream
if (file == null)
return null
def reader = new PdfReader(file)
def stamper = new PdfStamper(reader, new FileOutputStream(newFile))
def form = stamper.getAcroFields()
fields.eachWithIndex { k, v, i ->
def val = ""
if (v instanceof Closure) {
val = v(values)
}
else if (v == '_xxxx_') {
val = "${i + 1}"
}
else if (values[v]) {
val = values."$v"
}
println "setting value[$i]: ${val} to: $k"
form.setField (k, val)
}
stamper.close()
return new File (newFile)
}
Summing it up
The issue seems to be due to multiple field annotations in the PDF at hand for the each field which differ somewhat, though, and therefore have different appearances.
In detail
Looking at the document version BOE-267-L1-Rev-1.unlocked-with-fields.pdf we will inspect the topmost field on the first page, "This Claim is Filed for Fiscal Year 20". We see that the page object 9 in its annotations array (in object 265) has (among many others) object 304 and object 180 which both are annotations of that field!
304 0 obj
<<
/Ff 12582912
/MaxLen 2
/F 4
/Type/Annot
/Subtype/Widget
/T(This Claim is Filed for Fiscal Year 20)
/P 9 0 R
/Q 1
/MK<<>>
/FT/Tx
/Rect[166.765 693.57 188.965 701.479]
/DA(/Arial 8 Tf 0 g)
/AA<</F 333 0 R/K 334 0 R>>
>>
endobj
...
180 0 obj
<<
/Ff 0
/F 4
/Type/Annot
/Subtype/Widget
/DR<</Font<</Helv 2 0 R>>>>
/T(This Claim is Filed for Fiscal Year 20)
/V()
/AP<</N 179 0 R>>
/P 9 0 R
/BS<</W 0.5/S/S>>
/FT/Tx
/Rect[165.4 706.28 187.6 714.19]
/DA(/Helv 0 Tf 0 g )
>>
endobj
The definitions of these describe slightly different positions on the page:
/Rect[166.765 693.57 188.965 701.479]
...
/Rect[165.4 706.28 187.6 714.19]
and different default appearance strings
/DA(/Arial 8 Tf 0 g)
...
/DA(/Helv 0 Tf 0 g )
Thus, it is not a surprise that you get multiple, non-identical appearances of this field. The actual surprise is that the version filled by iText on Adobe Reader does not display double values.
#Bruno someone might want to look into this as soon as there is some time.
The other fields have duplicate appearances, too; most often the page positions are nearly identical, though, but the default appearance streams still differ which results in multiple, non-identical appearances for them, too.
Related
My php code, below, attemps to download all the photos for a property listing. It successfully queries the RETS server, and creates a file for each photo, but the file does not seem to be a functional image. (MATRIX requires files to be downloaded, instead of URLs.)
The list of photos below suggests that it successfully queries one listing id (47030752) for all photos that exist, (20 photos in this case). In a web browser, the files appear only as a small white square on a black background: e.g. (https://photos.atlantarealestate-homes.com/photos/PHOTO-47030752-9.jpg). The file size (4) also seems to be very low, as compared to that of a real photo.
du -s PHOTO*
4 PHOTO-47030752-10.jpg
4 PHOTO-47030752-11.jpg
4 PHOTO-47030752-12.jpg
4 PHOTO-47030752-13.jpg
4 PHOTO-47030752-14.jpg
4 PHOTO-47030752-15.jpg
4 PHOTO-47030752-16.jpg
4 PHOTO-47030752-17.jpg
4 PHOTO-47030752-18.jpg
4 PHOTO-47030752-19.jpg
4 PHOTO-47030752-1.jpg
4 PHOTO-47030752-20.jpg
4 PHOTO-47030752-2.jpg
4 PHOTO-47030752-3.jpg
4 PHOTO-47030752-4.jpg
4 PHOTO-47030752-5.jpg
4 PHOTO-47030752-6.jpg
4 PHOTO-47030752-7.jpg
4 PHOTO-47030752-8.jpg
4 PHOTO-47030752-9.jpg
script I'm using:
#!/usr/bin/php
<?php
date_default_timezone_set('this/area');
require_once("composer/vendor/autoload.php");
$config = new \PHRETS\Configuration;
$config->setLoginUrl('https://myurl/login.ashx')
->setUsername('myser')
->setPassword('mypass')
->setRetsVersion('1.7.2');
$rets = new \PHRETS\Session($config);
$connect = $rets->Login();
$system = $rets->GetSystemMetadata();
$resources = $system->getResources();
$classes = $resources->first()->getClasses();
$classes = $rets->GetClassesMetadata('Property');
$host="localhost";
$user="db_user";
$password="db_pass";
$dbname="db_name";
$tablename="db_table";
$link=mysqli_connect ($host, $user, $password, $dbname);
$query="select mlsno, matrix_unique_id, photomodificationtimestamp from fmls_homes left join fmls_images on (matrix_unique_id=mls_no and photonum='1') where photomodificationtimestamp <> last_update or last_update is null limit 1";
print ("$query\n");
$result= mysqli_query ($link, $query);
$num_rows = mysqli_num_rows($result);
print "Fetching Images for $num_rows Homes\n";
while ($Row= mysqli_fetch_array ($result)) {
$matrix_unique_id="$Row[matrix_unique_id]";
$objects = $rets->GetObject('Property', 'LargePhoto', $matrix_unique_id);
foreach ($objects as $object) {
// does this represent some kind of error
$object->isError();
$object->getError(); // returns a \PHRETS\Models\RETSError
// get the record ID associated with this object
$object->getContentId();
// get the sequence number of this object relative to the others with the same ContentId
$object->getObjectId();
// get the object's Content-Type value
$object->getContentType();
// get the description of the object
$object->getContentDescription();
// get the sub-description of the object
$object->getContentSubDescription();
// get the object's binary data
$object->getContent();
// get the size of the object's data
$object->getSize();
// does this object represent the primary object in the set
$object->isPreferred();
// when requesting URLs, access the URL given back
$object->getLocation();
// use the given URL and make it look like the RETS server gave the object directly
$object->setContent(file_get_contents($object->getLocation()));
$listing = $object->getContentId();
$number = $object->getObjectId();
$url = $object->getLocation();
//$photo = $object->getContent();
$size = $object->getSize();
$desc = $object->getContentDescription();
if ($number >= '1') {
file_put_contents("/bigdirs/fmls_pics/PHOTO-{$listing}-{$number}.jpg", "$object->getContent();");
print "$listing - $number - $size $desc\n";
} //end if
} //end foreach
} //end while
mysqli_close ($link);
fclose($f);
php?>
Are there any suggested changes to capture photos into the created files? This command creates the photo files:
file_put_contents("/bigdirs/fmls_pics/PHOTO-{$listing}-{$number}.jpg", "$object->getContent();");
There may be some parts of this script that wouldn't work in live production, but are sufficient for testing. This script seems to successfully query for the information needed from the RETS server. The problem is just simply that the actual files created do not seem to be functional photos.
Thanks in Advance! :)
Your code sample is a mix of the official documentation and a usable implementation. The problem is with this line:
$object->setContent(file_get_contents($object->getLocation()));
You should completely take that out. That's actually overriding the image you downloaded with nothing before you get a chance to save the contents to a file. With that removed, it should work as expected.
I'm trying to clean up text inside rectangle in pdf document using iText.
Following is the piece of code I’m using:
PdfReader pdfReader = null;
PdfStamper stamper = null;
try
{
int pageNo = 1;
List<Float> linkBounds = new ArrayList<Float>();
linkBounds.add(0, (float) 202.3);
linkBounds.add(1, (float) 588.6);
linkBounds.add(2, (float) 265.8);
linkBounds.add(3, (float) 599.7);
pdfReader = new PdfReader("Test1.pdf");
stamper = new PdfStamper(pdfReader, new FileOutputStream("Test2.pdf"));
Rectangle linkLocation = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(pageNo, linkLocation, BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
try {
stamper.close();
}
catch (Exception e) {
e.printStackTrace();
}
pdfReader.close();
}
After executing this piece of code, it’s clearing up entire line of text instead of cleaning up text only inside given rectangle.
To explain things in a better way I have attached pdf documents.
input PDF
output PDF
In the input pdf, I have highlighted the text to show the rectangle I’m specifying for cleaning up.
And, in the output pdf as you can clearly see that there is grey rectangle but if you notice it cleaned up the whole line of text.
Any help will be appreciated.
The files input.pdf and output.pdf the OP originally presented did not allow to reproduce the issue but instead seemed not at all to match. Thus, there was an original answer essentially demonstrating that the issue could not be reproduced.
The second set of files Test1.pdf and Test2.pdf, though, did allow to reproduce the issue, giving rise to the updated answer...
Updated answer referring to the OP's second set of sample files
There indeed is an issue in the current (up to 5.5.8) iText clean-up code: In case of tagged files some methods of PdfContentByte used here introduced extra instructions into the content stream which actually damaged it and relocated some text in the eyes of PDF viewers which ignored the damage.
In more detail:
PdfCleanUpContentOperator.writeTextChunks used canvas.setCharacterSpacing(0) and canvas.setWordSpacing(0) to initially set the character and word spacing to 0. Unfortunately these methods in case of tagged files check whether the canvas under construction currently is in a text object and (if not) start a text object. This check depends on a local flag set by beginText; but during clean-up text objects are not started using that method. Thus, writeTextChunks here inserts an extra "BT 1 0 0 1 0 0 Tm" sequence damaging the stream and relocating the following text.
private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
canvas.setCharacterSpacing(0);
canvas.setWordSpacing(0);
...
PdfCleanUpContentOperator.writeTextChunks instead should use hand-crafted Tc and Tw instructions to not trigger this side effect.
private void writeTextChunks(Map<Integer, Float> structuredTJoperands, List<PdfCleanUpContentChunk> chunks, PdfContentByte canvas,
float characterSpacing, float wordSpacing, float fontSize, float horizontalScaling) throws IOException {
if (Float.compare(characterSpacing, 0.0f) != 0 && Float.compare(characterSpacing, -0.0f) != 0) {
new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append(Tc);
}
if (Float.compare(wordSpacing, 0.0f) != 0 && Float.compare(wordSpacing, -0.0f) != 0) {
new PdfNumber(0).toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append(Tw);
}
canvas.getInternalBuffer().append((byte) '[');
With this change in place the OP's new sample file "Test1.pdf" is properly redacted by the sample code
#Test
public void testRedactJavishsTest1() throws IOException, DocumentException
{
try ( InputStream resource = getClass().getResourceAsStream("Test1.pdf");
OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "Test1-redactedJavish.pdf")) )
{
PdfReader reader = new PdfReader(resource);
PdfStamper stamper = new PdfStamper(reader, result);
List<Float> linkBounds = new ArrayList<Float>();
linkBounds.add(0, (float) 202.3);
linkBounds.add(1, (float) 588.6);
linkBounds.add(2, (float) 265.8);
linkBounds.add(3, (float) 599.7);
Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
stamper.close();
reader.close();
}
}
(RedactText.java)
Original answer referring to the OP's original sample files
I just tried to reproduce your issue using this test method
#Test
public void testRedactJavishsText() throws IOException, DocumentException
{
try ( InputStream resource = getClass().getResourceAsStream("input.pdf");
OutputStream result = new FileOutputStream(new File(OUTPUTDIR, "input-redactedJavish.pdf")) )
{
PdfReader reader = new PdfReader(resource);
PdfStamper stamper = new PdfStamper(reader, result);
List<Float> linkBounds = new ArrayList<Float>();
linkBounds.add(0, (float) 200.7);
linkBounds.add(1, (float) 547.3);
linkBounds.add(2, (float) 263.3);
linkBounds.add(3, (float) 558.4);
Rectangle linkLocation1 = new Rectangle(linkBounds.get(0), linkBounds.get(1), linkBounds.get(2), linkBounds.get(3));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(1, linkLocation1, BaseColor.GRAY));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
stamper.close();
reader.close();
}
}
(RedactText.java)
For your source PDF looking like this
the result was
and not your
I even re-tested using the iText versions 5.5.5 you mention in a comment and also 5.5.4, but in all cases I got the correct result.
Thus, I cannot reproduce your issue.
I had a closer look at your output.pdf. It is a bit peculiar, in particular it does not contain certain blocks typical for PDFs created or manipulated by current iText versions. Furthermore the content streams look extremely different.
Thus, I assume that after iText redacted your file some other tool post-processed and in doing so damaged it.
In particular the page content instructions preparing the insertion of the redacted line look like this in your input.pdf:
q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
[...] TJ
and like this in the version I received directly from iText:
q
0.24 0 0 0.24 113.7055 548.04 cm
BT
0.0116 Tc
45 0 0 45 0 0 Tm
/TT5 1 Tf
0 Tc
0 Tw
[...] TJ
but the corresponding lines in your output.pdf look like this
BT
1 0 0 1 113.3 548.5 Tm
0 Tc
BT
1 0 0 1 0 0 Tm
0 Tc
[...] TJ
Here the instructions in your output.pdf are
invalid as inside a text object BT ... ET there may be no other text object but you have two BT operations following each other without an ET inbetween;
effectively positioning the text at 0, 0 if a PDF viewer ignores the error mentioned above.
And indeed, if you look at the bottom of your output.pdf page you'll see:
So if my assumption that there is some other program post-processing the iText result, is correct, you should repair that post-processor.
If there is no such post-processor, you seem not to have the officially published iText version but something altogether different.
I've got the following code to delete duplicate images from a perceptual hash I calculated.
images = Image.objects.all()
images_deleted = 0
for image in images:
duplicates = Image.objects.filter(hash=image.hash).exclude(pk=image.pk).exclude(hash="ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff")
for duplicate in duplicates:
duplicate_tags = duplicate.tags.all()
image.tags.add(*duplicate_tags)
duplicate.delete()
images_deleted+=1
print(str(images_deleted))
running it I get the following exception:
django.db.utils.IntegrityError: insert or update on table
"crawlers_image_tags" violates foreign key constraint
"crawlers_image_t_image_id_72a28d1d54e11b5f_fk_crawlers_image_id"
DETAIL: Key (image_id)=(5675) is not present in table
"crawlers_image".
can anyone shed some light on what exactly the problem is?
edit:
models:
class Tag(models.Model):
name = models.CharField(max_length=100)
def __str__(self):
return self.name
class Image(models.Model):
origins = (
('PX', 'Pexels'),
('MG', 'Magdeleine'),
('FC', 'FancyCrave'),
('SS', 'StockSnap'),
('PB', 'PixaBay'),
('TP', 'tookapic'),
('KP', 'kaboompics'),
('PJ', 'picjumbo'),
('LS', 'LibreShot')
)
source_url = models.URLField(max_length=400)
page_url = models.URLField(unique=True, max_length=400)
thumbnail = models.ImageField(upload_to='thumbs', null=True)
origin = models.CharField(choices=origins, max_length=2)
tags = models.ManyToManyField(Tag)
hash = models.CharField(max_length=200)
def __str__(self):
return self.page_url
def create_hash(self):
thumbnail = Imagelib.open(self.thumbnail.path)
thumbnail = thumbnail.convert('RGB')
self.hash = blockhash(thumbnail, 24)
self.save(update_fields=["hash"])
def create_thumbnail(self, image_url):
if not self.thumbnail:
if not image_url:
image_url = self.source_url
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
}
for i in range(5):
r = requests.get(image_url, stream=True, headers=headers)
if r.status_code != 200 and r.status_code!= 304:
print("error loading image url status code: {}".format(r.status_code))
time.sleep(2)
else:
break
if r.status_code != 200 and r.status_code!= 304:
print("giving up on this image, final status code: {}".format(r.status_code))
return False
# Create the thumbnail of dimension size
size = 500, 500
img = Imagelib.open(r.raw)
thumb = ImageOps.fit(img, size, Imagelib.ANTIALIAS)
# Get the image name from the url
img_name = os.path.basename(image_url.split('?', 1)[0])
file_path = os.path.join(djangoSettings.MEDIA_ROOT, "thumb" + img_name)
thumb.save(file_path, 'JPEG')
# Save the thumbnail in the media directory, prepend thumb
self.thumbnail.save(
img_name,
File(open(file_path, 'rb')))
os.remove(file_path)
return True
Let's examine your code step by step.
Say, you have 3 images in your database (for simplicity i've skipped irrelevant fields):
Image(pk=1, hash="d2ffacb...e3')
Image(pk=2, hash="afcbdee...77')
Image(pk=3, hash="d2ffacb...e3')
As we can see, first and third image have exact same hash. Let's assume all your images have some tags. Now back to your code. Lets check what will happen in first iteration:
all images with same hash will be fetched from database, this will be only image pk=3
Iterating through that images will copy all your tags from that duplicates to original one. There is nothing wrong.
iterating through that images will also remove them.
So after first iteration, image with pk=3 doesn't exist anymore.
Next iteration, image pk=2. Nothing will happen because there are no duplicates.
Next iteration, image pk=3.
all images with same hash will be fetched from database, this will be only image pk=1
Iterating through that images will copy all your tags from that duplicates to original one. But wait... there is no image pk=3 in database, we can't assign any tags to it. And that will throw your IntegrityError.
To avoid that, you should simply fetch from database only original ones in outer for loop. To do that, you can do:
images = Image.objects.distinct('hash')
You can also add some ordering here, so there always will be fetched for example image with lower ID as original one:
images = Image.objects.order_by('id').distinct('hash')
This is to do with the evaluation strategy of the queryset.
Image.objects.all() returns a thunk - that is, a sort of promise of an iterable sequence of images. The SQL query is not executed at this stage.
When you start iterating over it - for image in images - the SQL query is evaluated. You now have a list of image objects in memory.
Now, say you have four images in the database - ids 0, 1, 2, and 3. 0 and 3 are duplicates. The first image is processed, turning up 3 as a duplicate. You delete 3. Image 3 is still in the images iterator, however. When you get there, you're going to try to add tags from image 0 to image 3's tags collection. This will trigger the integrity error, since image 3 has already been deleted.
The simple fix is to keep an accumulator of images to be deleted, and do them all at the end.
images = Image.objects.all()
images_to_delete = []
for image in images:
if image.pk in images_to_delete:
pass
else:
duplicates = Image.objects.filter(hash=image.hash).exclude(pk=image.pk).exclude(hash="ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff")
for duplicate in duplicates:
duplicate_tags = duplicate.tags.all()
image.tags.add(*duplicate_tags)
images_to_delete.append(duplicate.pk)
print(len(images_to_delete))
for pk in images_to_delete:
Image.objects.get(pk=pk).delete()
EDIT: corrected proximate cause of the error, as pointed out by GwynBleidD.
The game is a word search game in an advanced lingo book and the lingo code is using [cc] which is coming up as a code fault. What is wrong or is this use of [cc] obsolete? And if so, how can it be corrected?
on getPropertyDescriptionList me
list = [:]
-- the text member with the words in it
addProp list, #pWordSource,[cc]
[#comment: "Word Source",[cc]
#format: #text,[cc]
#default: VOID]
addProp list, #pEndGameFrame,[cc]
[#comment: "End Game Frame",[cc]
#format: #marker,[cc]
#default: #next]
return list
end
I guess this is code from here, right?
That seems like an older version of Lingo syntax. [cc], apparently, stands for "continuation character". It basically makes the compiler ignore the linebreak right after it, so that it sees everything from [#comment: to #default: VOID] as one long line, which is the syntactically correct way to write it.
If I remember correctly, once upon a time, the guys who made Lingo made one more crazy decision and made the continuation character look like this: ¬ Of course, this didn't print in lots of places, so some texts like your book used things like [cc] in its place.
In modern versions of Lingo, the continuation character is \, just like in C.
I programmed in early director but have gone on to other languages in the many years since. I understand this code. The function attempts to generate a dictionary of dictionaries. in quasi-JSON:
{
'pWordSource': { ... } ,
'pEndGameFrame': { ... }
}
It is creating a string hash, then storing a "pWordSource" as a new key pointing to a 3 item hash of it's own. The system then repeats the process with a new key "pEndGameFrame", providing yet another 3 item hash. So just to expand the ellipses ... from the above code example:
{
'pWordSource': { 'comment': 'Word Source', 'format': 'text', 'default': null } ,
'pEndGameFrame': { 'End Game Frame': 'Word Source', 'format': 'marker', 'default': 'next' }
}
So I hope that explains the hash characters. It's lingo's way of saying "this is not just a string, it's a special director-specific system we're calling a symbol. It can be described in more conventional programming terms as a constant. The lingo compiler will replace your #string1 with an integer, and it's always going to be the same integer associated with #string1. Because the hash keys are actually integers rather than strings, we can change the json model to look something more like this:
{
0: { 2: 'Word Source', 3: 'text', 4: null } ,
1: { 2:'End Game Frame', 3: 'marker', 4: 'next' }
}
where:
0 -> pWordSource
1 -> pEndGameFrame
2 -> comment
3 -> format
4 -> default
So to mimic the same construction behavior in 2016 lingo, we use the newer object oriented dot syntax for calling addProp on property lists.
on getPropertyDescriptionList me
list = [:]
-- the text member with the words in it
list.addProp(#pWordSource,[ \
#comment: "Word Source", \
#format: #text, \
#default: void \
])
list.addProp(#pEndGameFrame,[ \
#comment: "End Game Frame", \
#format: #marker, \
#default: #next \
])
return list
end
Likewise, the same reference shows examples of how to use square brackets to "access" properties, then initialize them by setting their first value.
on getPropertyDescriptionList me
list = [:]
-- the text member with the words in it
list[#pWordSource] = [ \
#comment: "Word Source", \
#format: #text, \
#default: void \
]
list[#pEndGameFrame] = [ \
#comment: "End Game Frame", \
#format: #marker, \
#default: #next \
]
return list
end
And if you are still confused about what the backslashes are doing, there are other ways to make the code more vertical.
on getPropertyDescriptionList me
list = [:]
-- the text member with the words in it
p = [:]
p[#comment] = "Word Source"
p[#format] = #text
p[#default] = void
list[#pWordSource] = p
p = [:] -- allocate new dict to avoid pointer bug
p[#comment] = "End Game Frame"
p[#format] = #marker
p[#default] = #next
list[#pEndGameFrame] = p
return list
end
The above screenshot shows it working in Director 12.0 on OS X Yosemite.
Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.
To extract highlighted parts, you can use PyMuPDF. Here is an example which works with this pdf file:
Direct download
# Based on https://stackoverflow.com/a/62859169/562769
from typing import List, Tuple
import fitz # install with 'pip install pymupdf'
def _parse_highlight(annot: fitz.Annot, wordlist: List[Tuple[float, float, float, float, str, int, int, int]]) -> str:
points = annot.vertices
quad_count = int(len(points) / 4)
sentences = []
for i in range(quad_count):
# where the highlighted part is
r = fitz.Quad(points[i * 4 : i * 4 + 4]).rect
words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
sentences.append(" ".join(w[4] for w in words))
sentence = " ".join(sentences)
return sentence
def handle_page(page):
wordlist = page.get_text("words") # list of words on page
wordlist.sort(key=lambda w: (w[3], w[0])) # ascending y, then x
highlights = []
annot = page.first_annot
while annot:
if annot.type[0] == 8:
highlights.append(_parse_highlight(annot, wordlist))
annot = annot.next
return highlights
def main(filepath: str) -> List:
doc = fitz.open(filepath)
highlights = []
for page in doc:
highlights += handle_page(page)
return highlights
if __name__ == "__main__":
print(main("PDF-export-example-with-notes.pdf"))
Ok, after looking I found a solution for exporting highlighted text from a pdf to a text file. Is not very hard:
First, you highlight your text with the tool you like to use (in my case, I highlight while I'm reading on an iPad using Goodreader app).
Transfer your pdf to a computer and open it using Skim (a pdf reader, free and easy to find on the web)
On FILE, choose CONVERT NOTES and convert all the notes of your document to SKIM NOTES.
That's all: simply go to EXPORT an choose EXPORT SKIM NOTES. It will export you a list of your highlighted text. Once opened this list can be exported again to a txt format file.
Not much work to do, and the result is fantastic.