how to change text item delimiter in Keynote rich text so period and comma are not ignored (using Javascript for Automation)? - osx-yosemite

In short:
with iWork rich text objects, breaking the text up in words goes from:
"This... he said, is a sentence!"
to:
["This", "he", "said", "is", "a", "sentence"]
So: periods, comma and exclamation point have disappeared.
Similar to the AppleScript situation, but with Javascript for Automation it is unclear to me how to set the text item delimiter (plus: I am hoping it can be simpler than in the old days).
In detail:
I would like to modify rich text like:
testing [value] units <ignore this>
>>>
also ignore this
<<<
etc.
The text can contain variations in size/color/weight, which should be kept.
The result should be e.g.:
testing 123 units
etc.
When I go through the words (in my case: presenter notes in Keynote), I get:
["testing", "value", "units", "ignore", "this", "also", "ignore", "this", "etc"]
instead of:
["testing", "[value]", "units", "<ignore", "this>", ">>>", "also", "ignore", "this", "<<<", "etc."]
So: characters like ., [, and > don't show up, which makes it impossible to search/replace.
To get the words, I use:
words = Application("Keynote").documents[0].slides[0].presenterNotes.words
I also tried using whose() in combination with ignoring/considering (case, hyphens, punctuation), but the result is the same.
How can I get a list of words that include the non-alphanumeric characters?

To get the full text of a slide's notes, use the presenterNotes() method:
note = Application("Keynote").documents[0].slides[0].presenterNotes()
It's not exactly intuitive for me, but it works fine.

Related

How do I replace part of a string with a lua filter in Pandoc, to convert from .md to .pdf?

I am writing markdown files in Obsidian.md and trying to convert them via Pandoc and LaTeX to PDF. Text itself works fine doing this, howerver, in Obsidian I use ==equal signs== to highlight something, however this doesn't work in LaTeX.
So I'd like to create a filter that either removes the equal signs entirely, or replaces it with something LaTeX can render, e.g. \hl{something}. I think this would be the same process.
I have a filter that looks like this:
return {
{
Str = function (elem)
if elem.text == "hello" then
return pandoc.Emph {pandoc.Str "hello"}
else
return elem
end
end,
}
}
this works, it replaces any instance of "hello" with an italicized version of the word. HOWEVER, it only works with whole words. e.g. if "hello" were part of a word, it wouldn't touch it. Since the equal signs are read as part of one word, it won't touch those.
How do I modify this (or, please, suggest another filter) so that it CAN replace and change parts of a word?
Thank you!
this works, it replaces any instance of "hello" with an italicized version of the word. HOWEVER, it only works with whole words. e.g. if "hello" were part of a word, it wouldn't touch it. Since the equal signs are read as part of one word, it won't touch those.
How do I modify this (or, please, suggest another filter) so that it CAN replace and change parts of a word?
Thank you!
A string like Hello, World! becomes a list of inlines in pandoc: [ Str "Hello,", Space, Str "World!" ]. Lua filters don't make matching on that particularly convenient: the best method is currently to write a filter for Inlines and then iterate over the list to find matching items.
For a complete example, see https://gist.github.com/tarleb/a0646da1834318d4f71a780edaf9f870.
Assuming we already found the highlighted text and converted it to a Span with with class mark. Then we can convert that to LaTeX with
function Span (span)
if span.classes:includes 'mark' then
return {pandoc.RawInline('latex', '\\hl{')} ..
span.content ..
{pandoc.RawInline('latex', '}')}
end
end
Note that the current development version of pandoc, which will become pandoc 3 at some point, supports highlighted text out of the box when called with
pandoc --from=markdown+mark ...
E.g.,
echo '==Hi Mom!==' | pandoc -f markdown+mark -t latex
⇒ \hl{Hi Mom!}

VBA replace certain carriage

All.
I am used to programming VBA in Excel, but am new to the structures in Word.
I am working through a library of text files to update them. Many of them are either OCR documents, or were manually entered.
Each has a recurring pattern, the most common of which is unnecessary carriage returns.
For example, I am looking at several text files where there is a double return after each line. A search and replace of all double carriage returns removes all paragraph distinctions.
However, each line is approximately 30 characters long, and if I manually perform the following logic, it gives me a functional document.
If there is a double carriage return after 30+ characters, I replace them with a space.
If there were less than 30 characters prior to the double return, I replace them with a single return.
Can anyone help me with some rudimentary code that would help me get started on that? I could then modify it for each "pattern" of text documents I have.
e.g.
In this case, there are more than
thirty characters per line. And I
will keep going to illustrate this
example.
This would be a new paragraph, and
would be separated by another of
the single returns.
I want code that would return:
In this case, there are more than thirty character returns. And I will keep going to illustrate this example.
This would be a new paragraph, and would be separated by another of the single returns.
Let me know if anyone can throw something out that I can play with!
You can do this without code (which RegEx requires), simply using Word's own wildcard Find/Replace tools, where:
Find = ([!^13]{30,})[^13]{1,}
Replace = \1^32
and, to clean up the residual multi-paragraph breaks:
Find = [^13]{2,}
Replace = ^p
You could, of course, record the above as a macro...
Here is a RegEx that might work for you:
(\n\n)(?<!\.(\n\n))
The substitution is just a plain space, you can try it out (and modify / tweak it) here: https://regex101.com/r/zG9GPw/4
This 'pattern' tells the RegEx engine to look for the newline character \n which occurs x2 like this \n\n (worth noting this is from your question and might be different in your files, e.g. could be \r\n) and it assumes that a valid line break will be proceeded by a full stop: \..
In RegEx the full stop symbol is a single character wild card so it needs to be escaped with the '\' (n and r are normal characters, escaping them tells the RegEx engine they represent newline and return characters).
So... the expression is looking for a group of x2 newline characters but then uses a negative look-behind to exclude any matches where the previous character was a full stop.
Anyway, it's all explained on the site:
Here is how you could do a RegEx find and replace using NotePad++ (I'm not sure if it comes with RegEx or if a plugin is needed, either way it is easy). But you can set a location, filters (to target specific file types), and other options (such as search in sub-directories).
Other than that, as #MacroPod pointed out you could also do this with MS Word, document by document, not using any code :)

The separator between keywords in PDF meta data

I cannot find an "official" documentation on whether the keywords and keyword phrases in the meta data of a PDF file are to be separated by a comma or by a comma with space.
The following example demonstrates the difference:
keyword,keyword phrase,another keyword phrase
keyword, keyword phrase, another keyword phrase
Any high-quality references?
The online sources I found are of low quality.
E.g., an Adobe press web page says "keywords must be separated by commas or semicolons", but in the example we see a semicolon with a following space before the first keyword and a semicolon with a following space between each two neighbor keywords. We don't see keyword phrases in the example.
The keywords metadata field is a single text field - not a list. You can choose whatever is visually pleasing to you. The search engine which operates on the keyword data may have other preferences, but I would imagine that either comma or semicolon would work with most modern search engines.
Reference: PDF 32000-1:2008 on page 550 at 1. Adobe; 2. The Internet Archive
ExifTool, for example parses for comma separated values, but if it does not find a comma it will split on spaces:
# separate tokens in comma or whitespace delimited lists
my #values = ($val =~ /,/) ? split /,+\s*/, $val : split ' ', $val;
foreach $val (#values) {
$et->FoundTag($tagInfo, $val);
}
I dont have a "high-quality references" but, if i generated a pdf using latex i do it in the following way:
adding in my main.tex following line:
\usepackage[a-1b]{pdfx}
then i write a file main.xmpdata and add this lines:
\Title{My Title}
\Author{My Name}
\Copyright{Copyright \copyright\ 2018 "My Name"}
\Kewords{KeywordA\sep
KeywordB\sep
KeywordC}
\Subject{My Short Discription}
after generating the pdf with pdflatex i used a python script based on "pdfminer.six" to extract the metadata
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fp = open('main.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
parser.set_document(doc)
if 'Metadata' in doc.catalog:
metadata = resolve1(doc.catalog['Metadata']).get_data()
print(metadata) # The raw XMP metadata
The part with the Keywords then look like this:
...<rdf:Bag><rdf:li>KeywordA</rdf:li>\n <rdf:li>KeywordB...
and looking with "Adobe Acrobat Reader DC" at the properties of "main.pdf" i find in the properties the following entry in the section keywords:
;KeywordA;KeywordB;KeywordC
CommonLook claim to be "a global leader in electronic document accessibility, providing software products and professional services enabling faster, more cost-efficient, and more reliable processes for achieving compliance with the leading PDF and document accessibility standards, including WCAG, PDF/UA and Section 508."
They provide the following advice on PDF metadata:
Pro Tip: When you’re entering Keywords into the metadata, separate
them with semicolons as opposed to commas.
although give no further reasoning as to why this is the preferred choice.

How do I auto indent a long method after typing in the finish semicolon in XCode5?

I have a long method, after I type it in, it looks like this:
[someObj action1:param1 action2:param2 action3:param3 action4:param4];
But I want it to become like this... :
[someObj action1:param1
action2:param2
action3:param3
action4:param4];
...automatically after I type in the last semicolon.
I just saw a video that did this, so how do I do that?
(It is a paid video so can't give link here)
For the question text:
[someObj action1:param1 action2:param2 action3:param3 action4:param4];
if you have already entered that on one line just select the space between parameters and return
That will tend to alligh the colon ":" characters on multiple lines.
To auto-indent already entered code select the text that you would like auto-indented and then control i and that selection will be indented to Xcode rules.
I use that all the time.
If you just want to move a selected block of code to the left command [, to the right command ]
Have a look at https://github.com/travisjeffery/ClangFormat-Xcode.
It will reformat your code to adhere to style rules, like Chromium's for example, where the line length is 80 chars max.
In this case, the method will be wrapped as you mentioned if it's more than 80 chars.
You can either reformat on a particular key combination, or on each save.

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));