How can I know which text document map to which ID - sequence

I am learning Mahout with 'Mahou in Action' and right now I am in chapter 8. I just downloaded the Reuters-21578 file and use the following commands to convert all the documents to SequenceFile:
bin/mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles
and I got chunk-0 in the 'reuters-seqfiles' folder.
My question is: How can I know which document has been assigned to which ID in this sequence file?

Related

Remove PDF metadata (removing complete PDF metadata )

I want to remove metadata from PDF files. I have already tried to use "exiftool", "pdftk" and "qpdf" to remove the metadata (method proposed - https://gist.github.com/hubgit/6078384 ). These tools claim to remove metadata but unfortunately retain them. I used "grep -a metadata_fieldname file.pdf" option and I could retrieve the metadata value.
Is there a way to completely delete the metadata information from PDF files (delete all the objects containing metadata information).
I am using Ubuntu. When I create a PDF file using LaTeX tool (ex- pdfTeX) or LibreOffice, the tool automatically writes the information of Producer, Creator and sometimes Full banner etc.. in the metadata of the PDF file. So I am looking to remove this information from PDF files (basically the metadata information stored by the PDF creator tool).
To remove all pdf information dictionary using pdftk on your ubuntu terminal, you can use the following commands:
pdftk file.pdf dump_data |sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | pdftk file.pdf update_info - output file_no_meta.pdf
Assuming file.pdf is the source file and your pdf file output as file_no_meta.pdf
Next, use the following command to remove XMP metadata:
exiftool -all:all= -overwrite_original file_no_meta.pdf
Finally, use the following command on your terminal to check for the file metadata again:
pdfinfo file_no_meta.pdf
You can use pdftk to strip all Info and XMP metadata from a document by copying its pages into a new PDF, like this:
pdftk A=mydoc.pdf cat A output mydoc.no_metadata.pdf
For
pdftk A=mydoc.pdf cat A output mydoc.no_metadata.pdf
to work, you need an older version of pdftk.
pdftk-java messes things up.

Rename ttf/woff/woff2 file to PostScript Font Name with Script

I am a typographer working with many fonts that have incorrect or incomplete filenames. I am on a Mac and have been using Hazel, AppleScript, and Automator workflows, attempting to automate renaming these files*. I require a script to replace the existing filename of ttf, woff, or woff2 files in Finder with the font's postscriptName. I know of tools (fc-scan/fontconfig, TTX, etc) which can retrieve the PostScript name-values I require, but lack the programming knowhow to code a script for my purposes. I've only managed to setup a watched directory that can run a script when any files matching certain parameters are added.
*To clarify, I am talking about changing the filename only, not the actual names stored within the font. Also I am open to a script of any compatible language or workflow of scripts if possible, e.g. this post references embedding AppleScript within Shell scripts via osascript.
StackExchange Posts I've Consulted:
How to get Fontname from OTF or TTF File?
How to get PostScript name of TTF font in OS X?
How to Change Name of Font?
Automate Renaming Files in macOS
Others:
https://github.com/dtinth/JXA-Cookbook/wiki/Using-JavaScript-for-Automation
https://github.com/fonttools/fonttools
https://github.com/devongovett/fontkit
https://www.npmjs.com/package/rename-js
https://opentype.js.org/font-inspector.html
http://www.fontgeek.net/blog/?p=343
https://www.lantean.co/osx-renaming-fonts-for-free
Edit: Added the following by request.
1) Screenshot of a somewhat typical webfont, illustrating how the form fields for font family and style names are often incomplete, blank, or contain illegal characters.
2) The woff file depicted (also, as base64).
Thank you all in advance!
Since you mentioned Automator in your question, I thought I'd try and solve this while using that to rename the file, along with standard Mac bash to get the font name. Hopefully, it beats learning a whole programming language.
I don't know what your workflow is so I'll leave any deviations to you but here is a method to select a font file and from Services, rename the file to the font's postscript name… based on Apple's metadata, specifically "com_apple_ats_name_postscript". This is one of the pieces of data retrieved using 'mdls' from the Terminal on the font file. To focus on the postscript name, grep the output for name_postscript. For simplicity here, I'll exclude the path to the selected file.
Font Name Aquisition
So… running this command…
mdls GenBkBasBI.ttf | grep -A1 name_postscript
… generates this output, which contains FontBook's Postscript name. The 'A1' in grep returns the found line and the first line after, which is the one containing the actual font name.
com_apple_ats_name_postscript = (
"GentiumBookBasic-BoldItalic"
Clean this up with some more bash (tr, tail)…
tr -d \ | tail -n 1 | tr -d \"
In order, these strip spaces, all lines excepting the last, and quotation marks. So for the first 'tr' instance, there is an extra space after the backslash.
In a single line, it looks like this…
mdls GenBkBasBI.ttf | grep -A1 name_postscript | tr -d \ | tail -n 1 | tr -d \"
…and produces this…
GentiumBookBasic-BoldItalic
Now, here is the workflow that includes the above bash command. I got the idea for variable usage from the answer to this question…
Apple Automator “New PDF from Images” maintaining same filename
Automator Workflow
Automator Workflow screenshot
At the top; Service receives selected 'files or folders' in 'Finder'.
Get Selected Finder Items
This (or Get Specified…) is there to allow testing. It is obviated by using this as a Service.
Set Value of Variable (File)
This is to remember which file you want to rename
Run Shell Script
This is where we use the bash stuff. The $f is the selected/specified file. I'm running 'zsh' for whatever reason. You can set it to whatever you're running, presumably 'bash'.
Set Value of Variable (Text)
Assign the bash output to a variable. This will be used by the last action for the new filename.
Get Value of Variable (File)
Recall the specified/selected file to rename.
Rename Finder Items: Name Single Item
I have it set to 'Basename only' so it will leave the extension alone. Enter the 'Text' variable from action 4 in here.

Batch edit in OpenRefine

So, I have a bunch of .csv files which need cleaning. They all need to go through the same steps, so I've extracted OpenRefine's operation history in order to apply it to other ones.
I could open each file one by one in OpenRefine and apply the extracted JSON history. But there are a lot of files...
Also, I don't have enough memory to open them all at once in OpenRefine (multiple selecting when opening the files).
Is there any way I could edit them all or automatically using that JSON I extracted from OpenRefine?
That's what we created BatchRefine for, the README should be pretty much self-explanatory. If not, let me know.
I just recently converted 4 million CSV records to RDF using BatchRefine, took me less than 10 minutes on my MacBook Pro.
I execute BatchRefine with this simple shell script:
#!/bin/bash
for file in ./input/*.tsv
do
filename=$(basename "$file")
if [ ! -f "target/"$filename"-transformed" ]
then
echo Processing $filename...
curl -XPOST -H 'Accept: text/turtle' -H 'Content-Type:text/csv' --data-binary "#"$file -o "target/"$filename"-transformed" 'localhost:8310/?refinejson=http://localhost:8000/bar-config.json'
else
echo Found "target/"$filename"-transformed", skipping $file
fi
done;
Note that you need to adjust the Acceptheader in the script, I guess you want CSV as output again, not RDF.
You can automate some OpenRefine operations using one of the existing libraries:
python
An other python library
ruby
javascript - nodejs

How to get the hidden text layout that tesseract creates for pdf files?

I don't have much experience with ocr. Here's what I try:
tesseract -l eng -psm 1 image_str007_0001.jpg image_str007_tess pdf
The result is a perfectly structured hidden text layout - the words are on their exact places when searching the pdf.
My question is: can I get this layout as a file (hocr or html)?
(Config parameters preferred, not API.)
What I've tried:
tesseract -l eng -psm 1 image_str007_0001.jpg output hocr
and
hocr2pdf -i image_str007_001 -o output.pdf < output.hocr
In the file output.pdf the words are badly mislpaced when searching through the text. Is command 2. not correct for creating the tesseract hocr layout file, or the hocr2pdf app does not create the pdf correctly?

In texinfo, how to specify a bash single quote?

I am writing a package using the GNU build system. The documentation hence is in the texinfo format. As a result, executing make converts the texinfo file into the info format, and executing make pdf automatically produces a pdf file.
In the texinfo file, I have something like this:
#verbatim
awk '{...}' data.txt
#end verbatim
However, in the pdf, the "basic" single quotes (U+0027) in the awk command above are transformed into "curvy" single quotes (U+2019) so that, if one does a copy-paste of the command from the pdf into a terminal, bash complains ("syntax error"). This forces the user to edit the command he just copy-pasted. Same problem occurs if I replace #verbatim by #example. I searched the texinfo manual but couldn't find a way to specify apostrophes. I am using texinfo version 5.2.
Karl Berry (via the bug-texinfo mailing list) told me to add 2 lines to my texi file (more info):
#codequoteundirected on
#codequotebacktick on
as well as add the latest version of texinfo.tex to my package.