How to assign taxonomy to trnL sequences using BLAST - sequence

I am using the trnL chloroplast gene to identify plants from herbivore dung, and am currently trying to assign taxonomy to trnL sequences from my Illumina output. Here is the QIIME script and options I would like to run:
assign_taxonomy.py -i rep_set_numbered.fa -r sequence.fasta -t id_to_taxonomy.txt -e 0.01 -m blast
I have the input file from our data pipeline, and the reference file from NCBI GenBank (205,703 sequences). However, I do not have a tab-delimted taxonomy text file. Normally I would generate one from Excel, but because the FASTA file is so large (over 500 MB), it cannot be fully viewed in Excel, and therefore cannot be reliably edited.
My question is, is there a command line method for generating my own tab-delimited taxonomy file from my reference FASTA file, and if so, how would I do that? If not, what are my other options for handling this required option on the "assign_taxonomy.py" QIIME script?

Related

Rename ttf/woff/woff2 file to PostScript Font Name with Script

I am a typographer working with many fonts that have incorrect or incomplete filenames. I am on a Mac and have been using Hazel, AppleScript, and Automator workflows, attempting to automate renaming these files*. I require a script to replace the existing filename of ttf, woff, or woff2 files in Finder with the font's postscriptName. I know of tools (fc-scan/fontconfig, TTX, etc) which can retrieve the PostScript name-values I require, but lack the programming knowhow to code a script for my purposes. I've only managed to setup a watched directory that can run a script when any files matching certain parameters are added.
*To clarify, I am talking about changing the filename only, not the actual names stored within the font. Also I am open to a script of any compatible language or workflow of scripts if possible, e.g. this post references embedding AppleScript within Shell scripts via osascript.
StackExchange Posts I've Consulted:
How to get Fontname from OTF or TTF File?
How to get PostScript name of TTF font in OS X?
How to Change Name of Font?
Automate Renaming Files in macOS
Others:
https://github.com/dtinth/JXA-Cookbook/wiki/Using-JavaScript-for-Automation
https://github.com/fonttools/fonttools
https://github.com/devongovett/fontkit
https://www.npmjs.com/package/rename-js
https://opentype.js.org/font-inspector.html
http://www.fontgeek.net/blog/?p=343
https://www.lantean.co/osx-renaming-fonts-for-free
Edit: Added the following by request.
1) Screenshot of a somewhat typical webfont, illustrating how the form fields for font family and style names are often incomplete, blank, or contain illegal characters.
2) The woff file depicted (also, as base64).
Thank you all in advance!
Since you mentioned Automator in your question, I thought I'd try and solve this while using that to rename the file, along with standard Mac bash to get the font name. Hopefully, it beats learning a whole programming language.
I don't know what your workflow is so I'll leave any deviations to you but here is a method to select a font file and from Services, rename the file to the font's postscript name… based on Apple's metadata, specifically "com_apple_ats_name_postscript". This is one of the pieces of data retrieved using 'mdls' from the Terminal on the font file. To focus on the postscript name, grep the output for name_postscript. For simplicity here, I'll exclude the path to the selected file.
Font Name Aquisition
So… running this command…
mdls GenBkBasBI.ttf | grep -A1 name_postscript
… generates this output, which contains FontBook's Postscript name. The 'A1' in grep returns the found line and the first line after, which is the one containing the actual font name.
com_apple_ats_name_postscript = (
"GentiumBookBasic-BoldItalic"
Clean this up with some more bash (tr, tail)…
tr -d \ | tail -n 1 | tr -d \"
In order, these strip spaces, all lines excepting the last, and quotation marks. So for the first 'tr' instance, there is an extra space after the backslash.
In a single line, it looks like this…
mdls GenBkBasBI.ttf | grep -A1 name_postscript | tr -d \ | tail -n 1 | tr -d \"
…and produces this…
GentiumBookBasic-BoldItalic
Now, here is the workflow that includes the above bash command. I got the idea for variable usage from the answer to this question…
Apple Automator “New PDF from Images” maintaining same filename
Automator Workflow
Automator Workflow screenshot
At the top; Service receives selected 'files or folders' in 'Finder'.
Get Selected Finder Items
This (or Get Specified…) is there to allow testing. It is obviated by using this as a Service.
Set Value of Variable (File)
This is to remember which file you want to rename
Run Shell Script
This is where we use the bash stuff. The $f is the selected/specified file. I'm running 'zsh' for whatever reason. You can set it to whatever you're running, presumably 'bash'.
Set Value of Variable (Text)
Assign the bash output to a variable. This will be used by the last action for the new filename.
Get Value of Variable (File)
Recall the specified/selected file to rename.
Rename Finder Items: Name Single Item
I have it set to 'Basename only' so it will leave the extension alone. Enter the 'Text' variable from action 4 in here.

Sejda merging PDFs from CSV filelist names

I recently installed sedja-console for merging pdf files from command line.
The names of the input pdf files are in a CSV file named filelist-inputs.csv like this:
./Temp/source/046032.pdf,./Temp/source/048155.pdf
./Temp/source/049278.pdf,./Temp/source/050818.pdf,./Temp/source/052962.pdf
./Temp/source/052962.pdf,./Temp/source/054117.pdf
I need one output pdf file for the first line of the CSV filelist names, other output pdf file for the second line of the second line, other output for the third line, and so...
I tried a command line like this:
~$ sejda-console merge -l filelist-inputs.csv -o ./Temp/target/merged[FILENUMBER####].pdf
But it only creates a unique file named literally merged[FILENUMBER####].pdf, when I want 3 files:
merged0001.pdf
merged0002.pdf
merged0003.pdf
I've simplified the problem, because I need to merge more than 3500 pdf files in 700 output files.
Sejda takes all the values in the CSV and generates a single merged PDF, there isn't any option or setting in Sejda to achieve what you asked, you will need some scripting to loop through the CSV lines, create a CSV per line and feed it to Sejda.
The output file name merged[FILENUMBER####].pdf is literally used because the PDF merge task generates one output file and it expects an explicit output file name. Prefixes like [CURRENTPAGE] or [FILENUMBER] are valid when used as -p argument in tasks generating multiple output PDF files (split tasks etc).

How to rename photo files using awk, such that they are named (and hence ordered) by "date taken"?

I have 3 groups of photos, from 3 different cameras (with time sychronised onboard all cameras) but with different naming schemes (e.g.: IMG_3142.jpg, DCM_022.jpg). I would like to rename every photo file with the following naming convention:
1_yyyy_mm_dd_hh_mm_ss.jpg for earliest
2_yyyy_mm_dd_hh_mm_ss.jpg for next earliest, and so on,
until we reach around 5000_yyyy_mm_dd_hh_mm_ss.jpg for the last one (i.e. the most recent)
I would like the yyyy_mm_dd_hh_mm_ss field to be replaced by the “date and time taken” value for when this photo was taken. Which is saved in the metadata/properties of each file.
I have seen awk used to carry out similar operations but I'm not familiar enough to know how to access the “time taken” metadata, etc.
Also, not that this should make a difference: my computer is a mac.
You can use jhead for this. The command is:
jhead -n%Y_%m_%d_%H_%M_%S *.jpg
Make a COPY of your files first before running it! You can install jhead with homebrew using:
brew install jhead
Or, if you don't have homebrew, download here for OS X.
That will get you the date in the filename as you wish. The sequence number is a little more difficult. Try what I am suggesting above and, if you are happy with that, we can work on the sequence number maybe. Basically, you would run jhead again to set the file modification times of your files to match the time they were shot - then the files can be made to show up in the listing in date order and we can put your sequence number on the front.
So, to get the file's date set on the computer to match the time it was taken, do:
jhead -ft *.jpg
Now all the files will be dated on your computer to match the time the photos were taken. Then we need to whizz through them in a loop with our script adding in the sequence number:
#!/bin/bash
seq=1
# List files in order, oldest first
for f in $(ls -rt *jpg)
do
# Work out new name
new="$seq_$f"
echo Rename $f as $new
# Remove "#" from start of following command if things look good so the renaming is actually done
# mv "$f" $new"
((seq++))
done
You would save that in your HOME directory as renamer, then you would go into Terminal and make the script executable like this:
chmod +x renamer
Then you need to go to where your photos are, say Desktop\Photos
cd "$HOME/Desktop/Photos"
and run the script
$HOME/renamer
That should do it.
By the way, I wonder how wise it is to use a simple sequence number at the start of your filenames because that will not make them come up in order when you look at them in Finder.
Think of file 20 i.e. 20_2015_02_03_11:45:52.jpg. Now imagine that files starting with 100-199 will be listed BEFORE file 2o, also files 1000-1999 will also be listed before file 20 - because their leading 1s come before file 20's leading 2. So, you may want to name your files:
0001_...
0002_...
0003_...
...
0019_...
0020_...
then they will come up in sequential order in Finder. If you want that, use this script instead:
#!/bin/bash
seq=1
for f in $(ls -rt *jpg)
do
# Generate new name with zero-padded sequence number
new=$(printf "%04d_$f" $seq)
echo Rename $f as $new
# Remove "#" from start of following command if things look good so the renaming is actually done
# mv "$f" $new"
((seq++))
done

How can I drop metadata fields (e.g., PageLabel fields) from PDFs?

I have used pdftk to change the "Info" metadata associated with a PDF. I currently have several PDFs with extraneous page labels and I cannot figure how to drop them. This is what I am currently doing:
$ pdftk example_orig.pdf dump_data output page_labels.orig
$ grep -v PageLabel page_labels.orig > page_labels.new
$ pdftk example_orig.pdf update_info page_labels.new output example_new.pdf
This does not remove the PageLabel* metadata which can be verified with:
$ pdftk example_orig.pdf dump_data | grep PageLabel
How can I programmatically remove this metadata from the PDF? It would be nice to do with with pdftk but if there another tool or way to do this on GNU/Linux, that would also work for me.
I need this because I am using LaTeX Beamer to generate presentations with the \setbeameroption{show notes on second screen} option which generates a double-width PDF for showing notes on a second screen. Unfortunately, there seems to be a bug in pgfpages which results in incorrect and extraneous PageLabels in these files (example). If I generate a slides only PDF, it will generates the correct PageLabels (example). Since I can generate a correct set of PageLabels, one solution would be to replace the pagelabels in the first examples with those in the second. That said, since there are extra pagelabels in the first example, I would need to remove them first.
Using a text editor to remove PDF metadata
If it is the first time you edit a PDF, make a backup copy first.
Open your PDF with a text editor that can handle binary blobs. vim -b will be fine.
Locate the /Info dictionary. Overwrite all the entries you do not want any more completely with blanks (an entry consists of /Key names plus the (some values) following them).
Be careful to not use more spaces than there were characters initially. Otherwise your xref table (ToC of PDF objects will be invalidated, and some viewers will indicate the PDF as corrupted).
For additional measure, locate the /XML string in your PDF. It should show you where your XMP/XML metadata section is (not all PDFs have them). Locate all the key values (not the <something keys>!) in there which you want to remove. Again, just overwrite them with blanks and be careful not to change the total length (neither longer, nor shorter).
In case your PDF does not make the /Info dictionary accessible, transform it with the help of qpdf.
Use this command:
qpdf --qdf --object-streams=disable orig.pdf qdf---orig.pdf
Apply the procedure outlined above. (The qdf---orig.pdf now should be much better suited for
Re-compact your edited file:
qpdf qdf---orig.pdf edited---orig.pdf
Done! Enjoy your edited---orig.pdf. Check if it has all the data removed:
pdfinfo -meta edited---orig.pdf
Update
After looking at the sample PDF files provided, it became clear to me that the /PageLabel key is not part of the /Info dictionary (PDF's Document Information Dictionary), but of the /Root object.
That's probably one reason why pdftk was unable to update it with the method the OP described.
The other reason is the following: the PDF which the OP quoted as containing the correct page labels does in fact contain incorrect ones!
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 2
3 | 2
4 | 2
5 | 2
6 | 4
The other PDF (which supposedly contains extraneous page labels) is incorrect in a different way:
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 2
6 | 4
My original advice about how to manually edit the classical metadata of a PDF remains valid. For the case of editing page labels you can apply the same method with a slight variation.
In the case of the OP's example files, the complication comes into play: the /Root object is not directly accessible, because it is hidden inside a compressed object stream (PDF object type /ObjStm). That means one has to decompress it with the help of qpdf first:
Use qpdf:
qpdf --qdf --object-streams=disable example_presentation-NOTES.pdf q-notes.pdf
Open the resulting file in binary mode with vim:
vim -b q-notes.pdf
Locate the 1 0 obj marker for the beginning of the /Root object, containing a dictionary named /PageLabels.
(a) To disable page labels altogether, just replace the /PageLabels string by /Pagelabels, using a lowercase 'l' (PDF is case sensitive, and will no longer recognize the keyword; you yourself could at some other time restore the original version should you need it.)
(b) To edit the page labels, first see how the consecutive labels for pages 1--6 are being referred to as
<feff0031>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
(These values are in BOM-marked hex, meaning 1, 2, 2, 2, 3, 4...)
Edit these values to read:
<feff0031>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
[....]
<feff0035>
[....]
<feff0036>
Save the file and run qpdf again in order to re-compress the PDF:
qpdf q-notes.pdf notes.pdf
These now hopefully are the page labels the OP is looking for....
Since the OP seems to be familiar with editing pdftk's output of dump_data output, he can possibly edit the output and use update_data to apply the fix to the PDF without needing to resort to qpdf and vim.
Update 2:
User #Iserni posted a very good, short and working answer, which limits itself to one command, pdftk, which the OP seems to be familiar with already, plus sed -- not needing to use a text editor to open the PDF, and not introducing an additional utility qpdf like my answer did.
Unfortunately #Iserni deleted it again after a comment of mine. I think his answer deserves to get the bounty and I call you to vote to "undelete" his answer!
So temporarily, I'll include a copy of #Iserni's answer here, until his is undeleted again:
Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
# Remove temp file
rm -f temp.pdf mangled.pdf
Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
rm -f temp.pdf mangled.pdf

How to create sequence files from tsv file for text classification

I have a tsv file which is seperated in class, id and text, e.g.
positive 2342 This is very good.
negative 4343 I hate it.
and I'm trying to feed Mahout's nbayes to classify the text part either pos or neg.
My first attempt was using mahout seqdirectory command on every line as a seperate file in its class directory. This works well with a small amount of data but eventually fails at around 30 Gigabytes of data with OutOfMemoryException. Increasing the heap size fails with "GC overhead limit exceeded" probably because of the large amount of seperate files.
My second attempt was loading the data into a hive table and convert it to a sequence file, as it is described here [0], which seems to work fine at first but after creating the vector file and splitting up the data set the trainnb step fails with an ArrayIndexOutOfBounds Exception.
[0] http://files.meetup.com/6195792/Working%20With%20Mahout.pdf
Right now I'm out of ideas what to look for. Any ideas how I can convert the tsv file or hive table to a sequencefile as it's generated by seqdirectory command on a directory?
Going to answer by myself in case some else needs a solution to the same or similar problem:
I found this code snippet at github and modified it to my needs. Additionally I had to trim the value string to get proper results.
This may be a simpler implementation for those searching for this answer in the future. This can be done completely from the command line (I tested it in EMR):
hadoop jar \
/home/hadoop/contrib/streaming/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-inputformat TextInputFormat \
-input {input_directory}/* \
-mapper '/bin/cat' \
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \
-output {output_directory}
/home/hadoop/contrib/streaming/hadoop-streaming.jar is the location of the hadoop-streaming.jar on Amazon EMR (AMI 3.4.0). It may be a in a different location depending on your configuration.