Sejda merging PDFs from CSV filelist names - pdf

I recently installed sedja-console for merging pdf files from command line.
The names of the input pdf files are in a CSV file named filelist-inputs.csv like this:
./Temp/source/046032.pdf,./Temp/source/048155.pdf
./Temp/source/049278.pdf,./Temp/source/050818.pdf,./Temp/source/052962.pdf
./Temp/source/052962.pdf,./Temp/source/054117.pdf
I need one output pdf file for the first line of the CSV filelist names, other output pdf file for the second line of the second line, other output for the third line, and so...
I tried a command line like this:
~$ sejda-console merge -l filelist-inputs.csv -o ./Temp/target/merged[FILENUMBER####].pdf
But it only creates a unique file named literally merged[FILENUMBER####].pdf, when I want 3 files:
merged0001.pdf
merged0002.pdf
merged0003.pdf
I've simplified the problem, because I need to merge more than 3500 pdf files in 700 output files.

Sejda takes all the values in the CSV and generates a single merged PDF, there isn't any option or setting in Sejda to achieve what you asked, you will need some scripting to loop through the CSV lines, create a CSV per line and feed it to Sejda.
The output file name merged[FILENUMBER####].pdf is literally used because the PDF merge task generates one output file and it expects an explicit output file name. Prefixes like [CURRENTPAGE] or [FILENUMBER] are valid when used as -p argument in tasks generating multiple output PDF files (split tasks etc).

Related

Mass extract part of a text file using Windows batch

I have thousands of txt files that are actually in JSON format.
Each file has the same string, with different values, namely:
"Name":"xxx","Email":"yyy#zzz.com"
I want to extract the values of these two strings from all the txt files that I put in the same folder.
I've found these lines of code:
Extract part of a text file using Windows batch
but it only applies to one txt file. Whereas what I need is, it can execute all files in one folder.
You can use the FORFILES command, to loop through each file,
Syntax
FORFILES [/p Path] [/m SrchMask] [/s] [/c Command] [/d [+ | -] {date | dd}]
From the following webpage,
https://ss64.com/nt/forfiles.html

Importing a *random* csv file from a folder into pandas

I have a folder with several csv files, with file names between 100 and 400 (Eg. 142.csv, 278.csv etc). Not all the numbers between 100-400 are associated with a file, for example there is no 143.csv. I want to write a loop that imports 5 random files into separate dataframes in pandas instead of manually searching and typing out the file names over and over. Any ideas to get me started with this?
You can use glob and read all the csv files in the directory.
file = glob.glob('*.csv')
random_files=np.random.choice(file,5)
dataframes= []
for fp in random_files :
dataframes.append(pd.read_csv(fp))
From this you can chose the random 5 files from directory and then read them seprately.
Hope I answer your question

While read loop and command with file output

I have run into an issue making a while loop (yes, I am new at this..).
I have a file $lines_to_find.txt, containing a list of names which I would like to find in another (large) file $file_to_search.fasta.
When the lines in lines_to_find.txt are found in file_to_search.fasta, the lines with search hits I would like to be printed to a new file: output_file.fasta.
So I have a command similar to grep, that takes the sequences (for that is whats in the large file), and prints them to a new file:
obigrep -D SEARCHWORD INPUTFILE.fasta > OUPUTFILE.fasta
Now I would like the searchword to be replaced with the file lines_to_find.txt, and each line should be read and matched to the file_to_search.fasta. Output should preferably be one file, containing the sequence-hits from all lines in file lines_to_find.txt.
I tried this:
while read line
do
obigrep -D '$line' file_to_search.fasta >> outputfile.fasta
done < lines_to_find.txt
But my outputfile just returns empty.
What am I doing wrong?
Am I just building the while read loop wrong?
Are there other ways to do it?
I'm open to all suggestions, and as I am new, please point out obvious begginer-flaws.

How to assign taxonomy to trnL sequences using BLAST

I am using the trnL chloroplast gene to identify plants from herbivore dung, and am currently trying to assign taxonomy to trnL sequences from my Illumina output. Here is the QIIME script and options I would like to run:
assign_taxonomy.py -i rep_set_numbered.fa -r sequence.fasta -t id_to_taxonomy.txt -e 0.01 -m blast
I have the input file from our data pipeline, and the reference file from NCBI GenBank (205,703 sequences). However, I do not have a tab-delimted taxonomy text file. Normally I would generate one from Excel, but because the FASTA file is so large (over 500 MB), it cannot be fully viewed in Excel, and therefore cannot be reliably edited.
My question is, is there a command line method for generating my own tab-delimited taxonomy file from my reference FASTA file, and if so, how would I do that? If not, what are my other options for handling this required option on the "assign_taxonomy.py" QIIME script?

Programatically splitting pdf pages to their own pdf's in UNIX

I am trying to write a program that takes as input a .pdf file and separates each page into their own .pdf files in UNIX command line. I have tried SplitPdf but for some reason I keep getting errors.
update: I have already tried pdftk but it has poor performance and a limitation on the size of the pdf file.
Use pdftk.
The burst command is what you are after.
Man page section: http://www.pdflabs.com/docs/pdftk-man-page/#dest-op-burst
burst
Splits a single, input PDF
document into individual pages. Also
creates a report named doc_data.txt
which is the same as the output from
dump_data. If the output section is
omitted, then PDF pages arenamed:
pg_%04d.pdf, e.g.: pg_0001.pdf,
pg_0002.pdf, etc. To name these pages
yourself, supply a printf-styled
format string in the output section.
For example, if you want pages named:
page_01.pdf, page_02.pdf, etc.,pass
output page_%02d.pdf to pdftk.
Encryption can be applied to the
output by appending output options
such as owner_pw, e.g.: pdftk in.pdf
burst owner_pw foopass