Grab X Amount Of Files and Zip Using DotNetZip Library - vb.net

So I've been using the DotNetZip Library for some time now, and it works pretty well, up until yesterday when I maxed out the zipfile size. On any given day, I need to zip PDFs and transfer them to an SFTP site, that only accepts zip files. The amount of PDFs range from a couple hundred, a couple thousand to well over 10K. I had about 24K PDFs yesterday when the DotNetZip process broke. There is a way to split the zipfiles using the DotNetZip library but for some reason, the system that is being used on the SFTP server cant handle zipfiles that are split.
What's the best way to grab say 5K (or any other set amount of files), zip, delete those files and grab another 5K, zip, delete and repeat the process until all files are zipped?
Here is my current code of the zip process...
Dim PathToPDFs As String = "C:\Temp" 'PDF LOCATION
Using Zip As ZipFile = New ZipFile()
Zip.AddSelectedFiles("(name = *.pdf)", PathToPDFs, "", True)
Zip.CompressionMethod = CompressionMethod.Deflate
Zip.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression
Zip.Save("C:\Temp\Zipfile.zip")
End Using

Try enumerating through all files first, getting a list of FileInfo, then going through them in a loop, and creating ZIP files every 5K (or whichever your batch size is). You don't need to delete anything, just keep a batch id in memory, so your zip file names would derive from that (i.e. pdf_batch_01.zip).
So when your batch size is reached you would do Save and create a new ZipFile, and keep adding files in the loop. Don't forget to also "commit" at last file (last batch would most likely be incomplete). To sum up, you "commit" when batch size is reached OR processing last entry (a varitation of i=FileCount-1).

Related

How to add disk in zip using DotNetZip

I use DotNetZip for creating zips. It has many option but I couldn't find if it is possible to store the disk where the file is located, in the archive. E.g. like the Absolute mode in 7-Zip. As far as I can see I can only do this:
zip.AddFile(cFileFull, cPath);
When cFileFull is e.g. "c:\temp\SomeFile.txt" and cPath = "c:\temp" opening the zipfile shows
temp
while I would like to see
C
and then, when I click on C
temp
This allows storing the same path/file found on different drives. Is this possible?

Split CSV file in records and save as a csv file format - Apache NIFI

What I want to do is the following...
I want to divide the input file into registers, convert each record into a
file and leave all the files in a directory.
My .csv file has the following structure:
ERP,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
ERP,FRANK,DIETSCH,5064 E METAIRIE AVE.,BRANDSVILLA,MO,65687,252-5592,1176 E THAYER ST.,COLUMBIA,MO,65215,557,291-9571,217-38-5525,129-10-0407,1/13/35,M,
As you can see it doesn't have Header row.
Here is my flow.
My problem is that when the Split Proccessor divides my csv into flows with 400 lines, it isn't save in my output directory.
It's first time using NIFI, sorry.
Make sure your RecordReader controller service is configured correctly(delimiter..etc) to read the incoming flowfile.
Records per split value as 1
You need to use UpdateAttribute processor before PutFile processor to change the filename to unique value (like UUID) unless if you are configured PutFile processor Conflict Resolution strategy as Ignore
The reason behind changing filename is SplitRecord processor is going to have same filename for all the splitted flowfiles.
Flow:
I tried your case and flow worked as expected, Use this template for your reference and upload to your NiFi instance, Make changes as per your requirements.

PDF File Merge Based on Filename

I have large batches of pdf files that must be merged.
Folder1 FileName Explaination: invoice12-105767-1510781492.pdf - 105767 is the component that will match with a pdf filename in Folder2.
"invoice12-" First section of the filename. This can sometimes be "invoice11-" or "invoice6-" so merging based on character length became challenging. The "invoicexx-" are based on where in the system the file came from.
"105767" Second part of the filename. This is the key component for matching and merging. this will be the filename in Folder2 it belongs with.
"-1510781492.pdf" Third part of the filename is a system generated unique ID, which can contain more or less characters.
Folder1:
invoice12-105767-1510781492.pdf
invoice12-105768-1510781484.pdf
invoice12-105769-1510781469.pdf
Folder2:
105767.pdf
105768.pdf
105769.pdf
OutputFolder:
Example I don't want to merge all the files in both folders into 1 huge file. I need them merged based on the Folder2 filename. (105767.pdf + invoice12-105767-1510781492.pdf) in that order specifically, also.
The final output should be three pdf files merged in order as follows:
105767.pdf + invoice12-105767-1510781492.pdf to make 1 file named 105767.pdf
105768.pdf + invoice12-105768-1510781484.pdf to make 1 file named 105768.pdf
105769.pdf + invoice12-105769-1510781469.pdf to make 1 file named 105769.pdf
I would appreciate any assistance with a way to automate this process. I merge over 800 files per day. This small automation would shave hours off my day and my wrist from carpel tunnel.
I primarily use Mac OS 10.13.1. I have looked around in Mac's "Automater" program and cannot figure out how to get it to do what I need. (I did figure out a great way to split files into single pages)
I downloaded pdftk server (since that is Mac compatible) but cannot figure out if this type of match and merge is capable with this program.
I have Adobe Acrobat DC Professional and it does not seem to have this match and merge function.
I am even open to other paid programs. I just need a fairly future-proof way of getting this mundane task done through automation on my Mac.
You can take a look at the APDFL library examples that are provided with sample code. These libraries are supported on Mac, but are not free.
https://dev.datalogics.com/adobe-pdf-library/sample-program-descriptions/c1samples/#mergedocuments
Here is a snippet of the code you would need to use:
APDFLDoc doc1 ( csInputFileName1.c_str(), true);
APDFLDoc doc2 ( csInputFileName2.c_str(), true);
// Insert doc2's pages into doc1.
// Here, we've stated PDLastPage, which adds the pages just before the last page of the target.
// If we specify PDBeforeFirstPage instead, doc2's pages will be inserted at the head of doc1.
PDDocInsertPages ( doc1.getPDDoc(),
PDLastPage,
doc2.getPDDoc(),
0,
PDAllPages,
PDInsertAll,
NULL, NULL, NULL, NULL);
doc1.saveDoc ( csOutputFileName.c_str(), PDSaveFull | PDSaveLinearized);

Reading MANY files at once in Fortran

I have 500,000 files which I need to read in Fortran and each file has ~14,000 entries in it (each entry is only about 100 characters long). I need to process each line for each file at a time. For example, I need to process line 1 for all 500,000 files before moving on to line 2 from the files and so forth.
I cannot open them all at once (I tried making an array of file pointers and opening them all) because there will be too many files open at once. Instead, I would like to do something as follows:
do iline = 1,Nlines
do ifile = 1,Nfiles
! open the file
! read a line
! close the file
enddo
end
In hopes that this would allow me to read one line at a time (from each file) and then move on to the next line (in each file). Unfortunately, each time I open the file it starts me off at line 1 again. Is there any way to open/close a file and then open it again where you left off previously?
Thanks
Unfortunately it is not possible in this way in standard Fortran. Even If you specify
position="ASIS"
the actual position will be unspecified for a not already connected unit and will be in fact the beginning of the file on most systems.
That means You have to use
read(*,*)
enough times to get on the right place in the file.
You could also use stream access. The file would be again opened at the beginning, but you can use
read(u,*,pos=n) number
where n is the position saved from the previous open. You can get the position from
inquire(unit=u, pos=n)
n = n
You would open the file with acess="STREAM".
Also 500000 opened files is indeed too much. There are ways how to inquire for the system limits and how to control them, but also your compiler may have some limits http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
Other solution: Couldn't you store the content of the files in memory? Today couple of Gigabytes is OK, but it may be not enough for you.
You can try using fseek and ftell in something like the following.
! initialize an array of 0's
do iline = 1,Nlines
do ifile = 1,Nfiles
! open the file
! fseek(fd, array(ifile))
! read a line
! array(ifile)=ftell(fd)
! close the file
enddo
end
The (untested) idea is to store the offset of each file in an array and position the cursor at that place upon opening the file. Then, once a line is read, the ftell retrieves the current position which is saved to memory for next round. If all entries have the same length, you can spare the array and just store one value.
If the files have fixed, i.e., constant, record lengths, you could use direct access. Then you could "directly" read a specific record. A big "if" however.
the overhead of all the file opening/closing will be a big performance bottleneck.
You should try to read as much as you can for each open operation given whatever memory you have:
pseudocode:
loop until done:
loop over all files:
open
fseek !as in damiens answer
read N lines into array ! N=100 eg.
save ftell value for file
close
end file loop
loop over N output files:
open
write array data
close

Can you split a PDF 'package' into separate files with CF8 or CF9?

The cfpdf tag has lots of options but I can't seem to find one for splitting apart a PDF package into separate files which can be saved to the file system.
Is this possible?
There's not a direct command, but you can achieve what you want to do in very few lines of code by using action="merge", with the "pages" attribute. So if you wanted to take a 20-page PDF and create 20 separate files, you could use getInfo to get the number of pages in the input document, then loop from 1 to that number, and in that loop, do a merge from your input document to a new output document for each iteration, with pages="#currentPage#" (or whatever your loop counter is)