PDFBox not clean tmp files after convertToImage method - pdfbox

I use the PDFBox function, such as convertToImage, everything works fine, but PDFBox does not clear the temporary files after the conversion. In my system in the directory for temporary files "/tmp" there are many files such us +~JF132216249314633400.tmp, they are deleted only after restarting my application, but when the application continues to work, temporary files are not deleted.
PDFBox version - 1.8.15
when I use this
page.convertToImage(BufferedImage.TYPE_INT_RGB, 300)
the PDFbox library creates tmp files such as "+~JF132216249314633400.tmp"
my method:
def splitPdfToImages(file: File): List[File] = {
val document = PDDocument.load(file)
val pages = (for (i <- 0 until document.getNumberOfPages)
yield document.getDocumentCatalog.getAllPages.get(i).asInstanceOf[PDPage]).toList
val imgFiles = pages.zipWithIndex.map { case (page, i) =>
val baos = IOUtils.createBAOS
ImageIO.write(page.convertToImage(BufferedImage.TYPE_INT_RGB, 300), "jpg", baos)
val bais = IOUtils.createBAIS(baos.toByteArray)
try {
val img = Image.fromStream(bais)
implicit val writer = JpegWriter().withCompression(100)
val tmpFile = File.createTempFile(s"""${file.getName.split("\\.").head}_$i""", file.getName.split("\\.").last)
img.output(tmpFile)
} finally {
baos.close()
bais.close()
}
}
document.close()
imgFiles
}
Please help me to solve this issue.

Related

Is it possible to load constrains from file (csv, txt) to Deequ Checks?

Is it possible to save suggested constrains to file and then load them as cheks? I was able to do it without saving them with next code
val allConstraints = suggestionResult.constraintSuggestions.flatMap {
case (_, suggestions) =>
suggestions.map {
_.constraint
}
}.toSeq
val generatedCheck = Check(CheckLevel.Error, "generated constraints", allConstraints)
val verificationResult: VerificationResult = {
VerificationSuite()
.onData(tested_df)
.addCheck(generatedCheck)
.run()
}
However I want to save them to file and apply them later when needed? Is there any way to do this?

Why loading 4000 images into redis using spark-submit takes time (9 Minutes) longer than loading the same images into HBase (2.5 Minutes)?

Loading Images into Redis should be much faster than doing the same thing using Hbase since Redis deals with RAM while HBase uses HDFS to store the data. I was surprised when I loaded 4000 images into Redis, it took 9 Minutes to finish! While the same process I've done using HBase and It took only 2.5 Minutes. Is there an interpretation for this? Any Suggestions to improve my code? Here is my code:
// The code for loading the images into Hbase (adopted from NIST)
val conf = new SparkConf().setAppName("Fingerprint.LoadData")
val sc = new SparkContext(conf)
Image.dropHBaseTable() Image.createHBaseTable()
val checksum_path = args(0)
println("Reading paths from: %s".format(checksum_path.toString))
val imagepaths = loadImageList(checksum_path) println("Got %s images".format(imagepaths.length))
imagepaths.foreach(println)
println("Reading files into RDD")
val images = sc.parallelize(imagepaths).map(paths => Image.fromFiles(paths._1, paths._2))
println(s"Saving ${images.count} images to HBase")
Image.toHBase(images)
println("Done")
} val conf = new SparkConf().setAppName("Fingerprint.LoadData") val sc = new SparkContext(conf) Image.dropHBaseTable() Image.createHBaseTable() val checksum_path = args(0) println("Reading paths from: %s".format(checksum_path.toString)) val imagepaths = loadImageList(checksum_path) println("Got %s images".format(imagepaths.length)) imagepaths.foreach(println) println("Reading files into RDD") val images = sc.parallelize(imagepaths) .map(paths => Image.fromFiles(paths._1, paths._2)) println(s"Saving ${images.count} images to HBase") Image.toHBase(images) println("Done")
} def toHBase(rdd: RDD[T]): Unit = {
val cfg = HBaseConfiguration.create()
cfg.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(cfg)
job.setOutputFormatClass(classOf[TableOutputFormat[String]])
rdd.map(Put).saveAsNewAPIHadoopDataset(job.getConfiguration)
}
//The code for Loading images intto Redis
val images = sc.parallelize(imagepaths).map(paths => Image.fromFiles(paths._1, paths._2)).collect
for(i <- images){
val stringRdd = sc.parallelize(Seq((i.uuid, new String(i.Png, StandardCharsets.UTF_8))))
sc.toRedisKV(stringRdd)(redisConfig)
stringRdd.collect}
println("Done")

Detecting file size with MultipartFormDataStreamProvider before file is saved?

We are using the MultipartFormDataStreamProviderto save file upload by clients. I have a hard requirement that file size must be greater than 1KB. The easiest thing to do would of course be the save the file to disk and then look at the file unfortunately i can't do it like this. After i save the file to disk i don't have the ability to access it so i need to look at the file before its saved to disk. I've been looking at the properties of the stream provider to try to figure out what the size of the file is but unfortunately i've been unsuccessful.
The test file i'm using is 1025 bytes.
MultipartFormDataStreamProvider.BufferSize is 4096
Headers.ContentDisposition.Size is null
ContentLength is null
Is there a way to determine file size before it's saved to the file system?
Thanks to Guanxi i was able to formulate a solution. I used his code in the link as the basis i just added a little more async/await goodness :). I wanted to add the solution just in case it helps anyone else:
private async Task SaveMultipartStreamToDisk(Guid guid, string fullPath)
{
var user = HttpContext.Current.User.Identity.Name;
var multipartMemoryStreamProvider = await Request.Content.ReadAsMultipartAsync();
foreach (var content in multipartMemoryStreamProvider.Contents)
{
using (content)
{
if (content.Headers.ContentDisposition.FileName != null)
{
var existingFileName = content.Headers.ContentDisposition.FileName.Replace("\"", string.Empty);
Log.Information("Original File name was {OriginalFileName}: {guid} {user}", existingFileName, guid,user);
using (var st = await content.ReadAsStreamAsync())
{
var ext = Path.GetExtension(existingFileName.Replace("\"", string.Empty));
List<string> validExtensions = new List<string>() { ".pdf", ".jpg", ".jpeg", ".png" };
//1024 = 1KB
if (st.Length > 1024 && validExtensions.Contains(ext, StringComparer.OrdinalIgnoreCase))
{
var newFileName = guid + ext;
using (var fs = new FileStream(Path.Combine(fullPath, newFileName), FileMode.Create))
{
await st.CopyToAsync(fs);
Log.Information("Completed writing {file}: {guid} {user}", Path.Combine(fullPath, newFileName), guid, HttpContext.Current.User.Identity.Name);
}
}
else
{
if (st.Length < 1025)
{
Log.Warning("File of length {FileLength} bytes was attempted to be uploaded: {guid} {user}",st.Length,guid,user);
}
else
{
Log.Warning("A file of type {FileType} was attempted to be uploaded: {guid} {user}", ext, guid,user);
}
var responseMessage = new HttpResponseMessage(HttpStatusCode.BadRequest)
{
Content =
st.Length < 1025
? new StringContent(
$"file of length {st.Length} does not meet our minumim file size requirements")
: new StringContent($"a file extension of {ext} is not an acceptable type")
};
throw new HttpResponseException(responseMessage);
}
}
}
}
}
You can also read the request contents without using MultipartFormDataStreamProvider. In that case all of the request contents (including files) would be in memory. I have given an example of how to do that at this link.
In this case you can read header for file size or read stream and check the file size. If it satisfy your criteria then only write it to desire location.

How to set log filename in flume

I am using Apache flume for log collection. This is my config file
httpagent.sources = http-source
httpagent.sinks = local-file-sink
httpagent.channels = ch3
#Define source properties
httpagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
httpagent.sources.http-source.channels = ch3
httpagent.sources.http-source.port = 8082
# Local File Sink
httpagent.sinks.local-file-sink.type = file_roll
httpagent.sinks.local-file-sink.channel = ch3
httpagent.sinks.local-file-sink.sink.directory = /home/avinash/log_dir
httpagent.sinks.local-file-sink.sink.rollInterval = 21600
# Channels
httpagent.channels.ch3.type = memory
httpagent.channels.ch3.capacity = 1000
My application is working fine.My problem is that in the log_dir the files are using some random number (I guess its timestamp) timestamp as by default.
How to give a proper filename suffix for logfiles ?
Having a look on the documentation it seems there is no parameter for configuring the name of the files that are going to be created. I've gone to the sources looking for some hidden parameter, but there is no one :)
Going into the details of the implementation, it seems the name of the file is managed by the PathManager class:
private PathManager pathController;
...
#Override
public Status process() throws EventDeliveryException {
...
if (outputStream == null) {
File currentFile = pathController.getCurrentFile();
logger.debug("Opening output stream for file {}", currentFile);
try {
outputStream = new BufferedOutputStream(new FileOutputStream(currentFile));
...
}
Which, as you already noticed, is based on the current timestamp (showing the constructor and the next file getter):
public PathManager() {
seriesTimestamp = System.currentTimeMillis();
fileIndex = new AtomicInteger();
}
public File nextFile() {
currentFile = new File(baseDirectory, seriesTimestamp + "-" + fileIndex.incrementAndGet());
return currentFile;
}
So, I think the only possibility you have is to extend the File Roll sink and override the process() method in order to use a custom path controller.
For sources you have execute commands to tail and pre-pend or append details, based on shell scripting. Below is a sample:
# Describe/configure the source for tailing file
httpagent.sources.source.type = exec
httpagent.sources.source.shell = /bin/bash -c
httpagent.sources.source.command = tail -F /path/logs/*_details.log
httpagent.sources.source.restart = true
httpagent.sources.source.restartThrottle = 1000
httpagent.sources.source.logStdErr = true

Adobe Illustrator - Scripting crashes when trying to fit to artboards command

activeDocument.fitArtboardToSelectedArt()
When calling this command, AI crashes on AI 5.1/6 32bit and 64bit versions. I can use the command from the menu. Has anyone encountered this? does anyone know of a work around?
The full code.
function exportFileToJPEG (dest) {
if ( app.documents.length > 0 ) {
activeDocument.selectObjectsOnActiveArtboard()
activeDocument.fitArtboardToSelectedArt()//crashes here
activeDocument.rearrangeArtboards()
var exportOptions = new ExportOptionsJPEG();
var type = ExportType.JPEG;
var fileSpec = new File(dest);
exportOptions.antiAliasing = true;
exportOptions.qualitySetting = 70;
app.activeDocument.exportFile( fileSpec, type, exportOptions );
}
}
var file_name = 'some eps file.eps'
var eps_file = File(file_name)
var fileRef = eps_file;
if (fileRef != null) {
var optRef = new OpenOptions();
optRef.updateLegacyText = true;
var docRef = open(fileRef, DocumentColorSpace.RGB, optRef);
}
exportFileToJPEG ("output_file.jpg")
I can reproduce the bug with AI CS5.
It seems that fitArtboardToSelectedArt() takes the index of an artboard as an optional parameter. When the parameter is set, Illustrator doesn't crash. (probably a bug in the code handling the situation of no parameter passed)
As a workaround you could use:
activeDocument.fitArtboardToSelectedArt(
activeDocument.artboards.getActiveArtboardIndex()
);
to pass the index of the active artboard with to the function. Hope this works for you too.
Also it's good practice to never omit the semicolon at the end of a statement.