Apache Tika Api consuming given stream

Apache Tika Api consuming given stream - apache

I use Apache Tika bundle dependency for a Project to find out MimeTypes for Files. due to some issues we have to find out through InputStream. it is actually guaranteed to mark / reset given InputStream. Tika-Bundle includes core and parser api and uses PoifscontainerDetector , ZipContainerDetector, OggDetector, MimeTypes and Magic for detection. I have been debugging for 3 hours and all of Detectors mark and reset after detection. I did it in following way.
TikaInputStream tis = null;
try {
TikaConfig config = new TikaConfig();
tikaDetector = config.getDetector();
tis = TikaInputStream.get(in);
MediaType mediaType = tikaDetector.detect(tis, new Metadata());
if (mediaType != null) {
String[] types = mediaType.toString().split(",");
for (int i = 0; i < types.length; i++) {
mimeTypes.add(new MimeType(types[i]));
}
}
} catch (Exception e) {
logger.error("Mime Type for given Stream could not be resolved: ", e);
}
But Stream is consumed. Does anyone know how to find out MimeTypes without consuming Stream?

This problem bugged me for a while too before I finally solved it. The problem is that, while Detector.detect() methods are required to mark and reset the stream, this resetting will have no effect on your original stream (the in variable) if marking is not supported in that stream.
In order to get this to work, I had to first convert my stream to a BufferedInputStream before doing anything else. I would then pass that buffered stream to the detect algorithm, and I would use that same buffered stream later for parsing, reading, or whatever I needed to do.
BufferedInputStream buffStream = new BufferedInputStream(in);
TikaInputStream tis = null;
try {
TikaConfig config = new TikaConfig();
tikaDetector = config.getDetector();
tis = TikaInputStream.get(buffStream);
MediaType mediaType = tikaDetector.detect(tis, new Metadata());
if (mediaType != null) {
String[] types = mediaType.toString().split(",");
for (int i = 0; i < types.length; i++) {
mimeTypes.add(new MimeType(types[i]));
}
}
} catch (Exception e) {
logger.error("Mime Type for given Stream could not be resolved: ", e);
}
// further along in my code...
doSomething(buffStream); // rather than doSomething(in)

Related

deleteObjects using AWS SDK V2?

I'm trying to migrate from AWS SDK V1.x to V2.2. I can't figure out the deleteObjects method though. I've found a bunch of examples - all the same one :-( that doesn't appear to ever use the list of objects to delete (i.e. the list is present, but never set in the DeleteObjectsRequest object - I assume that is where it should be set, but don't see where). How/where do I provide the object list? The examples I find are:
System.out.println("Deleting objects from S3 bucket: " + bucket_name);
for (String k : object_keys) {
System.out.println(" * " + k);
}
Region region = Region.US_WEST_2;
S3Client s3 = S3Client.builder().region(region).build();
try {
DeleteObjectsRequest dor = DeleteObjectsRequest.builder()
.bucket(bucket_name)
.build();
s3.deleteObjects(dor);
} catch (S3Exception e) {
System.err.println(e.awsErrorDetails().errorMessage());
System.exit(1);
}

The previously accepted answer was very helpful even though it wasn't 100% complete. I used it to write the following method. It basically works though I haven't tested its error handling particularly well.
An array of String keys is passed in, which are converted into the array of
ObjectIdentifier's that deleteObjects requires.
s3Client and log are assumed to have been initialized elsewhere. Feel free to remove the logging if you don't need it.
The method currently returns the number of successfully deletes.
public int deleteS3Objects(String bucketName, String[] keys) {
List<ObjectIdentifier> objIds = Arrays.stream(keys)
.map(key -> ObjectIdentifier.builder().key(key).build())
.collect(Collectors.toList());
try {
DeleteObjectsRequest dor = DeleteObjectsRequest.builder()
.bucket(bucketName)
.delete(Delete.builder().objects(objIds).build())
.build();
DeleteObjectsResponse delResp = s3client.deleteObjects(dor);
if (delResp.errors().size() > 0) {
String err = String.format("%d errors returned while deleting %d objects",
delResp.errors().size(), objIds.size());
log.warn(err);
}
if (delResp.deleted().size() < objIds.size()) {
String err = String.format("%d of %d objects deleted",
delResp.deleted().size(), objIds.size());
log.warn(err);
}
return delResp.deleted().size();
}
catch(AwsServiceException e) {
// The call was transmitted successfully, but Amazon S3 couldn't process
// it, so it returned an error response.
log.error("Error received from S3 while attempting to delete objects", e);
}
catch(SdkClientException e) {
// Amazon S3 couldn't be contacted for a response, or the client
// couldn't parse the response from Amazon S3.
log.error("Exception occurred while attempting to delete objects from S3", e);
}
return 0;
}
(Gratuitous commentary: I find it odd that in AWS SDK v2.3.9, deleteObjects requires Delete.Builder and ObjectIdentifier keys but getObject and putObject accept String keys. Why doesn't DeleteObjectsRequest.Builder just have a keys() method? They haven't officially said the SDK is production-ready AFAIK so some of this may be subject to change.)

Looks some more objects are needed to assign the key of the object in s3. This is untested, I put the links to the methods at the end.
System.out.println("Deleting objects from S3 bucket: " + bucket_name);
for (String k : object_keys) {
System.out.println(" * " + k);
}
Region region = Region.US_WEST_2;
S3Client s3 = S3Client.builder().region(region).build();
try {
ObjectIdentifier objectToDelete = ObjectIdentifier.Builder()
.key(KEY_OBJECT_TO_DELETE);
Delete delete_me Delete.Builder.objects(objectToDelete)
DeleteObjectsRequest dor = DeleteObjectsRequest.builder()
.bucket(bucket_name)
.delete(delete_me)
.build();
s3.deleteObjects(dor);
} catch (S3Exception e) {
System.err.println(e.awsErrorDetails().errorMessage());
System.exit(1);
}
Key to delete
https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/model/ObjectIdentifier.html#key--
Delete object
https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/model/ObjectIdentifier.Builder.html

AmazonS3: Getting warning: S3AbortableInputStream:Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection

Here's the warning that I am getting:
S3AbortableInputStream:Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
I tried using try with resources but S3ObjectInputStream doesn't seem to close via this method.
try (S3Object s3object = s3Client.getObject(new GetObjectRequest(bucket, key));
S3ObjectInputStream s3ObjectInputStream = s3object.getObjectContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(s3ObjectInputStream, StandardCharsets.UTF_8));
){
//some code here blah blah blah
}
I also tried below code and explicitly closing but that doesn't work either:
S3Object s3object = s3Client.getObject(new GetObjectRequest(bucket, key));
S3ObjectInputStream s3ObjectInputStream = s3object.getObjectContent();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(s3ObjectInputStream, StandardCharsets.UTF_8));
){
//some code here blah blah
s3ObjectInputStream.close();
s3object.close();
}
Any help would be appreciated.
PS: I am only reading two lines of the file from S3 and the file has more data.

Got the answer via other medium. Sharing it here:
The warning indicates that you called close() without reading the whole file. This is problematic because S3 is still trying to send the data and you're leaving the connection in a sad state.
There's two options here:
Read the rest of the data from the input stream so the connection can be reused.
Call s3ObjectInputStream.abort() to close the connection without reading the data. The connection won't be reused, so you take some performance hit with the next request to re-create the connection. This may be worth it if it's going to take a long time to read the rest of the file.

Following option #1 of Chirag Sejpal's answer I used the below statement to drain the S3AbortableInputStream to ensure the connection can be reused:
com.amazonaws.util.IOUtils.drainInputStream(s3ObjectInputStream);

I ran into the same problem and the following class helped me
#Data
#AllArgsConstructor
public class S3ObjectClosable implements Closeable {
private final S3Object s3Object;
#Override
public void close() throws IOException {
s3Object.getObjectContent().abort();
s3Object.close();
}
}
and now you can use without warning
try (final var s3ObjectClosable = new S3ObjectClosable(s3Client.getObject(bucket, key))) {
//same code
}

To add an example to Chirag Sejpal's answer (elaborating on option #1), the following can be used to read the rest of the data from the input stream before closing it:
S3Object s3object = s3Client.getObject(new GetObjectRequest(bucket, key));
try (S3ObjectInputStream s3ObjectInputStream = s3object.getObjectContent()) {
try {
// Read from stream as necessary
} catch (Exception e) {
// Handle exceptions as necessary
} finally {
while (s3ObjectInputStream != null && s3ObjectInputStream.read() != -1) {
// Read the rest of the stream
}
}
// The stream will be closed automatically by the try-with-resources statement
}

I ran into the same error.
As others have pointed out, the /tmp space in lambda is limited to 512 MB.
And if the lambda context is re-used for a new invocation, then the /tmp space is already half-full.
So, when reading the S3 objects and writing all the files to the /tmp directory (as I was doing),
I ran out of disk space somewhere in between.
Lambda exited with error, but NOT all bytes from the S3ObjectInputStream were read.
So, two things one need to keep in mind:
1) If the first execution causes the problem, be stingy with your /tmp space.
We have only 512 MB
2) If the second execution causes the problem, then this could be resolved by attacking the root problem.
Its not possible to delete the /tmp folder.
So, delete all the files in the /tmp folder after the execution is finished.
In java, here is what I did, which successfully resolved the problem.
public String handleRequest(Map < String, String > keyValuePairs, Context lambdaContext) {
try {
// All work here
} catch (Exception e) {
logger.error("Error {}", e.toString());
return "Error";
} finally {
deleteAllFilesInTmpDir();
}
}
private void deleteAllFilesInTmpDir() {
Path path = java.nio.file.Paths.get(File.separator, "tmp", File.separator);
try {
if (Files.exists(path)) {
deleteDir(path.toFile());
logger.info("Successfully cleaned up the tmp directory");
}
} catch (Exception ex) {
logger.error("Unable to clean up the tmp directory");
}
}
public void deleteDir(File dir) {
File[] files = dir.listFiles();
if (files != null) {
for (final File file: files) {
deleteDir(file);
}
}
dir.delete();
}

This is my solution. I'm using spring boot 2.4.3
Create an amazon s3 client
AmazonS3 amazonS3Client = AmazonS3ClientBuilder
.standard()
.withRegion("your-region")
.withCredentials(
new AWSStaticCredentialsProvider(
new BasicAWSCredentials("your-access-key", "your-secret-access-key")))
.build();
Create an amazon transfer client.
TransferManager transferManagerClient = TransferManagerBuilder.standard()
.withS3Client(amazonS3Client)
.build();
Create a temporary file in /tmp/{your-s3-key} so that we can put the file we download in this file.
File file = new File(System.getProperty("java.io.tmpdir"), "your-s3-key");
try {
file.createNewFile(); // Create temporary file
} catch (IOException e) {
e.printStackTrace();
}
file.mkdirs(); // Create the directory of the temporary file
Then, we download the file from s3 using transfer manager client
// Note that in this line the s3 file downloaded has been transferred in to the temporary file that we created
Download download = transferManagerClient.download(
new GetObjectRequest("your-s3-bucket-name", "your-s3-key"), file);
// This line blocks the thread until the download is finished
download.waitForCompletion();
Now that the s3 file has been successfully transferred into the temporary file that we created. We can get the InputStream of the temporary file.
InputStream input = new DataInputStream(new FileInputStream(file));
Because the temporary file is not needed anymore, we just delete it.
file.delete();

Text Extraction, Not Image Extraction

Please help me understand if my solution is correct.
I'm trying to extract text from a PDF file with a LocationTextExtractionStrategy parser. I'm getting exceptions because the ParseContentMethod tries to parse inline images? The code is simple and looks similar to this:
RenderFilter[] filter = { new RegionTextRenderFilter(cropBox) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
I realize the images are in the content stream but I have a PDF file failing to extract text because of inline images. It returns an UnsupportedPdfException of "The filter /DCTDECODE is not supported" and then it finally fails with and InlineImageParseException of "Could not find image data or EI", when all I really care about is the text. The BI/EI exists in my file so I assume this failure is because of the /DCTDECODE exception. But again, I don't care about images, I'm looking for text.
My current solution for this is to add a filterHandler in the InlineImageUtils class that assigns the Filter_DoNothing() filter to the DCTDECODE filterHandler dictionary. This way I don't get exceptions when I have InlineImages with DCTDECODE. Like this:
private static bool InlineImageStreamBytesAreComplete(byte[] samples, PdfDictionary imageDictionary) {
try {
IDictionary<PdfName, FilterHandlers.IFilterHandler> handlers = new Dictionary<PdfName, FilterHandlers.IFilterHandler>(FilterHandlers.GetDefaultFilterHandlers());
handlers[PdfName.DCTDECODE] = new Filter_DoNothing();
PdfReader.DecodeBytes(samples, imageDictionary, handlers);
return true;
} catch (IOException e) {
return false;
}
}
public class Filter_DoNothing : FilterHandlers.IFilterHandler
{
public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
{
return b;
}
}
My problem with this "fix" is that I had to change the iTextSharp library. I'd rather not do that so I can try to stay compatible with future versions.
Here's the PDF in question:
https://app.box.com/s/7eaewzu4mnby9ogpl2frzjswgqxn9rz5

Can I have simultaneous streams on one physical file

I have a wcf service that allow clients to download some files. Although there is a new instance of service for every client's request, if two clients try to download same file at the same time, first request to arrive locks the file until it is finished with it. So the other client is actually waiting for first client to finish as there is no multiple services. There must be a way to avoid this.
Is there anyone who knows how I can avoid this without having multiple files on servers hard disk? Or am I doing something totally wrong?
this is server side code:
`public Stream DownloadFile(string path)
{
System.IO.FileInfo fileInfo = new System.IO.FileInfo(path);
// check if exists
if (!fileInfo.Exists) throw new FileNotFoundException();
// open stream
System.IO.FileStream stream = new System.IO.FileStream(path, System.IO.FileMode.Open, System.IO.FileAccess.Read);
// return result
return stream;
}`
this is client side code:
public void Download(string serverPath, string path)
{
Stream stream;
try
{
if (System.IO.File.Exists(path)) System.IO.File.Delete(path);
serviceStreamed = new ServiceStreamedClient("NetTcpBinding_IServiceStreamed");
SimpleResult<long> res = serviceStreamed.ReturnFileSize(serverPath);
if (!res.Success)
{
throw new Exception("File not found: \n" + serverPath);
}
// get stream from server
stream = serviceStreamed.DownloadFile(serverPath);
// write server stream to disk
using (System.IO.FileStream writeStream = new System.IO.FileStream(path, System.IO.FileMode.CreateNew, System.IO.FileAccess.Write))
{
int chunkSize = 1 * 48 * 1024;
byte[] buffer = new byte[chunkSize];
OnTransferStart(new TransferStartArgs());
do
{
// read bytes from input stream
int bytesRead = stream.Read(buffer, 0, chunkSize);
if (bytesRead == 0) break;
// write bytes to output stream
writeStream.Write(buffer, 0, bytesRead);
// report progress from time to time
OnProgressChanged(new ProgressChangedArgs(writeStream.Position));
} while (true);
writeStream.Close();
stream.Dispose();
}
}
catch (Exception ex)
{
throw ex;
}
finally
{
if (serviceStreamed.State == System.ServiceModel.CommunicationState.Opened)
{
serviceStreamed.Close();
}
OnTransferFinished(new TransferFinishedArgs());
}
}

I agree with Mr. Kjörling, it's hard to help without seeing what you're doing. Since you're just downloading files from your server, why are you opening it as R/W (causing the lock). If you open it as read only, then it won't lock. Please don't mod down if my suggestion is lacking as it is only my interpretation of the problem w/o a lot of information.

Try this, it should enable two threads to read the file concurrently and independently:
System.IO.FileStream stream = new System.IO.FileStream(path, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read);

Returning binary content from a JPF action with Weblogic Portal 10.2

One of the actions of my JPF controller builds up a PDF file and I would like to return this file to the user so that he can download it.
Is it possible to do that or am I forced to write the file somewhere and have my action forward a link to this file? Note that I would like to avoid that as much as possible for security reasons and because I have no way to know when the user has downloaded the file so that I can delete it.
I've tried to access the HttpServletResponse but nothing happens:
getResponse().setContentLength(file.getSize());
getResponse().setContentType(file.getMimeType());
getResponse().setHeader("Content-Disposition", "attachment;filename=\"" + file.getTitle() + "\"");
getResponse().getOutputStream().write(file.getContent());
getResponse().flushBuffer();

We have something similar, except returning images instead of a PDF; should be a similar solution, though, I'm guessing.
On a JSP, we have an IMG tag, where the src is set to:
<c:url value="/path/getImage.do?imageId=${imageID}" />
(I'm not showing everything, because I'm trying to simplify.) In your case, maybe it would be a link, where the href is done in a similar way.
That getImage.do maps to our JPF controller, obviously. Here's the code from the JPF getImage() method, which is the part you're trying to work on:
#Jpf.Action(forwards = {
#Jpf.Forward(name = FWD_SUCCESS, navigateTo = Jpf.NavigateTo.currentPage),
#Jpf.Forward(name = FWD_FAILURE, navigateTo = Jpf.NavigateTo.currentPage) })
public Forward getImage(final FormType pForm) throws Exception {
final HttpServletRequest lRequest = getRequest();
final HttpServletResponse lResponse = getResponse();
final HttpSession lHttpSession = getSession();
final String imageIdParam = lRequest.getParameter("imageId");
final long header = lRequest.getDateHeader("If-Modified-Since");
final long current = System.currentTimeMillis();
if (header > 0 && current - header < MAX_AGE_IN_SECS * 1000) {
lResponse.setStatus(HttpServletResponse.SC_NOT_MODIFIED);
return null;
}
try {
if (imageIdParam == null) {
throw new IllegalArgumentException("imageId is null.");
}
// Call to EJB, which is retrieving the image from
// a separate back-end system
final ImageType image = getImage(lHttpSession, Long
.parseLong(imageIdParam));
if (image == null) {
lResponse.sendError(404, IMAGE_DOES_NOT_EXIST);
return null;
}
lResponse.setContentType(image.getType());
lResponse.addDateHeader("Last-Modified", current);
// public: Allows authenticated responses to be cached.
lResponse.setHeader("Cache-Control", "max-age=" + MAX_AGE_IN_SECS
+ ", public");
lResponse.setHeader("Expires", null);
lResponse.setHeader("Pragma", null);
lResponse.getOutputStream().write(image.getContent());
} catch (final IllegalArgumentException e) {
LogHelper.error(this.getClass(), "Illegal argument.", e);
lResponse.sendError(404, IMAGE_DOES_NOT_EXIST);
} catch (final Exception e) {
LogHelper.error(this.getClass(), "General exception.", e);
lResponse.sendError(500);
}
return null;
}
I've actually removed very little from this method, because there's very little in there that I need to hide from prying eyes--the code is pretty generic, concerned with images, not with business logic. (I changed some of the data type names, but no big deal.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Apache Tika Api consuming given stream - apache

Related

deleteObjects using AWS SDK V2?

AmazonS3: Getting warning: S3AbortableInputStream:Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection

Text Extraction, Not Image Extraction

Can I have simultaneous streams on one physical file

Returning binary content from a JPF action with Weblogic Portal 10.2

Categories

Resources