use tika in nutch plugin - apache

In nutch I'm implementing a plug-in that will get the content of webpages and process them in special way.
My main problem is I want to convert webpages to plainText to be able to processed,, I read that tika toolkit can do that
so, I found this code that use tika to parse urls, I write it under filter method
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
byte[] raw = content.getContent();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
String plainText = handler.toString();
LOG.info("Mime: " + metadata.get(Metadata.CONTENT_TYPE));
LOG.info("content: " + handler.toString());
}
The result of metadata.get(Metadata.CONTENT_TYPE) is text/html
but handler.toString() is empty !
Update:
Also I try to use this line after the parser method
LOG.info ("Status : "+ new ParseStatus().toString());
and I get this result:
Status : notparsed(0,0)

Since version 1.1 Nutch includes a Tika plugin (see also NUTCH-766) that should cover your need. I don't know if there's more comprehensive documentation available. You might want to ask the Nutch users mailing list for more details (or someone here on SO can fill in).

As Jukka Zitting said, Tika is already leveraged in nutch. In the code that you pasted, there is no place that you had set the metadata and ParseStatus to any nutch specific data structure. So you dont see the ParseStatus accordingly.

Related

Using a local image with EmbedBuilder

According to the Discord.NET documentation page for the EmbedBuilder class, the syntax (converted to VB) to add a local image to an EmbedBuilder object should look something like this:
Dim fileName = "image.png"
Dim embed = New EmbedBuilder() With {
.ImageUrl = $"attachment://{fileName}"
}.Build()
I'm trying to use something like this to add a dynamically created image to the EmbedBuilder, but I can't seem to get it to work properly. Here's basically what I've got:
Dim TweetBuilder As New Discord.EmbedBuilder
Dim DynamicImagePath As String = CreateDynamicImage()
Dim AttachURI As String = $"attachment:///" & DynamicImagePath.Replace("\", "/").Replace(" ", "%20")
With Builder
.Description = "SAMPLE DESCRIPTION"
.ImageUrl = AttachURI
End With
MyClient.GetGuild(ServerID).GetTextChannel(PostChannelID).SendMessageAsync("THIS IS A TEST", False, Builder.Build)
My CreateDynamicImage method returns the full path to the locally created image (e.g., C:\Folder\Another Folder\image.png). I've done a fair amount of "fighting"/testing with this to get past the Url must be a well-formed URI exception I was initially getting because of the [SPACE] in the path.
MyClient is a Discord.WebSocket.SocketClient object set elsewhere.
The SendMessageAsync method does send the Embed to Discord on the correct channel, but without the embedded image.
If I instead send the image using the SendFileAsync method (like so):
MyClient.GetGuild(ServerID).GetTextChannel(PostChannelID).SendFileAsync(DynamicImagePath, "THIS IS A TEST", False, Builder.Build)
the image is sent, but as a part of the message, rather than included as a part of the Embed (this is expected behavior - I only bring it up b/c it was a part of my testing to ensure that there wasn't a problem with actually sending the image to Discord).
I've tried using the file:/// scheme instead of the attachment:/// scheme, but that results in the entire post never making it to Discord at all.
Additionally, I've tried setting the ImageUrl property to a Web resource (e.g., https://www.somesite.com/someimage.png) and the Embed looks exactly as expected with the image and everything when it successfully posts to Discord.
So, I'm just wondering at this point if I'm just missing something, or if I'm just doing it completely wrong?
I cross-posted this to issue #1609 in the Discord.Net GitHub project to get a better idea of what options are available for this and received a good explanation of the issue:
The Embed (and EmbedImage) objects don't do anything with files. They simply pass the URI as configured straight into Discord. Discord then expects a URI in the form attachment://filename.ext if you want to refer to an attached image.
What you need to do is use SendFileAsync with the embed. You have two options here:
Use SendFileAsync with the Stream stream, string filename overload. I think this makes it clear what you need to do: you provide a file stream (via File.OpenRead or similar) and a filename. The provided filename does not have to match any file on disk. > So, for example:
var embed = new EmbedBuilder()
.WithImageUrl("attachment://myimage.png")
.Build();
await channel.SendFileAsync(stream, "myimage.png", embed: embed);
Alternatively, you can use SendFileAsync with the string filePath overload. Internally, this gets a stream of the file at the path, and sets filename (as sent to Discord) to the last part of the path. So it's equivalent to:
using var stream = File.OpenRead(filePath);
var filename = Path.GetFileName(filePath);
await channel.SendFileAsync(stream, filename);
From here, you can see that if you want to use the string filePath overload, you need to set embed image URI to something like $"attachment://{Path.GetFileName(filePath)}", because the attachment filename must match the one sent to Discord.
I almost had it with my code above, but I misunderstood the intention and usage of the method and property. I guess I thought the .ImageUrl property somehow "automatically" initiated a Stream in the background. Additionally, I missed one very important piece:
As it's an async method, you must await (or whatever the VB.NET equivalent is) on SendFileAsync.
So, after making my calling method into an async method, my code now looks like this:
Private Async Sub TestMessageToDiscord()
Dim Builder As New Discord.EmbedBuilder
Dim AttachmentPath As String = CreateDynamicImage() '<-- Returns the full, local path to the created file
With Builder
.Description = "SAMPLE DESCRIPTION"
.ImageUrl = $"attachment://{IO.Path.GetFileName(AttachmentPath)}"
End With
Using AttachmentStream As IO.Stream = IO.File.OpenRead(AttachmentPath)
Await MyClient.GetGuild(ServerID).GetTextChannel(PostChannelID).SendFileAsync(AttachmentStream, IO.Path.GetFileName(AttachmentPath), "THIS IS A TEST", False, Builder.Build)
End Using
End Sub
Now, everything works exactly as expected and I didn't have to resort to uploading the image to a hosting site and using the new URL (I actually had that working before I got the response on GitHub. I'm sure that code won't go to waste).
EDIT
Okay, so I still ended up going back to my separately hosted image option for one reason: I have a separate event method that modifies the original Embed object during which I want to remove the image and replace the text. However, when that event fired, while the text was replaced, the image was "moved" to the body of the Discord message. While I may have been able to figure out how to get rid of the image entirely, I decided to "drop back and punt" since I had already worked out the hosted image solution.
I've tried everyting I could, but I got stuck at the same point at where you are now.
My guesses are that Discord doesn't like the embedded images from https://cdn.discordapp.com/attachments, and only accepts the new files from https://media.discordapp.net. I might be wrong though, this is the way it worked for me.
I believe it's only a visual glitch, as I found if you send a link for an image from cdn.discordapp.com/attchments in your regular Discord client, it bugs out and shows an empty embed for some reason.
That would make sense since the default link used in an embedded image actually starts with https://cdn.discordapp.com/attachments/...
You could solve this issue by using https://media.discordapp.net, but it seems like Discord.net is configured to use the old domain.

OutOfMemory on custom extractor

I have stitched a lot of small XML files into one file, and then made a custom extractor to return rows with one byte array that corresponds to each file.
Run on remote/master
Run it for one file (gzipped, 11Mb), it works fine.
Run it for more than one file, I get a System.OutOfMemoryException.
Run on local/master
Run it for one or more files (gzipped 500+ Mbs), works fine.
Extractor looks like this:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (var stream = new StreamReader(input.BaseStream))
{
var xml = stream.ReadToEnd();
// Clean stiched XML
xml = UtilsXml.CleanXml(xml);
// Get nodes - one for each stiched file
var d = new XmlDocument();
d.LoadXml(xml);
var root = d.FirstChild;
for (int i = 0; i < root.ChildNodes.Count; i++)
{
output.Set<object>(1, Encoding.ASCII.GetBytes(root.ChildNodes[i].OuterXml.ToString()));
yield return output.AsReadOnly();
}
yield break;
}
}
and error message looks like this:
==== Caught exception System.OutOfMemoryException
at System.Xml.XmlDocument.CreateTextNode(String text)
at System.Xml.XmlLoader.LoadAttributeNode()
at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)
at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
at System.Xml.XmlDocument.Load(XmlReader reader)
at System.Xml.XmlDocument.LoadXml(String xml)
at Microsoft.Analytics.Tools.Formats.Text.XmlByteArrayRowExtractor.<Extract>d__0.MoveNext()
at ScopeEngine.SqlIpExtractor<ScopeEngine::GZipInput,Extract_0_Data0>.GetNextRow(SqlIpExtractor<ScopeEngine::GZipInput\,Extract_0_Data0>* , Extract_0_Data0* output) in d:\data\ccs\jobs\bc367467-ef86-43d2-a937-46ba2d4cc524_v0\sqlmanaged.h:line 1924
So what am I doing wrong? And how do I debug this on remote?
Thanks!
Unfortunately local run does not enforce memory allocations, so you would have to check memory in local vertex debug yourself.
Looking at your code above, I see that you are loading XML documents into a DOM. Please note that an XML DOM can explode the data size from the string representation up to a factor of 10 or more (I have seen 2 to 12 in my times as the resident SQL XML guru).
Each UDO today only gets 1/2 GB of RAM to play with. So what I assume is that your XML DOM document(s) start going beyond that.
The recommendation normally is that you use the XMLReader interface (there is a reader extractor in the samples on http://usql.io as well) and scan through the document(s) to find the information you are looking for.
If your documents are always small enough (e.g., <20MB), you may want to make sure that you release the memory of the other documents and operate one document at a time.
We do have plans to allow you to annotate your UDO with memory needs, but that is still a bit out.

Save an image present in PDF on local File System

This is my first experience of using PDFBox jar files. Also, I have recently started working on TestComplete. In short, all these things are new for me and I have been stuck on one issue for last few hours. I will try to explain as much as I can. Would really appreciate any help!
Objective:
To save an image present in a PDF file on the file system
Issue:
When this line gets executed objImage.write2file_2(strSavePath);, I get the error Object doesn't support this property or method.
I am taking some help from here
Code:
function fn_PDFImage()
{
var objPdfFile, strPdfFilePath, strSavePath, objPages, objPage, objImages, objImage, imgbuffer;
strPdfFilePath = "C:\\Users\\aabb\\Desktop\\name.pdf";
strSavePath = "C:\\Users\\aabb\\Desktop\\abc";
objPdfFile = JavaClasses.org_apache_pdfbox_pdmodel.PDDocument.load_3(strPdfFilePath);
objPages = objPdfFile.getDocumentCatalog().getAllPages();
//getting a page with index=1
objPage = objPages.get(1)
objImages = objPage.getResources().getXObjects().values().toArray();
Log.Message(objImages.length); //This is returning 14. i.e, 14 images
//getting an image with index=1
objImage = objImages.items(1);
Log.Message(typeof objImage); //returns "Object" which means it is not null
//saving the image
objImage.write2file_2(strSavePath); //<---GETTING AN ERROR HERE
}
ERROR:
If you are bothered about the method namewrite2file_2, please read this excerpt from the link which I have shared:
In Java, the constructor of a class has the name of this class.
TestComplete changes the constructor names to newInstance(). If a
class has overloaded constructors, TestComplete names them like
newInstance, newInstace_2, newInstance_3 and so on.
Additional Info:
I have imported Jar file(pdfbox-app-1.8.13.jar) and their classes in testcomplete. I am not sure if I need to import some other jar file or its class here:
XObjects are not always image XObjects. And write2file is in the class PDXObjectImage so you need to check your object type first.
Re the second question asked in the comment: the form XObject isn't something you can save. XObject forms are content streams with resources etc, similar to pages. However what you can do is to explore these too whether the resources have images. See how this is done in the ExtractImages source code of PDFBox 1.8.
However there are other places where there can be images (e.g. patterns, soft masks, inline images); this is only available in PDFBox 2.*, see the ExtractImages source code there. (Note that the class names are different).

Remove PDFont caching with Apache tika

I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc.
My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only extract the text.
I am using tika 1.12. Does anyone know how to get around this cahcing issue. This is how I am using Autodetect:
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File(child.getPath()));
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
String s=null;
s =handler.toString();
handler=null;
context=null;
inputstream.close();
PDFont.clearResources();
So I fudged a workaround and just called System.gc(); everytime the file had finished being processed which works a treat but doesn't really answer the question.

Does springfox-swagger2 UI support choosing multiple files at once?

I use Spring Boot and integrated swagger-ui (springfox-swagger2) and I want to be able to choose to upload multiple files at once. Unfortunately the Swagger UI doesn't appear to allow this, at least not give my controller method.
My controller method signature:
#ApiOperation(
value = "batch upload goods cover image",
notes = "batch upload goods cover image",
response = UploadCoverResultDTO.class,
responseContainer = "List"
)
public Result<?> uploadGoodsCover(#ApiParam(value = "Image array", allowMultiple = true,
required = true) #RequestPart("image") MultipartFile[] files) throws IOException {
Swagger UI generated:
But I was expecting a UI similar to this:
It's more convenient to choose all pictures in a folder in one go rather than choose one at a time e.g.:
<input type="file" name="img" multiple="multiple"/>
Does springfox-swagger2 support this? If so, what changes do I need to make?
Update: as pointed out by #Helen, this is now supported in Swagger 3.26.0 with OpenAPI 3 and should be in the next release of Springfox 3
Springfox 2: unfortunately the answer is no.
Springfox Swagger2 does not support this because it's not yet supported by Swagger: https://github.com/springfox/springfox/issues/1072
Relevant Swagger issues:
https://github.com/swagger-api/swagger-ui/issues/4600 (fixed in 3.26.0)
https://github.com/OAI/OpenAPI-Specification/issues/254