nutch to extract only pdf files

nutch to extract only pdf files - apache

Is there any way to perform a urlfilter from level 1-5 and a different urlfilter from 5 onwards. I need to extract pdf files which will be only after a given level (just to experiment).
The pdf files will be stored in a binary format in the crawl/segment folder. I would like to extract these pdf files and store all in 1 folder. I have been able to write a java program to identify a pdf file. I cant figure how to make a pdf file with its content having same font, page #, images etc.
perform crawl
merge segment data
run makePDF.java
this only identifies pdf files:
String uri = "/usr/local/nutch/framework/apache-nutch-1.6/merged572/20130407131335";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Text key = new Text();
Content content = new Content();
while (reader.next(key, content)) {
String contentType = content.getContentType();
if (contentType.equalsIgnoreCase("application/pdf")) {
//System.out.write( content.getContent(), 0, content.getContent().length );
System.out.println(key);
}
}
reader.close();
}
finally {
fs.close();
}

content.getContent() will return the content in bytes.
Just write the bytes to a file using BufferedOutputStream and save it as a pdf

Related

Save picture directly to stream? [duplicate]

I have a filename pointing to a text file, including its path, as a string. Now I'd like to load this .csv file into memory stream. How should I do that?
For example, I have this:
Dim filename as string="C:\Users\Desktop\abc.csv"

Dim stream As New MemoryStream(File.ReadAllBytes(filename))

You don't need to load a file into a MemoryStream.
You can simply call File.OpenRead to get a FileStream containing the file.
If you really want the file to be in a MemoryStream, you can call CopyTo to copy the FileStream to a MemoryStream.

I had an XML file being read from disk, using the old XmlReader API. How to read the XML file into memory, and then work with it in memory, instead of reading the disk repeatedly? Based on VB answer from Centro (upvoted) but with a Using block, and in C#.
The key line:
MemoryStream myXMLDocument = new MemoryStream(File.ReadAllBytes(#"c:\temp\myDemoXMLDocument.xml"));
Re the OP's question, if you wanted to load a CSV file into a MemoryStream:
MemoryStream myCSVDataInMemory = new MemoryStream(File.ReadAllBytes(#"C:\Users\Desktop\abc.csv"));
Following is a code snippet showing code to reads through XML document now that it's in a MemoryStream. Basically the same code as when it was coming from a FileStream that pointed to a file on disk. Yes, the XMLTextReader API is old and clunky, but it's what I had to work with in this app.
string myXMLFileName = #"c:\temp\myDemoXMLDocument.xml";
using (MemoryStream myXMLDocument = new MemoryStream(File.ReadAllBytes(myXMLFileName)))
{
myXMLTextReader = new XmlTextReader(myXMLDocument);
myXMLTextReader.WhitespaceHandling = WhitespaceHandling.None;
myXmlTextReader.Read(); // read the XML declaration node, advance to <Batch> tag
while (!myXmlTextReader.EOF)
{
if (myXmlTextReader.Name == "xml" && !myXmlTextReader.IsStartElement()) break;
// advance to <Batch> tag
while (myXmlTextReader.Name == "Batch" && myXmlTextReader.IsStartElement())
{
string BatchIdentifier = myXmlTextReader.GetAttribute("BatchIdentifier");
myXmlTextReader.Read(); // advance to next tag
while (!myXmlTextReader.EOF)
{
if (myXmlTextReader.Name == "Transaction" && myXmlTextReader.IsStartElement())
{
// Start a new set of items
string transactionID = myXmlTextReader.GetAttribute("ID");
myXmlTextReader.Read(); // Read next element, possibly another Transaction tag
}
}
//All Batch tags are completed.Move to next tag
myXmlTextReader.Read();
}
// Close the XML memory stream.
myXmlTextReader.Close();
myXmlDocument.Close();
}
}

You can copy it to a file stream like so:
string fullPath = Path.Combine(filePath, fileName);
FileStream fileStream = new FileStream(fullPath, FileMode.Open);
Image image = Image.FromStream(fileStream);
MemoryStream memoryStream = new MemoryStream();
image.Save(memoryStream, ImageFormat.Jpeg);
//Close File Stream
fileStream.Close();

Xamarin Android: How to Share PDF File From Assets Folder? Via WhatsApp I get message that the file you picked was not a document

I use Xamarin Android. I have a PDF File stored in Assets folder from Xamarin Android.
I want to share this file in WhatsApp, but I receive the message:
The file you picked was not a document.
I tried two ways:
This is the first way
var SendButton = FindViewById<Button>(Resource.Id.SendButton);
SendButton.Click += (s, e) =>
{
////Create a new file in the exteranl storage and copy the file from assets folder to external storage folder
Java.IO.File dstFile = new Java.IO.File(Environment.ExternalStorageDirectory.Path + "/my-pdf-File--2017.pdf");
dstFile.CreateNewFile();
var inputStream = new FileInputStream(Assets.OpenFd("my-pdf-File--2017.pdf").FileDescriptor);
var outputStream = new FileOutputStream(dstFile);
CopyFile(inputStream, outputStream);
//to let system scan the audio file and detect it
Intent intent = new Intent(Intent.ActionMediaScannerScanFile);
intent.SetData(Uri.FromFile(dstFile));
this.SendBroadcast(intent);
//share the Uri of the file
var sharingIntent = new Intent();
sharingIntent.SetAction(Intent.ActionSend);
sharingIntent.PutExtra(Intent.ExtraStream, Uri.FromFile(dstFile));
sharingIntent.SetType("application/pdf");
this.StartActivity(Intent.CreateChooser(sharingIntent, "#string/QuotationShare"));
};
This is the second
//Other way
var SendButton2 = FindViewById<Button>(Resource.Id.SendButton2);
SendButton2.Click += (s, e) =>
{
Intent intent = new Intent(Intent.ActionSend);
intent.SetType("application/pdf");
Uri uri = Uri.Parse(Environment.ExternalStorageDirectory.Path + "/my-pdf-File--2017.pdf");
intent.PutExtra(Intent.ExtraStream, uri);
try
{
StartActivity(Intent.CreateChooser(intent, "Share PDF file"));
}
catch (System.Exception ex)
{
Toast.MakeText(this, "Error: Cannot open or share created PDF report. " + ex.Message, ToastLength.Short).Show();
}
};
In other way, when I share via email, the PDF file is sent empty (corrupt file)
What can I do?

The solution is copying de .pdf file from assets folder to a local storage. Then We able to share de file.
First copy the file:
string fileName = "my-pdf-File--2017.pdf";
var localFolder = Android.OS.Environment.ExternalStorageDirectory.AbsolutePath;
var MyFilePath = System.IO.Path.Combine(localFolder, fileName);
using (var streamReader = new StreamReader(Assets.Open(fileName)))
{
using (var memstream = new MemoryStream())
{
streamReader.BaseStream.CopyTo(memstream);
var bytes = memstream.ToArray();
//write to local storage
System.IO.File.WriteAllBytes(MyFilePath, bytes);
MyFilePath = $"file://{localFolder}/{fileName}";
}
}
Then share the file, from local storage:
var fileUri = Android.Net.Uri.Parse(MyFilePath);
var intent = new Intent();
intent.SetFlags(ActivityFlags.ClearTop);
intent.SetFlags(ActivityFlags.NewTask);
intent.SetAction(Intent.ActionSend);
intent.SetType("*/*");
intent.PutExtra(Intent.ExtraStream, fileUri);
intent.AddFlags(ActivityFlags.GrantReadUriPermission);
var chooserIntent = Intent.CreateChooser(intent, title);
chooserIntent.SetFlags(ActivityFlags.ClearTop);
chooserIntent.SetFlags(ActivityFlags.NewTask);
Android.App.Application.Context.StartActivity(chooserIntent);

the file you picked was not a document
I had this issue when I trying to share a .pdf file via WhatsApp from assets folder, but it gives me the same error as your question :
the file you picked was not a document
Finally I got a solution that copy the .pdf file in assets folder to Download folder, it works fine :
var pathFile = Android.OS.Environment.GetExternalStoragePublicDirectory(Android.OS.Environment.DirectoryDownloads);
Java.IO.File dstFile = new Java.IO.File(pathFile.AbsolutePath + "/my-pdf-File--2017.pdf");
Effect like this.

How to open PDF/A-1 documents with a PdfCopy/PdfStamper that has a PDFA-2 or PDF/A-3 Conformance Level

I'm still trying to convert from PDF to PDF/A, from PDF/A-1 to PDF/A-2, from PDF/A-2 to PDF/A-3. As you can see, my aim is it to achieve a PDF/A-3 conformance file from an existing PDF file.
The problem by converting PDF/A-1 to PDF/A-2b is that it doesn't work. I am trying to open a PDF/A-1 conformance pdf-file with PdfACopy that shall create a PDF/A-2 conformance file, but this error occurs.
Different PDF/A versions.
Here is a little extract from my code:
using (Document doc = new Document())
{
using (FileStream fs = new FileStream(destPdfA, FileMode.Create, FileAccess.ReadWrite))
{
using (PdfReader reader = new PdfReader(pdfPath))
{
using (PdfACopy copy = new PdfACopy(doc, fs, PdfAConformanceLevel.PDF_A_2B))
{
copy.SetPdfVersion(PdfCopy.PDF_VERSION_1_7);
copy.SetTagged();
copy.CreateXmpMetadata();
doc.Open();
ICC_Profile icc = ICC_Profile.GetInstance(ICC);
copy.SetOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
copy.AddDocument(reader);
PdfStructureTreeRoot s = copy.StructureTreeRoot;
Boolean a = PdfStructTreeController.CheckTagged(reader);
doc.Close();
}
}
}
}
How can I create or convert those PDFs? Do I need to read the metadata from the original file, change the PDF/a version, read it again and then change it?
Please, tell me how to convert those files. I just want a PDF-A3 conformance file.

Stream pdfs from url and add it to Zip

I have a mvc 4.5 application where I show a grid. The first column of the grid is a document name. The document name is an hyper link to the actual document that is hosted on our site and is available via a url. The documents can be pdf or doc or ppt. I can access these documents only via url and I do not have access to the actual physical document on our server.
I am providing users an option to select one or many of these documents from the grid and then they can download them. What I am trying to achieve is read each of the selected documents via the url and write it to a zip file and make the zip file downloadable. So users will be downloading one file instead of multiple files.
I have tried to stream the documents via url in memory and then add it to the zip file using ZipArchive Library from Microsoft. This is not working for me.
I was able to add documents that was on disk to zip file using Zip Archive and it works great. But I do not have access to the physical document as I can access the documents only through URL. My next option is to download each of these documents into a temp location on server and then add it to zip file using Zip Archive.But I am trying to avoid downloading files into a temp location
Please suggest how I can achieve reading documents via url in memory and adding each of these document to zip file and make zip file downloadable.
Any help will be appreciated.

Thank you Cbroe for commenting. I figured the answer. The problem was I was reading the pdf from the url and convert it to a memory stream and then was trying to add the memory stream to ZipArchive which was not working but instead I extracted the byte array out of the memory stream and then added it to the zip archive and it worked.
Here is the code snippet that might be useful for some one. My first contribution to Stack OverFlow.
public FileResult DownloadZip()
{
MemoryStream memoryStream = new MemoryStream();
using (var archive = new ZipArchive(memoryStream, ZipArchiveMode.Create, true))
{
var demoFile = archive.CreateEntry("Pdf123.pdf");
var convertedStream = ConvertTobyte("http://www.example.com/Pdf123.pdf");
using (var entryStream = demoFile.Open())
{
entryStream.Write(convertedStream, 0, convertedStream.Length);
}
demoFile = archive.CreateEntry("Pdf456.pdf");
convertedStream = ConvertTobyte("http://www.example.com/Pdf456.pdf");
using (var entryStream = demoFile.Open())
{
entryStream.Write(convertedStream, 0, convertedStream.Length);
}
}
//This option is to write the zip to your local disk
using (var fileStream = new FileStream(#"C:\Temp\test.zip", FileMode.Create))
{
memoryStream.Seek(0, SeekOrigin.Begin);
memoryStream.CopyTo(fileStream);
}
//This option is to donload the zip via browser
memoryStream.Seek(0, SeekOrigin.Begin);
return new FileStreamResult(memoryStream, "application/zip")
{
FileDownloadName = "Archive.zip"
};
}
private static byte[] ConvertTobyte(string fileUrl)
{
byte[] imageData = null;
using (var wc = new System.Net.WebClient())
imageData = wc.DownloadData(fileUrl);
return imageData;
}

Upload file with meta data and checkin to sharpoint folder using Client Object Model

Hi I'm trying to upload a file to sharepoint 2010 using the client api with meta data and also checkin the file after I'm done. Below is my code:
public void UploadDocument(SharePointFolder folder, String filename, Boolean overwrite)
{
var fileInfo = new FileInfo(filename);
var targetLocation = String.Format("{0}{1}{2}", folder.ServerRelativeUrl,
Path.AltDirectorySeparatorChar, fileInfo.Name);
using (var fs = new FileStream(filename, FileMode.Open))
{
SPFile.SaveBinaryDirect(mClientContext, targetLocation, fs, overwrite);
}
// doesn't work
SPFile newFile = mRootWeb.GetFileByServerRelativeUrl(targetLocation);
mClientContext.Load(newFile);
mClientContext.ExecuteQuery();
//check out to make sure not to create multiple versions
newFile.CheckOut();
// use OverwriteCheckIn type to make sure not to create multiple versions
newFile.CheckIn("test", CheckinType.OverwriteCheckIn);
mClientContext.Load(newFile);
mClientContext.ExecuteQuery();
//SPFile uploadFile = mRootWeb.GetFileByServerRelativeUrl(targetLocation);
//uploadFile.CheckOut();
//uploadFile.CheckIn("SOME VERSION COMMENT I'D LIKE TO ADD", CheckinType.OverwriteCheckIn);
//mClientContext.ExecuteQuery();
}
I'm able to upload the file but I can't add any meta data and file is checked out. I want to add some meta data and checkin the file after I'm done.
My SharePointFolder class has the serverRelativeUrl of the folder path to upload to. Any help greatly appreciated.

You need a credential before the executeQuery(); and SaveBinaryDirect();
For example:
mClientContext.Credentials = new NetworkCredential("LoginID","LoginPW", "LoginDomain");
SPFile newFile = mRootWeb.GetFileByServerRelativeUrl(targetLocation);
mClientContext.Load(newFile);
mClientContext.ExecuteQuery();

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

nutch to extract only pdf files - apache

content.getContent() will return the content in bytes. Just write the bytes to a file using BufferedOutputStream and save it as a pdf

Related

Save picture directly to stream? [duplicate]

Xamarin Android: How to Share PDF File From Assets Folder? Via WhatsApp I get message that the file you picked was not a document

How to open PDF/A-1 documents with a PdfCopy/PdfStamper that has a PDFA-2 or PDF/A-3 Conformance Level

Stream pdfs from url and add it to Zip

Upload file with meta data and checkin to sharpoint folder using Client Object Model

Categories

Resources