OutOfMemory on custom extractor - azure-data-lake

I have stitched a lot of small XML files into one file, and then made a custom extractor to return rows with one byte array that corresponds to each file.
Run on remote/master
Run it for one file (gzipped, 11Mb), it works fine.
Run it for more than one file, I get a System.OutOfMemoryException.
Run on local/master
Run it for one or more files (gzipped 500+ Mbs), works fine.
Extractor looks like this:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
using (var stream = new StreamReader(input.BaseStream))
{
var xml = stream.ReadToEnd();
// Clean stiched XML
xml = UtilsXml.CleanXml(xml);
// Get nodes - one for each stiched file
var d = new XmlDocument();
d.LoadXml(xml);
var root = d.FirstChild;
for (int i = 0; i < root.ChildNodes.Count; i++)
{
output.Set<object>(1, Encoding.ASCII.GetBytes(root.ChildNodes[i].OuterXml.ToString()));
yield return output.AsReadOnly();
}
yield break;
}
}
and error message looks like this:
==== Caught exception System.OutOfMemoryException
at System.Xml.XmlDocument.CreateTextNode(String text)
at System.Xml.XmlLoader.LoadAttributeNode()
at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)
at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
at System.Xml.XmlDocument.Load(XmlReader reader)
at System.Xml.XmlDocument.LoadXml(String xml)
at Microsoft.Analytics.Tools.Formats.Text.XmlByteArrayRowExtractor.<Extract>d__0.MoveNext()
at ScopeEngine.SqlIpExtractor<ScopeEngine::GZipInput,Extract_0_Data0>.GetNextRow(SqlIpExtractor<ScopeEngine::GZipInput\,Extract_0_Data0>* , Extract_0_Data0* output) in d:\data\ccs\jobs\bc367467-ef86-43d2-a937-46ba2d4cc524_v0\sqlmanaged.h:line 1924
So what am I doing wrong? And how do I debug this on remote?
Thanks!

Unfortunately local run does not enforce memory allocations, so you would have to check memory in local vertex debug yourself.
Looking at your code above, I see that you are loading XML documents into a DOM. Please note that an XML DOM can explode the data size from the string representation up to a factor of 10 or more (I have seen 2 to 12 in my times as the resident SQL XML guru).
Each UDO today only gets 1/2 GB of RAM to play with. So what I assume is that your XML DOM document(s) start going beyond that.
The recommendation normally is that you use the XMLReader interface (there is a reader extractor in the samples on http://usql.io as well) and scan through the document(s) to find the information you are looking for.
If your documents are always small enough (e.g., <20MB), you may want to make sure that you release the memory of the other documents and operate one document at a time.
We do have plans to allow you to annotate your UDO with memory needs, but that is still a bit out.

Related

Batch read from DBs

Im a bit confused on how golangs sql package reads large datasets into memory. In this previous stackoverflow question - How to set fetch size in golang?, there seems to be conflicting ideas on whether batching of large datasets on read happens or not.
I am writing a go binary that connects to different remote DBs based on input params given and fetches resutls and subsequently converts them to a csv file. Suppose I have a query that returns a lot of rows; say 20 million rows. Loading this all at once in memory would be very exhaustive. Does the library batch the results automatically and only on row.Next() load the next batch into memory ?
If the db/sql package does not handle it, are there options in the various driver packages ?
https://github.com/golang/go/issues/13067 - From this issue and discussion, I understand that the general idea is to have the driver packages handle this. As mentioned in the issue and also in this blog https://oralytics.com/2019/06/17/importance-of-setting-fetched-rows-size-for-database-query-using-golang/, I found out that golangs oracle driver package has this option that I can pass for batching. But am not able to find an equivalent in the other driver packages.
To summarize -
Does db/sql batch read results automatically.
If yes, then my 2nd & 3rd question does not matter
If no, are there options that I can pass to the various driver pacakges to set the batch size and where can I find what these options are. I have already tried looking at pgx docs and cannot find anything there that sets a batch size.
Is there any other way to batch reads like a prepared statement with configuration specifying the batch size ?
Some clarifications:
My question is when the a query returns a large dataset, is the entire dataset loaded into memory or is it batched whether internally by some code that is called downstream from rows.Next or not.
From what I can see there is a chunk reader that gets created with a default 8kb size and is used to chunk. Are there cases where this does not happen ? Or are the results from db always chunked.
Is there any way this 8kb buffer size that the chunk reader uses configurable ?
For more clarity, I am adding what is existing in java. This is what already exists and I am looking to rewrite it in golang.
private static final int RESULT_SIZE = 10000;
private void generate() {
... //connection and other code...
Statement stmt = connection.createStatement(ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(RESULT_SIZE);
ResultSet resultset = stmt.executeQuery(dataQuery);
String fileInHome = getFullFileName(filePath, manager, parentDir);
rsToCSV(resultset, new BufferedWriter(new FileWriter(fileInHome)));
}
private void rsToCSV(ResultSet rs, BufferedWriter os) throws SQLException {
ResultSetMetaData metaData = rs.getMetaData();
int columnCount = metaData.getColumnCount();
try (PrintWriter pw = new PrintWriter(os)) {
readHeaders(metaData, columnCount, pw);
if (rs.next()) {
readRow(rs, metaData, columnCount, pw);
while (rs.next()) {
pw.println();
readRow(rs, metaData, columnCount, pw);
}
}
}
}
The stmt.setFetchSize(RESULT_SIZE); sets the number of rows to return in each result set which is then processed one by one to a csv.

MVC4 - How to upload a file partially (only the first 10 lines, for e.g.)

ASP.NET MVC - Is it possible to upload only the first 10 lines of a file? Basically, we have some files that can range from 1-10GB but the data that we need is present only in the first 10 rows in the file. Using the typical web development approache, we'd upload the whole file to the server and then read the first 10 rows, but uploading a 10GB file just to read a few bytes of data seems a big waste of resources. Is it possible to read such a file without uploading all of it to the webserver?
Solution - FileAPIs slice function solved this problem (thanks to Chris below). The simplified code is below for anyone interested -
var sampleFile = document.getElementById('yourfileelement').files[0];
var reader = new FileReader();
var fileData = sampleFile.slice(0, 500000); //Read top 500000 bytes
reader.onprogress = function (evt) { //Show progressbar etc }
reader.onloadend = function (evt) { alert(evt.target.result); } //evt.target.result contains the file data that was read
reader.readAsText(fileClientReadData);
No, but you may be able to accomplish it using the File API client-side to read and send to the server via AJAX just the first 10 lines. However, note that the File API is only supported in modern browsers, so this won't work with IE 9 or less. You might be able to create a more comprehensive solution using a Flash or Java applet, but ugh.

Cannot read second page scanned via ADF

I have a Brother mutlifunction networked printer/scanner/fax (model MFC-9140CDN). I am trying to use the following code with WIA, to retrieve items scanned in with the document feeder:
const int FEEDER = 1;
var manager=new DeviceManager();
var deviceInfo=manager.DeviceInfos.Cast<DeviceInfo>().First();
var device=deviceInfo.Connect();
device.Properties["Pages"].set_Value(1);
device.Properties["Document Handling Select"].set_Value(1);
var morePages=true;
var counter=0;
while (morePages) {
counter++;
var item=device.Items[1];
item.Properties["Bits Per Pixel"].set_Value(1);
item.Properties["Horizontal Resolution"].set_Value(300);
item.Properties["Vertical Resolution"].set_Value(300);
var img=(WIA.ImageFile)item.Transfer();
var path=String.Format(#"C:\Users\user1\Documents\test_{0}.tiff",counter);
img.SaveFile(path);
var status=(int)device.Properties["Document Handling Status"].get_Value();
morePages = (status & FEEDER) > 0;
}
When the Transfer method is reached for the first time, all the pages go through the document feeder. The first page gets saved with img.SaveFile to the passed-in path, but all the subsequent pages are not available - device.Items.Count is 1, and trying device.Items[2] raises an exception.
In the next iteration, calling Transfer raises an exception -- understandably, because there are now no pages in the feeder.
How can I get the subsequent images that have been scanned into the feeder?
(N.B. Iterating through all the device properties, there is an additional unnamed property with the id of 38922. I haven't been able to find any reference to this property.)
Update
I couldn't find a property on the device corresponding to WIA_IPS_SCAN_AHEAD or WIA_DPS_SCAN_AHEAD_PAGES, but that makes sense because this property is optional according to the documentation.
I tried using TWAIN (via the NTwain library, which I highly recommend) with the same problem.
I have recently experienced a similar error with a HP MFC.
It seems that a property was being changed by the driver. The previous developer of the software I'm working on just kept reinitalisating the driver each time in the for loop.
In my case the property was 'Media Type' being set to FLATBED (0x02) even though I was doing a multi-page scan and needed it to be NEXT_PAGE (0x80).
The way I found this was by storing every property before I scanner (both device and item properties) and again after scanning the first page. I then had my application print out any properties that had changed and was able to identify my problem.
This is a networked scanner, and I was using the WSD driver.
Once I installed the manufacturer's driver, the behavior is as expected -- one page goes through the ADF, after which control is returned to the program.
(Even now, when I use WIA's CommonDialog.ShowSelectDevice method, the scanner is available twice, once using the Windows driver and once using the Brother driver; when I choose the WSD driver, I still see the issue.)
This bug did cost me hours...
So thanks a lot Zev.
I also had two scanners shown in the dialog for physically one machine. One driver scans only the first page and then empties the feeder without any chance to intercept. The other one works as expected.
BTW: It is not needed to initialize the scanner for each page. I call my routines for initialization prior to the Transfer() loop. Works just fine.
Another hickup I ran into was to first initialize page sizes, then the feeder. So if you do not get it to work, try switching the sequence how you change the properties for your WIA driver. As mentioned in the MSDN, some properties also influence others, potentially resetting your changes.
So praise to ZEV SPITZ for the answer on Aug. 09, 2015.
You should instantiate and setup device inside the 'while' loop. See:
const int FEEDER = 1;
var morePages=true;
var counter=0;
while (morePages) {
counter++;
var manager=new DeviceManager();
var deviceInfo=manager.DeviceInfos.Cast<DeviceInfo>().First();
var device=deviceInfo.Connect();
//device.Properties["Pages"].set_Value(1);
device.Properties["Document Handling Select"].set_Value(1);
var item=device.Items[1];
item.Properties["Bits Per Pixel"].set_Value(1);
item.Properties["Horizontal Resolution"].set_Value(300);
item.Properties["Vertical Resolution"].set_Value(300);
var img=(WIA.ImageFile)item.Transfer();
var path=String.Format(#"C:\Users\user1\Documents\test_{0}.tiff",counter);
img.SaveFile(path);
var status=(int)device.Properties["Document Handling Status"].get_Value();
morePages = (status & FEEDER) > 0;
}
I got this looking into this free project, which I believe is able to help you too: adfwia.codeplex.com

GhostScript .NET not continuing past certain pages

I've created a program which needs to convert PDF files into image files, and for this GhostScript is the best choice. But once in a while, the library stalls completely on a page and doesn't continue, it just keeps using CPU power and working, as though it might be caught in an infinite loop. The error is easily reproduce-able as it happens every time on the specific PDF files that it occurs on, though no error is given from GhostScript of any kind, and nothing is out of the ordinary in the PDF files themselves as far as I can see.
I have however been able to find out that the stalling is due to a specific element or elements in the pdf files, and by deleting the elements the pdf will easily render in GhostScript, but this is not a solution, nor an answer I can use.
PDF link* - http://www.filedropper.com/usjunis1-32webtest
*saved with free version of PDF-XChange Editor, so it has watermarks at the top, but it is the square that creates the stalling. I've also seen it happen on vector graphics objects, so it is not limited to squares.
Code -
private void startImageProcessing(String pdfFile)
{
GhostscriptVersionInfo gvi = new GhostscriptVersionInfo(new Version(0, 0, 0), Directory.GetCurrentDirectory() + #"\gsdll32.dll", string.Empty, GhostscriptLicense.GPL);
Ghostscript.NET.Processor.GhostscriptProcessor processor = new Ghostscript.NET.Processor.GhostscriptProcessor(gvi, true);
processor.StartProcessing(CreateTestArgs(pdfFile, pdfFile.Substring(0, pdfFile.Length - 4) + "\\"+prefix+"-%03d.jpg", 72 * scale), new ConsoleStdIO(true));
}
private static string[] CreateTestArgs(string inputPath, string outputPath, int dpi)
{
List<string> gsArgs = new List<string>();
gsArgs.Add("-dSAFER");
gsArgs.Add("-dBATCH");
gsArgs.Add("-dNOPAUSE");
gsArgs.Add("-sDEVICE=jpeg");
gsArgs.Add("-r" + dpi);
gsArgs.Add("-dJPEGQ=100");
gsArgs.Add("-dNumRenderingThreads=" + Environment.ProcessorCount.ToString());
gsArgs.Add("-dTextAlphaBits=4");
gsArgs.Add("-dGraphicsAlphaBits=4");
gsArgs.Add(#"-sOutputFile=" + outputPath);
gsArgs.Add(#"-f" + inputPath);
return gsArgs.ToArray();
}
I've also created a pdf file only containing one of the wrong elements for testing, and it has both had the error when saved by Adobe Acrobat, and PDF-XChange Editor, so the error is not due to a specific program that I've used to save the PDF either.

How to open local bitcoin database

I am trying to extract data from local bitcoin database. As I know, bitcoin-qt is using BerkeleyDB. I have installed BerkleyDB from Oracle web site, and found there a DLL for .NET: libdb_dotnet60.dll. I am trying to open a file, but I get a DatabaseException. Here is my code:
using BerkeleyDB;
class Program
{
static void Main(string[] args)
{
var btreeConfig = new BTreeDatabaseConfig();
var btreeDb = BTreeDatabase.Open(#"c:\Users\<user>\AppData\Roaming\Bitcoin\blocks\blk00000.dat", btreeConfig);
}
}
Does anyone have examples how to work with a Bitcoin database (in any other language)?
What are you trying to extract? Only the wallet.dat file is Berkeley database.
Blocks are stored one after the other in the blkxxxxx.dat files with four bytes representing a network identifier and four bytes giving the block size, before each block.
An index for unspent outputs in stored as a leveldb database.
Knowing what type of information you are looking for would help.
There is library NBitcoin: https://github.com/MetacoSA/NBitcoin
How to enumerate blocks:
var store = new BlockStore(#"C:\Bitcoin\blocks\", Network.Main);
// this loop will enumerate all blocks ordered by height starting with genesis block
foreach (var block in store.EnumerateFolder())
{
var item = block.Item;
string blockID = item.Header.ToString();
foreach (var tx in item.Transactions)
{
string txID = tx.GetHash().ToString();
string raw = tx.ToHex();
}
}
In .NET you could use something like BitcoinBlockchain that is available as a NuGet package at https://www.nuget.org/packages/BitcoinBlockchain/. Its usage is trivial. If you want o see how it is implemented the sources are available on GitHub.
If you want to store the blockchain in a SQL database that you could query faster and in more ways that the raw blockchain you could use something like the BitcoinDatabaseGenerator tool available at https://github.com/ladimolnar/BitcoinDatabaseGenerator.