Custom USQL extractor - How to process more than 4 MB json object - azure-data-lake

We use a custom USQL extractor to flatten a json structure. The below sample code works fine if line(json object) of json is less than 4 MB. If the line size is above 4 MB, then we get error "A record in the input file is longer than 4194304 bytes." The similar code is tried in C# stand alone application for lines higher than 4 MB, it works fine. Do we have any restriction on json size with usql custom extractor? How do we handle json messages with size more than 4 MB?
The error is thrown from the highlighted line in below code
string line = lineReader.ReadToEnd();
Custom extractor Sample code
using Microsoft.Analytics.Interfaces;
using System.Collections.Generic;
using System.IO;
using System.Text;
using Microsoft.Analytics.Types.Sql;
using Newtonsoft.Json;
namespace Company.DataLakeAnalytics
{
[SqlUserDefinedExtractor(AtomicFileProcessing = false)]
public class CustomJSONExtractor : IExtractor
{
private readonly Encoding _encoding;
private readonly byte[] _row_delim;
private string DELIMITER = "~";
public CustomJSONExtractor(Encoding encoding = null, string row_delim = "\r\n")
{
_encoding = Encoding.UTF8;
_row_delim = _encoding.GetBytes(row_delim);
}
//Every json line in the raw file is transformed to a flat structure
public override IEnumerable Extract(IUnstructuredReader input, IUpdatableRow output)
{
//Read the input line by line
foreach (Stream current in input.Split(_row_delim))
{
using (StreamReader lineReader = new StreamReader(current, this._encoding))
{
//reads the entire line
string line = lineReader.ReadToEnd();
//break the line to multiple variables
output.Set(1, "A~1");
yield return output.AsReadOnly();
}
}
}
}
}
sample USQL code
DECLARE #INPUT_FILE="sample-data.txt";
#jsonDatafile = EXTRACT key string, jsonObjStr string FROM #INPUT_FILE USING new Damen.DataLakeAnalytics.CustomJSONExtractor(null,row_delim:"\n") ;
#dataJsonObject = SELECT jsonObjStr AS rawData FROM #dataAsStrings;
OUTPUT #dataJsonObject TO #flattenedOutputFile USING Outputters.Text(outputHeader:false,quoting: false,delimiter:'~');

The max size is indeed 4MB for rows and 128KB for a string. What you can do is using the solution provided in this similar answer:
What is the maximum allowed size for String in U-SQL?

Related

Detecting file size with MultipartFormDataStreamProvider before file is saved?

We are using the MultipartFormDataStreamProviderto save file upload by clients. I have a hard requirement that file size must be greater than 1KB. The easiest thing to do would of course be the save the file to disk and then look at the file unfortunately i can't do it like this. After i save the file to disk i don't have the ability to access it so i need to look at the file before its saved to disk. I've been looking at the properties of the stream provider to try to figure out what the size of the file is but unfortunately i've been unsuccessful.
The test file i'm using is 1025 bytes.
MultipartFormDataStreamProvider.BufferSize is 4096
Headers.ContentDisposition.Size is null
ContentLength is null
Is there a way to determine file size before it's saved to the file system?
Thanks to Guanxi i was able to formulate a solution. I used his code in the link as the basis i just added a little more async/await goodness :). I wanted to add the solution just in case it helps anyone else:
private async Task SaveMultipartStreamToDisk(Guid guid, string fullPath)
{
var user = HttpContext.Current.User.Identity.Name;
var multipartMemoryStreamProvider = await Request.Content.ReadAsMultipartAsync();
foreach (var content in multipartMemoryStreamProvider.Contents)
{
using (content)
{
if (content.Headers.ContentDisposition.FileName != null)
{
var existingFileName = content.Headers.ContentDisposition.FileName.Replace("\"", string.Empty);
Log.Information("Original File name was {OriginalFileName}: {guid} {user}", existingFileName, guid,user);
using (var st = await content.ReadAsStreamAsync())
{
var ext = Path.GetExtension(existingFileName.Replace("\"", string.Empty));
List<string> validExtensions = new List<string>() { ".pdf", ".jpg", ".jpeg", ".png" };
//1024 = 1KB
if (st.Length > 1024 && validExtensions.Contains(ext, StringComparer.OrdinalIgnoreCase))
{
var newFileName = guid + ext;
using (var fs = new FileStream(Path.Combine(fullPath, newFileName), FileMode.Create))
{
await st.CopyToAsync(fs);
Log.Information("Completed writing {file}: {guid} {user}", Path.Combine(fullPath, newFileName), guid, HttpContext.Current.User.Identity.Name);
}
}
else
{
if (st.Length < 1025)
{
Log.Warning("File of length {FileLength} bytes was attempted to be uploaded: {guid} {user}",st.Length,guid,user);
}
else
{
Log.Warning("A file of type {FileType} was attempted to be uploaded: {guid} {user}", ext, guid,user);
}
var responseMessage = new HttpResponseMessage(HttpStatusCode.BadRequest)
{
Content =
st.Length < 1025
? new StringContent(
$"file of length {st.Length} does not meet our minumim file size requirements")
: new StringContent($"a file extension of {ext} is not an acceptable type")
};
throw new HttpResponseException(responseMessage);
}
}
}
}
}
You can also read the request contents without using MultipartFormDataStreamProvider. In that case all of the request contents (including files) would be in memory. I have given an example of how to do that at this link.
In this case you can read header for file size or read stream and check the file size. If it satisfy your criteria then only write it to desire location.

Convert g722 audio to WAV using NAudio

I'm starting to write a Windows Service that will convert G.722 audio files into WAV files and I'm planning on using the NAudio library.
After looking at the NAudio demos, I've found that I will need to use the G722Codec to decode the audio data from the file but I'm having trouble figuring out how to read the G722 file. Which reader should I use?\
The G722 files are 7 kHz.
I'm working my way through the Pluralsight course for NAudio but it would be great to get a small code sample.
I got it working with the RawSourceWaveStreambut then tried to simply read the bytes of the file, decode using the G722 Codec and write the bytes out to a wave file. It worked.
private readonly G722CodecState _state = new G722CodecState(64000, G722Flags.SampleRate8000);
private readonly G722Codec _codec = new G722Codec();
private readonly WaveFormat _waveFormat = new WaveFormat(8000, 1);
public MainWindow()
{
InitializeComponent();
var data = File.ReadAllBytes(#"C:\Recordings\000-06Z_chunk00000.g722");
var output = Decode(data, 0, data.Length);
using (WaveFileWriter waveFileWriter = new WaveFileWriter(#"C:\Recordings\000-06Z_chunk00000.wav", _waveFormat))
{
waveFileWriter.Write(output, 0, output.Length);
}
}
private byte[] Decode(byte[] data, int offset, int length)
{
if (offset != 0)
{
throw new ArgumentException("G722 does not yet support non-zero offsets");
}
int decodedLength = length * 4;
var outputBuffer = new byte[decodedLength];
var wb = new WaveBuffer(outputBuffer);
int decoded = _codec.Decode(_state, wb.ShortBuffer, data, length);
return outputBuffer;
}

Process a CSV file starting at a predetermined line/row using LumenWorks parser

I am using LumenWorks awesome CSV reader to process CSV files. Some files have over 1 million records.
What I want is to process the file in sections. E.g. I want to process 100,000 records first, validate the data and then send this records over an Internet connection. Once sent, I then reopen the file and continue from record 100,001. On and on till I finish processing the file. In my application I have already created the logic of keeping track of which record I am currently processing.
Does the LumenWorks parser support processing from a predetermined line in the CSV or it always has to start from the top? I see it has a buffer variable. Is there a way to use this buffer variable to achieve my goal?
my_csv = New CsvReader(New StreamReader(file_path), False, ",", buffer_variable)
It seems the LumenWorks CSV Reader needs to start at the top - I needed to ignore the first n lines in a file, and attempted to pass a StreamReader that was at the correct position/row, but got a Key already exists Dictionary error when I attempted to get the FieldCount (there were no duplicates).
However, I have found some success by first reading pre-trimmed file into StringBuilder and then into a StringReader to allow the CSV Reader to read it. Your mileage may vary with huge files, but it does help to trim a file:
using (StreamReader sr = new StreamReader(filePath))
{
string line = sr.ReadLine();
StringBuilder sbCsv = new StringBuilder();
int lineNumber = 0;
do
{
lineNumber++;
// Ignore the start rows of the CSV file until we reach the header
if (lineNumber >= Constants.HeaderStartingRow)
{
// Place into StringBuilder
sbCsv.AppendLine(line);
}
}
while ((line = sr.ReadLine()) != null);
// Use a StringReader to read the trimmed CSV file into a CSV Reader
using (StringReader str = new StringReader(sbCsv.ToString()))
{
using (CsvReader csv = new CsvReader(str, true))
{
int fieldCount = csv.FieldCount;
string[] headers = csv.GetFieldHeaders();
while (csv.ReadNextRecord())
{
for (int i = 0; i < fieldCount; i++)
{
// Do Work
}
}
}
}
}
You might be able to adapt this solution to reading chunks of a file - e.g. as you read through the StreamReader, assign different "chunks" to a Collection of StringBuilder objects and also pre-pend the header row if you want it.
Try to use CachedCSVReader instead of CSVReader and MoveTo(long recordnumber), MoveToStart etc. methods.

How to find uncompressed size of ionic zip file

I have a zip file compressed using Ionic zip. Before extracting I need to verify the available disk space. But how do I find the uncompressed size before hand? Is there any header information in the zip file (by ionic) so that I can read it?
This should do the trick:
Option 1
static long totaluncompressedsize;
static string info;
foreach (ZipEntry e in zip) {
long uncompressedsize = e.UncompressedSize;
totaluncompressedsize += uncompressedsize;
}
Or option 2 - will need to sift through the mass of info
using (ZipFile zip = ZipFile.Read(zipFile)) {
info = zip.Info;
}
public static long GetTotalUnzippedSize(string zipFileName)
{
using (ZipArchive zipFile = ZipFile.OpenRead(zipFileName))
{
return zipFile.Entries.Sum(entry => entry.Length);
}
}

ResourceWriter data formatting

I have a .resx file to update some data. I can read the data from the file via a ResXResourceSet object, but when I want to save the data back, the saved data format is
unrecognizable. How do I edit .resx files? Thanks.
ResXResourceSet st = new ResXResourceSet(#"thepath");
entries=new List<DictionaryEntry>();
DictionaryEntry curEntry ;
foreach (DictionaryEntry ent in st)
{
if (ent.Key.ToString() == "Page.Title")
{
curEntry = ent;
curEntry.Value = "change this one"
entries.Add(curEntry);
}
else
{
entries.Add(ent);
}
}
st.Close();
System.Resources.ResourceWriter wr = new ResourceWriter(#"thepath");
foreach (DictionaryEntry entry in entries)
{
wr.AddResource(entry.Key.ToString(), entry.Value.ToString());
}
wr.Close();
Hi again i searched up and found that..
ResourceWriter writes data as binary type
ResourceReader reads data as binary type
ResXResourceWriter writes data as xml format
ResXResourceReader reads data as xml format
so example on top using ResXResourceWriter,ResXResourceReader instead of ResourceReader ,ResourceWriter will manipulate resources as xml type