Parsing Request Payload string

can anyone help me in parsing data from Request Payload string like following one:
You can call the Split() method on strings to split them at a certain character. Alternatively, you can use Regex.Split(value, "<pattern>"); for splitting, e.g. if you have multiple characters you want to split at. <pattern> is a string here, so you can provide more than one character (e.g. "\r\n" to find line breaks).
using System;
class Program {
static void Main() {
string s = "7|0|5||B8CC86B6E3BFEAF758DE5845F8EBEA08||getAssetDailyTicks|J|1|2|3|4|2|5|5|CB|U9mc4GQ|";
// Split string at pipe character
string[] parts = s.Split('|');
// Process segments
foreach (string segment in parts) {
// Use the segmented data...


Hive : accented characters to their non accented counterparts

How can I replace non-ascii characters with their ascii counterparts in a SELECT request sent to hive ? That is have accents removed (é, ê, è => e) and have other non alphanumeric characters (``) removed.
I know I can use regexp_replace() but I'd have to deal with every accented/non-accented pair there is. Surely, there is something more practical ?
It seems that you want to use
String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
As described in
Replace non ASCII character from string
I have tried using reflect but couldn't make it work due to the Normalizer.Form enum parameter.
So, it seems that you have to define a one-line UDF:
public class NormalizerUDF extends UDF {
public String evaluate(String in) {
return Normalizer.normalize(in, Normalizer.Form.NFD);

How to enable parallelism for a custom U-SQL Extractor

I’m implementing a custom U-SQL Extractor for our internal file format (binary serialization). It works well in the "Atomic" mode:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class BinaryExtractor : IExtractor
If I switch off the “Atomic“ mode, It looks like U-SQL is splitting the file in a random place (I guess just by 250MB chunks). This is not acceptable for me. The file format has a special row delimiter. Can I define a custom row delimiter in my Extractor and enable parallelism for it. Technically I can change our row delimiter to a new one if it can help.
Could anyone help me with this question?
The file is indeed split into chunks (I think it is 1 GB at the moment, but the exact value is implementation defined and may change for performance reasons).
If the file is indeed row delimited, and assuming your raw input data for the row is less than 4MB, you can use the input.Split() function inside your UDO to do the splitting into rows. The call will automatically handle the case if the raw input data spans the chunk boundary (assuming it is less than 4MB).
Here is an example:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
// this._row_delim = this._encoding.GetBytes(row_delim); in class ctor
foreach (Stream current in input.Split(this._row_delim))
using (StreamReader streamReader = new StreamReader(current, this._encoding))
int num = 0;
string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None);
for (int i = 0; i < array.Length; i++)
yield return outputrow.AsReadOnly();
Please note that you cannot read across chunk boundaries yourself and you should make sure your data is indeed splittable into rows.

Regenerate source from Antlr4 ParseTree preserving whitespaces

I am able to use Java.g4 grammar and generate lexers and parsers. This grammar is used to get started. I have a ParseTree and I walk it and change whatever I want and write it back to a Java source code file. The ParseTree is not directly changed though.
So for example this is the code. But I also modify method names and add annotations to methods. In some cases I also remove method parameters
public void enterClassDeclaration(JavaParser.ClassDeclarationContext ctx) {
printer.write( intervals );
printer.writeList(importFilter );
ParseTree pt = ctx.getChild(1);
The class name is changed by a visitor because
we get a ParseTree back
pt.accept(new ParseTreeVisitor<Object>() {
public Object visitChildren(RuleNode ruleNode) {
return null;
public Object visitErrorNode(ErrorNode errorNode) {
return null;
public Object visit(ParseTree parseTree) {
return null;
public Object visitTerminal(TerminalNode terminalNode) {
String className = terminalNode.getText();
System.out.println("Name of the class is [ " + className + "]");
printer.writeText( classModifier.get() + " class " + NEW_CLASS_IDENTIFIER );
return null;
But I am not sure how to print the changed Java code while preserving all the original whitespaces.
How is that done ?
Update : It seems that the whitespaces and comments are there but not accessible easily. So it looks like I need to specifically keep track of them and write them along with the code. Not sure though.
So more specifically the code is this.
package x;
import java.util.Enumeration;
import java.util.*;
As I hit the first ImportDeclarationContext I need to store all the hidden space tokens. When I write this code back I want to include those spaces too.
Solution :
Don't skip but add to a HIDDEN channel
// Whitespace and comments
WS : [ \t\r\n\u000C]+ -> channel(HIDDEN)
: '/*' .*? '*/' -> channel(HIDDEN)
Use code like this to get them back
I use this to get the whitespaces and comments before each method. But it should be possible to get whitespaces from other places too. I think.
((CommonTokenStream) tokens).getHiddenTokensToLeft( classOrInterfaceModifierContext.getStart().getTokenIndex(),

How can I Extract words with its coordinates from pdf using .net?

I'm working with pdf in hebrew language with diacritical marks. I want to extract all the words with its coordinates. I tried to use ITextSharp and pdfClown and they both didn't give me what I want.
In pdfClown there are missing letters\chars in ITextSharp I don't get the words coordinates.
Is there a way to do it? (I'm looking for a free framework\code)
PDFClown Code:
File file = new File(PDFFilePath);
TextExtractor te = new TextExtractor();
IDictionary<RectangleF?, IList<ITextString>> strs = te.Extract(file.Document.Pages[0].Contents);
List<string> correctText = new List<string>();
foreach (var key in strs.Keys)
foreach (var value in strs[key])
string reversedText = new string(value.Text.Reverse().ToArray());
string cleanText = RemoveDiacritics(reversedText);
You aren't showing how you are trying to extract text using iText(Sharp). I am assuming that you are following the official documentation and that your code looks like this:
public string ExtractText(byte[] src) {
PdfReader reader = new PdfReader(src);
MyTextRenderListener listener = new MyTextRenderListener();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.GetPageN(1);
PdfDictionary resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
ContentByteUtils.GetContentBytesForPage(reader, 1), resourcesDic);
return listener.Text.ToString();
If your code doesn't look like this, this explains already explains the first thing you're doing wrong.
In this method, there is one class that isn't part of iTextSharp: MyTextRenderListener. This is a class you should write and that looks for instance like this:
public class MyTextRenderListener : IRenderListener {
public StringBuilder Text { get; set; }
public MyTextRenderListener() {
Text = new StringBuilder();
public void BeginTextBlock() {
public void EndTextBlock() {
public void RenderImage(ImageRenderInfo renderInfo) {
public void RenderText(TextRenderInfo renderInfo) {
LineSegment segment = renderInfo.GetBaseline();
Vector start = segment.GetStartPoint();
Text.Append("| x=");
Text.Append("; y=");
When you run this code, and you look what's inside Text, you'll notice that a PDF document doesn't store words. Instead, it stores text blocks. In our special IRenderListener, we indicate the start and the end of text blocks using < and >. Inside these text blocks, you'll find text snippets. We'll mark text snippets like this: <text snippet| x=36.0000; y=806.0000> where the x and y value give you the coordinate of the start of the baseline (as opposed to the ascent and descent position). You can also get the end position of the baseline (and the ascent/descent).
Now how do you distill words out of all of this? The problem with the text snippets you get, is that they don't correspond with words. See for instance this file: hello_reverse.pdf
When you open it in Adobe Reader, you read "Hello World Hello People." You'd hope you'd find four words in the content stream, wouldn't you? In reality, this is what you'll find:
<<Hello People>>
To distill the words, "World" and "Hello" from the first line, you need to do plenty of Math. Instead of getting the base line of the TextRenderInfo object returned in the RenderText() method of your render listener, you have to use the GetCharacterRenderInfos() method. This will return a list of TextRenderInfo objects that gives you more info about every character (including the position of those characters). You then need to compose the words from those different characters.
This is explained in mkl's answer to this question: Retrieve the respective coordinates of all words on the page with itextsharp
We've done similar projects. One of them is described here:
You'll need to do quite some coding to get it right. One word about PdfClown: your text is probably stored as UNICODE in your PDF. To retrieve the correct characters, the parser needs to examine the mapping of the glyphs stored in the font and the corresponding UNICODE character. If PdfClown can't do this, this means that PdfClown doesn't do this task correctly. PdfClown is a one man project, so you'll have to ask that developer to fix this (if he has the time).
As you can tell from the video, iText could help you out, but iText is a company with subsidiaries in the US, Belgium and Singapore. It is a company with many employees and to keep that company running, we need to make money (that's how we pay our employees). Hence you shouldn't expect that we help you for free. Surely you can understand this as you wouldn't want to work for free either, would you?

WCF WebInvoke with query string parameters AND a post body

I'm fairly new to web services and especially WCF so bear with me.
I'm writing an API that takes a couple of parameters like username, apikey and some options, but I also need to send it a string which can be a few thousands words, which gets manipulated and passed back as a stream. It didn't make sense to just put it in the query string, so I thought I would just have the message body POSTed to the service.
There doesn't seem to be an easy way to do this...
My operation contract looks like this
[WebInvoke(Method = "POST", BodyStyle = WebMessageBodyStyle.Bare,
"&text={text}&quality={qual}", BodyStyle = WebMessageBodyStyle.Bare)]
Stream Method1(string email, string apikey, string text, string qual);
And this works. But it is the 'text' parameter I want to pull out and have in the post body. One thing I read said to have a stream as another parameter, like this:
Stream Method1(string email, string apikey, string qual, Stream text);
which I could then read in. But that throws an error saying that if I want to have a stream parameter, that it has to be the only parameter.
So how can I achieve what I am trying to do here, or is it no big deal to send up a few thousand words in the query string?
Best answer I could find that tackles this issue and worked for me so I could adhere to the RESTful standards correctly
A workaround is to not declare the query parameters within the method signature and just manually extract them from the raw uri.
Dictionary<string, string> queryParameters = WcfUtils.QueryParameters();
queryParameters.TryGetValue("email", out string email);
// (Inside WcfUtils):
public static Dictionary<string, string> QueryParameters()
// raw url including the query parameters
string uri = WebOperationContext.Current.IncomingRequest.UriTemplateMatch;
return uri.Split('?')
.SelectMany(s => s.Split('&'))
.Select(pv => pv.Split('='))
.Where(pv => pv.Length == 2)
.ToDictionary(pv => pv[0], pv => pv[1].TrimSingleQuotes());
// (Inside string extension methods)
public static string TrimSingleQuotes(this string s)
return (s != null && s.Length >= 2 && s[0] == '\'' && s[s.Length - 1] == '\'')
? s.Substring(1, s.Length - 2).Replace("''", "'")
: s;
Ended up solving simply by using WebServiceHostFactory