Text extraction from table cells - pdf

I have a pdf. The pdf contains a table. The table contains many cells (>100). I know the exact position (x,y) and dimension (w,h) of every cell of the table.
I need to extract text from cells using itextsharp. Using PdfReaderContentParser + FilteredTextRenderListener (using a code like this http://itextpdf.com/examples/iia.php?id=279 ) I can extract text but I need to run the whole procedure for each cell. My pdf have many cells and the program needs too much time to run. Is there a way to extract text from a list of "rectangle"? I need to know the text of each rectangle. I'm looking for something like PDFTextStripperByArea by PdfBox (you can define as many regions as you need and the get text using .getTextForRegion("region-name") ).

This option is not immediately included in the iTextSharp distribution but it is easy to realize. In the following I use the iText (Java) class, interface, and method names because I am more at home with Java. They should easily be translatable into iTextSharp (C#) names.
If you use the LocationTextExtractionStrategy, you can can use its a posteriori TextChunkFilter mechanism instead of the a priori FilteredRenderListener mechanism used in the sample you linked to. This mechanism has been introduced in version 5.3.3.
For this you first parse the whole page content using the LocationTextExtractionStrategy without any FilteredRenderListener filtering applied. This makes the strategy object collect TextChunk objects for all PDF text objects on the page containing the associated base line segment.
Then you call the strategy's getResultantText overload with a TextChunkFilter argument (instead of the regular no-argument overload):
public String getResultantText(TextChunkFilter chunkFilter)
You call it with a different TextChunkFilter instance for each table cell. You have to implement this filter interface which is not too difficult as it only defines one method:
public static interface TextChunkFilter
{
/**
* #param textChunk the chunk to check
* #return true if the chunk should be allowed
*/
public boolean accept(TextChunk textChunk);
}
So the accept method of the filter for a given cell must test whether the text chunk in question is inside your cell.
(Instead of separate instances for each cell you can of course also create one instance whose parameters, i.e. cell coordinates, can be changed between getResultantText calls.)
PS: As mentioned by the OP, this TextChunkFilter has not yet been ported to iTextSharp. It should not be hard to do so, though, only one small interface and one method to add to the strategy.
PPS: In a comment sschuberth asked
Do you then still call PdfTextExtractor.getTextFromPage() when using getResultantText(), or does it somehow replace that call? If so, how to you then specify the page to extract to?
Actually PdfTextExtractor.getTextFromPage() internally already uses the no-argument getResultantText() overload:
public static String getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText();
}
To make use of a TextChunkFilter you could simply build a similar convenience method, e.g.
public static String getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, TextChunkFilter chunkFilter) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText(chunkFilter);
}
In the context at hand, though, in which we want to parse the page content only once and apply multiple filters, one for each cell, we might generalize this to:
public static List<String> getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, Iterable<TextChunkFilter> chunkFilters) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(pageNumber, strategy, additionalContentOperators)
List<String> result = new ArrayList<>();
for (TextChunkFilter chunkFilter : chunkFilters)
{
result.add(strategy).getResultantText(chunkFilter);
}
return result;
}
(You can make this look fancier by using Java 8 collection streaming instead of the old'fashioned for loop.)

Here's my take on how to extract text from a table-like structure in a PDF using itextsharp. It returns a collection of rows and each row contains a collection of interpreted columns. This may work for you on the premise that there is a gap between one column and the next which is greater than the average width of a single character. I also added an option to check for wrapped text within a virtual column. Your mileage may vary.
using (PdfReader pdfReader = new PdfReader(stream))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
TableExtractionStrategy tableExtractionStrategy = new TableExtractionStrategy();
string pageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, tableExtractionStrategy);
var table = tableExtractionStrategy.GetTable();
}
}
public class TableExtractionStrategy : LocationTextExtractionStrategy
{
public float NextCharacterThreshold { get; set; } = 1;
public int NextLineLookAheadDepth { get; set; } = 500;
public bool AccomodateWordWrapping { get; set; } = true;
private List<TableTextChunk> Chunks { get; set; } = new List<TableTextChunk>();
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
string text = renderInfo.GetText();
Vector bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
Rectangle rectangle = new Rectangle(bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Chunks.Add(new TableTextChunk(rectangle, text));
}
public List<List<string>> GetTable()
{
List<List<string>> lines = new List<List<string>>();
List<string> currentLine = new List<string>();
float? previousBottom = null;
float? previousRight = null;
StringBuilder currentString = new StringBuilder();
// iterate through all chunks and evaluate
for (int i = 0; i < Chunks.Count; i++)
{
TableTextChunk chunk = Chunks[i];
// determine if we are processing the same row based on defined space between subsequent chunks
if (previousBottom.HasValue && previousBottom == chunk.Rectangle.Bottom)
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
else
{
// if we are processing a new line let's check to see if this could be word wrapping behavior
bool isNewLine = true;
if (AccomodateWordWrapping)
{
int readAheadDepth = Math.Min(i + NextLineLookAheadDepth, Chunks.Count);
if (previousBottom.HasValue)
for (int j = i; j < readAheadDepth; j++)
{
if (previousBottom == Chunks[j].Rectangle.Bottom)
{
isNewLine = false;
break;
}
}
}
// if the text was not word wrapped let's treat this as a new table row
if (isNewLine)
{
if (currentString.Length > 0)
currentLine.Add(currentString.ToString());
currentString.Clear();
previousBottom = chunk.Rectangle.Bottom;
previousRight = chunk.Rectangle.Right;
currentString.Append(chunk.Text);
if (currentLine.Count > 0)
lines.Add(currentLine);
currentLine = new List<string>();
}
else
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
}
}
return lines;
}
private struct TableTextChunk
{
public Rectangle Rectangle;
public string Text;
public TableTextChunk(Rectangle rect, string text)
{
Rectangle = rect;
Text = text;
}
public override string ToString()
{
return Text + " (" + Rectangle.Left + ", " + Rectangle.Bottom + ")";
}
}
}

Related

How to set variable inside of a Coroutine after yielding a webrequest

Okay I will try and explain this to the best of my ability. I have searched and searched all day for a solution to this issue but can't seem to find it. The problem that I am having is that I have a list of scriptable objects that I am basically using for custom properties to create gameobjects off of. One of those properties that I need to get is a Texture2D that I turn into a sprite. Therefor, I am using UnityWebRequest in a Coroutine and am having to yield the response. After I get the response I am trying to set the variable. However even using Lambdas it seems to me that if I yield return the response before the result it will not set the variable. So every time I check the variable after the Coroutine it comes back null. If someone could enlighten me with what I am missing here that would be just great!
Here is the Scriptable Object Class I am using.
[CreateAssetMenu(fileName = "new movie",menuName = "movie")]
public class MovieTemplate : ScriptableObject
{
public string Title;
public string Description;
public string ImgURL;
public string mainURL;
public string secondaryURL;
public Sprite thumbnail;
}
Here is the call to the Coroutine
foreach (var item in nodes)
{
templates.Add(GetMovieData(item));
}
foreach (MovieTemplate movie in templates)
{
StartCoroutine(GetMovieImage(movie.ImgURL, result =>
{
movie.thumbnail = result;
}));
}
Here is the Coroutine itself
IEnumerator GetMovieImage(string url, System.Action<Sprite> result)
{
using (UnityWebRequest web = UnityWebRequestTexture.GetTexture(url))
{
yield return web.SendWebRequest();
var img = DownloadHandlerTexture.GetContent(web);
result(Sprite.Create(img, new Rect(0, 0, img.width, img.height), Vector2.zero));
}
}
From what you desribe it still seems that the texture is somehow disposed as soon as the routine finishes. My guess would be that it happens due to the using block.
I would store the original texture reference
[CreateAssetMenu(fileName = "new movie",menuName = "movie")]
public class MovieTemplate : ScriptableObject
{
public string Title;
public string Description;
public string ImgURL;
public string mainURL;
public string secondaryURL;
public Sprite thumbnail;
public Texture texture;
public void SetSprite(Sprite newSprite, Texture newTexture)
{
if(texture) Destroy(texture);
texture = newTexture;
var tex = (Texture2D) texture;
thumbnail = Sprite.Create(tex, new Rect(0, 0, tex.width, tex.height), Vector2.zero);
}
}
So you can keep track of the texture itself as well, let it not be collected by the GC but also destroy it when not needed anymore. Usually Texture2D is removed by the GC as soon as there is no reference to it anymore but Texture2D created by UnityWebRequest might behave different.
Than in the webrequest return the texture and don't use using
IEnumerator GetMovieImage(string url, System.Action<Texture> result)
{
UnityWebRequest web = UnityWebRequestTexture.GetTexture(url));
yield return web.SendWebRequest();
if(!web.error)
{
result?.Invoke(DownloadHandlerTexture.GetContent(web));
}
else
{
Debug.LogErrorFormat(this, "Download error: {0} - {1}", web.responseCode, web.error);
}
}
and finally use it like
for (int i = 0; i < templates.Count; i++)
{
int index = i;//If u use i, it will be overriden too so we make a copy of it
StartCoroutine(
GetMovieImage(
templates[index].ImgURL,
result =>
{
templates[index].SetSprite(result);
})
);
}
The problem is with this section of your code :
foreach (MovieTemplate movie in templates)
{
StartCoroutine(GetMovieImage(movie.ImgURL, result =>
{
movie.thumbnail = result;//wrong movie obj
}));
}
Here you will loose refrence to movie object(override by foreach) before the result of callback arrive .
Change it to something like this :
foreach (int i=0;i<templates.Length;i++)
{
int index= i;//If u use i, it will be overriden too so we make a copy of it
StartCoroutine(GetMovieImage(movie.ImgURL, result =>
{
templates[index].thumbnail = result;
}));
}

How to replace / remove text from a PDF file

How would I go about replacing / removing text from a PDF file?
I have a PDF file that I obtained somewhere, and I want to be able to replace some text within it.
Or, I have a PDF file that I want to obscure (redact) some of the text within it so that it's no longer visible [and so that it looks cool, like the CIA files].
Or, I have a PDF that contains global Javascript that I want to stop from interrupting my use of the PDF.
This is possible in a limited fashion with the use of iText / iTextSharp.
It will only work with Tj/TJ opcodes (i.e. standard text, not text embedded in images, or drawn with shapes).
You need to override the default PdfContentStreamProcessor to act on the page content streams, as presented by Mkl here Removing Watermark from PDF iTextSharp. Inherit from this class, and in your new class look for the Tj/TJ opcodes, the operand(s) will generally be the text element(s) (for a TJ this may not be straightforward text, and may require further parsing of all the operands).
A pretty basic example of some of the flexibility around iTextSharp is available from this github repository https://github.com/bevanweiss/PdfEditor (code excerpts below also)
NOTE: This utilises the AGPL version of iTextSharp (and is hence also AGPL), so if you will be distributing executables derived from this code or allowing others to interact with those executables in any way then you must also provide your modified source code. There is also no warranty, implied or expressed, related to this code. Use at your own peril.
PdfContentStreamEditor
using System.Collections.Generic;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class PdfContentStreamEditor : PdfContentStreamProcessor
{
/**
* This method edits the immediate contents of a page, i.e. its content stream.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public void EditPage(PdfStamper pdfStamper, int pageNum)
{
var pdfReader = pdfStamper.Reader;
var page = pdfReader.GetPageN(pageNum);
var pageContentInput = ContentByteUtils.GetContentBytesForPage(pdfReader, pageNum);
page.Remove(PdfName.CONTENTS);
EditContent(pageContentInput, page.GetAsDict(PdfName.RESOURCES), pdfStamper.GetUnderContent(pageNum));
}
/**
* This method processes the content bytes and outputs to the given canvas.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public virtual void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
{
this.Canvas = canvas;
ProcessContent(contentBytes, resources);
this.Canvas = null;
}
/**
* This method writes content stream operations to the target canvas. The default
* implementation writes them as they come, so it essentially generates identical
* copies of the original instructions the {#link ContentOperatorWrapper} instances
* forward to it.
*
* Override this method to achieve some fancy editing effect.
*/
protected virtual void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
{
var index = 0;
foreach (var pdfObject in operands)
{
pdfObject.ToPdf(null, Canvas.InternalBuffer);
Canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
}
}
//
// constructor giving the parent a dummy listener to talk to
//
public PdfContentStreamEditor() : base(new DummyRenderListener())
{
}
//
// constructor giving the parent a dummy listener to talk to
//
public PdfContentStreamEditor(IRenderListener renderListener) : base(renderListener)
{
}
//
// Overrides of PdfContentStreamProcessor methods
//
public override IContentOperator RegisterContentOperator(string operatorString, IContentOperator newOperator)
{
var wrapper = new ContentOperatorWrapper();
wrapper.SetOriginalOperator(newOperator);
var formerOperator = base.RegisterContentOperator(operatorString, wrapper);
return (formerOperator is ContentOperatorWrapper operatorWrapper ? operatorWrapper.GetOriginalOperator() : formerOperator);
}
public override void ProcessContent(byte[] contentBytes, PdfDictionary resources)
{
this.Resources = resources;
base.ProcessContent(contentBytes, resources);
this.Resources = null;
}
//
// members holding the output canvas and the resources
//
protected PdfContentByte Canvas = null;
protected PdfDictionary Resources = null;
//
// A content operator class to wrap all content operators to forward the invocation to the editor
//
class ContentOperatorWrapper : IContentOperator
{
public IContentOperator GetOriginalOperator()
{
return _originalOperator;
}
public void SetOriginalOperator(IContentOperator op)
{
this._originalOperator = op;
}
public void Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
if (_originalOperator != null && !"Do".Equals(oper.ToString()))
{
_originalOperator.Invoke(processor, oper, operands);
}
((PdfContentStreamEditor)processor).Write(processor, oper, operands);
}
private IContentOperator _originalOperator = null;
}
//
// A dummy render listener to give to the underlying content stream processor to feed events to
//
class DummyRenderListener : IRenderListener
{
public void BeginTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
}
TextReplaceStreamEditor
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class TextReplaceStreamEditor : PdfContentStreamEditor
{
public TextReplaceStreamEditor(string MatchPattern, string ReplacePattern)
{
_matchPattern = MatchPattern;
_replacePattern = ReplacePattern;
}
private string _matchPattern;
private string _replacePattern;
protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
var operatorString = oper.ToString();
if ("Tj".Equals(operatorString) || "TJ".Equals(operatorString))
{
for(var i = 0; i < operands.Count; i++)
{
if(!operands[i].IsString())
continue;
var text = operands[i].ToString();
if(Regex.IsMatch(text, _matchPattern))
{
operands[i] = new PdfString(Regex.Replace(text, _matchPattern, _replacePattern));
}
}
}
base.Write(processor, oper, operands);
}
}
}
TextRedactStreamEditor
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class TextRedactStreamEditor : PdfContentStreamEditor
{
public TextRedactStreamEditor(string MatchPattern) : base(new RedactRenderListener(MatchPattern))
{
_matchPattern = MatchPattern;
}
private string _matchPattern;
protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
base.Write(processor, oper, operands);
}
public override void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
{
((RedactRenderListener)base.RenderListener).SetCanvas(canvas);
base.EditContent(contentBytes, resources, canvas);
}
}
//
// A pretty simple render listener, all we care about it text stuff.
// We listen out for text blocks, look for our text, and then put a
// black box over it.. text 'redacted'
//
class RedactRenderListener : IRenderListener
{
private PdfContentByte _canvas;
private string _matchPattern;
public RedactRenderListener(string MatchPattern)
{
_matchPattern = MatchPattern;
}
public RedactRenderListener(PdfContentByte Canvas, string MatchPattern)
{
_canvas = Canvas;
_matchPattern = MatchPattern;
}
public void SetCanvas(PdfContentByte Canvas)
{
_canvas = Canvas;
}
public void BeginTextBlock() { }
public void RenderText(TextRenderInfo renderInfo)
{
var text = renderInfo.GetText();
var match = Regex.Match(text, _matchPattern);
if(match.Success)
{
var p1 = renderInfo.GetCharacterRenderInfos()[match.Index].GetAscentLine().GetStartPoint();
var p2 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetAscentLine().GetEndPoint();
var p3 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetDescentLine().GetEndPoint();
var p4 = renderInfo.GetCharacterRenderInfos()[match.Index].GetDescentLine().GetStartPoint();
_canvas.SaveState();
_canvas.SetColorStroke(BaseColor.BLACK);
_canvas.SetColorFill(BaseColor.BLACK);
_canvas.MoveTo(p1[Vector.I1], p1[Vector.I2]);
_canvas.LineTo(p2[Vector.I1], p2[Vector.I2]);
_canvas.LineTo(p3[Vector.I1], p3[Vector.I2]);
_canvas.LineTo(p4[Vector.I1], p4[Vector.I2]);
_canvas.ClosePathFillStroke();
_canvas.RestoreState();
}
}
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
Using them with iTextSharp
var reader = new PdfReader("SRC FILE PATH GOES HERE");
var dstFile = File.Open("DST FILE PATH GOES HERE", FileMode.Create);
pdfStamper = new PdfStamper(reader, output, reader.PdfVersion, false);
// We don't need to auto-rotate, as the PdfContentStreamEditor will already deal with pre-rotated space..
// if we enable this we will inadvertently rotate the content.
pdfStamper.RotateContents = false;
// This is for the Text Replace
var replaceTextProcessor = new TextReplaceStreamEditor(
"TEXT TO REPLACE HERE",
"TEXT TO SUBSTITUTE IN HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
replaceTextProcessor.EditPage(pdfStamper, i);
// This is for the Text Redact
var redactTextProcessor = new TextRedactStreamEditor(
"TEXT TO REDACT HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
redactTextProcessor.EditPage(pdfStamper, i);
// Since our redacting just puts a box over the top, we should secure the document a bit... just to prevent people copying/pasting the text behind the box.. we also prevent text to speech processing of the file, otherwise the 'hidden' text will be spoken
pdfStamper.Writer.SetEncryption(null,
Encoding.UTF8.GetBytes("ownerPassword"),
PdfWriter.AllowDegradedPrinting | PdfWriter.AllowPrinting,
PdfWriter.ENCRYPTION_AES_256);
// hey, lets get rid of Javascript too, because it's annoying
pdfStamper.Javascript = "";
// and then finally we close our files (saving it in the process)
pdfStamper.Close();
reader.Close();
You can use GroupDocs.Redaction (available for .NET) for replacing or removing the text from PDF documents. You can perform the exact phrase, case-sensitive and regular expression redaction (removal) of the text. The following code snippet replaces the word "candy" with "[redacted]" in the loaded PDF document.
C#:
using (Document doc = Redactor.Load("D:\\candy.pdf"))
{
doc.RedactWith(new ExactPhraseRedaction("candy", new ReplacementOptions("[redacted]")));
// Save the document to "*_Redacted.*" file.
doc.Save(new SaveOptions() { AddSuffix = true, RasterizeToPDF = false });
}
Disclosure: I work as Developer Evangelist at GroupDocs.

Telerik Mail Merge - Friendly Names (using RadRichTextBox)

I think i'm missing something obvious...
I'm using the Telerik Rad controls for WPF but i assume that the Rich text box uses some similar implementation for the mail merge functionality.
I want to have some friendly names on my mail merge fields. (namely spaces in the field names)
So i have a class for instance
Public Class someclass
{
<DisplayName("This is the complex description of the field")>
Public property thisfieldnamehasacomplexdescription as string
Public property anothercomplexfield as string
}
This is the only way i know to get "Friendly" names in the dropdown that is the mail merge.
So the two fields turn up okay as :
"This is the complex description of the field"
"anothercomplexfield"
but only anothercomplexfield actually populates with data when you do the merge.
Am i going to have to template the raddropdownbutton that holds the mail merge fields?
Is there an example of this somewhere?
Also a sub question. How do i add a scroll bar on these things?
(also i know this board is not a TELERIK specific board (duh!) but this might be useful to someone in the future. So i'll copy the answer i get from Telerik into here!
http://www.telerik.com/community/forums/wpf/richtextbox/558428-radrichtextbox-mailmerge---using-displayname-to-create-a-friendly-name-with-spaces.aspx )
This is what telerik gave me:
With the default MergeFields, it is not possible to change the display name fragment of the field in order to achieve a more friendly look. This should be possible if you implement a custom MergeField by deriving from the MergeField class. Here is a sample implementation that shows how this can be done:
public class CustomMergeField : MergeField
{
private const string CustomFieldName = "CustomField";
static CustomMergeField()
{
CodeBasedFieldFactory.RegisterFieldType(CustomMergeField.CustomFieldName, () => { return new CustomMergeField(); });
}
public override string FieldTypeName
{
get
{
return CustomMergeField.CustomFieldName;
}
}
public override Field CreateInstance()
{
return new CustomMergeField();
}
protected override DocumentFragment GetDisplayNameFragment()
{
return base.CreateFragmentFromText(string.Format(Field.DisplayNameFragmentFormat, this.GetFriendlyFieldName(this.PropertyPath)));
}
private string GetFriendlyFieldName(string fieldName)
{
int lettersInEnglishAlphabet = 26;
List<char> separators = new List<char>(lettersInEnglishAlphabet);
for (int i = 0; i < lettersInEnglishAlphabet; i++)
{
separators.Add((char)('A' + i));
}
StringBuilder newFieldName = new StringBuilder();
int previousIndex = 0;
for (int i = 1; i < fieldName.Length; i++)
{
if (separators.Contains(fieldName[i]))
{
if (previousIndex > 0)
{
newFieldName.Append(" ");
}
newFieldName.Append(fieldName.Substring(previousIndex, i - previousIndex));
previousIndex = i;
}
}
newFieldName.Append(" " + fieldName.Substring(previousIndex));
return newFieldName.ToString();
}
}
Note that the fragment that is shown when the DisplayMode is Code cannot be changed.
As for your other question, you can change the content of the dropdown button to show the friendly name of the fields and to include a scrollbar in the following way:
1. First, remove the binding of the button to the InsertMergeFieldEmptyCommand from XAML and give it a name (e.g. insertMergeField).
2. Next, add the following code in code-behind:
AddMergeFieldsInDropDownContent(this.insertMergeFieldButton);
private void AddMergeFieldsInDropDownContent(RadRibbonDropDownButton radRibbonDropDownButton)
{
Grid grid = new Grid();
grid.RowDefinitions.Add(new RowDefinition() { Height = new GridLength(100, GridUnitType.Pixel) });
ScrollViewer scrollViewer = new ScrollViewer();
scrollViewer.VerticalScrollBarVisibility = ScrollBarVisibility.Auto;
StackPanel stackPanel = new StackPanel();
foreach (string fieldName in this.editor.Document.MailMergeDataSource.GetColumnNames())
{
RadRibbonButton fieldButton = new RadRibbonButton()
{
Text = this.GetFriendlyFieldName(fieldName),
Size = ButtonSize.Medium,
HorizontalAlignment = HorizontalAlignment.Stretch,
HorizontalContentAlignment = HorizontalAlignment.Left
};
fieldButton.Command = this.editor.Commands.InsertFieldCommand;
fieldButton.CommandParameter = new MergeField() { PropertyPath = fieldName };
//or
//fieldButton.CommandParameter = new CustomMergeField() { PropertyPath = fieldName };
stackPanel.Children.Add(fieldButton);
}
stackPanel.HorizontalAlignment = System.Windows.HorizontalAlignment.Stretch;
scrollViewer.Content = stackPanel;
grid.Children.Add(scrollViewer);
radRibbonDropDownButton.DropDownContent = grid;
}
You can, of course optimize the code of the GetFriendlyName method and add it in a way that will be available by both classes.

How to customize data serialization based on content in WCF?

Trying to serialize a union-like data-type. There is an enum field indicating the type of data stored in the union, and a variety of possible field types.
The desired result is DataContractSerializer produced XML which contains just the enum, and the relevant field.
Possible solutions, none of which have been attempted yet, are:
Use a custom serializer and mark the union properties with a custom attribute, similar to this question. The custom serializer would strip out the members not required.
Use ISerializationSurrogate and serialize a different object which just contains the relevant data.
Don't use separate fields in the union, use one object field (this could be used as part of the implementation of the ISerializationSurrogate approach).
Other... ?
For example:
[DataContract]
public class WCFTestUnion
{
public enum EUnionType
{
[EnumMember]
Bool,
[EnumMember]
String,
[EnumMember]
Dictionary,
[EnumMember]
Invalid
};
EUnionType unionType = EUnionType.Invalid;
bool boolValue = true;
string stringValue = "Hello";
IDictionary<object, object> dictionaryValue = null;
// Could use custom attribute here ?
[DataMember]
public bool BoolValue
{
get { return this.boolValue; }
set { this.boolValue = value; }
}
// Could use custom attribute here ?
[DataMember]
public string StringValue
{
get { return this.stringValue; }
set { this.stringValue = value; }
}
// Could use custom attribute here ?
[DataMember]
public IDictionary<object, object> DictionaryValue
{
get { return this.dictionaryValue; }
set { this.dictionaryValue = value; }
}
[DataMember]
public EUnionType UnionType
{
get { return this.unionType; }
set { this.unionType = value; }
}
} // Ends class WCFTestUnion
Test
class TestSerializeUnion
{
internal static void Test()
{
Console.WriteLine("===TestSerializeUnion.Test()===");
WCFTestUnion u = new WCFTestUnion();
u.UnionType = WCFTestUnion.EUnionType.Dictionary;
u.DictionaryValue = new Dictionary<object, object>();
u.DictionaryValue[1] = "one";
u.DictionaryValue["two"] = 2;
System.Runtime.Serialization.DataContractSerializer serialize = new System.Runtime.Serialization.DataContractSerializer(typeof(WCFTestUnion));
System.IO.Stream stream = new System.IO.MemoryStream();
serialize.WriteObject(stream, u);
stream.Seek(0, System.IO.SeekOrigin.Begin);
byte[] buffer = new byte[stream.Length];
int length = checked((int)stream.Length);
int read = stream.Read(buffer, 0, length);
while (read < stream.Length)
{
read += stream.Read(buffer, 0, length - read);
}
string xml = Encoding.Default.GetString(buffer);
System.Xml.XmlDocument doc = new System.Xml.XmlDocument();
doc.LoadXml(xml);
System.Xml.XmlTextWriter xmlwriter = new System.Xml.XmlTextWriter(Console.Out);
xmlwriter.Formatting = System.Xml.Formatting.Indented;
doc.WriteContentTo(xmlwriter);
xmlwriter.Flush();
Console.WriteLine();
}
} // Ends class TestSerializeUnion
Output:
<WCFTestUnion xmlns="http://schemas.datacontract.org/2004/07/WCFTestServiceContracts" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<BoolValue>true</BoolValue>
<DictionaryValue xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">1</a:Key>
<a:Value i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">one</a:Value>
</a:KeyValueOfanyTypeanyType>
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">two</a:Key>
<a:Value i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">2</a:Value>
</a:KeyValueOfanyTypeanyType>
</DictionaryValue>
<StringValue>Hello </StringValue>
<UnionType>Dictionary</UnionType>
</WCFTestUnion>
Desired Output (only field being used is serialized, along with enum):
<WCFTestUnion xmlns="http://schemas.datacontract.org/2004/07/WCFTestServiceContracts" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<DictionaryValue xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">1</a:Key>
<a:Value i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">one</a:Value>
</a:KeyValueOfanyTypeanyType>
<a:KeyValueOfanyTypeanyType>
<a:Key i:type="b:string" xmlns:b="http://www.w3.org/2001/XMLSchema">two</a:Key>
<a:Value i:type="b:int" xmlns:b="http://www.w3.org/2001/XMLSchema">2</a:Value>
</a:KeyValueOfanyTypeanyType>
</DictionaryValue>
<UnionType>Dictionary</UnionType>
</WCFTestUnion>
You do have several options here. What you use depends on the complexity of this scenario (where else you have to do something like this, how often and in what ways you have to serialize this data, performance, etc.) Take a look at these options, ask away if you have more questions, but mostly, I recommend you just play and experiment with multiple strategies from the list below before picking one or a hybrid solution.
Use a data contract resolver. Provides a mechanism for dynamically mapping types to and from wire representations during serialization and deserialization, giving you flexibility to support far more types than you can out-of-the-box.
Use IObjectReference. You can have a class which implements and returns a reference to a different object after it has been deserialized.
Use a data contract surrogate. This is different from the serialization surrogates you're referring to, but also similar. I think these might work out nicely for you

How can I use Lucene's PriorityQueue when I don't know the max size at create time?

I built a custom collector for Lucene.Net, but I can't figure out how to order (or page) the results. Everytime Collect gets called, I can add the result to an internal PriorityQueue, which I understand is the correct way to do this.
I extended the PriorityQueue, but it requires a size parameter on creation. You have to call Initialize in the constructor and pass in the max size.
However, in a collector, the searcher just calls Collect when it gets a new result, so I don't know how many results I have when I create the PriorityQueue. Based on this, I can't figure out how to make the PriorityQueue work.
I realize I'm probably missing something simple here...
PriorityQueue is not SortedList or SortedDictionary.
It is a kind of sorting implementation where it returns the top M results(your PriorityQueue's size) of N elements. You can add with InsertWithOverflow as many items as you want, but it will only hold only the top M elements.
Suppose your search resulted in 1000000 hits. Would you return all of the results to user?
A better way would be to return the top 10 elements to the user(using PriorityQueue(10)) and
if the user requests for the next 10 result, you can make a new search with PriorityQueue(20) and return the next 10 elements and so on.
This is the trick most search engines like google uses.
Everytime Commit gets called, I can add the result to an internal PriorityQueue.
I can not undestand the relationship between Commit and search, Therefore I will append a sample usage of PriorityQueue:
public class CustomQueue : Lucene.Net.Util.PriorityQueue<Document>
{
public CustomQueue(int maxSize): base()
{
Initialize(maxSize);
}
public override bool LessThan(Document a, Document b)
{
//a.GetField("field1")
//b.GetField("field2");
return //compare a & b
}
}
public class MyCollector : Lucene.Net.Search.Collector
{
CustomQueue _queue = null;
IndexReader _currentReader;
public MyCollector(int maxSize)
{
_queue = new CustomQueue(maxSize);
}
public override bool AcceptsDocsOutOfOrder()
{
return true;
}
public override void Collect(int doc)
{
_queue.InsertWithOverflow(_currentReader.Document(doc));
}
public override void SetNextReader(IndexReader reader, int docBase)
{
_currentReader = reader;
}
public override void SetScorer(Scorer scorer)
{
}
}
searcher.Search(query,new MyCollector(10)) //First page.
searcher.Search(query,new MyCollector(20)) //2nd page.
searcher.Search(query,new MyCollector(30)) //3rd page.
EDIT for #nokturnal
public class MyPriorityQueue<TObj, TComp> : Lucene.Net.Util.PriorityQueue<TObj>
where TComp : IComparable<TComp>
{
Func<TObj, TComp> _KeySelector;
public MyPriorityQueue(int size, Func<TObj, TComp> keySelector) : base()
{
_KeySelector = keySelector;
Initialize(size);
}
public override bool LessThan(TObj a, TObj b)
{
return _KeySelector(a).CompareTo(_KeySelector(b)) < 0;
}
public IEnumerable<TObj> Items
{
get
{
int size = Size();
for (int i = 0; i < size; i++)
yield return Pop();
}
}
}
var pq = new MyPriorityQueue<Document, string>(3, doc => doc.GetField("SomeField").StringValue);
foreach (var item in pq.Items)
{
}
The reason Lucene's Priority Queue is size limited is because it uses a fixed size implementation that is very fast.
Think about what is the reasonable maximum number of results to get back at a time and use that number, the "waste" for when the results are few is not that bad for the benefit it gains.
On the other hand, if you have such a huge number of results that you cannot hold them, then how are you going to be serving/displaying them? Keep in mind that this is for "top" hits so as you iterate through the results you will be hitting less and less relevant ones anyway.