What I do wrong with algorithm to compare samples? - naudio

I want compare and recognize two sound streams. I created my own algorithm, but its not working exactly like i would like. I trying for example, compare a few letters "A,B,C" with "D,E,F" or words "facebook" with "music" and algorithm give a true value for this comparing, but these are not the same words. My algorithm is so imprecise or it is cause of quality of sounds recorded using a microphone from laptop?
My concept of comparing algorithm:
Im taking for example 100 samples from one stream ( it can be in the middle of track) and im checking in loop every piece of second stream in specified way: first 0-99 samples , 1-100, 2-101 etc.
My program have about few tracks to compare with one input track, so my algorithm could get the best solution ( the most similar sample in track ) from every track Unfortunately it is getting wrong results.
using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.ComponentModel;
using System.IO;
using System.Runtime.CompilerServices;
using System.Windows;
using Controller.Annotations;
using NAudio.Wave;
namespace Controller.Models
{
class DecompositionOfSound
{
private int _numberOfSimilarSamples;
private String _stream;
public string Stream
{
get { return _stream; }
set { _stream = value; }
}
public int IloscPodobnychProbek
{
get { return _numberOfSimilarSamples; }
set { _numberOfSimilarSamples = value; }
}
public DecompositionOfSound(string stream)
{
_stream = stream;
SaveSamples(stream);
}
private void SaveSamples(string stream)
{
var wave = new WaveChannel32(new WaveFileReader(stream));
Samples = new byte[wave.Length];
wave.Read(Samples, 0, (int) wave.Length);
}
private byte[] _samples;
public byte[] Samples
{
get { return _samples; }
set { _samples = value; }
}
}
class Sample: INotifyPropertyChanged
{
#region Cechy
private IList<DecompositionOfSound> _listoOfSoundSamples = new ObservableCollection<DecompositionOfSound>();
private string[] _filePaths;
#endregion
#region Property
public string[] FilePaths
{
get { return _filePaths; }
set { _filePaths = value; }
}
public IList<DecompositionOfSound> ListaSciezekDzwiekowych
{
get { return _listoOfSoundSamples; }
set { _listoOfSoundSamples = value; }
}
#endregion
#region Metody
public Sample()
{
LoadSamples(); // przy każdym nowym nagraniu należy zaktualizować !!!
}
public void DisplayMatchingOfSamples()
{
foreach (var decompositionOfSound in ListaSciezekDzwiekowych)
{
MessageBox.Show(decompositionOfSound.IloscPodobnychProbek.ToString());
}
}
public DecompositionOfSound BestMatchingOfSamples()
{
int max=0;
DecompositionOfSound referenceToObject = null;
foreach (var numberOfMatching in _listoOfSoundSamples)
{
if (numberOfMatching.IloscPodobnychProbek > max)
{
max = numberOfMatching.IloscPodobnychProbek;
referenceToObject = numberOfMatching;
}
}
return referenceToObject;
}
public void LoadSamples()
{
int i = 0;
_filePaths = Directory.GetFiles(#"Samples","*.wav");
while (i < _filePaths.Length)
{
ListaSciezekDzwiekowych.Add(new DecompositionOfSound(_filePaths[i]));
i++;
}
}
public void CheckMatchingOfWord(byte[] inputSound,double eps)
{
foreach (var probka in _listoOfSoundSamples)
{
CompareBufforsOfSamples(inputSound, probka, eps);
}
}
public void CheckMatchingOfWord(String inputSound,int iloscProbek, double eps)
{
var wave = new WaveChannel32(new WaveFileReader(inputSound));
var samples = new byte[wave.Length];
wave.Read(samples, 0, (int)wave.Length);
var licznik = 0;
var samplesTmp = new byte[iloscProbek];
while (licznik < iloscProbek)
{
samplesTmp[licznik] = samples[licznik + (wave.Length >> 1)];
licznik++;
}
foreach (var probka in _listoOfSoundSamples)
{
CompareBufforsOfSamples(samplesTmp, probka, eps);
}
}
private void CompareBufforsOfSamples(byte[] inputSound, DecompositionOfSound samples, double eps)
{
int max = 0;
for (int i = 0; i < (samples.Samples.Length - inputSound.Length); i++)
{
int counter = 0;
for (int j = 0; j < inputSound.Length; j++)
{
if (inputSound[j] * eps <= samples.Samples[i + j] &&
(inputSound[j] + inputSound[j] *(1 - eps)) >= samples.Samples[i + j])
{
counter++;
}
}
if (counter > max) max = counter;
}
samples.IloscPodobnychProbek = max;
}
#endregion
#region INotifyPropertyChange
public event PropertyChangedEventHandler PropertyChanged;
[NotifyPropertyChangedInvocator]
protected virtual void OnPropertyChanged([CallerMemberName] string propertyName = null)
{
PropertyChangedEventHandler handler = PropertyChanged;
if (handler != null) handler(this, new PropertyChangedEventArgs(propertyName));
}
#endregion
}
During comaparing all of sound samples, algorithm finds the soundtrack which have the highest number of matching samples, but it isnt correct record. Do my comparison of the two records make sense and how i can fix it to get expected result. Would you like to help me find solution of this problem ? Sorry for my English.
Kind Regards

You simply cannot do sample-level comparisons on recordings to determine your match. Even given two recordings on the same computer of the same word spoken by the same person - ie: every detail exactly the same - the recorded samples will differ. Digital audio is like that. It can sound the same, but the actual recorded samples will not match.
Speech To Text is not simple, and neither is voice recognition (ie: checking the identity of a person from their voice).
Instead of samples you need to examine the frequency profiles of the recordings. Various sounds in natural speech have distinctive frequency distributions. Sibilants - the s sound - have a wide distribution of higher frequencies for instance, so are easy to spot - which is why they used to use sibilant detection for the old yes/no response detection on phone systems.
You can get the frequency profile of a waveform by using a Fast Fourier Transform on a block of samples. Run through the audio stream and do a series of FFTs to get a 2D map of the frequencies of the waveform, then look for interesting things like sibilants (lots of high frequency, very little low frequency).
Of course you could just use one of the web-based Speech to Text APIs.

Related

How to replace / remove text from a PDF file

How would I go about replacing / removing text from a PDF file?
I have a PDF file that I obtained somewhere, and I want to be able to replace some text within it.
Or, I have a PDF file that I want to obscure (redact) some of the text within it so that it's no longer visible [and so that it looks cool, like the CIA files].
Or, I have a PDF that contains global Javascript that I want to stop from interrupting my use of the PDF.
This is possible in a limited fashion with the use of iText / iTextSharp.
It will only work with Tj/TJ opcodes (i.e. standard text, not text embedded in images, or drawn with shapes).
You need to override the default PdfContentStreamProcessor to act on the page content streams, as presented by Mkl here Removing Watermark from PDF iTextSharp. Inherit from this class, and in your new class look for the Tj/TJ opcodes, the operand(s) will generally be the text element(s) (for a TJ this may not be straightforward text, and may require further parsing of all the operands).
A pretty basic example of some of the flexibility around iTextSharp is available from this github repository https://github.com/bevanweiss/PdfEditor (code excerpts below also)
NOTE: This utilises the AGPL version of iTextSharp (and is hence also AGPL), so if you will be distributing executables derived from this code or allowing others to interact with those executables in any way then you must also provide your modified source code. There is also no warranty, implied or expressed, related to this code. Use at your own peril.
PdfContentStreamEditor
using System.Collections.Generic;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class PdfContentStreamEditor : PdfContentStreamProcessor
{
/**
* This method edits the immediate contents of a page, i.e. its content stream.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public void EditPage(PdfStamper pdfStamper, int pageNum)
{
var pdfReader = pdfStamper.Reader;
var page = pdfReader.GetPageN(pageNum);
var pageContentInput = ContentByteUtils.GetContentBytesForPage(pdfReader, pageNum);
page.Remove(PdfName.CONTENTS);
EditContent(pageContentInput, page.GetAsDict(PdfName.RESOURCES), pdfStamper.GetUnderContent(pageNum));
}
/**
* This method processes the content bytes and outputs to the given canvas.
* It explicitly does not descent into form xobjects, patterns, or annotations.
*/
public virtual void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
{
this.Canvas = canvas;
ProcessContent(contentBytes, resources);
this.Canvas = null;
}
/**
* This method writes content stream operations to the target canvas. The default
* implementation writes them as they come, so it essentially generates identical
* copies of the original instructions the {#link ContentOperatorWrapper} instances
* forward to it.
*
* Override this method to achieve some fancy editing effect.
*/
protected virtual void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
{
var index = 0;
foreach (var pdfObject in operands)
{
pdfObject.ToPdf(null, Canvas.InternalBuffer);
Canvas.InternalBuffer.Append(operands.Count > ++index ? (byte) ' ' : (byte) '\n');
}
}
//
// constructor giving the parent a dummy listener to talk to
//
public PdfContentStreamEditor() : base(new DummyRenderListener())
{
}
//
// constructor giving the parent a dummy listener to talk to
//
public PdfContentStreamEditor(IRenderListener renderListener) : base(renderListener)
{
}
//
// Overrides of PdfContentStreamProcessor methods
//
public override IContentOperator RegisterContentOperator(string operatorString, IContentOperator newOperator)
{
var wrapper = new ContentOperatorWrapper();
wrapper.SetOriginalOperator(newOperator);
var formerOperator = base.RegisterContentOperator(operatorString, wrapper);
return (formerOperator is ContentOperatorWrapper operatorWrapper ? operatorWrapper.GetOriginalOperator() : formerOperator);
}
public override void ProcessContent(byte[] contentBytes, PdfDictionary resources)
{
this.Resources = resources;
base.ProcessContent(contentBytes, resources);
this.Resources = null;
}
//
// members holding the output canvas and the resources
//
protected PdfContentByte Canvas = null;
protected PdfDictionary Resources = null;
//
// A content operator class to wrap all content operators to forward the invocation to the editor
//
class ContentOperatorWrapper : IContentOperator
{
public IContentOperator GetOriginalOperator()
{
return _originalOperator;
}
public void SetOriginalOperator(IContentOperator op)
{
this._originalOperator = op;
}
public void Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
if (_originalOperator != null && !"Do".Equals(oper.ToString()))
{
_originalOperator.Invoke(processor, oper, operands);
}
((PdfContentStreamEditor)processor).Write(processor, oper, operands);
}
private IContentOperator _originalOperator = null;
}
//
// A dummy render listener to give to the underlying content stream processor to feed events to
//
class DummyRenderListener : IRenderListener
{
public void BeginTextBlock() { }
public void RenderText(TextRenderInfo renderInfo) { }
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
}
TextReplaceStreamEditor
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class TextReplaceStreamEditor : PdfContentStreamEditor
{
public TextReplaceStreamEditor(string MatchPattern, string ReplacePattern)
{
_matchPattern = MatchPattern;
_replacePattern = ReplacePattern;
}
private string _matchPattern;
private string _replacePattern;
protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
var operatorString = oper.ToString();
if ("Tj".Equals(operatorString) || "TJ".Equals(operatorString))
{
for(var i = 0; i < operands.Count; i++)
{
if(!operands[i].IsString())
continue;
var text = operands[i].ToString();
if(Regex.IsMatch(text, _matchPattern))
{
operands[i] = new PdfString(Regex.Replace(text, _matchPattern, _replacePattern));
}
}
}
base.Write(processor, oper, operands);
}
}
}
TextRedactStreamEditor
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFCleaner
{
public class TextRedactStreamEditor : PdfContentStreamEditor
{
public TextRedactStreamEditor(string MatchPattern) : base(new RedactRenderListener(MatchPattern))
{
_matchPattern = MatchPattern;
}
private string _matchPattern;
protected override void Write(PdfContentStreamProcessor processor, PdfLiteral oper, List<PdfObject> operands)
{
base.Write(processor, oper, operands);
}
public override void EditContent(byte[] contentBytes, PdfDictionary resources, PdfContentByte canvas)
{
((RedactRenderListener)base.RenderListener).SetCanvas(canvas);
base.EditContent(contentBytes, resources, canvas);
}
}
//
// A pretty simple render listener, all we care about it text stuff.
// We listen out for text blocks, look for our text, and then put a
// black box over it.. text 'redacted'
//
class RedactRenderListener : IRenderListener
{
private PdfContentByte _canvas;
private string _matchPattern;
public RedactRenderListener(string MatchPattern)
{
_matchPattern = MatchPattern;
}
public RedactRenderListener(PdfContentByte Canvas, string MatchPattern)
{
_canvas = Canvas;
_matchPattern = MatchPattern;
}
public void SetCanvas(PdfContentByte Canvas)
{
_canvas = Canvas;
}
public void BeginTextBlock() { }
public void RenderText(TextRenderInfo renderInfo)
{
var text = renderInfo.GetText();
var match = Regex.Match(text, _matchPattern);
if(match.Success)
{
var p1 = renderInfo.GetCharacterRenderInfos()[match.Index].GetAscentLine().GetStartPoint();
var p2 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetAscentLine().GetEndPoint();
var p3 = renderInfo.GetCharacterRenderInfos()[match.Index+match.Length].GetDescentLine().GetEndPoint();
var p4 = renderInfo.GetCharacterRenderInfos()[match.Index].GetDescentLine().GetStartPoint();
_canvas.SaveState();
_canvas.SetColorStroke(BaseColor.BLACK);
_canvas.SetColorFill(BaseColor.BLACK);
_canvas.MoveTo(p1[Vector.I1], p1[Vector.I2]);
_canvas.LineTo(p2[Vector.I1], p2[Vector.I2]);
_canvas.LineTo(p3[Vector.I1], p3[Vector.I2]);
_canvas.LineTo(p4[Vector.I1], p4[Vector.I2]);
_canvas.ClosePathFillStroke();
_canvas.RestoreState();
}
}
public void EndTextBlock() { }
public void RenderImage(ImageRenderInfo renderInfo) { }
}
}
Using them with iTextSharp
var reader = new PdfReader("SRC FILE PATH GOES HERE");
var dstFile = File.Open("DST FILE PATH GOES HERE", FileMode.Create);
pdfStamper = new PdfStamper(reader, output, reader.PdfVersion, false);
// We don't need to auto-rotate, as the PdfContentStreamEditor will already deal with pre-rotated space..
// if we enable this we will inadvertently rotate the content.
pdfStamper.RotateContents = false;
// This is for the Text Replace
var replaceTextProcessor = new TextReplaceStreamEditor(
"TEXT TO REPLACE HERE",
"TEXT TO SUBSTITUTE IN HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
replaceTextProcessor.EditPage(pdfStamper, i);
// This is for the Text Redact
var redactTextProcessor = new TextRedactStreamEditor(
"TEXT TO REDACT HERE");
for(int i=1; i <= reader.NumberOfPages; i++)
redactTextProcessor.EditPage(pdfStamper, i);
// Since our redacting just puts a box over the top, we should secure the document a bit... just to prevent people copying/pasting the text behind the box.. we also prevent text to speech processing of the file, otherwise the 'hidden' text will be spoken
pdfStamper.Writer.SetEncryption(null,
Encoding.UTF8.GetBytes("ownerPassword"),
PdfWriter.AllowDegradedPrinting | PdfWriter.AllowPrinting,
PdfWriter.ENCRYPTION_AES_256);
// hey, lets get rid of Javascript too, because it's annoying
pdfStamper.Javascript = "";
// and then finally we close our files (saving it in the process)
pdfStamper.Close();
reader.Close();
You can use GroupDocs.Redaction (available for .NET) for replacing or removing the text from PDF documents. You can perform the exact phrase, case-sensitive and regular expression redaction (removal) of the text. The following code snippet replaces the word "candy" with "[redacted]" in the loaded PDF document.
C#:
using (Document doc = Redactor.Load("D:\\candy.pdf"))
{
doc.RedactWith(new ExactPhraseRedaction("candy", new ReplacementOptions("[redacted]")));
// Save the document to "*_Redacted.*" file.
doc.Save(new SaveOptions() { AddSuffix = true, RasterizeToPDF = false });
}
Disclosure: I work as Developer Evangelist at GroupDocs.

Save and displaying high score with variable across classes unity

I need to get the variable score from my other class and set it as a UserPrefs key. it doesnt seem to be setting as i have a GUI label ingame which shos "None" if there is no UserPrefs key "Highscore".
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using System.Threading;
public class Player : MonoBehaviour {
public Vector2 jumpForce = new Vector2(0, 300);
private Rigidbody2D rb2d;
Generator SwagScript;
GameObject generator;
// Use this for initialization
void Start () {
rb2d = gameObject.GetComponent<Rigidbody2D>();
generator = GameObject.FindGameObjectWithTag("Generator");
SwagScript = generator.GetComponent<Generator>();
}
// Update is called once per frame
void Update () {
if (Input.GetKeyUp("space"))
{
rb2d.velocity = Vector2.zero;
rb2d.AddForce(jumpForce);
}
Vector2 screenPosition = Camera.main.WorldToScreenPoint(transform.position);
if (screenPosition.y > Screen.height || screenPosition.y < 0)
{
Die();
}
}
void OnCollisionEnter2D(Collision2D other)
{
Die();
}
void Die()
{
if (PlayerPrefs.HasKey("HighScore"))
{
if (PlayerPrefs.GetInt("Highscore") < SwagScript.score)
{
PlayerPrefs.SetInt("HighScore", SwagScript.score);
}
else
{
PlayerPrefs.SetInt("HighScore", SwagScript.score);
}
}
Application.LoadLevel(Application.loadedLevel);
}
}
Unless you're creating the Highscore playerpref somewhere else, the Die() method won't never create it, since the if (PlayerPrefs.HasKey("HighScore")) will always return false.
Just remove that if.

My Akka.Net Demo is incredibly slow

I am trying to get a proof of concept running with akka.net. I am sure that I am doing something terribly wrong, but I can't figure out what it is.
I want my actors to form a graph of nodes. Later, this will be a complex graph of business objekts, but for now I want to try a simple linear structure like this:
I want to ask a node for a neighbour that is 9 steps away. I am trying to implement this in a recursive manner. I ask node #9 for a neighbour that is 9 steps away, then I ask node #8 for a neighbour that is 8 steps away and so on. Finally, this should return node #0 as an answer.
Well, my code works, but it takes more than 4 seconds to execute. Why is that?
This is my full code listing:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using Akka;
using Akka.Actor;
namespace AkkaTest
{
class Program
{
public static Stopwatch stopwatch = new Stopwatch();
static void Main(string[] args)
{
var system = ActorSystem.Create("MySystem");
IActorRef[] current = new IActorRef[0];
Console.WriteLine("Initializing actors...");
for (int i = 0; i < 10; i++)
{
var current1 = current;
var props = Props.Create<Obj>(() => new Obj(current1, Guid.NewGuid()));
var actorRef = system.ActorOf(props, i.ToString());
current = new[] { actorRef };
}
Console.WriteLine("actors initialized.");
FindNeighboursRequest r = new FindNeighboursRequest(9);
stopwatch.Start();
var response = current[0].Ask(r);
FindNeighboursResponse result = (FindNeighboursResponse)response.Result;
stopwatch.Stop();
foreach (var d in result.FoundNeighbours)
{
Console.WriteLine(d);
}
Console.WriteLine("Search took " + stopwatch.ElapsedMilliseconds + "ms.");
Console.ReadLine();
}
}
public class FindNeighboursRequest
{
public FindNeighboursRequest(int distance)
{
this.Distance = distance;
}
public int Distance { get; private set; }
}
public class FindNeighboursResponse
{
private IActorRef[] foundNeighbours;
public FindNeighboursResponse(IEnumerable<IActorRef> descendants)
{
this.foundNeighbours = descendants.ToArray();
}
public IActorRef[] FoundNeighbours
{
get { return this.foundNeighbours; }
}
}
public class Obj : ReceiveActor
{
private Guid objGuid;
readonly List<IActorRef> neighbours = new List<IActorRef>();
public Obj(IEnumerable<IActorRef> otherObjs, Guid objGuid)
{
this.neighbours.AddRange(otherObjs);
this.objGuid = objGuid;
Receive<FindNeighboursRequest>(r => handleFindNeighbourRequest(r));
}
public Obj()
{
}
private async void handleFindNeighbourRequest (FindNeighboursRequest r)
{
if (r.Distance == 0)
{
FindNeighboursResponse response = new FindNeighboursResponse(new IActorRef[] { Self });
Sender.Tell(response, Self);
return;
}
List<FindNeighboursResponse> responses = new List<FindNeighboursResponse>();
foreach (var actorRef in neighbours)
{
FindNeighboursRequest req = new FindNeighboursRequest(r.Distance - 1);
var response2 = actorRef.Ask(req);
responses.Add((FindNeighboursResponse)response2.Result);
}
FindNeighboursResponse response3 = new FindNeighboursResponse(responses.SelectMany(rx => rx.FoundNeighbours));
Sender.Tell(response3, Self);
}
}
}
The reason of such slow behavior is the way you use Ask (an that you use it, but I'll cover this later). In your example, you're asking each neighbor in a loop, and then immediately executing response2.Result which is actively blocking current actor (and thread it resides on). So you're essentially making synchronous flow with blocking.
The easiest thing to fix that, is to collect all tasks returned from Ask and use Task.WhenAll to collect them all, without waiting for each one in a loop. Taking this example:
public class Obj : ReceiveActor
{
private readonly IActorRef[] _neighbours;
private readonly Guid _id;
public Obj(IActorRef[] neighbours, Guid id)
{
_neighbours = neighbours;
_id = id;
Receive<FindNeighboursRequest>(async r =>
{
if (r.Distance == 0) Sender.Tell(new FindNeighboursResponse(new[] {Self}));
else
{
var request = new FindNeighboursRequest(r.Distance - 1);
var replies = _neighbours.Select(neighbour => neighbour.Ask<FindNeighboursResponse>(request));
var ready = await Task.WhenAll(replies);
var responses = ready.SelectMany(x => x.FoundNeighbours);
Sender.Tell(new FindNeighboursResponse(responses.ToArray()));
}
});
}
}
This one is much faster.
NOTE: In general you shouldn't use Ask inside of an actor:
Each ask is allocating a listener inside current actor, so in general using Ask is A LOT heavier than passing messages with Tell.
When sending messages through chain of actors, cost of ask is additionally transporting message twice (one for request and one for reply) through each actor. One of the popular patterns is that, when you are sending request from A⇒B⇒C⇒D and respond from D back to A, you can reply directly D⇒A, without need of passing the message through whole chain back. Usually combination of Forward/Tell works better.
In general don't use async version of Receive if it's not necessary - at the moment, it's slower for an actor when compared to sync version.

Text extraction from table cells

I have a pdf. The pdf contains a table. The table contains many cells (>100). I know the exact position (x,y) and dimension (w,h) of every cell of the table.
I need to extract text from cells using itextsharp. Using PdfReaderContentParser + FilteredTextRenderListener (using a code like this http://itextpdf.com/examples/iia.php?id=279 ) I can extract text but I need to run the whole procedure for each cell. My pdf have many cells and the program needs too much time to run. Is there a way to extract text from a list of "rectangle"? I need to know the text of each rectangle. I'm looking for something like PDFTextStripperByArea by PdfBox (you can define as many regions as you need and the get text using .getTextForRegion("region-name") ).
This option is not immediately included in the iTextSharp distribution but it is easy to realize. In the following I use the iText (Java) class, interface, and method names because I am more at home with Java. They should easily be translatable into iTextSharp (C#) names.
If you use the LocationTextExtractionStrategy, you can can use its a posteriori TextChunkFilter mechanism instead of the a priori FilteredRenderListener mechanism used in the sample you linked to. This mechanism has been introduced in version 5.3.3.
For this you first parse the whole page content using the LocationTextExtractionStrategy without any FilteredRenderListener filtering applied. This makes the strategy object collect TextChunk objects for all PDF text objects on the page containing the associated base line segment.
Then you call the strategy's getResultantText overload with a TextChunkFilter argument (instead of the regular no-argument overload):
public String getResultantText(TextChunkFilter chunkFilter)
You call it with a different TextChunkFilter instance for each table cell. You have to implement this filter interface which is not too difficult as it only defines one method:
public static interface TextChunkFilter
{
/**
* #param textChunk the chunk to check
* #return true if the chunk should be allowed
*/
public boolean accept(TextChunk textChunk);
}
So the accept method of the filter for a given cell must test whether the text chunk in question is inside your cell.
(Instead of separate instances for each cell you can of course also create one instance whose parameters, i.e. cell coordinates, can be changed between getResultantText calls.)
PS: As mentioned by the OP, this TextChunkFilter has not yet been ported to iTextSharp. It should not be hard to do so, though, only one small interface and one method to add to the strategy.
PPS: In a comment sschuberth asked
Do you then still call PdfTextExtractor.getTextFromPage() when using getResultantText(), or does it somehow replace that call? If so, how to you then specify the page to extract to?
Actually PdfTextExtractor.getTextFromPage() internally already uses the no-argument getResultantText() overload:
public static String getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText();
}
To make use of a TextChunkFilter you could simply build a similar convenience method, e.g.
public static String getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, TextChunkFilter chunkFilter) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText(chunkFilter);
}
In the context at hand, though, in which we want to parse the page content only once and apply multiple filters, one for each cell, we might generalize this to:
public static List<String> getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, Iterable<TextChunkFilter> chunkFilters) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(pageNumber, strategy, additionalContentOperators)
List<String> result = new ArrayList<>();
for (TextChunkFilter chunkFilter : chunkFilters)
{
result.add(strategy).getResultantText(chunkFilter);
}
return result;
}
(You can make this look fancier by using Java 8 collection streaming instead of the old'fashioned for loop.)
Here's my take on how to extract text from a table-like structure in a PDF using itextsharp. It returns a collection of rows and each row contains a collection of interpreted columns. This may work for you on the premise that there is a gap between one column and the next which is greater than the average width of a single character. I also added an option to check for wrapped text within a virtual column. Your mileage may vary.
using (PdfReader pdfReader = new PdfReader(stream))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
TableExtractionStrategy tableExtractionStrategy = new TableExtractionStrategy();
string pageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, tableExtractionStrategy);
var table = tableExtractionStrategy.GetTable();
}
}
public class TableExtractionStrategy : LocationTextExtractionStrategy
{
public float NextCharacterThreshold { get; set; } = 1;
public int NextLineLookAheadDepth { get; set; } = 500;
public bool AccomodateWordWrapping { get; set; } = true;
private List<TableTextChunk> Chunks { get; set; } = new List<TableTextChunk>();
public override void RenderText(TextRenderInfo renderInfo)
{
base.RenderText(renderInfo);
string text = renderInfo.GetText();
Vector bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
Rectangle rectangle = new Rectangle(bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Chunks.Add(new TableTextChunk(rectangle, text));
}
public List<List<string>> GetTable()
{
List<List<string>> lines = new List<List<string>>();
List<string> currentLine = new List<string>();
float? previousBottom = null;
float? previousRight = null;
StringBuilder currentString = new StringBuilder();
// iterate through all chunks and evaluate
for (int i = 0; i < Chunks.Count; i++)
{
TableTextChunk chunk = Chunks[i];
// determine if we are processing the same row based on defined space between subsequent chunks
if (previousBottom.HasValue && previousBottom == chunk.Rectangle.Bottom)
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
else
{
// if we are processing a new line let's check to see if this could be word wrapping behavior
bool isNewLine = true;
if (AccomodateWordWrapping)
{
int readAheadDepth = Math.Min(i + NextLineLookAheadDepth, Chunks.Count);
if (previousBottom.HasValue)
for (int j = i; j < readAheadDepth; j++)
{
if (previousBottom == Chunks[j].Rectangle.Bottom)
{
isNewLine = false;
break;
}
}
}
// if the text was not word wrapped let's treat this as a new table row
if (isNewLine)
{
if (currentString.Length > 0)
currentLine.Add(currentString.ToString());
currentString.Clear();
previousBottom = chunk.Rectangle.Bottom;
previousRight = chunk.Rectangle.Right;
currentString.Append(chunk.Text);
if (currentLine.Count > 0)
lines.Add(currentLine);
currentLine = new List<string>();
}
else
{
if (chunk.Rectangle.Left - previousRight > 1)
{
currentLine.Add(currentString.ToString());
currentString.Clear();
}
currentString.Append(chunk.Text);
previousRight = chunk.Rectangle.Right;
}
}
}
return lines;
}
private struct TableTextChunk
{
public Rectangle Rectangle;
public string Text;
public TableTextChunk(Rectangle rect, string text)
{
Rectangle = rect;
Text = text;
}
public override string ToString()
{
return Text + " (" + Rectangle.Left + ", " + Rectangle.Bottom + ")";
}
}
}

UDA generating error, insufficient buffer size

I have a UDA in SQL 2005 that keeps generating the below error. I'm guessing this is most likely due to the limitations of the max byte size of 8000....Is there any work around I can use to get around this? Any suggestions for avoiding this limitation in 2005? I know 2008 supposedly took these limitations off but I cannot upgrade for the time being.
A .NET Framework error occurred during execution of user-defined routine or aggregate "CommaListConcatenate":
System.Data.SqlTypes.SqlTypeException: The buffer is insufficient. Read or write operation failed.
System.Data.SqlTypes.SqlTypeException:
at System.Data.SqlTypes.SqlBytes.Write(Int64 offset, Byte[] buffer, Int32 offsetInBuffer, Int32 count)
at System.Data.SqlTypes.StreamOnSqlBytes.Write(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.BinaryWriter.Write(String value)
at TASQLCLR.CommaListConcatenate.Write(BinaryWriter w)
For SQL 2005 you can solve the 8000 byte limit by turning one parameter into multiple parameters by using a delimiter. I didn't need to go into the details myself, but you can find the answer here: http://www.mssqltips.com/tip.asp?tip=2043
For SQL 2008 you need to pass MaxByteSize as -1. If you try and pass a number greater than 8000, SQL will not let you create the aggregate, complaining that there is an 8000 byte limit. If you pass in -1, it seems to get around this issue and lets you create the aggregate (which I have also tested with > 8000 bytes).
The error is:
The size (100000) for
"Class.Concatenate" is not in the
valid range. Size must be -1 or a
number between 1 and 8000.
Here is the working class definition for VB.NET to support > 8000 bytes in SQL 2008.
<Serializable(), SqlUserDefinedAggregate(Format.UserDefined,
IsInvariantToNulls:=True,
IsInvariantToDuplicates:=False,
IsInvariantToOrder:=False, MaxByteSize:=-1)>
<System.Runtime.InteropServices.ComVisible(False)> _
Public Class Concatenate Implements IBinarySerialize
End Class
The code below shows how to calculate the media of a set of decimal numbers within an SQLAggregate. It resolves the issue of size parameter limitation implementing a data dictionary. The idea is taken from Expert SQL Express 2005.
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Linq.Expressions;
using SafeDictionary;
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
Format.UserDefined, MaxByteSize=16)]
public struct CMedian2 : IBinarySerialize
{
readonly static SafeDictionary<Guid , List<String>> theLists = new SafeDictionary<Guid , List<String>>();
private List<String> theStrings;
//Make sure to use SqlChars if you use
//VS deployment!
public SqlString Terminate()
{
List<Decimal> ld = new List<Decimal>();
foreach(String s in theStrings){
ld.Add(Convert.ToDecimal(s));
}
Decimal median;
Decimal tmp;
int halfIndex;
int numberCount;
ld.Sort();
Decimal[] allvalues = ld.ToArray();
numberCount = allvalues.Count();
if ((numberCount % 2) == 0)
{
halfIndex = (numberCount) / 2;
tmp = Decimal.Add(allvalues[halfIndex-1], allvalues[halfIndex]);
median = Decimal.Divide(tmp,2);
}
else
{
halfIndex = (numberCount + 1) / 2;
median = allvalues[halfIndex - 1];
tmp = 1;
}
return new SqlString(Convert.ToString(median));
}
public void Init()
{
theStrings = new List<String>();
}
public void Accumulate(SqlString Value)
{
if (!(Value.IsNull))
theStrings.Add(Value.Value);
}
public void Merge(CMedian2 Group)
{
foreach (String theString in Group.theStrings)
this.theStrings.Add(theString);
}
public void Write(System.IO.BinaryWriter w)
{
Guid g = Guid.NewGuid();
try
{
//Add the local collection to the static dictionary
theLists.Add(g, this.theStrings);
//Persist the GUID
w.Write(g.ToByteArray());
}
catch
{
//Try to clean up in case of exception
if (theLists.ContainsKey(g))
theLists.Remove(g);
}
}
public void Read(System.IO.BinaryReader r)
{
//Get the GUID from the stream
Guid g = new Guid(r.ReadBytes(16));
try
{
//Grab the collection of strings
this.theStrings = theLists[g];
}
finally
{
//Clean up
theLists.Remove(g);
}
}
}
you also need to implement the dictionary as Expert 2005 does:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
namespace SafeDictionary
{
public class SafeDictionary<K, V>
{
private readonly Dictionary<K, V> dict = new Dictionary<K,V>();
private readonly ReaderWriterLock theLock = new ReaderWriterLock();
public void Add(K key, V value)
{
theLock.AcquireWriterLock(2000);
try
{
dict.Add(key, value);
}
finally
{
theLock.ReleaseLock();
}
}
public V this[K key]
{
get
{
theLock.AcquireReaderLock(2000);
try
{
return (this.dict[key]);
}
finally
{
theLock.ReleaseLock();
}
}
set
{
theLock.AcquireWriterLock(2000);
try
{
dict[key] = value;
}
finally
{
theLock.ReleaseLock();
}
}
}
public bool Remove(K key)
{
theLock.AcquireWriterLock(2000);
try
{
return (dict.Remove(key));
}
finally
{
theLock.ReleaseLock();
}
}
public bool ContainsKey(K key)
{
theLock.AcquireReaderLock(2000);
try
{
return (dict.ContainsKey(key));
}
finally
{
theLock.ReleaseLock();
}
}
}
}
The dictionary must be deployed in a separate assembly with unsafe code grants. The idea is to avoid to serializate all the numbers by keeping in memory data structure dictionary. I recommend the Expert SQL 2005 chapter:
CHAPTER 6 ■ SQLCLR: ARCHITECTURE AND
DESIGN CONSIDERATIONS.
By the way, this solution didn't work for me. Too many conversions from Decimal to String and viceversa make it slow when working with millions of rows, anyway, I enjoyed to try it. For other usages it is a good pattern.