rtf file contains strange formatting data preventing field merging - formatting

I created an RTF file using Word 2007. I want to insert merge fields that can be parsed and merged with database info at a later stage. The document contained the phrase 'Dear [salutation] [surname] How are you?'. I then edited the [surname] part to say [lastname]. If I now view the rtf source it contains loads of unwanted characters as follows:
Dear [salutation] [}{\rtlch\fcs1 \af31507 \ltrch\fcs0 \insrsid6575321 last}{\rtlch\fcs1 \af31507 \ltrch\fcs0 \insrsid2040086 name}{\rtlch\fcs1 \af31507 \ltrch\fcs0 \insrsid2434881 ]}{\rtlch\fcs1 \af31507 \ltrch\fcs0 \insrsid2040086 \r\n\par How are you?
This means that when I try merging then the [lastname] is too mangled to be found for merging.
Does anyone know what's going on here, and how I can prevent Word from embedding all this unwanted stuff?
Thanks.

In the end I used the System.Windows.Forms.RichTextBox to solve the problem, as below:
public class RTF
{
/// <summary>
/// Merge the merge data with the target RTF document
/// </summary>
/// <param name="byteStream">Original RTF document</param>
/// <param name="mergeDatatable">Merge Data (as per sproc_GetDocumentMergeData)</param>
/// <returns>String representation of the RTF document</returns>
public static string GetMergedRTFDocument(byte[] byteStream,DataTable mergeDatatable)
{
System.Windows.Forms.RichTextBox rtb = new System.Windows.Forms.RichTextBox();
MemoryStream stream = new MemoryStream(byteStream);
rtb.LoadFile(stream, System.Windows.Forms.RichTextBoxStreamType.RichText); // Use for RTF
int selstart = 0;
string findTerm = "";
DataRow mergerow = mergeDatatable.Rows[0];
foreach (DataColumn col in mergeDatatable.Columns)
{
findTerm = "[" + col.ColumnName + "]";
selstart = rtb.Find(findTerm);
while (selstart > -1)
{
rtb.SelectionStart = selstart;
rtb.SelectedText = mergerow[col].ToString();
selstart = rtb.Find(findTerm);
}
}
return rtb.Rtf;
}
}

You can use Regular expression expressions to complete this merge process. I have created a blog post about how this can be done.
Using regular expression to merge database content into Rich Text format template documents
An example written in PHP can be found on Github:
https://github.com/olekrisek/PHP_RTF_RegExMerge

Related

How to get NPOI Excel RichStringCellValue?

I am using DotNetCore.NPOI (1.2.1) in order to read an MS Excel file.
Some of the cells are of type text and contain formatted strings (e.g. some words in bold).
How do I get the formatted cell value? My final goal: Retrieve the cell text as HTML.
I tried
var cell = row.GetCell(1);
var richStringCellValue = cell.RichStringCellValue;
But this won't let me access the formatted string (just the plain string without formattings).
Does anybody have an idea or solution?
I think you'll have to take longer route in this case. First you'll have to maintain the formatting of cell value like date, currency etc and then extract the style from cell value and embed the cell value under that style.
best option is to write extenstion method to get format and style value.
To get the fomat Please see this link How to get the value of cell containing a date and keep the original formatting using NPOI
For styling first you'll have to check and find the exact style of running text and then return the value inside the html tag , below method will give you idea to extract styling from cell value. Code is untested , you may have to include missing library.
public void GetStyleOfCellValue()
{
XSSFWorkbook wb = new XSSFWorkbook("YourFile.xlsx");
ISheet sheet = wb.GetSheetAt(0);
ICell cell = sheet.GetRow(0).GetCell(0);
XSSFRichTextString richText = (XSSFRichTextString)cell.RichStringCellValue;
int formattingRuns = cell.RichStringCellValue.NumFormattingRuns;
for (int i = 0; i < formattingRuns; i++)
{
int startIdx = richText.GetIndexOfFormattingRun(i);
int length = richText.GetLengthOfFormattingRun(i);
Console.WriteLine("Text: " + richText.String.Substring(startIdx, startIdx + length));
if (i == 0)
{
short fontIndex = cell.CellStyle.FontIndex;
IFont font = wb.GetFontAt(fontIndex);
Console.WriteLine("Bold: " + (font.IsBold)); // return string <b>my string</b>.
Console.WriteLine("Italics: " + font.IsItalic + "\n"); // return string <i>my string</i>.
Console.WriteLine("UnderLine: " + font.Underline + "\n"); // return string <u>my string</u>.
}
else
{
IFont fontFormat = richText.GetFontOfFormattingRun(i);
Console.WriteLine("Bold: " + (fontFormat.IsBold)); // return string <b>my string</b>.
Console.WriteLine("Italics: " + fontFormat.IsItalic + "\n");// return string <i>my string</i>.
}
}
}
Font formatting in XLSX files are stored according to schema http://schemas.openxmlformats.org/spreadsheetml/2006/main which has no direct relationship to HTML tags. Therefore your task is not that much straight forward.
style = cell.getCellStyle();
font = style.getFont(); // or style.getFont(workBook);
// use Font object to query font attributes. E.g. font.IsItalic
Then you will have to build the HTML by appending relevant HTML tags.

Import CSV File Error : Column Value containing column delimiter

I am trying to Import a Csv File into SQL SERVER using SSIS
Here's an example how data looks like
Student_Name,Student_DOB,Student_ID,Student_Notes,Student_Gender,Student_Mother_Name
Joseph Jade,2005-01-01,1,Good listener,Male,Amy
Amy Jade,2006-01-01,1,Good in science,Female,Amy
....
Csv Columns are not containing text qualifiers (quotations)
I Created a simple package using SSIS to import it into SQL but sometime the data in SQL looked like below
Student_Name Student_DOB Student_ID Student_Notes Student_Gender Student_Mother_Name
Ali Jade 2004-01-01 1 Good listener Bad in science Male,Lisa
The Reason was that somtimes [Student_Notes] column contains Comma (,) that is used as column delimiter so the Row are not imported Correctly
Any suggestions
A word of warning: I'm not a regular C# coder.
But anyway this code does the following:
It opens a file called C:\Input.TXT
It searches each line. If the line has more than 5 commas, it takes all the extra commas out of the third last field (notes)
It writes the result to C:\Output.TXT - that's the one you need to actually import
There are many improvements that could be made:
Get file paths from connection managers
Error handling
An experienced C# programmer could probably do this in hlaf the code
Keep in mind your package will need write access to the appropriate folder
public void Main()
{
// Search the file and remove extra commas from the third last field
// Extended from code at
// http://stackoverflow.com/questions/1915632/open-a-file-and-replace-strings-in-c-sharp
// Nick McDermaid
string sInputLine;
string sOutputLine;
string sDelimiter = ",";
String[] sData;
int iIndex;
// open the file for read
using (System.IO.FileStream inputStream = File.OpenRead("C:\\Input.txt"))
{
using (StreamReader inputReader = new StreamReader(inputStream))
{
// open the output file
using (StreamWriter outputWriter = File.AppendText("C:\\Output.txt"))
{
// Read each line
while (null != (sInputLine = inputReader.ReadLine()))
{
// Grab each field out
sData = sInputLine.Split(sDelimiter[0]);
if (sData.Length <= 6)
{
// 6 or less fields - just echo it out
sOutputLine = sInputLine;
}
else
{
// line has more than 6 pieces
// We assume all of the extra commas are in the notes field
// Put the first three fields together
sOutputLine =
sData[0] + sDelimiter +
sData[1] + sDelimiter +
sData[2] + sDelimiter;
// Put the middle notes fields together, excluding the delimiter
for (iIndex=3; iIndex <= sData.Length - 3; iIndex++)
{
sOutputLine = sOutputLine + sData[iIndex] + " ";
}
// Tack on the last two fields
sOutputLine = sOutputLine +
sDelimiter + sData[sData.Length - 2] +
sDelimiter + sData[sData.Length - 1];
}
// We've evaulted the correct line now write it out
outputWriter.WriteLine(sOutputLine);
}
}
}
}
Dts.TaskResult = (int)Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success;
}
In The Flat File Connection Manager. Make the File as only one column (DT_STR 8000)
Just add a script Component in the dataflowtask and Add Output Columns (Same as Example Shown)
in The script component split each row using the following Code:
\\Student_Name,Student_DOB,Student_ID,Student_Notes,Student_Gender,Student_Mother_Name
Dim strCells() as string = Row.Column0.Split(CChar(","))
Row.StudentName = strCells(0)
Row.StudentDOB = strCells(1)
Row.StudentID = strCells(2)
Row.StudentMother = strCells(strCells.Length - 1)
Row.StudentGender = strCells(strCells.Length - 2)
Dim strNotes as String = String.Empty
For int I = 3 To strCells.Length - 3
strNotes &= strCells(I)
Next
Row.StudentNotes = strNotes
it worked fine for me
If import CSV file is not a routine
Import CSV file in Excel
Search error rows with Excel rows filter and rewrite them
Save Excel file in TXT Tab delimited
Import TXT file with SSIS
Else make a script that search comma in the Student Notes column range

How to modify footnote placeholder in docx4j

I have a docx file which contains a footnote. I have a placeholder in the footnote text that needs to be replaced. While extracting the nodes and modifying the textvalue that placeholder went unpassed. For some reason I think it is not picking up the text provided in the footnote.
Can u please guide me as to how u get to replace a placeholder in the footnote.
Approach 1
faster if you haven't yet caused unmarshalling to occur:
FootnotesPart fp = wordMLPackage.getMainDocumentPart().getFootnotesPart();
fp.variableReplace(mappings);
Approach 2
FootnotesPart fp = wordMLPackage.getMainDocumentPart().getFootnotesPart();
// unmarshallFromTemplate requires string input
String xml = XmlUtils.marshaltoString(fp.getJaxbElement(), true);
// Do it...
Object obj = XmlUtils.unmarshallFromTemplate(xml, mappings);
// Inject result into docx
fp.setJaxbElement((CTFootnotes) obj);
Since #JasonPlutext's answer did not work for my case I am posting what worked for me
FootnotesPart fp = template.getMainDocumentPart().getFootnotesPart();
List<Object> texts = fp.getJAXBNodesViaXPath("//w:t", true);
for(Object obj : texts) {
Text text = (Text) ((JAXBElement) obj).getValue();
String textValue = text.getValue();
// do variable replacement
text.setValue(textValue);
}
But still I face the issue when exporting this as pdf using Docx4J.toPDF(..);
The output does not pick up the footnote reference.

How to rename PDF form fields using PDF Sharp?

I am using PDF Sharp and have one issue only. I cannot rename form fields. We have a field called 'x' and after an operation is applied to field 'x', it needs to be renamed to field 'y'.
I have seen tons of documentation on how to do this using itextSharp. Unfortunately my firm cannot use them and so I am looking for a solution using PDF Sharp.
Any ideas?
This can give you an idea on how to perform the field renaming
var uniqueIndex = Guid.NewGuid();
var fields = pdfDocument.AcroForm.Fields;
var fieldNames = fields.Names;
for (int idx = 0; idx < fieldNames.Length; ++idx)
{
var fieldName = fieldNames[idx];
var field = fields[fieldName];
field.Elements.SetName($"/{fieldName}", $"{fieldName}_{uniqueIndex}");
}
I was able to rename form field via PdfSharp as follow:
public void RenameAcroField(PdfAcroField field, string newFieldName)
{
field.Elements.SetString("/T", newFieldName);
}
Little bit tricky but worked for my case. Hope it will help.
VB.NET version for PDFsharp 1.50.5147
Dim i = 0
While i < pdfDoc.AcroForm.Fields.Count
pdfDoc.AcroForm.Fields(i).Elements.SetString("/T", "formField" & i)
i += 1
End While

Issue with itextsharp

I have a PDF document that has several hundred fields. All of the field names have periods in them, such as "page1.line1.something"
I want to remove these periods and replace them with either an underscore or (better) nothing at all
There appears to be a bug in the itextsharp libraries where the renamefield method does not work if the field has a period, so the following does not work (always returns false)
Dim formfields As AcroFields = stamper.AcroFields
Dim renametest As Boolean
renametest = formfields.RenameField("page1.line1.something", "page1_line1_something")
If the field does not have a period in it, it works fine.
Has anyone come across this and is there a workaround?
Is this an AcroForm form or a LiveCycle Designer (xfa) form?
If it's XFA (which is likely given the field names), iText can't help you. It can only get/set field values when working with XFA.
Okay, an AcroForm. Rather than go the route used in your source, I suggest you directly manipulate the existing field dictionaries and the acroForm field list.
I'm a Java native when it comes to iText, so you'll have to do some translation, but here goes:
A) Delete the AcroForm's field array. Leave the calculation order alone if present (/CO). I think.
PdfDictionary acroDict = reader.getCatalog().getAsDictionary(PdfName.ACROFORM);
acroDict.remove(PdfName.FIELDS);
B) Attach all the 'top level' fields to a new FIELDS array.
PdfArray newFldArray = new PdfArray();
acroDict.put(newFldArray, PdfName.FIELDS);
// you could wipe this between pages to speed things up a bit
Set<PdfIndirectReference> radioFieldsAdded = new HashSet<PdfIndirectReference>();
int numPages = reader.getNumberOfPages();
for (int curPg = 1; curPg <= numPages; ++curPg) {
PdfDictionary curPageDict = reader.getPageN(curPg);
PdfArray annotArray = curPageDict.getAsArray(PdfName.ANNOTS);
if (annotArray == null)
continue;
for (int annotIdx = 0; annotIdx < annotArray.size(); ++annotIdx) {
PdfIndirectReference fieldReference = (PdfIndirectReference) annotArray.getAsIndirect(annotIdx);
PdfDictionary field = (PdfDictionary)PdfReader.getObject(fieldReference);
// if it's a radio button
if ((PdfFormField.FF_RADIO & field.getAsNumber(PdfName.FF).intValue()) != 0) {
fieldReference = field.get(pdfName.PARENT);
field = field.getAsDict(PdfName.PARENT); // looks up indirect reference for you.
// only add each radio field once.
if (radioFieldsAdded.contains(fieldReference)) {
continue;
} else {
radioFieldsAdded.add(fieldReference);
}
}
field.remove(PdfName.PARENT);
// you'll need to assemble the original field name manually and replace the bits
// you don't like. Parent.T + '.' child.T + '.' + ...
String newFieldName = SomeFunction(field);
field.put(PdfName.T, new PdfString( newFieldName ) );
// add the reference, not the dictionary
newFldArray.add(fieldReference)
}
}
C) Clean up
reader.removeUnusedObjects();
Disadvantage:
More Work.
Advantages:
Maintains all field types, attributes, appearances, and doesn't change the file as a whole all that much. Less CPU & memory.
Your existing code ignores field script, all the field flags (read only, hidden, required, multiline text, etc), lists/combos, radio buttons, and quite a few other odds and ends.
if you use periods in your field name, only the last part can be renamed, e.g. in page1.line1.something only "something" can be renamed. This is because the "page1" and "line1" are treated by adobe as parents to the "something" field
I needed to delete this hierarchy and replace it with a flattened structure
I did this by
creating a pdfdictionary object for each field
reading the annotations I needed for each field into an array
deleting the field hierarchy in my (pdfstamper) document
creating a new set of fields from my array data
I have created some sample code for this if you want to see how I did it.