Short Summary:
Open Excel file in GUI with one row and it's records, analyze it and
write back to a file.
I am working on a project that has lots of records in Excel file. The data has web URL that I have to analyze and write appropriate comments about it.
Copying and pasting can be hectic as there are hundreds and hundreds of records.
So, I am thinking to automate the process.
What I'd like to do is have a GUI that would populate one record at a time in the GUI. Open the URL in IE. It will have some extra fields (drop down, input box) in addition to its original columns so I can record the analysis data.
Based on the drop down option, it will create a document (or append if it already exists) that record. Once clicked save, it will populate the next record.
What would be a best way to go? I thought of using Visual Basic because of its GUI, but everyone knows about VB and why I should avoid it.
I'm also thinking about web app, so it will not be OS dependent, but I am not sure how Excel files work with PHP, other web scripting languages.
Any input would be greatly appreciated. If you know any tutorial that can give some insight, will also help.
I would do this with some javascript on an html page. Here's an outline of a general strategy with a few examples, none have been tested. There's probably better ways to do some of this stuff.
Step 1 - Convert spreadsheet to CSV
File->Save As->Save as Type->CSV
Step 2 - An HTML page to act as a viewer for the URLs
Example:
<html>
<head>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js"></script>
<script src="myAutomationScript.js"></script>
</head>
<body>
<table id="csvRecord">
<tr id="header"></tr>
<tr id="data"></tr>
</table>
<input id="nextRecord" type=submit />
<iframe id="viewer"></iframe>
</body>
</html>
Step 3 - Parse CSV
There's a jQuery plugin from this answer that will parse the CSV for you. You can read in the file (described here), or if this is a one-off thing you could just copy and paste the csv data into a string variable in your javascript file.
var csv = "",//populate this from file, or paste in data or whatever
records = $.csv.toObjects(csv),
keys = Object.keys(records[0]),
currentRecord = 0;
Step 4 - Display record in form
function displayHeaders (keys) {
//create a header cell for each key
//add a header (or headers) for your additional fields
}
function displayRecord (record) {
//populate a td cell for each piece of data
//add input element(s) for your additional review fields
//set the source of the iframe to the url
$('#viewer').setAttr('src',record.url);
}
Step 5 - Save the record and move on to the next
//this would be a handler for the click event of nextRecord
function moveOn() {
//read in data from the csvRecord table
//add the record to another collection, or read in the output file and add a line
//save the collection, or string or whatever to a local file using the html5 local file api
//++ currentRecord, and display that
currentRecord = currentRecord + 1;
displayRecord(records[currentRecord]);
}
Related
I'm using docx4j to convert a Word template to several HTML files, one per chapter.
The Word template has several custom properties mapped by several fields (DOCPROPERTY ...) represented as both simple and complex fields. I populate those properties to obtain Freemarker code when the word document is converted to HTML (like ${...} or [#... /] directives).
In a later step I look for "heading 1" paragraphs to identify chapters and then split the document in several Word documents before conversion, then these documents are converted to HTML and written to temporary files.
Each document is successfully converted to HTML and fields are correctly replaced with my markers, but it behaves wrong when it writes header and footer parts: field codes are written before field values (eg. DOCPROPERTY "PROPERTY_NAME" \* MERGEFORMAT ${constants['PROPERTY_NAME']} ) instead of field values only (eg. ${constants['PROPERTY_NAME']} ).
If I write the updated document to a docx file instead, nothing seems wrong into the generated document.
If it's useful to solve the problem, this is what I do to split the document (per chapter):
clone the updated WordprocessingMLPackage (clone method)
delete every root element before the chapter's "heading 1" element
delete every root element from the "heading 1" element of the next chapter
convert the cloned and cleaned document
(actually I don't use the clone method every time, but I write the updated document to a ByteArrayOutputStream and then read it for every chapter, inspired by the source of the clone method).
I suspect it's for a docx4j bug, did anybody else try something similar?
Finally these are my platform details:
JDK 1.6
Docx4J v3.2.2
Thanks in advance for any help
EDIT
To produce freemarker markers in place of Word fields, I set document property values as follows:
traverse the document looking for simple or complex fields with new TraversalUtil(wordMLPackage.getMainDocumentPart().getContent(), visitor);, where visitor is my custom callback for looking for fields and set properties
traversing the document I look for
FldChar elements with type BEGIN and parse them using FieldsPreprocessor.canonicalise((P) ((R) fc.getParent()).getParent(), fields); (I don't use the return value of canonicalise) where fc is the found FldChar and fields is a empty ArrayList<FieldRef>; then I extract and parse field's instrText attribute
CTSimpleField elements and parse them using FldSimpleModel fldSimpleModel = new FldSimpleModel(); fldSimpleModel.build((CTSimpleField) o, null);; then I use fldSimpleModel.getFldArgument() to get the property name
I look for the freemarker code to show in place of the current field and set it as property value using wordMLPackage.getDocPropsCustomPart().setProperty(propertyName, finalValue);
finally I do the same from step 1 for headers and footers as follows:
List<Relationship> rels = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getRelationships().getRelationship();
for (Relationship rel : rels) {
Part p = wordMLPackage.getMainDocumentPart().getRelationshipsPart().getPart(rel);
if (p == null) {
continue;
}
if (p instanceof ContentAccessor) {
new TraversalUtil(((ContentAccessor) p).getContent(), visitor);
}
}
Finally I update fields as follows
FieldUpdater updater = new FieldUpdater(wordMLPackage);
try {
updater.update(true);
} catch (Docx4JException ex) {
Logger.getLogger(WorkerDocx4J.class.getName()).log(Level.SEVERE, null, ex);
}
After filling all field properties, I clone the document as previously described and convert filtered cloned instances using
HTMLSettings settings = Docx4J.createHTMLSettings();
settings.setWmlPackage(wordDoc);
settings.setImageHandler(new InlineImageHandler(myDataModel));
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write("[#ftl]\r\n".getBytes("UTF-8"));
Docx4J.toHTML(settings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
String template = new String(os.toByteArray(), "UTF-8");
then I obtain in template variable the resulting freemarker template.
The following XML is the content of footer1.xml part of the document generated after updating the document properties as described: footer1.xml after field updates
The very strange thing (in my opinion) is that if some properties are not found, step 5 throws an Exception (ok), fields updating stops at the wrong field (ok) and all fields in header and footer are rendered right. In this case, this is the content for footer1.xml.
In the last case, fields are defined in a different way. I think the HTML converter handles well the last case and does something wrong in the first one.
Is there something I do wrong or I can do better?
VS 2013, VB, EF6
I am creating an object that will keep user input in one of its properties. I would like that user input to be stored as rich text. What's involved to make that stored text be rich text format? So,
Public Property Text as <what?>
I thought I would post what was my answer for others who might ask the question the same way I did. I begin by stating that my question was poorly formed because I didn't understand I'm not really storing RTF, I'm storing WYSIWYG text with html tags. But I think the question as phrased is useful because that's how many people think until they are taught by others.
Ultimately this process opens a serious XSS vector, but first we have to at least collect the WYSIWYG text.
First step: using a script-based editor capture the text with html tags. I used CKEditor which is easy to download on NuGet. It comes in 3 flavors: basic, standard and full. Another popular one seems to be TinyMCE also available through NuGet.
CKEditor must be 'wired in' to replace the existing input element. I replaced #html.editorfor with a < textarea > directly as follows. Model.UserPost.Body is the property into which I want to place the WYSIWYG text. The Raw helper is required so the output is NOT encoded allowing us to see our WYSIWYG text.
<textarea name="model.UserPost.Body" id="model_UserPost_Body" class="form-control text-box multi-line">
#Html.Raw(Model.UserPost.Body)
</textarea>
CKEditor is 'wired in' using a script element to replace the < textarea > element.
#Section Scripts
<script src="~/scripts/ckeditor/ckeditor.js"></script>
<script>
CKEDITOR.replace('model.UserPost.Body');
</script>
End Section
The script above can be added to all pages via _layout.vbhtml, or just the target page via a #Section Scripts section as shown above, which is often recommended and what I did, but that may also require adding to the standard _Layout the following in the < head > section such as follows.
#RenderSection("Styles", False)
In the controller POST method for the view the following code is needed to capture the WYSIWYG text otherwise the default filter will raise an exception when it detects anything that looks like an html tag.
Dim rawBody = Request.Unvalidated.Form("model.UserPost.Body")
userPost.Body = rawBody
There are some possible gotcha's; The 'body' property has to be removed from the Include:= list of the < Bind > element in the method paramter list if < Bind > is being used. Also, although not directly related to this solution, you can't have a Data Annotation like < Required() > on this property in the model because background checking won't be able to confirm that condition so the ModelState.IsValid flag won't ever go true.
Second step: before saving the input it MUST be checked for XSS. Microsoft has a nice video explaining basic XSS that I recommend viewing; it's only 11 minutes.
Mikesdotnetting has a nice explaination for dealing with XSS and shows a whitelisting algorithm toward the bottom of this page. The following code is based on his work.
To create a white listing approach, the HTML Agility Pack is useful to catalogue the HTML nodes for review. This is easily loaded from Nu Get as well. This is the code I used in the POST method to invoke the white list methods (Yes, it could be more compact, but this is easier to read for us novices):
Dim tempDoc = New HtmlDocument()
tempDoc.LoadHtml(rawBody)
RemoveNodes(tempDoc.DocumentNode, allowedTags)
userPost.Body = tempDoc.DocumentNode.OuterHtml
The allowed tags are what you will allow, which means everything else is rejected, hence whitelisting. This is just a sample list:
Dim allowedTags As New List(Of String)() From {"p", "em", "s", "ol", "ul", "li", "h1", "h2", "h3", "h4", "h5", "h6", "strong"}
These are the methods based on Mikesdotnetting page:
Private Sub RemoveNodes(ByVal node As HtmlNode, allowedTags As List(Of String))
If (node.NodeType = HtmlNodeType.Element) Then
If Not allowedTags.Contains(node.Name) Then
node.ParentNode.RemoveChild(node)
Exit Sub
End If
End If
If (node.HasChildNodes) Then
RemoveChildren(node, allowedTags)
End If
End Sub
Private Sub RemoveChildren(ByVal parent As HtmlNode, allowedTags As List(Of String))
For i = parent.ChildNodes.Count() - 1 To 0 Step -1
RemoveNodes(parent.ChildNodes(i), allowedTags)
Next
End Sub
So basically, (1) CKEditor captures user input with html tags that looks nice, (2) the raw input is specially requested in the Controller POST method and then (3) cleaned using a white list. After that it can be output directly to the page using #Html.Raw() because it can be trusted.
That's it. I've not really posted solutions like this before, so if I've missed something let me know and I'll correct or add it.
Rich Text is stored in the Rich Text Format.
The Rich Text Format specifications can be found here:
http://www.microsoft.com/en-us/download/details.aspx?id=10725
It is just an ordinary string. You can extract the string from a RichTextBox using the SaveFile function:
Private Function GetRTF(ByRef Box As RichTextBox) As String
Using ms As New IO.MemoryStream
Box.SaveFile(ms, RichTextBoxStreamType.RichText)
Return System.Text.Encoding.ASCII.GetString(ms.ToArray)
End Using
End Function
You can load text in the Rich Text Format into a RichTextBox using the LoadFile method of the RichTextBox. The text needs to be in the correct format:
Dim rtf As String = "{\rtf1 {\colortbl;\red0\green0\blue255;\red255\green0\blue0;}Guten Tag!\line{\i Dies} ist ein\line formatierter {\b Text}.\line Das {\cf1 Ende}.}"
Using ms As New IO.MemoryStream(System.Text.Encoding.ASCII.GetBytes(rtf))
RichTextBox1.LoadFile(ms, RichTextBoxStreamType.RichText)
End Using
Ordinary controls usually will not interpret this format in their text property.
How do I get .Dump() to work without showing the number of results as the first row?
I've switched a manual report to run via lprun and email to a client. However, I was removing that row manually when I saved the excel file before.
I need to keep the html formatting, so I don't want to do csv. I also use .Dump() in the report (one call at the end), I'm not writing with the html, csv, or xml writers manually.
Looking at the source of the html generated, the header that contains the number of items is located in td.typeheader. I guess you can inject a simple css style into the html generated to hide it :
td.typeheader { display: none; }
The injection can be a simple replace :
File.WriteAllText(pathToReport, File.ReadAllText(pathToReport)
.Replace("</head>", "<style type=\"text/css\">td.typeheader { display: none; }</script></head>"));
I am trying to configure this in v5 and cannot find any documentation. This is what I have so far...
followed the V4 documentation as close as possible, but cannot get the form to allow me to choose multiple files!
Under HTML Render I have Form Tag Attachment = enctype="multipart/form-data"
Under the designer tab I have file field element on the form. Under this I have
Field Name = file1[]
Field ID = file1
Multiple=Yes
Under the settings tab I have the files upload action in the on submit event. In the files upload action I have
Enabled=Yes
Files config = file1:jpg-png-gif-txt
Array fields = file1
is there anything else I need to do?
Turns out this is a bug and will be fixed in the next update. Workaround by creating a custom file upload element with the word multiple as a parameter.
I have a dojo editor on a jsp page. The dojo editor is one of the required fields and i have a validation in place for it. There is a scenario in which some tags are getting appended. I cannot find a particular pattern when it gets appended but most of the times it occurs after one selects and copies all the content and pastes on the editor. So the editor content in this case was
<div id="dijitEditorBody">content which user entered</div>
Issue: When the user deletes all content which was entered the tags are still there and get submitted. In this case atleast visually editor has no content but the field holds the following value:
<div id="dijitEditorBody"></div>
or
<div id="dijitEditorBody"><br /></div>
So it skips validation and displays an empty editor when data is retrieved from DB?
I am confused about why these tags are getting appended?
In RichText.js, this snippet :
if(dojo.isIE || dojo.isWebKit || (!this.height && !dojo.isMoz)){
// In auto-expand mode, need a wrapper div for AlwaysShowToolbar plugin to correctly
// expand/contract the editor as the content changes.
html = "<div id='dijitEditorBody'></div>";
setBodyId = false;
}else if(dojo.isMoz){
// workaround bug where can't select then delete text (until user types something
// into the editor)... and/or issue where typing doesn't erase selected text
this._cursorToStart = true;
html = " ";
}
Explains the reason why that tag is added...
Although you see it in your alertbox, I believe it's not present in the posted contents... right ?
The editor should take care of removing the extra-tags => not tested but pretty sure...