keeping count and storing of text instances - vb.net

I would like to make a simple code that counts the top three most recurring lines/ text in a txt file then saves that line/ text to another text file (this in turn will be read into AutoCAD’s variable system).
Forgetting the AutoCAD part which I can manage how do I in VB.net save the 3 most recurring lines of text each to its own text file see example below:
Text file to be read reads as follows:
APG
BTR
VTS
VTS
VTS
VTS
BTR
BTR
APG
PNG
The VB.net program would then save the text VTS to mostused.txt BTR to 2ndmostused.txt and APG to 3rdmostused.txt
How can this be best achieved?

Since I'm C# developer, I'll use it:
var dict = new Dictionary<string, int>();
using(var sr = new StreamReader(file))
{
var line = string.Empty;
while ((line = sr.ReadLine()) != null)
{
var words = line.Split(' '); // get the words
foreach(var word in words)
{
if(!dict.Contains(word)) dict.Add(word, 0);
dict[word]++; // count them
}
}
}
var query = from d in dict select d order by d.Value; // now you have it sorted
int counter = 1;
foreach(var pair in query)
{
using(var sw = new StreamWriter("file" + counter + ".txt"))
sw.writer(pair.Key);
}

Related

Pdf Merge issue in itextsharp - PDF looks distorted after Merge

I have a simple scenario where I extract pages from a PDF document (or split the document in two parts, if you will) and merge the parts back to a new document, with an option to add new pages in between.
However, in one particular case the resulting document differs from the original one in that couple of pages (in this case pages 4 and 5) look distorted in comparison to the source document.
How can I circumvent the distortion of the pages? The reproduction code below has been tested with iTextSharp versions 5.5.0.0 and 5.5.6.0 (latest at the moment).
You can find the input-File i used here.
void Main()
{
var pathPrefix = #"C:\temp"; // TODO change
var inputDocPath = #"input.pdf";
var part1 = ExtractPages(Path.Combine(pathPrefix, inputDocPath), 1, 2);
var outputPath1 = Path.Combine(pathPrefix, "part1.pdf");
File.WriteAllBytes(outputPath1, part1);
var part2 = ExtractPages(Path.Combine(pathPrefix, inputDocPath), 3);
var outputPath2 = Path.Combine(pathPrefix, "part2.pdf");
File.WriteAllBytes(outputPath2, part2);
var merged = Merge(new[] {
outputPath1,
outputPath2
});
var mergedPath = Path.Combine(pathPrefix, "output.pdf");
File.WriteAllBytes(mergedPath, merged);
}
//Page sizes:
// input: 8,26x11,68; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,68; 8,26x11,68
// output: 8,26x11,68; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,69; 8,26x11,68; 8,26x11,68
public static byte[] Merge(string[] documentPaths)
{
byte[] mergedDocument;
using (MemoryStream memoryStream = new MemoryStream())
using (Document document = new Document())
{
PdfSmartCopy pdfSmartCopy = new PdfSmartCopy(document, memoryStream);
document.Open();
foreach (var docPath in documentPaths)
{
PdfReader reader = new PdfReader(docPath);
try
{
reader.ConsolidateNamedDestinations();
var numberOfPages = reader.NumberOfPages;
for (int page = 0; page < numberOfPages;)
{
PdfImportedPage pdfImportedPage = pdfSmartCopy.GetImportedPage(reader, ++page);
pdfSmartCopy.AddPage(pdfImportedPage);
}
}
finally
{
reader.Close();
}
}
document.Close();
mergedDocument = memoryStream.ToArray();
}
return mergedDocument;
}
public static byte[] ExtractPages(string pdfDocument, int startPage, int? endPage = null)
{
var reader = new PdfReader(pdfDocument);
var numberOfPages = reader.NumberOfPages;
var endPageResolved = endPage.HasValue ? endPage.Value : numberOfPages;
if (startPage > numberOfPages || endPageResolved > numberOfPages)
string.Format("Error: page indices ({0}, {1}) out of bounds. Document has {2} pages.",
startPage, endPageResolved, numberOfPages).Dump();
byte[] outputDocument;
using (var doc = new Document()) // NOTE use reader.GetPageSizeWithRotation(startPage) ?
using (var msOut = new MemoryStream())
{
var pdfCopyProvider = new PdfCopy(doc, msOut);
doc.Open();
for (var i = startPage; i <= endPageResolved; i++)
{
var page = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(page);
}
doc.Close();
reader.Close();
outputDocument = msOut.ToArray();
}
return outputDocument;
}
I could reproduce the issue using your code and your test file with iTextSharp 5.5.6. Actually, though, the images are not merely distorted, they have been replaced by other ones! Inspecting the result PDF internally, one observes:
Originally page 3 through 5 each had their own respective Resource dictionary containing different entries than the ones of each other.
After split up, as pages 1 through 3 of part2.pdf, they still had different Resource dictionaries.
In the final merged result, though, page 3 through 5 all refer to the same Resource dictionary object, a copy of the resources of the original page 3!
(As page 3 contains images with the same names as the images on pages 4 and 5, this results in page 3 images being shown on pages 4 and 5.)
Somehow PdfSmartCopy seems to outsmart itself here, using PdfCopy instead creates the expected result.
I assume PdfSmartCopy falsely considers those source dictionaries identical, probably some hash collision without actual equality check.
It might be of interest to note that an equivalent test using Java and iText, SmartMerging.java, does not show the same issue, its result is as expected.
Thus, this looks like an issue of the iTextSharp port or .Net in general.

creating multiple pdfs from multiple excel files that support both formats in java

below is my code to convert excel to pdf, but i dont understand how do i generate multiple pdf from multiple excel sheets.
String files;
File folder = new File(dirpath);
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
files = listOfFiles[i].getName();
if (files.endsWith(".xls") || files.endsWith(".xlsx")) {
// inputting files one by one
//here it should take an input one by one
System.out.println(files);
String inputR = files.toString();
FileInputStream input_document = new FileInputStream(new File("D:\\ExcelToPdfProject\\"+inputR));
// Read workbook into HSSFWorkbook
Workbook workbook = null;
if (inputR.endsWith(".xlsx")) {
workbook = new XSSFWorkbook(input_document);
System.out.println("1");
} else if (inputR.endsWith(".xls")) {
workbook = new HSSFWorkbook(input_document);
System.out.println("GO TO HELL ######");
} else {
System.out.println("GO TO HELL");
}
Sheet my_worksheet = workbook.getSheetAt(2);
// Read worksheet into HSSFSheet
// To iterate over the rows
Iterator<Row> rowIterator = my_worksheet.iterator();
//Iterator<Row> rowIterator1 = my_worksheet.iterator();
//We will create output PDF document objects at this point
Document iText_xls_2_pdf = new Document();
PdfWriter writer = PdfWriter.getInstance(iText_xls_2_pdf, new FileOutputStream("D:\\Output.pdf"));
iText_xls_2_pdf.open();
//we have two columns in the Excel sheet, so we create a PDF table with two columns
//Note: There are ways to make this dynamic in nature, if you want to.
Row row = rowIterator.next();
row.setHeight((short) 2);
int count = row.getPhysicalNumberOfCells();
PdfPTable my_table = new PdfPTable(count);
float[] columnWidths = new float[count];
my_table.setWidthPercentage(100f);
//We will use the object below to dynamically add new data to the table
PdfPCell table_cell;
I want something that can help me create a folder full of pdfs.

How do you select the specific line to read in vb.net?

I was wondering if/how you can read a specific line in vb.net using a system.io.streamreader.
Dim streamreader as system.io.streamreader
streamreader.selectline(linenumber as int).read
streamreader.close()
Is this possible or is there a similiar function to this one?
I'd use File.ReadAllLines to read in the lines into an array, then just use the array to select the line.
Dim allLines As String() = File.ReadAllLines(filePath)
Dim lineTwo As String = allLines(1) '0-based index
Note that ReadAllLines will read the entire text file into memory but I assume this isn't a problem because, if it is, then I suggest you take an alternative approach to trying to jump to a specific line.
ReadLines is pretty fast as it doesn't load everything in memory. It returns an IEnumerable<string> which allow you to skip to a line easily. Take this 5GB file:
var data = new string('A', 1022);
using (var writer = new StreamWriter(#"d:\text.txt"))
{
for (int i = 1; i <= 1024 * 1024 * 5; i++)
{
writer.WriteLine("{0} {1}", i, data);
}
}
var watch = Stopwatch.StartNew();
var line = File.ReadLines(#"d:\text.txt").Skip(704320).Take(1).FirstOrDefault();
watch.Stop();
Console.WriteLine("Elapsed time: {0}", watch.Elapsed); // Elapsed time: 00:00:02.0507396
Console.WriteLine(line); // 704320 AAAAAA...
Console.ReadLine();

Lucene: how to preserve whitespaces etc when tokenizing stream?

I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is
Term1: Term2 Stopword! Term3
Term4
then I want the output to look like
Term1': Term2' Stopword! Term3'
Term4'
(where Termi' is translation of Termi) instead of simply
Term1' Term2' Term3' Term4'
Currently I am doing the following:
PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
PatternAnalyzer.WHITESPACE_PATTERN,
false,
WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);
while (ts.incrementToken()) { // loop over tokens
String termIn = charTermAttribute.toString();
...
}
but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!
============ UPDATE!
I tried splitting the original stream into "words" and "non-words". It seems to work fine. Not sure whether it's the most efficient way, though:
public ArrayList splitToWords(String sIn)
{
if (sIn == null || sIn.length() == 0) {
return null;
}
char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>();
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
boolean newIsLetter = Character.isLetter(c[pos]);
if (newIsLetter == curIsLetter) {
continue;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
tokenStart = pos;
curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));
return list;
}
Well it doesn't really lose whitespace, you still have your original text :)
So I think you should make use of OffsetAttribute, which contains startOffset() and endOffset() of each term into your original text. This is what lucene uses, for example, to highlight snippets of search results from the original text.
I wrote up a quick test (uses EnglishAnalyzer) to demonstrate:
The input is:
Just a test of some ideas. Let's see if it works.
The output is:
just a test of some idea. let see if it work.
// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
String input = "Just a test of some ideas. Let's see if it works.";
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
StringBuilder output = new StringBuilder(input);
// in some cases, the analyzer will make terms longer or shorter.
// because of this we must track how much we have adjusted the text so far
// so that the offsets returned will still work for us via replace()
int delta = 0;
TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
ts.reset();
while (ts.incrementToken()) {
String term = termAtt.toString();
int start = offsetAtt.startOffset();
int end = offsetAtt.endOffset();
output.replace(delta + start, delta + end, term);
delta += (term.length() - (end - start));
}
ts.close();
System.out.println(output.toString());
}

How do I save RichTextBox content into SQL varbinary (byte array) column?

I want to save content of a RichTextBox to varbinary (= byte array) in XamlPackage format.
I need technicial advise on how to it.
I actually need to know how to convert between FlowDocument to byte array.
Is it even recommended to store it as varbinary, or this is a bad idea?
Update
Code snippet:
///Load
byte[] document = GetDocumentFromDataBase();
RickTextBox tb = new RickTextBox();
TextRange tr = new TextRange(tb.Document.ContentStart, tb.Document.ContentEnd)
tr.Load(--------------------------) //Load from the byte array.
///Save
int maxAllowed = 1024;
byte[] document;
RichTextBox tb = new RichTextBox();
//User entered text and designs in the rich text
TextRange tr = new TextRange(tb.Document.ContentStart, tb.Document.ContentEnd)
tr.Save(--------------------------) //Save to byte array
if (document.Length > maxAllowed)
{
MessageBox.Show((document.Length - maxAllowed) + " Exceeding limit.");
return;
}
SaveToDataBase();
TextRange
I can't find my full example right now, but you can use XamlReader and XamlWriter to get the document into and out of a string. From there, you can use UnicodeEncoding, AsciiEncoding or whatever encoder you want to get it into and out of bytes.
My shorter example for setting the document from a string...
docReader is my flow document reader
private void SetDetails(string detailsString)
{
if (docReader == null)
return;
if (String.IsNullOrEmpty(detailsString))
{
this.docReader.Document = null;
return;
}
using (
StringReader stringReader = new StringReader(detailsString))
{
using (System.Xml.XmlReader reader = System.Xml.XmlReader.Create(stringReader))
{
this.docReader.Document = XamlReader.Load(reader) as FlowDocument;
}
}
}