How to sort through a dictionary (a real world dictionary) that is in a .csv file? - sql

I haven't read enough theory or have had enough practice in CS, but there must be a simpler, faster way to look up data from a file. I'm working with a literal, real world dictionary .csv file, and I'm wondering how I can speed up look up of every word. No doubt going through the whole list for the word does not make sense; splitting the file into a-z order, and only looking there for each word, makes sense.
But what else? Should I learn SQL or something and try to convert the text database into an SQL database? Are there methods in SQL that would enable me to do what I wish? Please give me ideas!

SQLite sounds fit to this task.
Create a table, import your csv file, create an index and you're done.

I just did this using interop with a moderate size .csv file given me by a supply company. It worked well, but still requires a considerable delay due to the cumbersome decorators used in interop/COM.
class Excel
{
private excel.Application application;
private excel.Workbook excelWorkBook;
protected const string WORD_POSITION = "A"; //whichever column the word is located in when loaded on Excel spreadsheet.
protected const string DEFINITION_POSITION = "B"; // whichever column the definition is loaded into on Excel spreadsheet.
Dictionary<string,string> myDictionary = new Dicationary<string,string>();
public Excel(string path) // where path is the fileName
{
try
{
application = new excel.Application();
excelWorkBook = application.Workbooks.Add(path);
int row = 1;
while (application.Cells[++row, WORD_POSITION].Value != null)
{
myDictionary[GetValue(row, WORD_POSITION)] = GetValue(row, DEFINITION_POSITION);
});
}
}
catch (Exception ex)
{
Debug.WriteLine(ex.ToString());
}
finally
{
excelWorkBook.Close();
application.Quit();
}
}
private string GetValue(int row, string columnName)
{
string returnValue = String.Empty;
returnValue = application.Cells[row, columnName].Value2;
if (returnValue == null) return string.Empty;
return returnValue;
}
}
}

Create a new sql database, import the cab into a new table, place an index on the column that stores the word values, then search against table... That is the approach I would take

Related

How to enable parallelism for a custom U-SQL Extractor

I’m implementing a custom U-SQL Extractor for our internal file format (binary serialization). It works well in the "Atomic" mode:
[SqlUserDefinedExtractor(AtomicFileProcessing = true)]
public class BinaryExtractor : IExtractor
If I switch off the “Atomic“ mode, It looks like U-SQL is splitting the file in a random place (I guess just by 250MB chunks). This is not acceptable for me. The file format has a special row delimiter. Can I define a custom row delimiter in my Extractor and enable parallelism for it. Technically I can change our row delimiter to a new one if it can help.
Could anyone help me with this question?
The file is indeed split into chunks (I think it is 1 GB at the moment, but the exact value is implementation defined and may change for performance reasons).
If the file is indeed row delimited, and assuming your raw input data for the row is less than 4MB, you can use the input.Split() function inside your UDO to do the splitting into rows. The call will automatically handle the case if the raw input data spans the chunk boundary (assuming it is less than 4MB).
Here is an example:
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
{
// this._row_delim = this._encoding.GetBytes(row_delim); in class ctor
foreach (Stream current in input.Split(this._row_delim))
{
using (StreamReader streamReader = new StreamReader(current, this._encoding))
{
int num = 0;
string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None);
for (int i = 0; i < array.Length; i++)
{
// DO YOUR PROCESSING
}
}
yield return outputrow.AsReadOnly();
}
}
Please note that you cannot read across chunk boundaries yourself and you should make sure your data is indeed splittable into rows.

reading large semicolon separated files via streamreader and inserting in sql db in vb.net

I need to read large csv files and to insert them into SQL, my idea was to use streamreader and read the file line by line because if I store the content in a variable, program crashes. So thats what i thought:
using FileStream fs
Dim list as String
Try
Dim MyFile as String = ("C:\\Test.txt")
Using fs as FileStream = File.Open(MyFile, FileMode.Open, FileAccess.ReadWrite, FileShare.None) 'file is opened in a protected mode
firstline= fs.ReadLine 'treat the firstline as columnname
rest = fs.ReadLine 'the rest as rest
Do While (Not rest Is Nothing) 'read the complete file
list.Add(rest)
Filestream.TextFieldType = FileIO.FieldType.Delimited
Filestream.SetDelimiters(";")
Loop
End Using
Catch
ResultBlock.Text = "File not readable"
End Try
i wrote list.Add(rest) which is actually a bad idea because the content is stored in a variable then, but i need to read and insert line for line in a sql database which seems to be pretty complicated though, does anyone has an idea how i could handle that?
If you can't read the file into memory because it's too big then what you need is some sort of buffer that holds the records in memory and writes to the database when the list gets to a certain size.
If you really want to keep it manageable then the reader, the writer, and the buffer should all be completely separate from each other. That sounds like more work because it's more classes, but it's actually simpler because each class only does one thing.
I would create a class that represents the item that you're reading from the file, with properties for each record. Like if each line in the file represents a person with a name and employee number, create a class like
public class Person
{
public string FirstName {get;set;}
public string LastName {get;set;}
public string EmployeeNumber {get;set;}
}
You'll need a buffer. The job of the buffer is to have items put into it, and flush to a writer when it reaches its maximum size. Perhaps like this:
public interface IBuffer<T>
{
void AddItem(T item);
}
public interface IWriter<T>
{
void Write(IEnumerable<T> items);
}
public class WriterBuffer<T> : IBuffer<T>
{
private readonly IWriter<T> _writer;
private readonly int _maxSize;
private readonly List<T> _buffer;
public WriterBuffer(IWriter<T> writer, int maxSize)
{
_writer = writer;
_maxSize - maxSize;
}
public void AddItem(T item)
{
_buffer.Add(item);
if(_buffer.Count >= _maxSize)
{
_writer.Write(_buffer);
_buffer.Clear();
}
}
}
Then, your reader class doesn't know about the writer at all. All it knows is that it writes to the buffer.
public class PersonFileReader
{
private readonly string _filename;
private readonly IBuffer<Person> _buffer;
public PersonFileReader(string filename, IBuffer<Person> buffer)
{
_filename = filename;
_buffer = buffer;
}
public void ReadFile()
{
//Reads from file.
//Creates a new Person for each record
//Calls _buffer.Add(person) for each Person.
}
}
public class PersonSqlWriter : IWriter<Person>
{
private readonly string _connectionString;
public PersonSqlWriter(string connectionString)
{
_connectionString = connectionString;
}
public void Write(IEnumerable<Person> items)
{
//Writes the list of items to the database
//using _connectionString;
}
}
The result is that each of these classes does only one thing. You can use them separately from the others and test them separately from the others. That applies the Single Responsibility Principle. No one class is too complicated because each one has only one responsibility. It also applies the Dependency Inversion principle. The reader doesn't know what the buffer does. It just depends on the interface. The buffer doesn't know what the writer does. And the writer doesn't care where the data comes from.
Now the complexity is in creating the objects. You need a file name, a connection string, and a maximum buffer size. That means something like
var filename = "your file name";
var maxBufferSize = 50;
var connectionString = "your connection string"
var reader = new PersonFileReader(
filename,
new WriterBuffer<Person>(
new PersonSqlWriter(connectionString),
maxBufferSize));
Your classes are simpler, but wiring them all together has gotten a little more complicated. That's where dependency injection comes in. It manages this for you. I won't go into that yet because it might be information overload. But if you mention what sort of application this is - web, WCF service, etc., then I might be able to provide a concrete example of how a dependency injection container like Windsor, Autofac, or Unity can manage this for you.
This was all new to me several years ago. At first it just looked like more code. But it actually makes it easier to write small, simple classes, which in turn makes building complex applications much easier.
Have a look at below links:
BulkCopy How can I insert 10 million records in the shortest time possible?
This one contains code samples: http://www.sqlteam.com/article/use-sqlbulkcopy-to-quickly-load-data-from-your-client-to-sql-server
You can also use Import Wizard (https://msdn.microsoft.com/en-us/library/ms141209.aspx?f=255&MSPPError=-2147217396).

replace string in PDF document (ITextSharp or PdfSharp)

We use non-manage DLL that has a funciton to replace text in PDF document (http://www.debenu.com/docs/pdf_library_reference/ReplaceTag.php).
We are trying to move to managed solution (ITextSharp or PdfSharp).
I know that this question has been asked before and that the answers are "you should not do it" or "it is not easily supported by PDF".
However there exists a solution that works for us and we just need to convert it to C#.
Any ideas how I should approach it?
According to your library reference link, you use the Debenu PDFLibrary function ReplaceTag. According to this Debenu knowledge base article
the ReplaceTag function simply replaces text in the page’s content stream, so for most documents it wouldn’t have any effect. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed. Essentially it’s the same as doing:
DPL.CombineContentStreams();
string content = DPL.GetContentStreamToString();
DPL.SetPageContentFromString(content.Replace("Moby", "Mary"));
That should be possible with any general purpose PDF library, it definitely is with iText(Sharp):
void VerySimpleReplaceText(string OrigFile, string ResultFile, string origText, string replaceText)
{
using (PdfReader reader = new PdfReader(OrigFile))
{
byte[] contentBytes = reader.GetPageContent(1);
string contentString = PdfEncodings.ConvertToString(contentBytes, PdfObject.TEXT_PDFDOCENCODING);
contentString = contentString.Replace(origText, replaceText);
reader.SetPageContent(1, PdfEncodings.ConvertToBytes(contentString, PdfObject.TEXT_PDFDOCENCODING));
new PdfStamper(reader, new FileStream(ResultFile, FileMode.Create, FileAccess.Write)).Close();
}
}
WARNING: Just like in case of the Debenu function, for most documents this code wouldn’t have any effect or would even be destructive. For some simple documents it might be able to replace content, but it really depends on how the PDF was constructed.
By the way, the Debenu knowledge base article continues:
If you created a PDF using Debenu Quick PDF Library and a standard font then the ReplaceTag function should work – however, for PDFs created with tools that do subsetted fonts or even kerning (where words will be split up) then the search text probably won’t be in the content in a simple format.
So in short, the ReplaceTag function will only work in some limited scenarios and isn’t a function that you can rely on for searching and replacing text.
Thus, if during your move to managed solution you also change the way the source documents are created, chances are that neither the Debenu PDFLibrary function ReplaceTag nor the code above will be able to change the content as desired.
for pdfsharp users heres a somewhat usable function, i copied from my project and it uses an utility method which is consumed by othere methods hence the unused result.
it ignores whitespaces created by Kerning, and therefore may mess up the result (all characters in the same space) depending on the source material
public static void ReplaceTextInPdfPage(PdfPage contentPage, string source, string target)
{
ModifyPdfContentStreams(contentPage, stream =>
{
if (!stream.TryUnfilter())
return false;
var search = string.Join("\\s*", source.Select(c => c.ToString()));
var stringStream = Encoding.Default.GetString(stream.Value, 0, stream.Length);
if (!Regex.IsMatch(stringStream, search))
return false;
stringStream = Regex.Replace(stringStream, search, target);
stream.Value = Encoding.Default.GetBytes(stringStream);
stream.Zip();
return false;
});
}
public static void ModifyPdfContentStreams(PdfPage contentPage,Func<PdfDictionary.PdfStream, bool> Modification)
{
for (var i = 0; i < contentPage.Contents.Elements.Count; i++)
if (Modification(contentPage.Contents.Elements.GetDictionary(i).Stream))
return;
var resources = contentPage.Elements?.GetDictionary("/Resources");
var xObjects = resources?.Elements.GetDictionary("/XObject");
if (xObjects == null)
return;
foreach (var item in xObjects.Elements.Values.OfType<PdfReference>())
{
var stream = (item.Value as PdfDictionary)?.Stream;
if (stream != null)
if (Modification(stream))
return;
}
}

Sorting an ArrayList of NotesDocuments using a CustomComparator

I'm trying to sort a Documents Collection using a java.util.ArrayList.
var myarraylist:java.util.ArrayList = new java.util.ArrayList()
var doc:NotesDocument = docs.getFirstDocument();
while (doc != null) {
myarraylist.add(doc)
doc = docs.getNextDocument(doc);
}
The reason I'm trying with ArrayList and not with TreeMaps or HashMaps is because the field I need for sorting is not unique; which is a limitation for those two objects (I can't create my own key).
The problem I'm facing is calling CustomComparator:
Here how I'm trying to sort my arraylist:
java.util.Collections.sort(myarraylist, new CustomComparator());
Here my class:
import java.util.Comparator;
import lotus.notes.NotesException;
public class CustomComparator implements Comparator<lotus.notes.Document>{
public int compare(lotus.notes.Document doc1, lotus.notes.Document doc2) {
try {
System.out.println("Here");
System.out.println(doc1.getItemValueString("Form"));
return doc1.getItemValueString("Ranking").compareTo(doc2.getItemValueString("Ranking"));
} catch (NotesException e) {
e.printStackTrace();
}
return 0;
}
}
Error:
Script interpreter error, line=44, col=23: Error calling method
'sort(java.util.ArrayList, com.myjavacode.CustomComparator)' on java
class 'java.util.Collections'
Any help will be appreciated.
I tried to run your SSJS code in a try-catch block, printing the error in exception in catch block and I got the following message - java.lang.ClassCastException: lotus.domino.local.Document incompatible with lotus.notes.Document
I think you have got incorrect fully qualified class names of Document and NotesException. They should be lotus.domino.Document and lotus.domino.NotesException respectively.
Here the SSJS from RepeatControl:
var docs:NotesDocumentCollection = database.search(query, null, 0);
var myarraylist:java.util.ArrayList = new java.util.ArrayList()
var doc:NotesDocument = docs.getFirstDocument();
while (doc != null) {
myarraylist.add(doc)
doc = docs.getNextDocument(doc);
}
java.util.Collections.sort(myarraylist, new com.mycode.CustomComparator());
return myarraylist;
Here my class:
package com.mycode;
import java.util.Comparator;
public class CustomComparator implements Comparator<lotus.domino.Document>{
public int compare(lotus.domino.Document doc1, lotus.domino.Document doc2) {
try {
// Numeric comparison
Double num1 = doc1.getItemValueDouble("Ranking");
Double num2 = doc2.getItemValueDouble("Ranking");
return num1.compareTo(num2);
// String comparison
// return doc1.getItemValueString("Description").compareTo(doc2.getItemValueString("Description"));
} catch (lotus.domino.NotesException e) {
e.printStackTrace();
}
return 0;
}
}
Not that this answer is necessarily the best practice for you, but the last time I tried to do the same thing, I realized I could instead grab the documents as a NotesViewEntryCollection, via SSJS:
var col:NotesViewEntryCollection = database.getView("myView").getAllEntriesByKey(mtgUnidVal)
instead of a NotesDocumentCollection. I just ran through each entry, grabbed the UNIDs for those that met my criteria, added to a java.util.ArrayList(), then sent onward to its destination. I was already sorting the documents for display elsewhere, using a categorized column by parent UNID, so this is probably what I should have done first; still on leading edge of the XPages/Notes learning curve, so every day brings something new.
Again, if your collection is not equatable to a piece of a Notes View, sorry, but for those with an available simple approach, KISS. I remind myself frequently.

How do I read a large file from disk to database without running out of memory

I feel embarrassed to ask this question as I feel like I should already know. However, given I don't....I want to know how to read large files from disk to a database without getting an OutOfMemory exception. Specifically, I need to load CSV (or really tab delimited files).
I am experimenting with CSVReader and specifically this code sample but I'm sure I'm doing it wrong. Some of their other coding samples show how you can read streaming files of any size, which is pretty much what I want (only I need to read from disk), but I don't know what type of IDataReader I could create to allow this.
I am reading directly from disk and my attempt to ensure I don't ever run out of memory by reading too much data at once is below. I can't help thinking that I should be able to use a BufferedFileReader or something similar where I can point to the location of the file and specify a buffer size and then CsvDataReader expects an IDataReader as it's first parameter, it could just use that. Please show me the error of my ways, let me be rid of my GetData method with it's arbitrary file chunking mechanism and help me out with this basic problem.
private void button3_Click(object sender, EventArgs e)
{
totalNumberOfLinesInFile = GetNumberOfRecordsInFile();
totalNumberOfLinesProcessed = 0;
while (totalNumberOfLinesProcessed < totalNumberOfLinesInFile)
{
TextReader tr = GetData();
using (CsvDataReader csvData = new CsvDataReader(tr, '\t'))
{
csvData.Settings.HasHeaders = false;
csvData.Settings.SkipEmptyRecords = true;
csvData.Settings.TrimWhitespace = true;
for (int i = 0; i < 30; i++) // known number of columns for testing purposes
{
csvData.Columns.Add("varchar");
}
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(#"Data Source=XPDEVVM\XPDEV;Initial Catalog=MyTest;Integrated Security=SSPI;"))
{
bulkCopy.DestinationTableName = "work.test";
for (int i = 0; i < 30; i++)
{
bulkCopy.ColumnMappings.Add(i, i); // map First to first_name
}
bulkCopy.WriteToServer(csvData);
}
}
}
}
private TextReader GetData()
{
StringBuilder result = new StringBuilder();
int totalDataLines = 0;
using (FileStream fs = new FileStream(pathToFile, FileMode.Open, System.IO.FileAccess.Read, FileShare.ReadWrite))
{
using (StreamReader sr = new StreamReader(fs))
{
string line = string.Empty;
while ((line = sr.ReadLine()) != null)
{
if (line.StartsWith("D\t"))
{
totalDataLines++;
if (totalDataLines < 100000) // Arbitrary method of restricting how much data is read at once.
{
result.AppendLine(line);
}
}
}
}
}
totalNumberOfLinesProcessed += totalDataLines;
return new StringReader(result.ToString());
}
Actually your code is reading all data from file and keep into TextReader(in memory). Then you read data from TextReader to Save server.
If data is so big, data size in TextReader caused out of memory. Please try this way.
1) Read data (each line) from File.
2) Then insert each line to Server.
Out of memory problem will be solved because only each record in memory while processing.
Pseudo code
begin tran
While (data = FilerReader.ReadLine())
{
insert into Table[col0,col1,etc] values (data[0], data[1], etc)
}
end tran
Probably not the answer you're looking for but this is what BULK INSERT was designed for.
I would just add using BufferedFileReader with the readLine method and doing exatcly in the fashion above.
Basically understanding the resposnisbilties here.
BufferedFileReader is the class reading data from file (buffe wise)
There should be a LineReader too.
CSVReader is a util class for reading the data assuming that its in correct format.
SQlBulkCopy you are anywsay using.
Second Option
You can go to the import facility of database directly. If the format of the file is correct and thw hole point of program is this only. that would be faster too.
I think you may have a red herring with the size of the data. Every time I come across this problem, it's not the size of the data but the amount of objects created when looping over the data.
Look in your while loop adding records to the db within the method button3_Click(object sender, EventArgs e):
TextReader tr = GetData();
using (CsvDataReader csvData = new CsvDataReader(tr, '\t'))
Here you declare and instantiate two objects each iteration - meaning for each chunk of file you read you will instantiate 200,000 objects; the garbage collector will not keep up.
Why not declare the objects outside of the while loop?
TextReader tr = null;
CsvDataReader csvData = null;
This way, the gc will stand half a chance. You could prove the difference by benchmarking the while loop, you will no doubt notice a huge performance degradation after you have created just a couple of thousand objects.
pseudo code:
while (!EOF) {
while (chosenRecords.size() < WRITE_BUFFER_LIST_SIZE) {
MyRecord record = chooseOrSkipRecord(file.readln());
if (record != null) {
chosenRecords.add(record)
}
}
insertRecords(chosenRecords) // <== writes data and clears the list
}
WRITE_BUFFER_LIST_SIZE is just a constant that you set... bigger means bigger batches and smaller means smaller batches. A size of 1 is RBAR :).
If your operation is big enough that failing partway through is a realistic possibility, or if failing partway through could cost someone a non-trivial amount of money, you probably want to also write to a second table the total number of records processed so far from the file (including the ones you skipped) as part of the same transaction so that you can pick up where you left off in the event of partial completion.
Instead of reading csv rows one by one and inserting into db one by one I suggest read a chunk and insert it into database. Repeat this process until the entire file has been read.
You can buffer in memory, say 1000 csv rows at a time, then insert them in the database.
int MAX_BUFFERED=1000;
int counter=0;
List<List<String>> bufferedRows= new ...
while (scanner.hasNext()){
List<String> rowEntries= getData(scanner.getLine())
bufferedRows.add(rowEntries);
if (counter==MAX_BUFFERED){
//INSERT INTO DATABASE
//append all contents to a string buffer and create your SQL INSERT statement
bufferedRows.clearAll();//remove data so it could be GCed when GC kicks in
}
}