reading large semicolon separated files via streamreader and inserting in sql db in vb.net - vb.net

I need to read large csv files and to insert them into SQL, my idea was to use streamreader and read the file line by line because if I store the content in a variable, program crashes. So thats what i thought:
using FileStream fs
Dim list as String
Try
Dim MyFile as String = ("C:\\Test.txt")
Using fs as FileStream = File.Open(MyFile, FileMode.Open, FileAccess.ReadWrite, FileShare.None) 'file is opened in a protected mode
firstline= fs.ReadLine 'treat the firstline as columnname
rest = fs.ReadLine 'the rest as rest
Do While (Not rest Is Nothing) 'read the complete file
list.Add(rest)
Filestream.TextFieldType = FileIO.FieldType.Delimited
Filestream.SetDelimiters(";")
Loop
End Using
Catch
ResultBlock.Text = "File not readable"
End Try
i wrote list.Add(rest) which is actually a bad idea because the content is stored in a variable then, but i need to read and insert line for line in a sql database which seems to be pretty complicated though, does anyone has an idea how i could handle that?

If you can't read the file into memory because it's too big then what you need is some sort of buffer that holds the records in memory and writes to the database when the list gets to a certain size.
If you really want to keep it manageable then the reader, the writer, and the buffer should all be completely separate from each other. That sounds like more work because it's more classes, but it's actually simpler because each class only does one thing.
I would create a class that represents the item that you're reading from the file, with properties for each record. Like if each line in the file represents a person with a name and employee number, create a class like
public class Person
{
public string FirstName {get;set;}
public string LastName {get;set;}
public string EmployeeNumber {get;set;}
}
You'll need a buffer. The job of the buffer is to have items put into it, and flush to a writer when it reaches its maximum size. Perhaps like this:
public interface IBuffer<T>
{
void AddItem(T item);
}
public interface IWriter<T>
{
void Write(IEnumerable<T> items);
}
public class WriterBuffer<T> : IBuffer<T>
{
private readonly IWriter<T> _writer;
private readonly int _maxSize;
private readonly List<T> _buffer;
public WriterBuffer(IWriter<T> writer, int maxSize)
{
_writer = writer;
_maxSize - maxSize;
}
public void AddItem(T item)
{
_buffer.Add(item);
if(_buffer.Count >= _maxSize)
{
_writer.Write(_buffer);
_buffer.Clear();
}
}
}
Then, your reader class doesn't know about the writer at all. All it knows is that it writes to the buffer.
public class PersonFileReader
{
private readonly string _filename;
private readonly IBuffer<Person> _buffer;
public PersonFileReader(string filename, IBuffer<Person> buffer)
{
_filename = filename;
_buffer = buffer;
}
public void ReadFile()
{
//Reads from file.
//Creates a new Person for each record
//Calls _buffer.Add(person) for each Person.
}
}
public class PersonSqlWriter : IWriter<Person>
{
private readonly string _connectionString;
public PersonSqlWriter(string connectionString)
{
_connectionString = connectionString;
}
public void Write(IEnumerable<Person> items)
{
//Writes the list of items to the database
//using _connectionString;
}
}
The result is that each of these classes does only one thing. You can use them separately from the others and test them separately from the others. That applies the Single Responsibility Principle. No one class is too complicated because each one has only one responsibility. It also applies the Dependency Inversion principle. The reader doesn't know what the buffer does. It just depends on the interface. The buffer doesn't know what the writer does. And the writer doesn't care where the data comes from.
Now the complexity is in creating the objects. You need a file name, a connection string, and a maximum buffer size. That means something like
var filename = "your file name";
var maxBufferSize = 50;
var connectionString = "your connection string"
var reader = new PersonFileReader(
filename,
new WriterBuffer<Person>(
new PersonSqlWriter(connectionString),
maxBufferSize));
Your classes are simpler, but wiring them all together has gotten a little more complicated. That's where dependency injection comes in. It manages this for you. I won't go into that yet because it might be information overload. But if you mention what sort of application this is - web, WCF service, etc., then I might be able to provide a concrete example of how a dependency injection container like Windsor, Autofac, or Unity can manage this for you.
This was all new to me several years ago. At first it just looked like more code. But it actually makes it easier to write small, simple classes, which in turn makes building complex applications much easier.

Have a look at below links:
BulkCopy How can I insert 10 million records in the shortest time possible?
This one contains code samples: http://www.sqlteam.com/article/use-sqlbulkcopy-to-quickly-load-data-from-your-client-to-sql-server
You can also use Import Wizard (https://msdn.microsoft.com/en-us/library/ms141209.aspx?f=255&MSPPError=-2147217396).

Related

Ignite CacheJdbcPojoStoreFactory using Enum fields

I am to using the CacheJdbcPojoStoreFactory
I want to have a VARCHAR field in the database which maps to an Enum in Java.
The way I am trying to achieve this is something like the following. I want the application code to work with the enum, but the persistence to use the string so that it is human readable in the database. I do not want to use int values in the database.
This seems to work fine for creating new objects, but not for reading them out. It seems that it tries to set the field directly, and the setter (setSideAsString) is not called. Of course there is no field called sideAsString. Should this work? Any suggestions?
Here is the code excerpt
In some application code I would do something like
trade.setSide(OrderSide.Buy)
And this will persist fine. I can read "Buy" in the side column as a VARCHAR.
In Trade
private OrderSide side; // OrderSide is an enum with Buy,Sell
public OrderSide getSide() {
return side;
}
public void setSide(OrderSide side) {
this.side = side;
}
public String getSideAsString() {
return this.side.name();
}
public void setSideAsString(String s) {
this.side = OrderSide.valueOf(s);
}
Now when configuring the store, I do this
Collection<JdbcTypeField> vals = new ArrayList<>();
vals.add(new JdbcTypeField(Types.VARCHAR, "side", String.class, "sideAsString"));
After a clean start, If I query Trade using Ignite SQL query, and call trade.getSide() it will be null. Other (directly mapped) columns are fine.
Thanks,
Gordon
BinaryMarshaller deserialize only fields which used in query.
Please try to use OptimizedMarshaller:
IgniteConfiguration cfg = new IgniteConfiguration();
...
cfg.setMarshaller(new OptimizedMarshaller());
Here's the ticket for support enum mapping in CacheJdbcPojoStore.

Java 8 map with Map.get nullPointer Optimization

public class StartObject{
private Something something;
private Set<ObjectThatMatters> objectThatMattersSet;
}
public class Something{
private Set<SomeObject> someObjecSet;
}
public class SomeObject {
private AnotherObject anotherObjectSet;
}
public class AnotherObject{
private Set<ObjectThatMatters> objectThatMattersSet;
}
public class ObjectThatMatters{
private Long id;
}
private void someMethod(StartObject startObject) {
Map<Long, ObjectThatMatters> objectThatMattersMap = StartObject.getSomething()
.getSomeObject.stream()
.map(getSomeObject::getAnotherObject)
.flatMap(anotherObject-> anotherObject.getObjectThatMattersSet().stream())
.collect(Collectors.toMap(ObjectThatMatters -> ObjectThatMatters.getId(), Function.identity()));
Set<ObjectThatMatters > dbObjectThatMatters = new HashSet<>();
try {
dbObjectThatMatters.addAll( tartObject.getObjectThatMatters().stream().map(objectThatMatters-> objectThatMattersMap .get(objectThatMatters.getId())).collect(Collectors.toSet()));
} catch (NullPointerException e) {
throw new someCustomException();
}
startObject.setObjectThatMattersSet(dbObjectThatMatters);
Given a StartObject that contains a set of ObjectThatMatters
And a Something that contains the database structure already fetched filled with all valid ObjectThatMatters.
When I want to swap the StartObject set of ObjectThatMatters to the valid corresponding db objects that only exist in the scope of the Something
Then I compare the set of ObjectThatMatters on the StartObject
And replace every one of them with the valid ObjectThatMatters inside the Something object
And If some ObjectThatMatters doesn't have a valid ObjectThatMatters I throw a someCustomException
This someMethod seems pretty horrible, how can I make it more readable?
Already tried to change the try Catch to a optional but that doesn't actually help.
Used a Map instead of a List with List.contains because of performance, was this a good idea? The total number of ObjectThatMatters will be usually 500.
I'm not allowed to change the other classes structure and I'm only showing you the fields that affect this method not every field since they are extremely rich objects.
You don’t need a mapping step at all. The first operation, which produces a Map, can be used to produce the desired Set in the first place. Since there might be more objects than you are interested in, you may perform a filter operation.
So first, collect the IDs of the desired objects into a set, then collect the corresponding db objects, filtering by the Set of IDs. You can verify whether all IDs have been found, by comparing the resulting Set’s size with the ID Set’s size.
private void someMethod(StartObject startObject) {
Set<Long> id = startObject.getObjectThatMatters().stream()
.map(ObjectThatMatters::getId).collect(Collectors.toSet());
HashSet<ObjectThatMatters> objectThatMattersSet =
startObject.getSomething().getSomeObject().stream()
.flatMap(so -> so.getAnotherObject().getObjectThatMattersSet().stream())
.filter(obj -> id.contains(obj.getId()))
.collect(Collectors.toCollection(HashSet::new));
if(objectThatMattersSet.size() != id.size())
throw new SomeCustomException();
startObject.setObjectThatMattersSet(objectThatMattersSet);
}
This code produces a HashSet; if this is not a requirement, you can just use Collectors.toSet() to get an arbitrary Set implementation.
It’s even easy to find out which IDs were missing:
private void someMethod(StartObject startObject) {
Set<Long> id = startObject.getObjectThatMatters().stream()
.map(ObjectThatMatters::getId)
.collect(Collectors.toCollection(HashSet::new));// ensure mutable Set
HashSet<ObjectThatMatters> objectThatMattersSet =
startObject.getSomething().getSomeObject().stream()
.flatMap(so -> so.getAnotherObject().getObjectThatMattersSet().stream())
.filter(obj -> id.contains(obj.getId()))
.collect(Collectors.toCollection(HashSet::new));
if(objectThatMattersSet.size() != id.size()) {
objectThatMattersSet.stream().map(ObjectThatMatters::getId).forEach(id::remove);
throw new SomeCustomException("The following IDs were not found: "+id);
}
startObject.setObjectThatMattersSet(objectThatMattersSet);
}

Does this saving/loading pattern have a name?

There's a variable persistence concept I have integrated multiple times:
// Standard initialiation
boolean save = true;
Map<String, Object> dataHolder;
// variables to persist
int number = 10;
String text = "I'm saved";
// Use the variables in various ways in the project
void useVariables() { ... number ... text ...}
// Function to save the variables into a datastructure and for example write them to a file
public Map<String, Object> getVariables()
{
Map<String, Object> data = new LinkedHashMap<String, Object>();
persist(data);
return(data);
}
// Function to load the variables from the datastructure
public void setVariables(Map<String, Object> data)
{
persist(data);
}
void persist(Map<String, Object> data)
{
// If the given datastructure is empty, it means data should be saved
save = (data.isEmpty());
dataHolder = data;
number = handleVariable("theNumber", number);
text = handleVariable("theText", text);
...
}
private Object handleVariable(String name, Object value)
{
// If currently saving
if(save)
dataHolder.put(name, value); // Just add to the datastructure
else // If currently writing
return(dataHolder.get(name)); // Read and return from the datastruct
return(value); // Return the given variable (no change)
}
The main benefit of this principle is that you only have a single script where you have to mention new variables you add during the development and it's one simple line per variable.
Of course you can move the handleVariable() function to a different class which also contains the "save" and "dataHolder" variables so they wont be in the main application.
Additionally you could pass meta-information, etc. for each variable required for persisting the datastructure to a file or similar by saving a custom class which contains this information plus the variable instead of the object itself.
Performance could be improved by keeping track of the order (in another datastructure when first time running through the persist() function) and using a "dataHolder" based on an array instead of a search-based map (-> use an index instead of a name-string).
However, for the first time, I have to document this and so I wondered whether this function-reuse principle has a name.
Does someone recognize this idea?
Thank you very much!

How to add options for Analyze in Apache Lucene?

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include converting tokens to lowercase, stemming, removing stopwords, etc.
I'm running an experiment where I want to try all possible combinations of analysis operations: stemming only, stopping only, stemming and stopping, ...
In total, there 36 combinations that I want to try.
How can I do easily and gracefully do this?
I know that I can extend the Analyzer class and implement the tokenStream() function to create my own Analyzer:
public class MyAnalyzer extends Analyzer
{
public TokenStream tokenStream(String field, final Reader reader){
return new NameFilter(
CaseNumberFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader)
)
), StopAnalyzer.ENGLISH_STOP_WORDS)
)
);
}
What I'd like to do is write one such class, which can somehow take boolean values for each of the possible operations (doStopping, doStemming, etc.). I don't want to have to write 36 different Analyzer classes that each perform one of the 36 combinations. What makes it difficult is the way the filters are all combined together in their constructors.
Any ideas on how to do this gracefully?
EDIT: By "gracefully", I mean that I can easily create a new Analyzer in some sort of loop:
analyzer = new MyAnalyzer(doStemming, doStopping, ...)
where doStemming and doStopping change with each loop iteration.
Solr solves this problem by using Tokenizer and TokenFilter factories. You could do the same, for example:
public interface TokenizerFactory {
Tokenizer newTokenizer(Reader reader);
}
public interface TokenFilterFactory {
TokenFilter newTokenFilter(TokenStream source);
}
public class ConfigurableAnalyzer {
private final TokenizerFactory tokenizerFactory;
private final List<TokenFilterFactory> tokenFilterFactories;
public ConfigurableAnalyzer(TokenizerFactory tokenizerFactory, TokenFilterFactory... tokenFilterFactories) {
this.tokenizerFactory = tokenizerFactory;
this.tokenFilterFactories = Arrays.asList(tokenFilterFactories);
}
public TokenStream tokenStream(String field, Reader source) {
TokenStream sink = tokenizerFactory.newTokenizer(source);
for (TokenFilterFactory tokenFilterFactory : tokenFilterFactories) {
sink = tokenFilterFactory.newTokenFilter(sink);
}
return sink;
}
}
This way, you can configure your analyzer by passing a factory for one tokenizer and 0 to n filters as constructor arguments.
Add some class variables to the custom Analyzer class which can be easily set and unset on the fly. Then, in the tokenStream() function, use those variables to determine which filters to perform.
public class MyAnalyzer extends Analyzer {
private Set customStopSet;
public static final String[] STOP_WORDS = ...;
private boolean doStemming = false;
private boolean doStopping = false;
public JavaSourceCodeAnalyzer(){
super();
customStopSet = StopFilter.makeStopSet(STOP_WORDS);
}
public void setDoStemming(boolean val){
this.doStemming = val;
}
public void setDoStopping(boolean val){
this.doStopping = val;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
// First, convert to lower case
TokenStream out = new LowerCaseTokenizer(reader);
if (this.doStopping){
out = new StopFilter(true, out, customStopSet);
}
if (this.doStemming){
out = new PorterStemFilter(out);
}
return out;
}
}
There is one gotcha: LowerCaseTokenizer takes as input the reader variable, and returns a TokenStream. This is fine for the following filters (StopFilter, PorterStemFilter), because they take TokenStreams as input and return them as output, and so we can chain them together nicely. However, this means you can't have a filter before the LowerCaseTokenizer that returns a TokenStream. In my case, I wanted to split camelCase words into parts, and this has to be done before converting to lower case. My solution was to perform the splitting manually in the custom Indexer class, so by the time MyAnalyzer sees the text, it has already been split.
(I have also added a boolean flag to my customer Indexer class, so now both can work based solely on flags.)
Is there a better answer?

Using JSON with VB.NET ASP.NET 2.0

Total newby question here, I've been struggling with it for hours!
I'm trying to understand how to actually use, and create JSON data. I've been Googling all afternoon and trying to understand what I fine here http://james.newtonking.com/projects/json/help/ having downloaded the Newtonsoft DLLs.
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
using (JsonWriter jsonWriter = new JsonTextWriter(sw))
{
jsonWriter.Formatting = Formatting.Indented;
jsonWriter.WriteStartObject();
jsonWriter.WritePropertyName("CPU");
jsonWriter.WriteValue("Intel");
jsonWriter.WritePropertyName("PSU");
jsonWriter.WriteValue("500W");
jsonWriter.WritePropertyName("Drives");
jsonWriter.WriteStartArray();
jsonWriter.WriteValue("DVD read/writer");
jsonWriter.WriteComment("(broken)");
jsonWriter.WriteValue("500 gigabyte hard drive");
jsonWriter.WriteValue("200 gigabype hard drive");
jsonWriter.WriteEnd();
jsonWriter.WriteEndObject();
}
Should create something that looks like:
{
"CPU": "Intel",
"PSU": "500W",
"Drives": [
"DVD read/writer"
/*(broken)*/,
"500 gigabyte hard drive",
"200 gigabype hard drive" ]
}
and I am sure it does... but how do I view it? How do I turn that into an object that the browser can output.
It seems to me that the first stage I need to resolve is "how to create" JSON files/strings, next stage will be how to actually use them. If it helps answer the question, what I'm aiming for initially is to be able to use AJAX Autocomplete from a search page generated from my MySQL database, I was hoping I could write a simple SQL query and have that returned using something similar to the above, but I'm clearly going about it all wrong!
BTW, the example above is in C#, I have successfully converted the process to VB, as that's what I am using, but any responses would be much appreciated as VB examples!
I came across this post about two years after it was posted, but I had the exact same question and noticed that the question wasn't really answered. To answer OP's question, this will get you the JSON string in his example.
sb.toString()
The upshot is that you need to get the JSON string back to the browser. You can either place it in a javascript variable (be sure to clean up line enders and single quotes if you do this) or pass it back as the result of an ajax query.
We actually use the built-in Javascript serializer since it has support both on the server and the client side and is quite easy to use. Assuming that you have an existing object, this code goes on the server side:
''' <summary>
''' This method safely serializes an object for JSON by removing all of the special characters (i.e. CRLFs, quotes, etc)
''' </summary>
''' <param name="oObject"></param>
''' <param name="fForScript">Set this to true when the JSON will be embedded directly in web page (as opposed to being passed through an ajax call)</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function SerializeObjectForJSON(ByVal oObject As Object, Optional ByVal fForScript As Boolean = False) As String
If oObject IsNot Nothing Then
Dim sValue As String
sValue = (New System.Web.Script.Serialization.JavaScriptSerializer).Serialize(oObject)
If fForScript Then
' If this serialized object is being placed directly on the page, then we need to ensure that its CRLFs are not interpreted literlally (i.e. as the actual JS values)
' If we don't do this, the script will not deserialize correctly if there are any embedded crlfs.
sValue = sValue.Replace("\r\n", "\\r\\n")
' Fix quote marks
Return CleanString(sValue)
Else
Return sValue
End If
Else
Return String.Empty
End If
End Function
On the client side, deserialization is trivial:
// The result should be a json-serialized record
oRecord = Sys.Serialization.JavaScriptSerializer.deserialize(result.value);
Once you have deserialized the object, you can use its properties directly in javascript:
alert('CPU = ' + oRecord.CPU);
In terms of generating the JSON try
public class HardwareInfo
{
[JsonProperty(PropertyName = "CPU")]
public string Cpu { get; set; }
[JsonProperty(PropertyName = "PSU")]
public string Psu { get; set; }
[JsonProperty]
public ICollection<string> Drives { get; set; }
}
public string SerializeHardwareInfo()
{
var info = new HardwareInfo
{
Cpu = "Intel",
Psu = "500W",
Drives = new List<string> { "DVD read/writer", "500 gigabyte hard drive", "200 gigabype hard drive" }
};
var json = JsonConvert.SerializeObject(info, Formatting.Indented);
// {
// "CPU": "Intel",
// "PSU": "500W",
// "Drives": [
// "DVD read/writer",
// "500 gigabyte hard drive",
// "200 gigabype hard drive"
// ]
// }
return json;
}
The formatting argument is optional. Best of luck.