Adding a New Element/Field with an Increment Integer as Value - vb.net

After using Mongoimport to import a CSV file to my database, I want to add a new field or element per document. And, the data per for this new field is the is the index number plus 2.
Dim documents = DB.GetCollection(Of BsonDocument)(collectionName).Find(filterSelectedDocuments).ToListAsync.Result
For Each doc in documents
DB.GetCollection(Of BsonDocument)(collectionName).UpdateOneAsync(
Builders(Of BsonDocument).Filter.Eq(Of ObjectId)("_id", doc.GetValue("_id").AsObjectId),
Builders(Of BsonDocument).Update.Set(Of Integer)("increment.value", documents.IndexOf(doc) + 2).Wait()
Next
If I have over a million of data to import, is there a better way to achieved this like using UpdateManyAsync?

Just as a side note: Since you've got the Wait() and the Result everywhere, the Async methods don't seem to make an awful lot of sense. Also, your logic appears flawed since there is no .Sort() anywhere. So you've got no guarantee about the order of your returned documents. Is it indended that every document just gets a kind of random but unique and increasing number assigned?
Anyway, to make this faster, you'd really want to patch your CSV file and write the increasing "increment.value" field straight into it before the import. This way, you've got your value directly in MongoDB and do not need to query and update the imported data again.
If this is not an option you could optimize your code like this:
Only retrieve the _id of your documents - that's all you need and it will majorly impact your .find() perfomance since a lot less data needs to be transferred/deserialized from MongoDB.
Iterate over the Enumerable of your result instead of using a fully populated list.
Use bulk writes to avoid connecting to MongoDB again and again for every document and use a chunked flushing approach and flush every 1000 documents or so.
Theoretically, you could go further using multithreading or yield semantics for nicer streaming. However, that's getting a little complicated and may not even be needed.
The following should get you going faster already:
' just some cached values
Dim filterDefinitionBuilder = Builders(Of BsonDocument).Filter
Dim updateDefinitionBuilder = Builders(Of BsonDocument).Update
Dim collection = DB.GetCollection(Of BsonDocument)(collectionName)
' load only _id field
Dim documentIds = collection.Find(filterSelectedDocuments).Project(Function(doc) doc.GetValue("_id")).ToEnumerable()
' bulk write buffer (pre-initialized to size 1000 to avoid memory traffic upon array expansion)
Dim updateModelsBuffer = new List(Of UpdateOneModel(Of BsonDocument))(1000)
' starting value for our update counter
Dim i As Long = 2
For Each objectId In documentIds
' for every document we want one update command...
' ...that finds exactly one document identified by its _id field
Dim filterDefinition = filterDefinitionBuilder.Eq(Of ObjectId)("_id", objectId)
' ...and updates the "increment.value" with our running counter
Dim updateDefinition = updateDefinitionBuilder.Set(Of Integer)("increment.value", i)
updateModelsBuffer.Add(New UpdateOneModel(Of BsonDocument)(filterDefinition, updateDefinition))
' every e.g. 1000 documents
If updateModelsBuffer.Count = 1000
' we flush the contents to the database
collection.BulkWrite(updateModelsBuffer)
' and we empty our buffer list
updateModelsBuffer.Clear()
End If
i = i + 1
Next
' flush left over commands that have not been written yet in case we do not have a multiple of 1000 documents
collection.BulkWrite(updateModelsBuffer)

Related

I want my Database to not fill the RAM. What to do with images that are currently not being viewed?

What does my program do?
As at 05 May 2021
This program was developed with the language VB.Net, the .NET framework 4.8 and with Visual Studio 2019 CE. The point of this program is to run a rudimentary database. The view is similar to a classic Internet forum—there are threads, in the threads there are different numbers of postings and in each post there are different numbers of pictures and long texts. If the thread is selected using the ComboBox, all posts with their images and texts are displayed one below the other. When you click on a specific post, only its images are displayed. Since the database is only intended for the company's products, it was decided not to use categories (e.g. images vs. videos vs. offtopic because it doesn't make any sense) and sub-categories (e.g. electrical vs. wood products).
When the program is closed, you will be asked whether the data should be saved. (still in the beta version). These data are read in when the program is loaded. If images are not found, their paths will be displayed in a window.
The user also has the option of searching through all threads and viewing the results with various sorting options. In this case, only the posts found are listed in the ListBox, and here, too, the user can select individual posts and have them displayed enlarged.
The program reads in the user data when it starts. A user can log in and, depending on his role, has certain power to make decisions. A “normal” user can create threads and posts, but only an administrator or moderator can edit and delete posts; and block a user. If you are not logged in or if you are locked, you can only read threads and posts.
The number of contributions is counted for each user. In the future, it should be possible to give a user stars.
About the classes
There is the Form1.vb class, and three other important classes: Class_Forum, Class_Post and Class_Thread. There is also the Class_User class. If a new post is created, this instance of Class_Post is added to a List(of Class_Post), which is located in Class_Thread (“The thread knows which posts it has”). Class_Post has a member ‘Made_by’, which is an instance of the Class_Users (“Every post knows which user made it”). Class_Post contains the member ‚Bilder‘ (=Images), which is a List(Of Bitmap). That is, every instance of class_post has got a List(of Bitmap).
There are also several forms for 1) editing, deleting posts, 2) for blocking or unblocking users, 3) for displaying enlarged images, 4) for logging in, 5) for displaying when images are not loading found, 6) to open the thread, 7) to post.
When the program is started, i.e. when data is read in, the thread instances and post instances are created.
To do list:
1.)
I would like, however, that only the images are in the RAM, which belong to the thread selected with the combobox1. My question is: Do I have to dispose all unnecessary images and read them in again when required? Do we get that built in?
This is my Code to load the data from the formatted txt file. I have a feeling, somewhere in here, or immediately after here, I have to do something.
Private Sub Daten_laden()
Dim Pfad As String 'file path
Using OFD1 As New CommonOpenFileDialog
OFD1.Title = "Textdatei auswählen"
OFD1.Filters.Add(New CommonFileDialogFilter("Textdatei", ".txt"))
OFD1.InitialDirectory = Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments)
Dim Result As CommonFileDialogResult
Me.Invoke(Sub() Result = OFD1.ShowDialog())
If Result = CommonFileDialogResult.Ok Then
Pfad = OFD1.FileName
Else
Return
End If
End Using
Dim Pruef_Anzahl_Posts_in_dem_Thread As Integer = 0 'Check number of posts in the thread
Dim Liste_mit_den_Pfaden As List(Of String) 'List with file paths
Dim Liste_mit_den_Bildern As List(Of System.Drawing.Bitmap)
'read all Text
Dim RAT() As String = System.IO.File.ReadAllLines(Pfad, System.Text.Encoding.UTF8)
Pfade_von_nicht_gefundenen_Bildern = New List(Of String) ' Paths of not found images
For i As Integer = 3 To RAT.Length - 2 Step 1
Liste_mit_allen_Threads.Add(New Class_Thread(RAT(i)))
Me.Invoke(Sub() RaiseEvent Es_wurde_ein_neuer_Thread_eroeffnet()) 'A new thread has been opened
Pruef_Anzahl_Posts_in_dem_Thread = CInt(RAT(i + 1))
For j As Integer = (i + 2) To RAT.Length - 2 Step 1
Liste_mit_den_Pfaden = New List(Of String)
Liste_mit_den_Bildern = New List(Of Bitmap)
Dim Index As Integer
Dim Die_ID_des_Nutzers_der_den_Post_erstellt_hat As ULong = CULng(RAT(j + 4)) 'The ID of the user who created the post
For u As Integer = 0 To alle_Nutzer_Liste.Count - 1 Step 1
If alle_Nutzer_Liste(u).ID = Die_ID_des_Nutzers_der_den_Post_erstellt_hat Then
Index = u
Exit For
End If
Next
Dim neuerPost As New Class_Post(RAT(j), RAT(j + 1), CUShort(RAT(j + 2)), Liste_mit_den_Bildern, CDate(RAT(j + 3)), Liste_mit_den_Pfaden, alle_Nutzer_Liste(Index))
'how many threads are there already
Dim wie_viele_Threads_gibt_es_bereits As Integer = Liste_mit_allen_Threads.Count
Liste_mit_allen_Threads(wie_viele_Threads_gibt_es_bereits - 1).Posts_in_diesem_Thread.Add(neuerPost)
' Set the index to the last possible one in the ComboBox. This causes the program to run into the Selected Index event and SI becomes the selected index.
Me.Invoke(Sub() ComboBox1.SelectedIndex = Liste_mit_allen_Threads.Count - 1)
j += 5
Do
Liste_mit_den_Pfaden.Add(RAT(j))
If System.IO.File.Exists(RAT(j)) Then
Liste_mit_den_Bildern.Add(New Bitmap(RAT(j)))
Else
Pfade_von_nicht_gefundenen_Bildern.Add(RAT(j))
End If
j += 1
Loop Until RAT(j) = "#" ' Marker: a post is over
If RAT(j + 1) = "" AndAlso RAT(j + 2) = "" Then ' A new thread is marked with 2 blank lines one below the other.
i = (j + 2)
Exit For
End If
Next
Next
Me.Invoke(Sub() alle_Posts_in_diesem_Thread_anzeigen()) 'show all posts in this thread
If Pfade_von_nicht_gefundenen_Bildern.Count > 0 Then ' In Case something went wrong
Using FBNG As New Form_Bild_nicht_gefunden
FBNG.Datei_anzeigen(Pfade_von_nicht_gefundenen_Bildern)
FBNG.ShowDialog()
End Using
End If
End Sub
On this image, the thread is being switched using the combobox which changes the variable SI which I use often. In this example, the thread Caucasian contains 1 post which contains 2 images. Which means, (still in this moment) I don't need images form shepherds thread
It is not wise to keep all data in RAM. When your database grows, there will be a time that there is too much data.
To overcome this problem, people invented databases: the data is saved on a disk, and only the data that you request is put into memory. Smart databases will keep important data and often used values in memory, to minimize the request time.
If I look at your program, it seems that you only have to change the display after operator input. Operator input is relatively slow: you're a good typist if you can type more than 3 character per second. Usually the response time after operator input is not a problem: if you get the data within half a second, no one will complain.
For a modern computer half a second is enough time to examine a million records. In your application a database won't be a problem.
So my advice would be: start using a database and load only the data that is needed right now, instead of reading all data at startup. Only if you experience long request times, consider to load data that you expect you will need very soon.
Alas, to use a database you will need to learn something new: at least how to structure a databases. If I look at your tables, I have the impression that you already mastered this. Furthermore you'll have to learn how to add / query / update / remove data. This is usually done using SQL or software that supports LINQ, like entity framework.
It seems to me that your queries are quite limited in number: you won't have hundreds of different queries. If you already know SQL, and you don't think you need to know entity framework in the near future, I would go for accessing the database using SQL.
If you don't know SQL very much, or if you need to do an awful lot of different queries, consider to access the database using LINQ. This requires entity framework.
If you haven't got a database already, my first shot would be to use SQLight: a database in one file, fast enough for your application.
If you hide properly that you use SQLight, migrating to a smarter database if the need arises won't require a big change in your application.
class Repository
{
public long AddPost(Post post) {...} // add Post to the database, returns the Id
public long AddUser(User user) {...}
...
// fetch all Posts of a User:
public User FetchUserWithHisPosts(int userId);
// fetch Posts of a User after a certain data:
public User FetchUserWithHisPosts(int userId, DateTime startDate);
...
}
A Repository is some kind of warehouse: all you know is that you can store items in it, and later retrieve them, even after your computer is restarted.
The Repository hides how it does this: the constructor might load everything in memory (like your current application), it could also be that the repository uses SQLight, or a smarter database, or even Entity Framework.
A good way to migrate would be to first translate your current application such that everyone only accesses the data using the Repository. The Repository accesses your "in-memory data" which is in a separate class that is loaded at startup.
Later you can change the repository such that it doesn't use the "in-memory data" anymore, but accesses the database: users of your repository won't have to change.
About loading pictures
No, you don't have to load all pictures at startup. It will be fast enough to load the pictures only when shown: after all, you won't be showing 1000 pictures on the screen at once.
As a picture uses a lot of memory, it is wise to Dispose() the picture as soon as you don't need it anymore:
Image GetImage(long imageId)
{
Repository repository = new Repository();
return repository.GetImageById(imageId);
}
void DisposeImage(Image image)
{
image.Dispose();
}
You hide how you Load the image, and how you free up memory after an Image is not needed anymore. This makes it easier to change this, might the need arise later. It also makes your code easier to read and to unit test.
Hello #Harald Coppoolse,
I thank you for your detailed and understandable answer. Admittedly, I have to look at and learn those SQL (Light) and Entity procedures.
For now, I've temporarily made sure that I only load the pictures that are shown and dispose all the others. I coded this as follows:
Private Sub ComboBox1_SelectedIndexChanged(sender As Object, e As EventArgs) Handles ComboBox1.SelectedIndexChanged
If ComboBox1.SelectedIndex <> (-1) Then
SI = ComboBox1.SelectedIndex
If Programm_fertig_geladen Then
Load_the_pictures_only_when_shown_and_free_up_memory_after_an_Image_is_not_needed()
End If
alle_Posts_in_diesem_Thread_anzeigen()
End If
End Sub
Private Sub Load_the_pictures_only_when_shown_and_free_up_memory_after_an_Image_is_not_needed()
For g As Integer = 0 To Liste_mit_allen_Threads.Count - 1 Step 1
If g <> SI Then 'Diese Bilder verwerfen
For i As Integer = 0 To Liste_mit_allen_Threads(g).Posts_in_diesem_Thread.Count - 1 Step 1
For j As Integer = 0 To Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).Bilder.Count - 1 Step 1
Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).Bilder(j).Dispose()
Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).BoolArray(j) = False
Next
Dim CNT As Integer = Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).Bilder.Count
For z As Integer = (CNT - 1) To 0 Step -1
Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).Bilder.RemoveAt(z)
Next
Next
Else 'Diese Bilder laden
'für jeden Post
For i As Integer = 0 To Liste_mit_allen_Threads(g).Posts_in_diesem_Thread.Count - 1 Step 1
'die Bilder in dem Post
For j As Integer = 0 To Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).Pfade_der_Bilder.Count - 1 Step 1
Dim neu_geladenes_Bild As New Bitmap(Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).Pfade_der_Bilder(j))
If Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).BoolArray(j) = False Then
Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).Bilder.Add(neu_geladenes_Bild)
Liste_mit_allen_Threads(g).Posts_in_diesem_Thread(i).BoolArray(j) = True
End If
'End Using
Next
Next
End If
Next
End Sub
I built a Boolean array into the Class_Post, which has as many members as there are images and paths in the post. The purpose of this is to make sure that I remember which pictures are there and which are not.
When disposing of the images of a post, the respective array member is set to false.
When reloading the images of a post, the respective array member is set to True.
Dispose alone is not enough, because funnily enough, the place in the List(of Bitmap) remains occupied. You also have to completely delete the picture with RemoveAt, but in a backward running for loop, because the list gets smaller during the deletion. ;)
Using cannot not be used.

How to improve Mongo document retrieval performance

I am retrieving documents from a Mongo database and copying them to internal storage. I'm finding it takes more than a few seconds to retrieve and store a hundred of these documents. Is there anything I can do to improve the performance? Some of collections have more than 1000 documents. Here's what I have (written in vb.net)
' get the documents from collection "reqitems" and put them in "collection"
Dim collection As IFindFluent(Of BsonDocument, BsonDocument) = _
reqitems.Find(Builders(Of BsonDocument).Filter.Empty)
ReDim model.ReqItems(TotalCollection.ToList.Count) ' storage for the processed documents
For Each item As BsonDocument In TotalCollection.ToList()
' note: given a string a=x, "GetRHS" returns x
Dim parentuid As String = GetRHS(item.GetElement("parentuid").ToString)
Dim nodename As String = GetRHS(item.GetElement("nodename").ToString)
' .... about a dozen of these elements
' .... process the elements and copy them to locations in model.ReqItems
next
You can add indexes to your collection if you haven't done so. Please refer to : https://docs.mongodb.com/manual/indexes/
Also, I would suggest to run the particular Mongodb query with executions stats. ex: db.mycollection.find().explain("executionStats"); which will give you more stats regarding the performance of the query. https://docs.mongodb.com/manual/reference/explain-results/#executionstats
Adding indices didn't really help. What slows it down is accessing the elements in the document one at a time (GetRHS in the posted code). So, as a fix, I converted the document to a string, then parse the string for keyword-value pair. Hopefully, what I found might be able to help someone with the same problem

Getting a count of "find" from multiple .txt files

I am currently running a find and replace through hundreds of .txt files at a time. I am looking for a way to pull a count of the number of finds for my value.
Here is the code I am currently running, hoping to be able to add to or modify this code.
Dim flatfiles As String() = IO.Directory.GetFiles("C:\DATA\TEST\", "*.txt").Where(Function(x) File.ReadAllText(x).Contains("Bob")).ToArray
For Each f As String In flatfiles
Dim contents As String = File.ReadAllText(f)
File.WriteAllText(f, contents.Replace("Bob", "Bill"))
Next
One (inefficient) way to do this would be to include a counter outside of the For Each loop (Dim itemsFound as Integer = 0), then increment it by the count of find in each file, using something like:
itemsFound = itemsFound + (Regex.Split(contents, find).Length - 1)
Regex.Split splits the string up whenever it finds find, which means the count you're looking for is one less than the number of items in the list.
I would say as well, that you're calling File.ReadAllText twice in your code, so you could improve it by getting rid of the Where code, and just check in your For each loop (seeing as you're now counting the number of instances in the file anyway, it's easy enough to check for 0 occurrences). Alternatively, you could replace the .Where code to store the contents of the file in an array rather than the name of the files (although this can be dangerous if the files are large); or you could even just do it all in Linq if you want some obfuscated code...

Lucene.net result as string / parsing to slow

I need some help to speed up the lucene.net search.
We need the result in a string with ; as a seperator.
The parsing of the topdocs takes to long:
Dim resultDocs As TopDocs = indexSearch.Search(query, indexReader.MaxDoc())
Dim hits As Object = resultDocs.ScoreDocs
Dim strGetDocIDList As String = ""
For Each sDoc As ScoreDoc In hits
Dim documentFromSearcher As Document = indexSearch.Doc(sDoc.Doc)
Dim contentValue As String = documentFromSearcher.Get("id")
strGetDocIDList = strGetDocIDList + Path.GetFileName(contentValue) + ";"
Next
Return strGetDocIDList
How can we speed this up?
Regards
Ingo
There are a few ways to tune performances for loading STORED fields in Lucene.
First, by default its loads every stored fields of the Document when you load it. Only store what you need and do not systematically store everything.
If you dont need to load all the stored fields for this particular query, try writing a FieldSelector to gain further control on field loading.
Finally, add the Field you load the stored data for more often before other stored fields in your Documents. Fields are loaded sequentially and in some cases, adding them first to Documents can speed up things a little bit.
FieldSelector API link
An article that may help you with implementing a FieldSelector

Saving multiple items to SQL database

I have some items in my listbox which I will store everyone of them into my database, what I am doing now is looping the box and call my database saving logic to save every single item. I thought this is pretty inefficient, is there anyway that I can use to batch save my items so that I don't open and close the connection as many times as my items. Thanks.
For Each item In outletToBox.Items
.CamCode = Items.ToString
.CamCampaignAutoID = retID
.CamRemarks = uitxtCamRemarks.Text.Trim
'---use savetable object to save to database table---
Next
I had to do something similar yesterday. My approach was to build up a List of whatever I had to save in my database, then serialize it to XML. I passed the XML as a parameter to a stored procedure which then processed it and saved the data.
Overall I would say this is a much more effective solution than calling the databse multiple times as your application gets the data it wants saved and gives it to your database in one transaction.
As an example of how you can do this here is my code which loops through a CheckBoxList and creates a List of the selected items then serializes it to XML. You should easily be able to adapt this to work with your ListBox
' This is the list that will hold each of our selected items
Dim listOfSelectedItems As New List(Of ListItem)
' Loop through the CheckBoxList control and add all selected items to
' the listOfSelectedItems List if the item has its Selected property
' set to true
For Each item As ListItem In chkNotify.Items
If (item.Selected = True) Then
listOfSelectedItems.Add(item)
End If
Next
' Declare a new XMLSerializer
Dim serializer As New XmlSerializer(listOfSelectedItems.GetType)
' Declare a StringWriter
Dim writer As StringWriter = New StringWriter()
' Serialize the listOfSelectedItems List
serializer.Serialize(writer, listOfSelectedItems)
' Store our XML in a String variable
Dim serializedXML As String = writer.ToString()
Normally, you don't have to close the connection, you keep the connection open and do multiple inserts. You can either commit after every insert, or do groups of inserts and only then commit. You are right, closing and opening connections is expensive.
More information is needed about the DB you are using to specify if there are methods to do multiple inserts in batches.
Depends on if you have access to change the db procedure. If you can change your sql procedure to be aware of multiple values in one parameter you're all set. You can create a delimited string (using any character that wouldn't occur in the text, comma, pipe, etc...) or pass xml, then parse out the values in the sql proc.
Can you access savetable object?