The task is to extract data from multiple CSV files according to a criteria. The file contains a sampleId (this is the criteria) and other columns. At the end of the file there are the measurement values under 0...100 named columns (the numbers are the actual names of the columns). To make it a bit more interesting there can be variations in different CSV files, depending on the customer needs. This means the measurement data count can be 15, 25, 50 etc. but no more than 100 and no variations within one file. This data is always placed in the end of the line, so there is a set of columns before the numbers.
I'd like to have a SQL statement which can accept parameters:
SELECT {0} FROM {1} WHERE sampleId = {2}
0 is the numbers, 1 is the CSV file name and 2 is sampleId is what we looking for. The other solution which came into my mind is to look all the columns after the last fix column. I don't know is it possible or not, just thinking out loud.
Please be descriptive, my SQL knowledge is basic. Any help is really appreciated.
So finally managed to solve it. The code is in VB.NET, but the logic is quite clear.
Private Function GetDataFromCSV(sampleIds As Integer()) As List(Of KeyValuePair(Of String, List(Of Integer)))
Dim dataFiles() As String = System.IO.Directory.GetFiles(OutputFolder(), "*.CSV")
Dim results As List(Of KeyValuePair(Of String, List(Of Integer))) = New List(Of KeyValuePair(Of String, List(Of Integer)))
If dataFiles.Length > 0 And sampleIds.Length > 0 Then
For index As Integer = 0 To sampleIds.Length - 1
If sampleIds(index) > 0 Then
For Each file In dataFiles
If System.IO.File.Exists(file) Then
Dim currentId As String = sampleIds(index).ToString()
Dim filename As String = Path.GetFileName(file)
Dim strPath As String = Path.GetDirectoryName(file)
Dim conn As OleDb.OleDbConnection = New OleDb.OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0; Data Source=" & strPath & "; Extended Properties='text; HDR=Yes; FMT=Delimited'")
Dim command As OleDb.OleDbCommand = conn.CreateCommand()
command.CommandText = "SELECT * FROM [" & filename & "] 'WHERE Sample ID = " & currentId
conn.Open()
Dim reader As OleDb.OleDbDataReader = command.ExecuteReader()
Dim numberOfFields = reader.FieldCount
While reader.Read()
If reader("Sample ID").ToString() = currentId Then 'If found write particle data into output file
Dim particles As List(Of Integer) = New List(Of Integer)
For field As Integer = 0 To numberOfFields - 1
particles.Add(CInt(reader(field.ToString())))
Next field
results.Add(New KeyValuePair(Of String, List(Of Integer))(currentId, particles))
End If
End While
conn.Close()
End If
Next file
End If
Next index
Return results
Else
MessageBox.Show("Missing csv files or invalid sample Id(s)", "Internal error", MessageBoxButtons.OK, MessageBoxIcon.Exclamation)
End If
End Function
Related
ask permission,
I created a bot to input data to the web using vb.net and selenium.
Retrieve data from csv .
How to retrieve data from csv as needed, for example, there are 100 rows, only 30-50 rows are taken, for example. The loop code should not be looped at all.
Dim textFieldParser As TextFieldParser = New TextFieldParser(TextBox1.Text) With
{
.TextFieldType = FieldType.Delimited,
.Delimiters = New String() {","}
}
drv = New ChromeDriver(options)
While Not textFieldParser.EndOfData
Try
Dim strArrays As String() = textFieldParser.ReadFields()
Dim name As String = strArrays(0)
Dim alamat As String = strArrays(1)
Dim notlp As String = strArrays(2)
drv.Navigate().GoToUrl("URL")
Dim Nm = drv.FindElement(By.XPath("/html/body/div[1]/div[3]/form/div[1]/div[1]/div[1]/div/div[2]/input"))
Nm.SendKeys(name)
Threading.Thread.Sleep(3000)
Catch ex As Exception
MsgBox("Line " & ex.Message & "is not valid and will be skipped.")
End Try
End While
Thank you
Here's an example of using TextFieldParser to read one specific line and a specific range of lines. Note that I am using zero-based indexes for the lines. You can adjust as required if you want to use 1-based line numbers.
Public Function GetLine(filePath As String, index As Integer) As String()
Using parser As New TextFieldParser(filePath) With {.Delimiters = {","}}
Dim linesDiscarded = 0
Do Until linesDiscarded = index
parser.ReadLine()
linesDiscarded += 1
Loop
Return parser.ReadFields()
End Using
End Function
Public Function GetLines(filePath As String, startIndex As Integer, count As Integer) As List(Of String())
Using parser As New TextFieldParser(filePath) With {.Delimiters = {","}}
Dim linesDiscarded = 0
Do Until linesDiscarded = startIndex
parser.ReadLine()
linesDiscarded += 1
Loop
Dim lines As New List(Of String())
Do Until lines.Count = count
lines.Add(parser.ReadFields())
Loop
Return lines
End Using
End Function
Simple loops to skip and to take lines.
I have a requirement that I need to query a DB and fetch the records in a Data Table. The Data Table has 20,000 records.
I need to batch these records in Batches of 100 records each and write these batches into a individual Text files.
Till now I have been able to batch the records in batches of 100 each using IEnumerable(of DataRow).
I am now facing issue in writing the IEnumeable(Of DatRow) to a Text File.
My code is a below:
Dim strsql = "Select * from myTable;"
Dim dt as DataTable
Using cnn as new SqlConnection(connectionString)
cnn.Open()
Using dad as new SqlAdapter(strsql ,cnn)
dad.fill(dt)
End Using
cnn.Close()
End Using
Dim Chunk = getChunks(dt,100)
For each chunk as IEnumerable(Of DataRow) In Chunks
Dim path as String = "myFilePath"
If Not File.Exists(myFilePath) Then
//** Here I will write my Batch into the File.
End If
Next
Public Iterator Function getChunks(byVal Tab as DataTable, byVal size as Integer) as IEnumerable (Of IEnumerable(of DataRow))
Dim chunk as List(Of DataRow) = New List(of DataRow)(size)
For Each row As DataRow in tab.Rows
chunk.Add(row)
if chunk.Count = size Then
Yield chunk
chunk = New List(of DataRow0(size)
Next
if chunk.Any() Then Yield chunk
End Function
Need your help to write the IEneumerable of DataRows into a Text file for each Batch of Records.
Thanks
:)
Your existing code is needlessly complex. If this is all you're doing, then using a datatable is unnecessary/unwise; this is one of the few occasions I would advocate using a lower level datareader to keep the memory impact low
Writing a db table to a file, quick, easy and low memory consumption:
Dim dr = sqlCommand.ExecuteReader()
Dim sb as New StringBuilder
Dim lineNum = -1
Dim batchSize = 100
While dr.Read()
'turn the row into a string for our file
For x = 0 to dr.FieldCount -1
sb.Append(dr.GetString(x)).Append(",")
Next x
sb.Length -= 1 'remove trailing comma
sb.AppendLine()
'keep track of lines written so we can batch accordingly
lineNum += 1
Dim fileNum = lineNum \ batchSize
File.AppendAllText($"c:\temp\file{fileNum}.csv", sb.ToString())
'clear the stringbuilder
sb.Length = 0
End While
If you really want to use a datatable, there isn't anything stopping you swapping this while dr For a For Each r as DataRow in myDatatable.Rows
Please note, this isn't an exercise in creating a fully escaped csv, nor formatting the data; it is demonstrating the concept of having a firehose of data and simply writing it to N different files by utilising the fact that doing an integer divide on every number from 0 to 99 will result in 0 (and hence go in file 0) and then very number from 1 to 199 will result in 1 (and hence lines go in file 1) etc, and doing this process on a single stream of data, or single iteration of N items
You could build the file lines in the string builder and write them every batchSize if lineNum Mod batchSize = batchSize - 1, if you feel that it would be more efficient than calling file appendalltext (which opens and closes the file)
Tested this with a table of a little over 1,500 records and 10 fields. The file creation took a little over 5 seconds (excluding data access). All things being equal (which I know they are not) that would be over 13 seconds writing the files.
Since your problem was with the iterator I assume the there were no memory issues with the DataTable.
You can include more than one database object in a Using block by using a comma to designate a list of objects in the Using.
Private Sub OPCode()
Dim myFilePath = "C:\Users\xxx\Documents\TestLoop\DataFile"
Dim strsql = "Select * from myTable;"
Dim dt As New DataTable
Using cnn As New SqlConnection(connectionString),
cmd As New SqlCommand(strsql, cnn)
cnn.Open()
dt.Load(cmd.ExecuteReader)
End Using
sw.Start()
Dim StartRow = 0
Dim EndRow = 99
Dim FileNum = 1
Dim TopIndex = dt.Rows.Count - 1
Do
For i = StartRow To EndRow
Dim s = String.Join("|", dt.Rows(i).ItemArray)
File.AppendAllText(myFilePath & FileNum & ".txt", s & Environment.NewLine)
Next
FileNum += 1
StartRow += 100
EndRow += 100
If EndRow >= TopIndex Then
EndRow = TopIndex
End If
Loop Until StartRow >= TopIndex
sw.Stop()
MessageBox.Show(sw.ElapsedMilliseconds.ToString)
End Sub
I thought your code was a great use of the iteration function.
Here is the code for your iterator.
Public Iterator Function getChunks(ByVal Tab As DataTable, ByVal size As Integer) As IEnumerable(Of IEnumerable(Of DataRow))
Dim chunk As List(Of DataRow) = New List(Of DataRow)(size)
For Each row As DataRow In Tab.Rows
chunk.Add(row)
If chunk.Count = size Then
Yield chunk
chunk = New List(Of DataRow)(size)
End If
Next
If chunk.Any() Then Yield chunk
End Function
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
Dim dt = LoadDataTable()
Dim myFilePath As String = "C:\Users\xxx\Documents\TestLoop\DataFile"
Dim FileNum = 1
For Each chunk As IEnumerable(Of DataRow) In getChunks(dt, 100)
For Each row As DataRow In chunk
Dim s = String.Join("|", row.ItemArray)
File.AppendAllText(myFilePath & FileNum & ".txt", s & Environment.NewLine)
Next
FileNum += 1
Next
MessageBox.Show("Done")
End Sub
You just needed to nest the For Each to get at the data rows.
Im reading excel sheet having two section as input and output section and import that to two Datagridview one for input and another one for Output section in VB.net 2008 Windows application.
If i have 10 rows and 10 columns for input section, then in my 11th row i have a text like 'End of Input Data' like the same i have in 11th column.
So by checking this,if i get the row and column number of this string i can import the data in to two data grid views. i can import only these row and column data in input gridview.
Below is the code for reading excel sheet. I don't know how to find string in Datatable. Or is there any other way to do that?
Private Sub ImpGrid_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles ImpGrid.Click
Dim xlApp As Excel.Application
Dim xlWorkBook As Excel.Workbook
Dim conStr As String, sheetName As String
Dim filePath As String = "C:\SIG.XLS"
Dim extension As String = ".xls"
conStr = String.Empty
conStr = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & filePath & ";Extended Properties=""Excel 8.0;HDR=No;IMEX=1"""
Using con As New OleDbConnection(conStr)
Using cmd As New OleDbCommand()
Using oda As New OleDbDataAdapter()
Dim dt As New DataTable()
' cmd.CommandText = (Convert.ToString("SELECT * From [") & sheetName) + "]"
cmd.CommandText = "select * from [Sheet1$]"
cmd.Connection = con
con.Open()
oda.SelectCommand = cmd
oda.Fill(dt)
con.Close()
'Populate DataGridView.
Loggridview.DataSource = dt
End Using
End Using
End Using
End Sub
Here's a function that will returns a list of hits. Each result is a Tuple with Item1 being the rowId and Item2 being the columnId
Public Structure SearchHit
Public ReadOnly RowIndex As Integer
Public ReadOnly CellIndex As Integer
Public Sub New(rowIndex As Integer, cellIndex As Integer)
Me.RowIndex = rowIndex
Me.CellIndex = cellIndex
End Sub
End Structure
Public Function FindData(ByVal value As String, table As DataTable) As List(Of SearchHit)
Dim results As New List(Of SearchHit)
For rowId = 0 To table.Rows.Count - 1
Dim row = table.Rows(rowId)
For colId = 0 To table.Columns.Count - 1
Dim cellValue = row(colId)
If value.Equals(cellValue) Then
results.Add(New SearchHit(rowId, colId))
End If
Next
Next
Return results
End Function
I've written a function which reads csv files and parametrizes them accordingly, therefore i have a function gettypessql which queries sql table at first to get data types and therefore to adjust the columns which are later inserted in sql. So my problem is when I set HDR=Yes in Jet OLE DB I get only column names like F1, F2, F3. To circumvent this issue I've set HDR=No and written some for loops but now I get only empty strings, what is actually the problem? here is my code:
Private Function GetCSVFile(ByVal file As String, ByVal min As Integer, ByVal max As Integer) As DataTable
Dim ConStr As String = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & TextBox1.Text & ";Extended Properties=""TEXT;HDR=NO;IMEX=1;FMT=Delimited;CharacterSet=65001"""
Dim conn As New OleDb.OleDbConnection(ConStr)
Dim dt As New DataTable
Dim da As OleDb.OleDbDataAdapter = Nothing
getData = Nothing
Try
Dim CMD As String = "Select * from " & _table & ".csv"
da = New OleDb.OleDbDataAdapter(CMD, conn)
da.Fill(min, max, dt)
getData = New DataTable(_table)
Dim firstRow As DataRow = dt.Rows(0)
For i As Integer = 0 To dt.Columns.Count - 1
Dim columnName As String = firstRow(i).ToString()
Dim newColumn As New DataColumn(columnName, mListOfTypes(i))
getData.Columns.Add(newColumn)
Next
For i As Integer = 1 To dt.Rows.Count - 1
Dim row As DataRow = dt.Rows(i)
Dim newRow As DataRow = getData.NewRow()
For j As Integer = 0 To getData.Columns.Count - 1
If row(j).GetType Is GetType(String) Then
Dim colValue As String = row(j).ToString()
colValue = ChangeEncoding(colValue)
colValue = ParseString(colValue)
colValue = ReplaceChars(colValue)
newRow(j) = colValue
Else
newRow(j) = row(j)
End If
Next
getData.Rows.Add(newRow)
Application.DoEvents()
Next
Catch ex As OleDbException
MessageBox.Show(ex.Message)
Catch ex As Exception
MessageBox.Show(ex.Message)
Finally
dt.Dispose()
da.Dispose()
End Try
Return getData
End Function
and get types sql, this one doesn't convert properly, especially doubles
Private Sub GetTypesSQL()
If (mListOfTypes Is Nothing) Then
mListOfTypes = New List(Of Type)()
End If
mListOfTypes.Clear()
Dim dtTabelShema As DataTable = db.GetDataTable("SELECT TOP 0 * FROM " & _table)
Using dtTabelShema
For Each col As DataColumn In dtTabelShema.Columns
mListOfTypes.Add(col.DataType)
Next
End Using
End Sub
I think you have made it more complicated than it needs to be. For instance, you get the dbSchema by creating an empty DataTable and harvesting the Datatypes from it. Why not just use that first table rather than creating a new table from the Types? The table also need not be reconstructed over and over for each batch of rows imported.
Generally since OleDb will try to infer types from the data, it seems unnecessary and may even get in the way in some cases. Also, you are redoing everything that OleDB does and copying data to a different DT. Given that, I'd skip the overhead OleDB imposes and work with the raw data.
This creates the destination table using the CSV column name and the Type from the Database. If the CSV is not in the same column order as those delivered in a SELECT * query, it will fail.
The following uses a class to map csv columns to db table columns so the code is not depending on the CSVs being in the same order (since they may be generated externally). My sample data CSV is not in the same order:
Public Class CSVMapItem
Public Property CSVIndex As Int32
Public Property ColName As String = ""
'optional
Public Property DataType As Type
Public Sub New(ndx As Int32, csvName As String,
dtCols As DataColumnCollection)
CSVIndex = ndx
For Each dc As DataColumn In dtCols
If String.Compare(dc.ColumnName, csvName, True) = 0 Then
ColName = dc.ColumnName
DataType = dc.DataType
Exit For
End If
Next
If String.IsNullOrEmpty(ColName) Then
Throw New ArgumentException("Cannot find column: " & csvName)
End If
End Sub
End Class
The code to parse the csv uses CSVHelper but in this case the TextFieldParser could be used since the code just reads the CSV rows into a string array.
Dim SQL = String.Format("SELECT * FROM {0} WHERE ID<0", DBTblName)
Dim rowCount As Int32 = 0
Dim totalRows As Int32 = 0
Dim sw As New Stopwatch
sw.Start()
Using dbcon As New MySqlConnection(MySQLConnStr)
Using cmd As New MySqlCommand(SQL, dbcon)
dtSample = New DataTable
dbcon.Open()
' load empty DT, create the insert command
daSample = New MySqlDataAdapter(cmd)
Dim cb = New MySqlCommandBuilder(daSample)
daSample.InsertCommand = cb.GetInsertCommand
dtSample.Load(cmd.ExecuteReader())
' dtSample is not only empty, but has the columns
' we need
Dim csvMap As New List(Of CSVMapItem)
Using sr As New StreamReader(csvfile, False),
parser = New CsvParser(sr)
' col names from CSV
Dim csvNames = parser.Read()
' create a map of CSV index to DT Columnname SEE NOTE
For n As Int32 = 0 To csvNames.Length - 1
csvMap.Add(New CSVMapItem(n, csvNames(n), dtSample.Columns))
Next
' line data read as string
Dim data As String()
data = parser.Read()
Dim dr As DataRow
Do Until data Is Nothing OrElse data.Length = 0
dr = dtSample.NewRow()
For Each item In csvMap
' optional/as needed type conversion
If item.DataType = GetType(Boolean) Then
' "1" wont convert to bool, but (int)1 will
dr(item.ColName) = Convert.ToInt32(data(item.CSVIndex).Trim)
Else
dr(item.ColName) = data(item.CSVIndex).Trim
End If
Next
dtSample.Rows.Add(dr)
rowCount += 1
data = parser.Read()
If rowCount = 50000 OrElse (data Is Nothing OrElse data.Length = 0) Then
totalRows += daSample.Update(dtSample)
' empty the table if there will be more than 100k rows
dtSample.Rows.Clear()
rowCount = 0
End If
Loop
End Using
End Using
End Using
sw.Stop()
Console.WriteLine("Parsed and imported {0} rows in {1}", totalRows,
sw.Elapsed.TotalMinutes)
The processing loop updates the DB every 50K rows in case there are many many rows. It also does it in one pass rather than reading N rows thru OleDB at a time. CsvParser will read one row at a time, so there should never be more than 50,001 rows worth of data on hand at a time.
There may be special cases to handle for type conversions as shown with If item.DataType = GetType(Boolean) Then. A Boolean column read in as "1" cant be directly passed to a Boolean column, so it is converted to integer which can. There could be other conversions such as for funky dates.
Time to process 250,001 rows: 3.7 mins. An app which needs to apply those string transforms to every single string column will take much longer. I'm pretty sure that using the CsvReader in CSVHelper you could have those applied as part of parsing to a Type.
There is a potential disaster waiting to happen since this is meant to be an all-purpose importer/scrubber.
For i As Integer = 0 To dt.Columns.Count - 1
Dim columnName As String = firstRow(i).ToString()
Dim newColumn As New DataColumn(columnName, mListOfTypes(i))
getData.Columns.Add(newColumn)
Next
Both the question and the self-answer build the new table using the column names from the CSV and the DataTypes from a SELECT * query on the destination table. So, it assumes the CSV Columns are in the same order that SELECT * will return them, and that all CSVs will always use the same names as the tables.
The answer above is marginally better in that it finds and matches based on name.
A more robust solution is to write a little utility app where a user maps a DB column name to a CSV index. Save the results to a List(Of CSVMapItem) and serialize it. There could be a whole collection of these saved to disk. Then, rather than creating a map based on dead reckoning, just deserialize the desired for user as the csvMap in the above code.
I am trying to read fields from a query into a text string arrary. In vb6 I can simply declare the array and then read the fields into it without it caring what type of values are in it. Now when I try to do that same thing I get an "unable to cast com object of type 'dao.fieldclass' to type 'system.string". Do I need to read the field value into a seperarte variable and then convert it to a string? The seqNum is what I am having the problem with
Public dbEngine As dao.DBEngine
Public db As dao.Database, recSet As dao.Recordset
dbEngine = New dao.DBEngine
Dim seqNum As Long
scExportTemplatePath = "M:\robot\scTemplates\"
db = dbEngine.OpenDatabase(scExportStuffPath & "scExport.mdb")
tsOut = fso.CreateTextFile(wildePath & dte & "-" & fle.Name & ".csv", True)
With recSet
.MoveFirst()
Do While Not .EOF
seg = .Fields("segmentID")
If seg <> segHold Then
seqNum = 1
End If
arrOut(0) = .Fields("jobnum_AM")
Loop
End With
You have several problems with this code. In addition to the points mentioned by Jeremy:
What was Long in VB6 is now Integer in VB.NET. Long is a 64 bit integer now.
Use System.IO.Path.Combine in order to combine path strings. Combine automatically adds missing backslashes and removes superfluous ones. Path.Combine(scExportTemplatePath, "scExport.mdb")
The Field property does not have a default property any more. Non-indexed properties are never default properties in VB.NET. Get the field value with .Fields("segmentID").Value.
Convert its value to the appropriate type: seg = Convert.ToInt32(.Fields("segmentID").Value)
Note: VB's Integer type is just an alias for System.Int32.
You are always adding to the same array field. I don't know exactly what you have in mind. If you want to add one field only, you could just use a List(Of String). If you are adding several fields for each record, then a List(Of String()) (i.e. a list of string arrays) would be appropriate. Lists have the advantage that they grow automatically.
Dim list As New List(Of String())
Do While Not .EOF
Dim values = New String(2) {}
values(0) = Convert.ToString(.Fields("field_A").Value)
values(1) = Convert.ToString(.Fields("field_B").Value)
values(2) = Convert.ToString(.Fields("field_C").Value)
list.Add(values)
recSet.MoveNext()
Loop
But it is more comprehensible, if you create a custom class for storing your field values:
Console.WriteLine("{0} {1} ({2})", user.FirstName, user.LastName, user.DateOfBirth)
... reads much better than:
Console.WriteLine("{0} {1} ({2})", values(0), values(1), values(2))
In VB.NET you have other possibilities to work with databases:
Dim list As New List(Of String())
Using conn = New OleDb.OleDbConnection( _
"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=MyPath\MyDb.mdb")
Dim sql = "SELECT myStuff FROM myTable"
Dim command = New OleDbCommand(sql, conn)
conn.Open()
Using reader As OleDbDataReader = command.ExecuteReader()
While reader.Read()
Dim values = New String(reader.FieldCount - 1) {}
For i = 0 To reader.FieldCount - 1
values(i) = Convert.ToString(reader.GetValue(i))
Next
list.Add(values)
End While
End Using
End Using
Note that the Using statement closes the resources automatically at the end. Even if an error occurs and the code is terminated prematurely.
In VB.NET you can write to files like this (without using fso, which is not .NET like)
Using writer As New StreamWriter("myFile.txt", False)
writer.WriteLine("line 1")
writer.WriteLine("line 2")
writer.WriteLine("line 3")
End Using
1) You dont show how you open the Recordset, eg:
Set recSet = db.OpenRecordset("query_name or SQL")
2) You dont have a .MoveNext in the While Loop:
With recSet
.MoveFirst()
Do While Not .EOF
seg = .Fields("segmentID")
If seg <> segHold Then
seqNum = 1
End If
arrOut(0) = .Fields("jobnum_AM")
.MoveNext()
loop