Using Regex to match multiple html lines - vb.net

Like the title says I have the html source that contains the code below. I am trying to grab the number 25 which can change so I think I would use the wildcard .* but I am not sure how to grab it because I the number is on its own line. I usually used one line of html then used the split method and split the double quotes then getting the value. I know how to do html webrequest and webresponse, so I have no problem getting the source. Any help with the regex expression would be appreciated.
<td class="number">25</td>
Edit: This is what I used to get the source.
Dim strURL As String = "mysite"
Dim strOutput As String = ""
Dim wrResponse As WebResponse
Dim wrRequest As WebRequest = HttpWebRequest.Create(strURL)
wrResponse = wrRequest.GetResponse()
Using sr As New StreamReader(wrResponse.GetResponseStream())
strOutput = sr.ReadToEnd()
'Close and clean up the StreamReader
sr.Close()
End Using

Instead of a Regex, it is possible to use a string search to pull out the text, e.g.:
Sub Main()
Dim html = "<td class=""number"">" + vbCrLf + "25" + vbCrLf + "</td>"
Dim startText = "<td class=""number"">"
Dim startIndex = html.IndexOf(startText) + startText.Length
Dim endText = "</td>"
Dim endIndex = html.IndexOf(endText, startIndex)
Dim text = html.Substring(startIndex, endIndex - startIndex)
text = text.Trim({CChar(vbLf), CChar(vbCr), " "c})
Console.WriteLine(text)
End Sub

Related

How to remove extra line feeds within double quoted fields

Newbie here. Code below removes ALL line feeds in my file but it also removes EOR line feeds. Can somebody please help me how to fix code below so it only removes extra line feeds within double quoted fields? Any help will be greatly appreciated. Thanks
Public Sub Main()
'
Dim objReader As IO.StreamReader
Dim contents As String
objReader = New IO.StreamReader("testfile.csv")
contents = objReader.ReadToEnd()
objReader.Close()
Dim objWriter As New System.IO.StreamWriter("testfile.csv")
MsgBox(contents)
'contents = Replace(contents, vbCr, "")
contents = Replace(contents, vbLf, "")
MsgBox(contents)
objWriter.Write(contents)
objWriter.Close()
'
Dts.TaskResult = ScriptResults.Success
End Sub
I forgot to mention that the input file name changes daily, How do I code so it doesn't care for the file name as long as it as a CSV file? So testfile name has current date and changes daily. I've tried just the file path and it errored out as well. Used the *.csv and it didnt like that either.
objReader = New IO.StreamReader("\FolderA\FolderB\TestFile09212022.csv")
If you are sure there are no double quotes text inside the double quotes you can do it like this:
Dim sNewString As String = ""
Dim s As String
Dim bFirstQuoted As Boolean = False
Dim i As Integer
Dim objWriter As New System.IO.StreamWriter("testfile.csv")
MsgBox(contents)
For i = 1 To contents.Length
s = Mid(contents, i, 1)
If s = """" Then bFirstQuoted = Not bFirstQuoted
If Not bFirstQuoted OrElse (s <> vbLf AndAlso bFirstQuoted) Then
sNewString += s
else
sNewString += " "
End If
Next
MsgBox(sNewString )
objWriter.Write(sNewString )
objWriter.Close()
Dts.TaskResult = ScriptResults.Success

vb.net how do i add long text into csv

hello this is my firs thread ,
i'm trying to convert description of this page (https://www.tokopedia.com/indoislamicstore/cream-zaitun-arofah)
with regex and replace <br/> tag with new line and convert it to csv .
the datagridview it's alright but the csv got screwed
this is my code :
Dim dskrip As New System.Text.RegularExpressions.Regex("<p itemprop=""description"" class=""mt-20"">(.*?)\<\/p>\<\/div>")
Dim dskripm As MatchCollection = dskrip.Matches(rssourcecode0)
For Each itemdskrm As Match In dskripm
getdeskripsinew = itemdskrm.Groups(1).Value
Next
Dim deskripsinew As String = Replace(getdeskripsinew, ",", ";")
Dim deskripsitotal As String = Replace(deskripsinew, "<br/>", Environment.NewLine)
' ListView1.s = Environment.NewLine & deskripsinew
txtDeskripsi.Text = deskripsitotal
datascrapes.ColumnCount = 5
datascrapes.Columns(0).Name = "Title"
datascrapes.Columns(1).Name = "Price"
datascrapes.Columns(2).Name = "Deskripsi"
datascrapes.Columns(3).Name = "Gambar"
datascrapes.Columns(4).Name = "Total Produk"
Dim row As String() = New String() {getname, totalprice, deskripsitotal, directoryme + getfilename, "10"}
datascrapes.Rows.Add(row)
Dim filePath As String = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) & "\" & "Tokopedia_Upload.csv"
Dim delimeter As String = ","
Dim sb As New StringBuilder
For i As Integer = 0 To datascrapes.Rows.Count - 1
Dim array As String() = New String(datascrapes.Columns.Count - 1) {}
If i.Equals(0) Then
For j As Integer = 0 To datascrapes.Columns.Count - 1
array(j) = datascrapes.Columns(j).HeaderText
Next
sb.AppendLine(String.Join(delimeter, array))
End If
For j As Integer = 0 To datascrapes.Columns.Count - 1
If Not datascrapes.Rows(i).IsNewRow Then
array(j) = datascrapes(j, i).Value.ToString
End If
Next
If Not datascrapes.Rows(i).IsNewRow Then
sb.AppendLine(String.Join(delimeter, array))
End If
Next
File.WriteAllText(filePath, sb.ToString)
this is the csv file
I'm not sure where your problem is looking at the CSV file, but there are certain cases where you'll want to quote the values for a CSV. There's no official spec but RFC 4180 is often used as an unofficial standard. I would recommend using a library like CSV Helper

console write line every 8 characters

I'm converting text to binary, and sending that information to the console. However, I want the console to write a new line for every 8 characters.
My convert to binary code looks like this (where Result.Text contains random text, that will be converted)
Dim Resultconvert As String = ""
For Each C As Char In Result.Text
Dim s As String = System.Convert.ToString(AscW(C), 2).PadLeft(8, "0")
Resultconvert &= s
Next
Console.WriteLine(Resultconvert)
So it should look like this:
01110010
01111110
01101111
01011100
01100100
01010001
01001101
00111010
01010100
instead of:
011100100111111001101111010111000110010001010001010011010011101001010100
Any help is appreciated. Thank you in advance.
Dim Resultconvert As String = String.Empty
For Each C As Char In Result.Text
Dim s As String = System.Convert.ToString(AscW(C), 2).PadLeft(8, "0")
Debug.Print(s)
Resultconvert &= s
Next
The simplest solution is to just call WriteLine separately for each byte, like this:
For Each C As Char In Result.Text
Dim s As String = System.Convert.ToString(AscW(C), 2).PadLeft(8, "0")
Console.WriteLine(s)
Next
However, if you want to write it all at once, you need to append a new line to the string after each byte, like this:
Dim Resultconvert As String = ""
For Each C As Char In Result.Text
Dim s As String = System.Convert.ToString(AscW(C), 2).PadLeft(8, "0")
Resultconvert &= s & Environment.NewLine
Next
Console.WriteLine(Resultconvert)
But, at that point, it would be easier and more efficient to use a StringBuilder:
Dim builder As New StringBuilder()
For Each C As Char In Result.Text
Dim s As String = System.Convert.ToString(AscW(C), 2).PadLeft(8, "0")
builder.AppendLine(s)
Next
Console.WriteLine(builder.ToString())

Issue in splitting an array of strings

I'm using webrequests to retrieve data in a .txt file that's on my dropbox using this "format".
SomeStuff
AnotherStuff
StillAnother
And i'm using this code to retrieve each line and read it:
Private Sub DataCheck()
Dim datarequest As HttpWebRequest = CType(HttpWebRequest.Create("https://dl.dropboxusercontent.com.txt"), HttpWebRequest)
Dim dataresponse As HttpWebResponse = CType(datarequest.GetResponse(), HttpWebResponse)
Dim sr2 As System.IO.StreamReader = New System.IO.StreamReader(dataresponse.GetResponseStream())
Dim datastring() As String = sr2.ReadToEnd().Split(CChar(Environment.NewLine))
If datastring(datastring.Length - 1) <> String.Empty Then
For Each individualdata In datastring
MessageBox.Show(individualdata)
Console.WriteLine(individualdata)
Next
End If
End Sub
The problem is, the output is this:
It always adds a line break (equal to " " as i see as first character on each but the first line string) after the first line like:
http://img203.imageshack.us/img203/1296/gejb.png
Why this happens? I tried also replacing the Environment.Newline with nothing like this:
Dim newstring as String = individualdata.Replace(Environment.Newline, String.Empty)
But the result was the same... what's the problem here? I tried with multiple newline strings and consts like vbnewline, all had the same result, any ideas?
You are not splitting by NewLine since you are cutting off Environment.NewLine which is a string with CChar. You just have to use the overload of String.Split that takes a String() and a StringSplitOption:
So instead of
Dim text = sr2.ReadToEnd()
Dim datastring() As String = text.Split(CChar(Environment.NewLine))
this
Dim datastring() As String = text.Split({Environment.NewLine}, StringSplitOptions.None)
I suspect that your file contains a mix of NewLine+CarriageReturn (vbCrLf) and a simple NewLine (vbLf).
If this is the case then you could create an array of the possible separators
Dim seps(2) as Char
seps(0) = CChar(vbLf)
seps(1) = CChar(vbCr)
Dim datastring() As String = sr2.ReadToEnd().Split(seps, StringSplitOptions.RemoveEmptyEntries)
The StringSplitOptions.RemoveEmptyEntries is required because a vbCrLf creates an empty string between the two separators

How to add a string to multiple string for printing external

This is going to be a long one, but easy fix.
So i've manage to convert a pdf to string, then able to print an external pdf simply by putting the name of the file in a textbox.
I've also figured how to extract certain text from the pdf string, now the certain text are also files located in an external location (I use c:\temp\ for testing).
Which leaves me with one problem, the text I extract, I use shellexecute to print, works fine if its one string. however, If the file name I extract is more than one it will count it as a single string, thus adding the location and .pdf to that one string. instead of the two or more strings. which will do something like this:
As you can see, it will send that to the printer. I want to send one at a time to the printer. like this:
I've tried using an Arraylist and various methods. but my own lack of knowledge, I cannot figure it out.
I'm thinking a "for loop" will help me out. any ideas?
Below is my code.
Dim pdffilename As String = Nothing
pdffilename = RawTextbox.Text
Dim filepath = "c:\temp\" & RawTextbox.Text & ".pdf"
Dim thetext As String
thetext = GetTextFromPDF(filepath) ' converts pdf to text from a function I didnt show.
Dim re As New Regex("[\t ](?<w>((asm)|(asy)|(717)|(ssm)|(715)|(818))[a-z0-9]*)[\t ]", RegexOptions.ExplicitCapture Or RegexOptions.IgnoreCase Or RegexOptions.Compiled) ' This filters out and extract certain keywords from the PDF
Dim Lines() As String = {thetext}
Dim words As New List(Of String)
For Each s As String In Lines
Dim mc As MatchCollection = re.Matches(s)
For Each m As Match In mc
words.Add(m.Groups("w").Value)
Next
RawRich4.Text = String.Join(Environment.NewLine, words.ToArray)
Next
'This is where I need help with the code. how to have "words" putout "c:\temp\" & RawRich4.Text & ".pdf" with each keyword name
Dim rawtoprint As String = String.Join(Environment.NewLine, words.ToArray)
Dim defname As String = Nothing
defname = RawRich4.Text
rawtoprint = "c:\temp\" & RawRich4.Text & ".pdf"
Dim psi As New System.Diagnostics.ProcessStartInfo()
psi.UseShellExecute = True
psi.Verb = "print"
psi.WindowStyle = ProcessWindowStyle.Hidden
psi.Arguments = PrintDialog1.PrinterSettings.PrinterName.ToString()
psi.FileName = (rawtoprint) ' this is where the error occurs it doesn't send both files separately to the printer, it tries to send it as one name
MessageBox.Show(rawtoprint) ' This is just to test the output, this will be removed.
'Process.Start(psi)
End Sub
Updated.
Imports System.Text.RegularExpressions
Module Program
Sub Main()
Dim pdffilename As String = RawTextbox.Text
Dim filepath = "c:\temp\" & RawTextbox.Text & ".pdf"
Dim thetext As String
thetext = GetTextFromPDF(filepath) ' converts pdf to text from a function I didnt show.
'thetext = "Random text here and everywhere ASM00200207 1 1 same here bah boom 12303 doh hel232 ASM00200208 1 2 "
Dim pattern As String = "(?i)[\t ](?<w>((asm)|(asy)|(717)|(ssm)|(715)|(818))[a-z0-9]*)[\t ]"
For Each m As Match In rgx.Matches(thetext, pattern)
'Console.WriteLine("C:\temp\" & Trim(m.ToString) & ".pdf")
RawPrintFunction("C:\temp\" & Trim(m.ToString) & ".pdf")
Next
End Sub
Function RawPrintFunction(ByVal rawtoprint As String) As Integer
Dim psi As New System.Diagnostics.ProcessStartInfo()
psi.UseShellExecute = True
psi.Verb = "print"
psi.WindowStyle = ProcessWindowStyle.Hidden
psi.Arguments = PrintDialog1.PrinterSettings.PrinterName.ToString()
MessageBox.Show(rawtoprint) This will be removed, this is just for testing to see what files will be printed
'Process.Start(psi) This will be uncomment.
return 0
End Function
End Module
If I don't misunderstand the code -since I can't test and run it here- you can iterate through file names stored in words variable and send it to printer. Following is an example on how to do that :
....
....
Dim Lines() As String = {thetext}
Dim words As New List(Of String)
For Each s As String In Lines
Dim mc As MatchCollection = re.Matches(s)
For Each m As Match In mc
words.Add(m.Groups("w").Value)
Next
RawRich4.Text = String.Join(Environment.NewLine, words.ToArray)
Next
For Each fileName As String In words
Dim rawtoprint As String
rawtoprint = "c:\temp\" & fileName & ".pdf"
Dim psi As New System.Diagnostics.ProcessStartInfo()
psi.UseShellExecute = True
psi.Verb = "print"
psi.WindowStyle = ProcessWindowStyle.Hidden
psi.Arguments = PrintDialog1.PrinterSettings.PrinterName.ToString()
psi.FileName = (rawtoprint) ' this is where the error occurs it doesn't send both files separately to the printer, it tries to send it as one name
MessageBox.Show(rawtoprint) ' This is just to test the output, this will be removed.
'Process.Start(psi)
Next