vb.net how to split text file - vb.net

I have a file with the comma-separated lines as below:
PS23456789,08/2023,2011,LAM CHIUE MONG JP
sdad,08/2023,2011,LAM CHIUE MONG JP
xvczxcssf,08/2023,2011,LAM CHIUE MONG JP
42432,08/2023,2011,LAM CHIUE MONG JP
fdsafs,08/2023,2011,LAM CHIUE MONG JP
I want to convert this data to fixed-length values and no commas, like this:
PS23456789 08/2023 2011 LAM CHIUE MONG JP
sdad 08/2023 2011 LAM CHIUE MONG JP
xvczxcssf 08/2023 2011 LAM CHIUE MONG JP
42432 08/2023 2011 LAM CHIUE MONG JP
fdsafs 08/2023 2011 LAM CHIUE MONG JP
Unfortunately, I can only get the first row to look right. The others do not work. Here is what it looks like:
PS23456789 08/2023 2011 LAM CHIUE MONG JP
sdad08/2023 2011 LAM CHIUE MONG JP
xvczxcssf08/2023 2011 LAM CHIUE MONG JP
4243208/2023 2011 LAM CHIUE MONG JP
fdsafs08/2023 2011 LAM CHIUE MONG JP
This is my code:
Dim splitFile = File.ReadAllText(Result).Split(",")
For count = 0 To splitFile.Length - 1
While (splitFile(count).Length < 20)
splitFile(count) = splitFile(count) + " "
End While
totalFile += splitFile(count)
Next
My.Computer.FileSystem.WriteAllText("C:test.txt", totalFile, False)
How can I fix this?

This should do as you want:
''' <summary>
''' Removes delimiters from a file and makes all but the last column a fixed width.
''' </summary>
''' <param name="inputFilePath">
''' The path of the input file.
''' </param>
''' <param name="outputFilePath">
''' The path of the output file. Can be the same as the input file.
''' </param>
''' <param name="delimiter">
''' The field delimiter in the input file.
''' </param>
''' <param name="fieldWidth">
''' The width of the columns in the output file.
''' </param>
Private Sub MakeDelimitedFileFixedWidth(inputFilePath As String,
outputFilePath As String,
delimiter As String,
fieldWidth As Integer)
'The lines to write to the output file.
Dim outputLines As New List(Of String)
For Each line In File.ReadLines(inputFilePath)
'Split the existing line on the delimiter.
Dim fields = line.Split({delimiter}, StringSplitOptions.None)
'Pad all but the last column to the specified width.
For i = 0 To fields.Length - 2
fields(i) = fields(i).PadRight(fieldWidth)
Next
outputLines.Add(String.Concat(fields))
Next
'Write out the processed data.
File.WriteAllLines(outputFilePath, outputLines)
End Sub
That will not pad the last column. If you want that then change fields.Length - 2 to fields.Length - 1 or fields.GetUpperBound(0).
In your case, you would call it like so:
MakeDelimitedFileFixedWidth(Result, "C:test.txt", ",", 20)
While I haven't examined your existing code in detail, the issue is probably that you were reading the existing data as a single block, then splitting that on the delimiters. That would mean that you would have the last field from one line and the first field from the next line together as one value. That would play havoc with your field length calculations. The code I have provided reads the existing data line by line, so that issue doesn't exist.

I think that's because you are reading all the lines as a whole. The end of each line should be a \n symbol, thus the last string of the first line cannot be separated with the first string of the second line. The concatenated string is longer than the length you have defined (i.e. 20), therefore no space is added. You should read the file line by line and process each line separately.
BTW, to add trailing spaces to get fixed length strings, you may want to try String.Format which performs much better.
Check ref here. Example:
String.Format({0:-20}, splitFile(count))

Related

pandas string manipulation with regular expressions

I have pandas dataframe containing column with string
I want to get read of empty space in the beggining; 1.; 1st and 2nd numbers in a string; also / in the middle of the word, \ between words
I could not remove 2nd digit though
How to do it in one go and also remove second digit in a string
i could do it one by one (not sure if it is correct but working)
st={'string':['155555 11111 hhhh 15-0850tcx cord\with plastic end /
light mustard -82cm шнур нужд вес 07 кг',' 1. 06900000027899 non woven
12 grid socks']}
s = pd.DataFrame(st)
s['string'] = s['string'].str.replace(r'\d\.', '') #removes 1.
s['string'] = s['string'].str.replace(r"\\", " ") #removes backslash
s['string'] = s['string'].str.replace(r"\/", "") #removes backslash
s['string'] = s['string'].str.replace(r"^\d*", "") #removes digit in the begginning of string
s['string'] = s['string'].str.strip() #removes space in front
An equivalent of your commands could be:
s['string'] = s['string'].str.replace(r'(^\s*\d+\.?\s*|\\|/)', '', regex=True)
Output:
string
0 11111 hhhh 15-0850tcx cordwith plastic end light mustard -82cm шнур нужд вес 07 кг
1 06900000027899 non woven 12 grid socks
regex demo

How do I test if an enum is contained in a number?

In VB6 I combine several DT_DRAW_FLAG values like this:
dim lMyFlags As DT_DRAW_FLAG
lMyFlags = DT_CENTER OR DT_VCENTER OR DT_RTLREADING
This would result in lMyFlags = 131077
Now to test if a certain flag is contained in such a combine flags value, I would do the following:
If (131077 And DT_RTLREADING) = DT_RTLREADING Then
'DT_RTLREADING is contained.
Else
'DT_RTLREADING is NOT contained.
End Enum
How would I do this in VB.NET?
Do I still have to use this "pure math" approach, or is there a method like...
lMyFlags.ContainsEnum(DT_RTLREADING)
... which I have not found yet?
Thank you!
If you have an enum declaration like this
<Flags()> Public Enum DT_DRAW_FLAG As Integer
DT_CENTER = 1
DT_VCENTER = 2
DT_RTLREADING = 4
End Enum
Then you can use HasFlag to do your logic
Dim lMyFlags As DT_DRAW_FLAG = DT_DRAW_FLAG.DT_CENTER Or DT_DRAW_FLAG.DT_RTLREADING
lMyFlags.HasFlag(DT_DRAW_FLAG.DT_RTLREADING) ' => true
The important point here is that the single enum values are powers of two.
Accepted answer is good, it works. In your case, your Enum might already have the flags attribute, because it is a combination of three powers of 2: 2^0=1, 2^2=4, and 2^17=131072
Your enum may look like this
<Flags>
Public Enum DT_DRAW_FLAG As Long
''' <summary>
''' 2 ^ 0
''' </summary>
DT_CENTER = 1
' 2 ^ 1 in here
''' <summary>
''' 2 ^ 4
''' </summary>
DT_VCENTER = 4
' 2 ^ 3 through 2 ^ 16 in here
''' <summary>
''' 2 ^ 17
''' </summary>
DT_RTLREADING = 131072
End Enum
The Flags Attribute
Indicates that an enumeration can be treated as a bit field; that is, a set of flags.
However, whether or not it has the Flags attribute, you can treat it the same way using bitwise And. I believe the HasFlags function is just shorthand for the bitwise logic:
' Bitwise logic
If (lMyFlags And DT_DRAW_FLAG.DT_RTLREADING) = DT_DRAW_FLAG.DT_RTLREADING Then
' Reading is contained
Else
' Reading is not contained
End If
' HasFlags
If lMyFlags.HasFlag(DT_DRAW_FLAG.DT_RTLREADING) Then
' Reading is contained
Else
' Reading is not contained
End If
It is certainly less code.
Note, you don't combine enums in the way that you have shown, without some additional conversion. Use bitwise Or to do that
Dim lMyFlags = DT_DRAW_FLAG.DT_CENTER Or DT_DRAW_FLAG.DT_VCENTER Or DT_DRAW_FLAG.DT_RTLREADING
Also, you can use Enum.HasFlag regardless of whether you used the Flags Attribute. As far as I know, the attribute is just used to signal to the consumer that the values are distinct powers of two, and bitwise logic can be performed on them. There is nothing strictly going on with the flags attribute so there's some trust in the consumer to know what to do with it (and we assume the original author knew about it, too)

Cycling through two csv's in Python/Pandas same code work with only one of the files

I made a code to pick the company number field in a "companies partner's" file and then compare with a file with companies list of a specific State of the Country, writing the result to a third file (final result: to have the partners of all the companies of that State).
The code is simple:
import pandas as pd
dsocio = pd.read_csv('D:/CNPJ-full-master/cnpj-csv/socios.csv', chunksize=262144, low_memory=False)
duf = pd.read_csv('D:/pyData/receita/empES.csv', usecols = ['cnpj'], low_memory=False)
for chunk in dsocio:
result = chunk[chunk['cnpj'].isin(duf.cnpj)]
result.to_csv('D:/CNPJ-full-master/cnpj-csv/UFs/socioES.csv', index=False, header=True, mode='a')
The problem is, I have two versions of the "empES.csv" file. They have different number of columns, but both have the field 'cnpj' as the first column. And this is the only field I need. When I run the code passing the version 1 file, it runs perfectly. But, when I try to open the version 2 instead, my output file starts being populated with only the header. Many lines with the header!
Here are some snippets of the first lines of:
The partners file (socios.csv, the one I will copy the matching lines from):
'''
"cnpj","tipo_socio","nome_socio","cnpj_cpf_socio","cod_qualificacao","perc_capital","data_entrada","cod_pais_ext","nome_pais_ext","cpf_repres","nome_repres","cod_qualif_repres"
"00000000000191","2","MARCIO HAMILTON FERREIRA","*923641","10",0.0,"20101117","","","","","00"
"00000000000191","2","NILSON MARTINIANO MOREIRA","*491386","10",0.0,"20101117","","","","","00"
"00000000002135","2","DEBORA CRISTINA FONSECA","*314628","08",0.0,"20200312","","","","","00"
"00000000002216","2","WALDERY RODRIGUES JUNIOR","*025913","08",0.0,"20200312","","","","","00"
"00000000002216","2","ERIK DA COSTA BREYER","*093217","10",0.0,"20191209","","","","","00"
"00000000002216","2","THOMPSON SOARES PEREIRA CESAR","*503187","10",0.0,"20191209","","","","","00"
"00000000002569","2","WALTER MALIENI JUNIOR","*718468","10",0.0,"20101117","","","","","00"
"00000000002569","2","NILSON MARTINIANO MOREIRA","*491386","10",0.0,"20101117","","","","","00"
"00000000002640","2","WALDERY RODRIGUES JUNIOR","*025913","08",0.0,"20200312","","","","","00"
'''
The working companies file (empES.csv), from which I read only the 'cnpj' field:
'''
cnpj,identificador_matriz_filial,razao_social,nome_fantasia,situacao_cadastral,data_situacao_cadastral,motivo_situacao_cadastral,nome_cidade_exterior,codigo_natureza_juridica,data_inicio_atividade,cnae_fiscal,descricao_tipo_logradouro,logradouro,numero,complemento,bairro,cep,uf,codigo_municipio,municipio,ddd_telefone_1,ddd_telefone_2,ddd_fax,qualificacao_do_responsavel,capital_social,porte,opcao_pelo_simples,data_opcao_pelo_simples,data_exclusao_do_simples,opcao_pelo_mei,situacao_especial,data_situacao_especial
2135,2,BANCO DO BRASIL SA,VITORIA - ES,2,2005-11-03,0,,2038,1966-08-01,6421200,PRACA,PIO XII,30,,CENTRO,29010340.0,ES,5705,VITORIA,,,,10,0.0,5,0,,,0,,
8338,2,BANCO DO BRASIL SA,CACHOEIRO DE ITAPEMIRIM-ES-EST UNIF,2,2005-11-03,0,,2038,1966-08-01,6421200,PRACA,JERONIMO MONTEIRO,26,,CENTRO,29300902.0,ES,5623,CACHOEIRO DE ITAPEMIRIM,,,,10,0.0,5,0,,,0,,
11207,2,BANCO DO BRASIL SA,COLATINA-ES-EST.UNIF,2,2005-11-03,0,,2038,1966-08-01,6421200,RUA,EXPED ABILIO DOS SANTOS,124,,CENTRO,29700070.0,ES,5629,COLATINA,,,,10,0.0,5,0,,,0,,
18643,2,BANCO DO BRASIL SA,,2,2005-11-03,0,,2038,1966-08-01,6421200,RUA,PRESIDENTE VARGAS,29,,CENTRO,29400000.0,ES,5667,MIMOSO DO SUL,,,,10,0.0,5,0,,,0,,
19615,2,BANCO DO BRASIL SA,,2,2005-11-03,0,,2038,1982-05-04,6421200,AVENIDA,SENADOR EURICO RESENDE,994,,CENTRO,29845000.0,ES,5619,BOA ESPERANCA,,,,10,0.0,5,0,,,0,,
20974,2,BANCO DO BRASIL SA,SANTA TERESA ES-EST UNIF,2,2005-11-03,0,,2038,1966-08-01,6421200,RUA,JERONIMO VERVLOET,178,,CENTRO,29650000.0,ES,5691,SANTA TERESA,,,,10,0.0,5,0,,,0,,
'''
The new companies file (empES.csv), which gives me the weird behavior:
'''
cnpj,matriz_filial,razao_social,nome_fantasia,situacao,data_situacao,motivo_situacao,nm_cidade_exterior,cod_pais,nome_pais,cod_nat_juridica,data_inicio_ativ,cnae_fiscal,tipo_logradouro,logradouro,numero,complemento,bairro,cep,uf,cod_municipio,municipio,ddd_1,telefone_1,ddd_2,telefone_2,ddd_fax,num_fax,email,qualif_resp,capital_social,porte,opc_simples,data_opc_simples,data_exc_simples,opc_mei,sit_especial,data_sit_especial
2135,2,BANCO DO BRASIL SA,VITORIA - ES,2,20051103,0,,,,2038,19660801,6421200,PRACA,PIO XII,30,,CENTRO,29010340.0,ES,5705,VITORIA,,,,,,,AGE0021#BB.COM.BR,10,0.0,5,0,,,,,
8338,2,BANCO DO BRASIL SA,CACHOEIRO DE ITAPEMIRIM-ES-EST UNIF,2,20051103,0,,,,2038,19660801,6421200,PRACA,JERONIMO MONTEIRO,26,,CENTRO,29300902.0,ES,5623,CACHOEIRO DE ITAPEMIRIM,,,,,,,,10,0.0,5,0,,,,,
11207,2,BANCO DO BRASIL SA,COLATINA-ES-EST.UNIF,2,20051103,0,,,,2038,19660801,6421200,RUA,EXPED ABILIO DOS SANTOS,124,,CENTRO,29700070.0,ES,5629,COLATINA,,,,,,,,10,0.0,5,0,,,,,
18643,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19660801,6421200,RUA,PRESIDENTE VARGAS,29,,CENTRO,29400000.0,ES,5667,MIMOSO DO SUL,,,,,,,,10,0.0,5,0,,,,,
19615,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19820504,6421200,AVENIDA,SENADOR EURICO RESENDE,994,,CENTRO,29845000.0,ES,5619,BOA ESPERANCA,,,,,,,,10,0.0,5,0,,,,,
20974,2,BANCO DO BRASIL SA,SANTA TERESA ES-EST UNIF,2,20051103,0,,,,2038,19660801,6421200,RUA,JERONIMO VERVLOET,178,,CENTRO,29650000.0,ES,5691,SANTA TERESA,,,,,,,,10,0.0,5,0,,,,,
22241,2,BANCO DO BRASIL SA,SAO MATEUS ES EST UNIF,2,20051103,0,,,,2038,19660801,6421200,AVENIDA,JONES DOS SANTOS NEVES,324,,CENTRO,29930010.0,ES,5697,SAO MATEUS,,,,,,,,10,0.0,5,0,,,,,
28100,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19660801,6421200,AVENIDA,JERONIMO MONTEIRO,38/46,,CENTRO,29500000.0,ES,5603,ALEGRE,,,,,,,,10,0.0,5,0,,,,,
37001,2,BANCO DO BRASIL SA,,2,20051103,0,,,,2038,19660801,6421200,RUA,DEMERVAL AMARAL,35,,CENTRO,29560000.0,ES,5645,GUACUI,,,,,,,,10,0.0,5,0,,,,,
'''
Here's a sample of the output when I pass the first empES.csv file:
'''
cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres
2135,2,WALDERY RODRIGUES JUNIOR,*025913,8,0.0,20200312,,,,,0
2135,2,ERIK DA COSTA BREYER,*093217,10,0.0,20191209,,,,,0
2135,2,THOMPSON SOARES PEREIRA CESAR,*503187,10,0.0,20191209,,,,,0
2135,2,MAURICIO NOGUEIRA,*894537,10,0.0,20191209,,,,,0
2135,2,DANIEL ANDRE STIELER,*145110,10,0.0,20190910,,,,,0
2135,2,ENIO MATHIAS FERREIRA,*078106,10,0.0,20181107,,,,,0
2135,2,RONALDO SIMON FERREIRA,*685018,10,0.0,20190729,,,,,0
2135,2,IVANDRE MONTIEL DA SILVA,*975660,10,0.0,20190403,,,,,0
2135,2,FABIO AUGUSTO CANTIZANI BARBOSA,*379967,10,0.0,20190403,,,,,0
2135,2,CARLOS MOTTA DOS SANTOS,*876287,10,0.0,20190403,,,,,0
2135,2,CAMILO BUZZI,*569178,10,0.0,20190403,,,,,0
'''
And here's what happens when I try to use the other "empES.csv" file:
'''
j,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres
cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres
cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres
cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres
cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres
cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres
cnpj,tipo_socio,nome_socio,cnpj_cpf_socio,cod_qualificacao,perc_capital,data_entrada,cod_pais_ext,nome_pais_ext,cpf_repres,nome_repres,cod_qualif_repres
'''
...and goes like this forever.
I have no clue why the first one goes fine through the code and why the second gives that output, it's like the .isin is not iterating in that case!
Any thoughts?
ps: All the data presented here is public domain from Brazil's government.
Well, at the end it was a column with bad values. Basically I exported a file with only the 'cnpj' column:
import pandas as pd
duf = pd.read_csv('D:/CNPJ-full-master/cnpj-csv/UFs/empES.csv', usecols = ['cnpj'], low_memory=False)
duf.to_csv('D:/CNPJ-full-master/cnpj-csv/UFs/empES-cnpj.csv', index=False)`
Then I looked on it with notepad++. I saw there was a column in the middle of it with 'cnpj' again, instead of a value. Then I looked for it and found 200 more lines with the same 'cnpj' in the place of values. Well, in approximately 900.000 lines, 200 is not a lot, so I just removed them, and it finally works.
Anyway, although the problem is fixed, I don't know why a non numeric value has crashed the code that way. Must have something to do with the fact the string value is the same as the column name.

Find Each Occurrence of X and Insert a Carriage Return

A colleague has some data he is putting into a flat file (.txt) and needs to insert a carriage return before EACH occurrence of 'POL01', 'SUB01','VEH01','MCO01'.
I did use:
For Each line1 As String In System.IO.File.ReadAllLines(BodyFileLoc)
If line1.Contains("POL01") Or line1.Contains("SUB01") Or line1.Contains("VEH01") Or line1.Contains("MCO01") Then
Writer.WriteLine(Environment.NewLine & line1)
Else
Writer.WriteLine(line1)
End If
Next
But unfortunately it turns out that the file is not formatted in 'lines' by SSIS but as one whole string.
How can I insert a carriage return before every occurrence of the above?
Test Text
POL01CALT302276F 332 NBPM 00101 20151113201511130001201611132359 2015111300010020151113000100SUB01CALT302276F 332 NBPMP01 Akl Abi-Khalil 19670131 M U33 Stoford Close SW19 6TJ 2015111300010020151113000100VEH01CALT302276F 332 NBPM001LV56 LEJ N 2006VAUXHALL CA 2015111300010020151113000100MCO01CALT302276F 332 NBPM0101 0 2015111300010020151113000100POL01CALT742569N
You can use regular expressions for this, specifically by using Regex.Replace to find and replace each occurrence of the strings you're looking for with a newline followed by the matching text:
Dim str as String = "xxxPOL01xxxSUB01xxxVEH01xxxMCO01xxx"
Dim output as String = Regex.Replace(str, "((?:POL|SUB|VEH|MCO)01)", Environment.NewLine + "$1")
'output contains:
'xxx
'POL01xxx
'SUB01xxx
'VEH01xxx
'MCO01xxx
There may be a better way to construct this regular expression, but this is a simple alternation on the different letters, followed by 01. This matched text is represented by the $1 in the replacement string.
If you're new to regular expressions, there are a number of tools that help you understand them - for example, regex101.com will show you an explanation of the one I have used here:

Best Way To Find Pattern 6 digit space 7 digit ###### ####### with VB.NET

Parsing a text file in vb.net and need to locate the latitude and longitude in these two sections of text. The patter is 6 digits space 7 digits (364800 0953600). The samples are from two different map files and have slightly differing formats.
I 2H02 364800 0953600 ' SEC72 10496300-
I 2H05 360100 0953645 ' ZFW J602 ZME 2A93 10496400-
I 2H06 361215 0952400 ' SEC72 ZME 2A75 10496500-
I 2H07 361715 0951145 ' SEC27/72 ZME 2A78 10496600-
I 2H08 362025 0950100 ' TUL ZME 2A69 10496700-
I 2H10 360800 0952915 ' ZME 2A85 10496800-
I 2H11 362500 0955015 ' SEC62/72 10496900-
I 2H14 364145 0954315 ' TUL 10497000-
I A85A 'AL851 50591 REF 33393944
391500 0831100 ' 50591 REF 33393945
I A85B 'AL851 50591 REF 33393946
374500 0825700 ' 50591 REF 33393947
I A87A 'AL871 111592 REF 33393948
402050 0814420 ' 111592 REF 33393949
I A87B 'AL871 111592 REF 33393950
400449 0814400 ' 111592 REF 33393951
I A87C 'AL872 '030394 GDK 33393952
392000 0810000 ' '030394 GDK 33393953
Thanks,
Dave
Dim matches As MatchCollection
Dim regex As New Regex("\d{6} \d{7}")
matches = regex.Matches(your_text_string)
A simple regex should do it:
[0-9]{6} [0-9]{7}
.....
(?<First>\d{6})\s(?<Second>\d{7})
Do a simple group capture. It appears your RegEx formula will be simple enough to handle both scenarios (be a little lose on the space detection). Then you can access the group properties of the match (either named or just basic index) and get the data you need.