This is my sample file
#%cty_id1,#%ccy_id2,#%cty_src,#%cty_cd3,#%cty_nm4,#%cty_reg5,#%cty_natnl6,#%cty_bus7,#%cty_data8
690,ALL2,,AL,ALBALODMNIA,,,,
90,ALL2,,,AQ,AKNTARLDKCTICA,,,
161,IDR2,,AZ,AZLKFMERBALFKIJAN,,,,
252,LTL2,,BJ,BENLFMIN,,,,
206,CVE2,,BL,SAILFKNT BAFSDRTHLEMY,,,,
360,,,BW2,BOPSLFTSWLSOANA,,,,
The problem is for #%cty_cd3 is a standard column(NOT NULL) with length 2 letters only, but in sql server the record shifts to the other column,(due to a extra comma in btw)how do i validate a csv file,to make sure that
when there's a 2 character word need to be only in 4 column?
there are around 10000 records ?
Set of rules Defined !
Should have a standard set of delimiters for eachrow
if not
Check for NOT NULL values having Null values
If found Null
remove delimiter at the pointer
The 3 ,,, are not replaced with 2 ,,
#UPDATED : Can i know if this can be done using a script ?
Updated i need only a function That operates on records like
90,ALL2,,,AQ,AKNTARLDKCTICA,,, correct them using a Regex or any other method and put back into the source file !
Your best bet here may be to use the tSchemaComplianceCheck component in Talend.
If you read the file in with a tFileInputDelimited component and then check it with the tSchemaComplianceCheck where you set cty_cd to not nullable then it will reject your Antarctica row simply for the null where you expect no nulls.
From here you can use a tMap and simply map the fields to the one above.
You should be able to easily tweak this as necessary, potentially with further tSchemaComplianceChecks down the reject lines and mapping to suit. This method is a lot more self explanatory and you don't have to deal with complicated regex's that need complicated management when you want to accommodate different variations of your file structure with the benefit that you will always capture all of the well formatted rows.
You could try to delete the empty field in column 4, if column no. 4 is not a two-character field, as follows:
awk 'BEGIN {FS=OFS=","}
{
for (i=1; i<=NF; i++) {
if (!(i==4 && length($4)!=4))
printf "%s%s",$i,(i<NF)?OFS:ORS
}
}' file.csv
Output:
"id","cty_ccy_id","cty_src","cty_nm","cty_region","cty_natnl","cty_bus_load","cty_data_load"
6,"ALL",,"AL","ALBANIA",,,,
9,"ALL",,"AQ","ANTARCTICA",,,
16,"IDR",,"AZ","AZERBAIJAN",,,,
25,"LTL",,"BJ","BENIN",,,,
26,"CVE",,"BL","SAINT BARTH�LEMY",,,,
36,,,"BW","BOTSWANA",,,,
41,"BNS",,"CF","CENTRAL AFRICAN REPUBLIC",,,,
47,"CVE",,"CL","CHILE",,,,
50,"IDR",,"CO","COLOMBIA",,,,
61,"BNS",,"DK","DENMARK",,,,
Note:
We use length($4)!=4 since we assume two characters in column 4, but we also have to add two extra characters for the double quotes..
The solution is to use a look-ahead regex, as suggested before. To reproduce your issue I used this:
"\\,\\,\\,(?=\\\"[A-Z]{2}\\\")"
which matches three commas followed by two quoted uppercase letters, but not including these in the match. Ofc you could need to adjust it a bit for your needs (ie. an arbitrary numbers of commas rather than exactly three).
But you cannot use it in Talend directly without tons of errors. Here's how to design your job:
In other words, you need to read the file line by line, no fields yet. Then, inside the tMap, do the match&replace, like:
row1.line.replaceAll("\\,\\,\\,(?=\\\"[A-Z]{2}\\\")", ",,")
and finally tokenize the line using "," as separator to get your final schema. You probably need to manually trim out the quotes here and there, since tExtractDelimitedFields won't.
Here's an output example (needs some cleaning, ofc):
You don't need to entry the schema for tExtractDelimitedFields by hand. Use the wizard to record a DelimitedFile Schema into the metadata repository, as you probably already did. You can use this schema as a Generic Schema, too, fitting it to the outgoing connection of tExtractDelimitedField. Not something the purists hang around, but it works and saves time.
About your UI problems, they are often related to file encodings and locale settings. Don't worry too much, they (usually) won't affect the job execution.
EDIT: here's a sample TOS job which shows the solution, just import in your project: TOS job archive
EDIT2: added some screenshots
Coming to the party late with a VBA based approach. An alternative way to regex is to to parse the file and remove a comma when the 4th field is empty. Using microsoft scripting runtime this can be acheived the code opens a the file then reads each line, copying it to a new temporary file. If the 4 element is empty, if it is it writes a line with the extra comma removed. The cleaned data is then copied to the origonal file and the temporary file is deleted. It seems a bit of a long way round, but it when I tested it on a file of 14000 rows based on your sample it took under 2 seconds to complete.
Sub Remove4thFieldIfEmpty()
Const iNUMBER_OF_FIELDS As Integer = 9
Dim str As String
Dim fileHandleInput As Scripting.TextStream
Dim fileHandleCleaned As Scripting.TextStream
Dim fsoObject As Scripting.FileSystemObject
Dim sPath As String
Dim sFilenameCleaned As String
Dim sFilenameInput As String
Dim vFields As Variant
Dim iCounter As Integer
Dim sNewString As String
sFilenameInput = "Regex.CSV"
sFilenameCleaned = "Cleaned.CSV"
Set fsoObject = New FileSystemObject
sPath = ThisWorkbook.Path & "\"
Set fileHandleInput = fsoObject.OpenTextFile(sPath & sFilenameInput)
If fsoObject.FileExists(sPath & sFilenameCleaned) Then
Set fileHandleCleaned = fsoObject.OpenTextFile(sPath & sFilenameCleaned, ForWriting)
Else
Set fileHandleCleaned = fsoObject.CreateTextFile((sPath & sFilenameCleaned), True)
End If
Do While Not fileHandleInput.AtEndOfStream
str = fileHandleInput.ReadLine
vFields = Split(str, ",")
If vFields(3) = "" Then
sNewString = vFields(0)
For iCounter = 1 To UBound(vFields)
If iCounter <> 3 Then sNewString = sNewString & "," & vFields(iCounter)
Next iCounter
str = sNewString
End If
fileHandleCleaned.WriteLine (str)
Loop
fileHandleInput.Close
fileHandleCleaned.Close
Set fileHandleInput = fsoObject.OpenTextFile(sPath & sFilenameInput, ForWriting)
Set fileHandleCleaned = fsoObject.OpenTextFile(sPath & sFilenameCleaned)
Do While Not fileHandleCleaned.AtEndOfStream
fileHandleInput.WriteLine (fileHandleCleaned.ReadLine)
Loop
fileHandleInput.Close
fileHandleCleaned.Close
Set fileHandleCleaned = Nothing
Set fileHandleInput = Nothing
KillFile (sPath & sFilenameCleaned)
Set fsoObject = Nothing
End Sub
If that's the only problem (and if you never have a comma in the field bt_cty_ccy_id), then you could remove such an extra comma by loading your file into an editor that supports regexes and have it replace
^([^,]*,[^,]*,[^,]*,),(?="[A-Z]{2}")
with \1.
i would question the source system which is sending you this file as to why this extra comma in between for some rows? I guess you would be using comma as a delimeter for importing this .csv file into talend.
(or another suggestion would be to ask for semi colon as column separator in the input file)
9,"ALL",,,"AQ","ANTARCTICA",,,,
will be
9;"ALL";,;"AQ";"ANTARCTICA";;;;
Related
I have a whole bunch of text files that are exported from Photoshop that I need to import into an Excel document. I wrote a macro to get the job done and it seemed to work just fine for my test document but when I tried loading in some of the actual files produced by Photoshop Excel started putting all the data in a separate column except for the first line.
My code that reads the text file:
Open currentDocPath For Input As stream
Do Until EOF(stream)
Input #stream, currentLine
columnContents = Split(currentLine, vbTab)
For n = 0 To UBound(columnContents)
ActiveSheet.Cells(row, Chr(64 + colum + n)).Value = columnContents(n)
Next n
row = row + 1
Loop
Close stream
The text files I am reading look like this, only with much more data:
"Name" "Data" "Info" "blah"
"Name1" "Data1" "Info1" "blah1"
"Name2" "Data2" "Info2" "blah2"
The problem seemed pretty trivial, but when I load it into excel, instaed of looking like it does above it looks like this:
ÿþ"Name" "Data" "Info" "blah"
Name1
Data1
Info1
blah1
Name2
Data2
Info2
blah2
Now I am not sure why this is happening. It seems like the first two characters in the first row are there because those bytes declare the text encoding. Somehow those characters keep the first row formatted correctly while the remaining rows lose their quotation marks and all get moved to new lines.
Could someone who understands UCS-2 Little Endian text encoding explain how I can work around this? When I convert the files to ASCII it works fine.
Cheers!
edit: Okay so I understand now that the encoding is UTF-16 (I don't know a whole lot about character encoding). My main issue is that it's formatting strangely and I don't understand why or how to fix it. Thanks!
As I mentioned in my comment, it appears the file you're trying to import is encoded in UTF-16.
In this vbaexpress.com article, someone suggested that the following should work:
Dim GetOpenFile As String
Dim MyData As String
Dim r As Long
GetOpenFile = Application.GetOpenFilename
r = 1
Open GetOpenFile For Input As #1
Do While Not EOF(1)
Line Input #1, MyData
Cells(r, 1).Value = MyData
r = r + 1
Loop
Close #1
Obviously I can't test it myself, but maybe it'll help you.
Why not just tell excel to import the file. MS has probably put hundreds of thousands of person hours into that code. Record the importation to get easy code.
Remember Excel is a tool for non programmers to do programming things. Use it instead of trying to replace it.
These are the replacement file functions that you use for new code. Add a reference to Microsoft Scripting Runtime.
Opens a specified file and returns a TextStream object that can be used to read from, write to, or append to the file.
object.OpenTextFile(filename[, iomode[, create[, format]]])
Arguments
object
Required. Object is always the name of a FileSystemObject.
filename
Required. String expression that identifies the file to open.
iomode
Optional. Can be one of three constants: ForReading, ForWriting, or ForAppending.
create
Optional. Boolean value that indicates whether a new file can be created if the specified filename doesn't exist. The value is True if a new file is created, False if it isn't created. If omitted, a new file isn't created.
format
Optional. One of three Tristate values used to indicate the format of the opened file. If omitted, the file is opened as ASCII.
The format argument can have any of the following settings:
Constant Value Description
TristateUseDefault
-2
Opens the file using the system default.
TristateTrue
-1
Opens the file as Unicode.
TristateFalse
0
Opens the file as ASCII.
At the company I work at, I have a software that I am developing in vb.net. This software uses a web browser control to load an excel file that the employee can modify. If then saves a copy of the excel file as an excel file for future modification, it saves it as a pdf file, to send to the customer, then prints the first page twice. I am trying to create a quote list. Quote File names are structured as follows...
12345 My Company Name Here 10-25-2013.pdf
Is there any way to "extract" just the "My Company Name Here" in the above example. I tried removing all numbers, and then the - and .pdf from the string, but it actually makes it where fewer results appear in the list view control. Any Ideas?
Dim di As New IO.DirectoryInfo("Z:\Quotes\" & Today.Year & "\" & Today.Month _
& " " & MonthName(Today.Month))
Dim diar1 As IO.FileInfo() = di.GetFiles("*.pdf")
Dim dra As IO.FileInfo
ListView1.View = View.Details
ListView1.Columns.Clear()
ListView1.Columns.Add("Quote Number")
ListView1.Columns.Add("Customer Name")
ListView1.Columns(0).Width = -2
ListView1.Columns(1).Width = -2
For Each dra In diar1
If dra.ToString.Contains("Product") = False Or dra.ToString.Contains("Thumbs.db") Then
Dim newIrm() = dra.ToString.Split(" ")
Dim NumericCharacters As New System.Text.RegularExpressions.Regex("\d")
Dim nonNumericOnlyString As String = NumericCharacters.Replace(newIrm(2), String.Empty)
ListView1.Items.Add(New ListViewItem({newIrm(0), newIrm(1) & newIrm(2)}))
End If
Next
Filename Format:
Z:\Quotes\2013\10 October\12345-RR My Company Name Here 10-25-2013.pdf
By assuming that the company name is always surrounded by blank spaces and that all the surrounding text does not contain any, you can use IndexOf and LastIndexOf. Sample code:
Dim input As String = "Z:\Quotes\2013\10 October\12345-RR My Company Name Here 10-25-2013.pdf"
Dim companyName As String = System.IO.Path.GetFileNameWithoutExtension(input)
companyName = companyName.Substring(companyName.IndexOf(" "), companyName.LastIndexOf(" ") - companyName.IndexOf(" ")).Trim()
If these conditions do not fully apply, you would have to describe clearly the constraints in order to update this code. Without systematically-applied constraints, there wouldn't be any way to deliver an accurate solution for this problem.
The postfix (date.pdf) is a constant size assuming your date format uses leading zeros.
The prefix is a variable size, however the first space of the complete file name always comes before the first character of the company name.
Using these two facts, you can easily find the index of the first and last character of the company "extract" the company name using this information.
Alternatively, you can split the file name into an array using space as your delimiter. You can then grab every index of the array, excluding the first and last index, and combine these elements seperated by a space.
I am not a VBA programmer. However, I have the 'unpleasant' task of re-implementing someones VBA code in another language. The VBA code consists of 75 modules which use one massive 'calculation sheet' to store all 'global variables'. So instead of using descriptive variable names, it often uses:
= Worksheets("bla").Cells(100, 75).Value
or
Worksheets("bla").Cells(100, 75).Value =
To make things worse, the 'calculation sheet' also contains some formulas.
Are there any (free) tools which allow you to reverse engineer such code (e.g. create Nassi–Shneiderman diagram, flowcharts)? Thanks.
I think #JulianKnight 's suggestion should work
Building on this, you could:
Copy all the code to a text editor capable of RegEx search/replace (Eg. Notepad++).
Then use the RegEx search/Replace with a search query like:
Worksheets\(\"Bla\"\).Cells\((\d*), (\d*)\).Value
And replace with:
Var_\1_\2
This will convert all the sheet stored values to variable names with row column indices.
Example:
Worksheets("bla").Cells(100, 75).Value To Var_100_75
These variables still need to be initialized.
This may be done by writing a VBA code which simply reads every (relevant) cell in the "Bla" worksheet and writes it out to a text file as a variable initialization code.
Example:
Dim FSO As FileSystemObject
Dim FSOFile As TextStream
Dim FilePath As String
Dim col, row As Integer
FilePath = "c:\WriteTest.txt" ' create a test.txt file or change this
Set FSO = New FileSystemObject
' opens file in write mode
Set FSOFile = FSO.OpenTextFile(FilePath, 2, True)
'loop round adding lines
For col = 1 To Whatever_is_the_column_limit
For row = 1 To Whatever_is_the_row_limit
' Construct the output line
FSOFile.WriteLine ("Var_" & Str(row) & "_" & Str(col) & _
" = " & Str(Worksheets("Bla").Cells(row, col).Value))
Next row
Next col
FSOFile.Close
Obviously you need to correct the output line syntax and variable name structure for whatever other language you need to use.
P.S. If you are not familiar with RegEx (Regular Expressions), you will find a plethora of articles on the web explaining it.
I'm quite new to vb and doing simple basics, I have managed to access and read a specific file line by line. If I wanted to split information by a comma or space and then sort alphabetically or numerically, how would I go about this procedure? Would I create a loop within the reading loop to parse the information? A simple to follow example would really help...Thanks!
Dim file As String = "C:\Users\test.txt"
Dim Line As String
If System.IO.File.Exists(file) = True Then
Dim objReader As New System.IO.StreamReader(file)
Do While objReader.Peek() <> -1
Line = Line & objReader.ReadLine() & vbNewLine
Loop
Next
Label1.Text = Line
objReader.Close()
Else
MsgBox("File Does Not Exist")
End If
It depends what you want do do with the text you split really.
The Split() function will return you an array of string with the results of your split, from there it really depends on the data.
Here is an example of using split http://www.dotnetperls.com/split-vbnet
Since you mentioned you want to sort the data alphabetically you may wish to look at http://www.codepedia.com/1/VBNET_ArraySort or look into using LINQ.
It is quite acceptable to nest a loop within your main loop if you want to do something more complex with the data.
I am creating a sql query for an access database that will be exported to a text file. The requirements include a line feed separating each line. Does that happen by default, or its something that I need to add in?
If I need to add it, how do I do that?
TIA
TransferText includes LineFeed and I am fairly sure most methods of getting text out of Access will include linefeed, unless you do something to stop it. It is not too difficult to check.
Dim fs As New FileSystemObject
s = "c:\docs\test.txt"
DoCmd.TransferText acExportDelim, , "query6", s
Set f = fs.OpenTextFile(s)
a = f.ReadAll
''Split at linefeed: Chr(10)
aa = Split(a, Chr(10))
''Test 1
Debug.Print UBound(aa)
''Test 2
For Each itm In aa
Debug.Print itm
Next