Java: StringTokenizer does not respect separator - tokenize

I have the following code that extracts tab-separated strings into a string array:
static public List<String> getContents(File aFile, String separator){
// all strings, split based on separator
List<String> contentList = new ArrayList<String>();
StringTokenizer tokenizer = new StringTokenizer(Util.getContents(aFile), separator);
while (tokenizer.hasMoreTokens()){
contentList.add(tokenizer.nextToken());
}
return contentList;
}
The separator in this case is therefore a "\t".
As long as two strings are separated by one tab, everything is great. However, my dataset sometimes has two strings between separated by two tabs. This means that one parameter is missing and an emptry string shoulid be added to the list. However the method ignores that and just returns an array with one string less.
In my particular case, I always want an array of 5 strings back. That means, a text containing only 4 tabs with no text returns an array of 5 empty strings (needed for a parsing job that is based on that). Unfortunately, I have no control over the content and I am working with millions of files that are generated out of my control.
Is there a better way to do this with StringTokenizer ? Or do I have to implement something on my own?
Here some examples:
String ok = a\tb\tc\td\te
String nok = a\tb\tc\t\te
Ralf

Found this: How to split a string in Java
and that I can do it with
"myString".split("\t", -1);
to obtain the empty strings if there are multiple separators custering in one place.
Thanks anyway!

Related

Finding best delimiter by size of resulting array after split Kotlin

I am trying to obtain the best delimiter for my CSV file, I've seen answers that find the biggest size of the header row. Now instead of doing the standard method that would look something like this:
val supportedDelimiters: Array<Char> = arrayOf(',', ';', '|', '\t')
fun determineDelimiter(headerRow): Char {
var headerLength = 0
var chosenDelimiter =' '
supportedDelimiters.forEach {
if (headerRow.split(it).size > headerLength) {
headerLength = headerRow.split(it).size
chosenDelimiter = it
}
}
return chosenDelimiter
}
I've been trying to do it with some in-built Kotlin collections methods like filter or maxOf, but to no avail (the code below does not work).
fun determineDelimiter(headerRow: String): Char {
return supportedDelimiters.filter({a,b -> headerRow.split(a).size < headerRow.split(b)})
}
Is there any way I could do it without forEach?
Edit: The header row could look something like this:
val headerRow = "I;am;delimited;with;'semi,colon'"
I put the '' over an entry that could contain other potential delimiter
You're mostly there, but this seems simpler than you think!
Here's one answer:
fun determineDelimiter(headerRow: String)
= supportedDelimiters.maxByOrNull{ headerRow.split(it).size } ?: ' '
maxByOrNull() does all the hard work: you just tell it the number of headers that a delimiter would give, and it searches through each delimiter to find which one gives the largest number.
It returns null if the list is empty, so the method above returns a space character, like your standard method. (In this case we know that the list isn't empty, so you could replace the ?: ' ' with !! if you wanted that impossible case to give an error, or you could drop it entirely if you wanted it to give a null which would be handled elsewhere.)
As mentioned in a comment, there's no foolproof way to guess the CSV delimiter in general, and so you should be prepared for it to pick the wrong delimiter occasionally. For example, if the intended delimiter was a semicolon but several headers included commas, it could wrongly pick the comma. Without knowing any more about the data, there's no way around that.
With the code as it stands, there could be multiple delimiters which give the same number of headers; it would simply pick the first. You might want to give an error in that case, and require that there's a unique best delimiter. That would give you a little more confidence that you've picked the right one — though there's still no guarantee. (That's not so easy to code, though…)
Just like gidds said in the comment above, I would advise against choosing the delimiter based on how many times each delimiter appears. You would get the wrong answer for a header row like this:
Type of shoe, regardless of colour, even if black;Size of shoe, regardless of shape
In the above header row, the delimiter is obviously ; but your method would erroneously pick ,.
Another problem is that a header column may itself contain a delimiter, if it is enclosed in quotes. Your method doesn't take any notice of possible quoted columns. For this reason, I would recommend that you give up trying to parse CSV files yourself, and instead use one of the many available Open Source CSV parsers.
Nevertheless, if you still want to know how to pick the delimiter based on its frequency, there are a few optimizations to readability that you can make.
First, note that Kotlin strings are iterable; therefore you don't have to use a List of Char. Use a String instead.
Secondly, all you're doing is counting the number of times a character appears in the string, so there's no need to break the string up into pieces just to do that. Instead, count the number of characters directly.
Third, instead of finding the maximum value by hand, take advantage of what the standard library already offers you.
const val supportedDelimiters = ",;|\t"
fun determineDelimiter(headerRow: String): Char =
supportedDelimiters.maxBy { delimiter -> headerRow.count { it == delimiter } }
fun main() {
val headerRow = "one,two,three;four,five|six|seven"
val chosenDelimiter = determineDelimiter(headerRow)
println(chosenDelimiter) // prints ',' as expected
}

How to copy one string's n number of characters to another string in Kotlin?

Let's take a string var str = "Hello Kotlin". I want to copy first 5 character of str to another variable strHello. I was wondering is there any function of doing this or I have to apply a loop and copy characters one by one.
As Tim commented, there's a substring() method which does exactly this, so you can simply do:
val strHello = str.substring(0, 5)
(The first parameter is the 0-based index of the first character to take; and the second is the index of the character to stop before.)
There are many, many methods available on most of the common types.  If you're using an IDE such as IDEA or Eclipse, you should see a list of them pop up after you type str..  (That's one of many good reasons for using an IDE.)  Or check the official documentation.
Please use the string.take(n) utility.
More details at
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/take.html
I was using substring in my project, but it gave an exception when the length of the string was smaller than the second index of substring.
val name1 = "This is a very very long name"
// To copy to another string
val name2 = name1.take(5)
println(name1.substring(0..5))
println(name1.substring(0..50)) // Gives EXCEPTION
println(name1.take(5))
println(name1.take(50)) // No EXCEPTION

Java - Index a String (Substring)

I have this string:
201057&channelTitle=null_JS
I want to be able to cut out the '201057' and make it a new variable. But I don't always know how long the digits will be, so can I somehow use the '&' as a reference?\
myDigits substring(0, position of &)?
Thanks
Sure, you can split the string along the &.
String s = "201057&channelTitle=null_JS";
String[] parts = s.split("&");
String newVar = parts[0];
The expected result here is
parts[0] = "201057";
parts[1] = "channelTitle=null_JS";
In production code you chould check of course the length of the parts array, in case no "&" was present.
Several programming languages also support the useful inverse operation
String s2 = parts.join("&"); // should have same value like s
Alas this one is not part of the Java standard libs, but e.g. Apache Commons Lang features it.
Always read the API first. There is an indexOf method in String that will return you the first index of the character/String you gave it.
You can use myDigits.substring(0, myDigits.indexOf('&');
However, if you want to get all of the arguments in the query separately, then you should use mvw's answer.

Convert a StringBuilder to a Jagged Array

I have built a VB.Net class that will be used in VBA for reading text files. I've set it up so the user can specify what tables in the file he wants to return. What I have done is build a StringBuilder of the tables, then return it as a jagged array, but I can't quite get the conversion of the builder to array part right. I'd like the the first level to be split on "NewLine" and the second level to be split on ",".
Is this possible without having to use multiple arrays and\or loops?
This will create the jagged array:
Dim myArray = (From row In myStringBuilder.ToString().Split({vbCrLf}, StringSplitOptions.None)
Select (From col In row.Split(","c)
Select col
).ToArray()
).ToArray()
Explanation:
First, we convert the StringBuilder to a String: myStringBuilder.ToString()
Then we split on line breaks: Split({vbCrLf}, StringSplitOptions.None). Since a line break consists of two characters in Windows, we use the Split overload that accepts a string array (hence the braces).
Within the row we split the line on commas: Split(","c). The c specifies that this is a character instead of a string.
Finally, we convert this enumerable of enumerables into an array of arrays by applying ToArray to the outer as well as the inner LINQ expression.
You could represent your jagged array using nested lists and generics. The outer (row) would be a generic list and the inner (col) could be a list of strings.
Other approaches could leverage XML or LINQ but would be less efficient.

Weird results when splitting strings in VB.NET

I was getting weird results when doing multiple splits on a string, so I decided to make a simple test to figure out what was going on
testString "1234567891011121314151617181920"
If I wanted to get whats between 10 to 20 in Javascript I would do this:
var results = testString.split("10")[1].split("20")[0]
Which would return 111213141516171819
However when I do this in VB I get 111
Split(testString,"10")(1).Split("20")(0)
It seems the 2nd split is only recognizing the first character no matter what I put.
So it's stopping when it finds the next "2" in the string, even "2abc" would have the same outcome even though that string doesn't even exist.
String.Split does not have an overload that takes only a String. The argument is a Char array or String array. Your string is probably being converted to a char array. Explicitly pass a string array like so:
testString.Split(New String() { "10" }, StringSplitOptions.None)
Try wrapping the second split so it's fashioned like the first one, i.e.:
Split( Split(testString,"10")(1), "20" )(0)"
Vb treats the delimiter argument only as a single character.
This is a tricky scenario that I have seen trip people up before, so I think it is worth a little more explanation than the other answers give. In your original format Split(testString,"10")(1).Split("20")(0), you are unknowingly using two DIFFERENT Split functions.
The first Split(testString,"10") is using the Microsoft.VisualBasic.Strings.Split function, which takes String type parameters. http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.strings.split(v=vs.110).aspx
The second .Split("20")(0) is using System.String.Split method, which does not have an overload that takes a String parameter. http://msdn.microsoft.com/en-us/library/System.String.Split(v=vs.110).aspx
So what was happening is:
Split(testString,"10") uses Microsoft.VisualBasic.Strings.Split, which
returns new String() {"123456789", "11121314151617181920"}
(1) means get 1st position of the returned array, which is "11121314151617181920"
"11121314151617181920".Split("20")(0) uses System.String.Split, and attempts to split on string separator "20"
NOTE: The string "20" param gets implicitly converted to a char "2" because the only single parameter overload of String.Split has a signature of Public Function Split (ParamArray separator As Char()) As String(). The ParamArray parameter option allows you to pass a comma delimited list of values into the function, similar to how String.Format works with a dynamic # of replacement values. http://msdn.microsoft.com/en-us/library/538f81ec.aspx
Step 3 code becomes "11121314151617181920".Split(new Char() {CChar("20")})(0), which using literal values is "11121314151617181920".Split(new Char() {"2"c})(0). The result is {"111", "13141516171819", "0"}. Get the 0th position, returns "111".
So to avoid confusion, you should convert your code to use the same version of Split on both sides.
Either of the 2 examples below should work:
Example 1: Using Microsoft.VisualBasic.Strings.Split:
Split( Split(testString,"10")(1), "20" )(0)
Example 2: Using System.String.Split:
testString _
.Split(New String() {"10"}, StringSplitOptions.None)(1) _
.Split(New String() {"20"}, StringSplitOptions.None)(0)