How to skip a header row in a csv file? - vb.net

I have a csv file generated with headings on the first row and data on the rest. The file varies each time and I have to have all these values for further usage. I'm using File.ReadAllLines(path) but could ignore the header row. How to accomplish this? Please help.

you should just start from the second line (index 1 of returned string[])

EDIT This is better:
File.ReadAllLines(#"c:\test.txt").Skip(1); // this will return an IEnumerable<string>
File.ReadAllLines(#"c:\test.txt").Skip(1).ToArray(); // This will return an array of string (string[])
OLD
bool first = true;
StringBuilder sb = new StringBuilder();
File.ReadLines(#"c:\test.txt").ToList().ForEach(c =>
{
if (first) first = false;
else sb.Append(c);
}
);
string res = sb.ToString();
This will essentually skip the first line, Don't know if there is a better way to do it

Related

How can I load in a pipe (|) delimited text file that has columns that sometimes contain line breaks?

I have built an SSIS package that loads in several delimited text files into a SQL database. One of the files often contains line spaces in it, which breaks the standard data flow task of setting a flat file source and mapping to an ado.net destination since it thinks it is on a new line when it reaches a line break. The vendor sending over the files does not want to sent the file without any edits and can't do XML at this time. Is there any way to fix this? I was thinking of writing a small vb.net program that would correct the files so they would work in the SSIS package, but not sure how to write that logic. The file has 5 columns, the first 2 are big integer and always contain some long integer ID, then there is a small text column that just contains one short word, then a date, and then a long comments field that is causing the problem. The comments field is sometimes blank (which is ok), the problem are the rows that have line breaks. I never know how many line breaks are in the comments, some have none, some can have several, even multiple line breaks in a row, so was wondering if this is even possible.
5787626|6547599|Approved|1/10/2017|Applicant request for fee waiver approved
5443221|7742812|Active|11/5/2013|
3430962|7643957|Re-Scheduled|5/25/2016|REVISED TERMS AND CONDITIONS REJECTED
Applicant has 30 DAYS To submit paperwork for extension.
34433624|7673715|Denied|1/24/2017|
34113575|7653748|Active|1/8/2014|New terms have been granted.
Sample File Format.
As long as there is logic that you can program/predict, it will be possible.
I would do it using a Script Component as a source, which means you don't need to rewrite the file before processing it. It also provides a lot of flexibility, e.g., you can store values in variables while iterating over multiple lines in the file, etc.
I posted another answer recently that gives a lot of detail on how to go about this: SSIS import a Flat File to SQL with the first row as header and last row as a total.
An example of holding the values in variables until the row is ready to be written:-
For this example I am writing three columns, ID1, ID2 and Comments. The file looks like this:
1|2|Comment1
Comment2
4|5|Comment3
Comment4
Comment5
6|7|Comment6
The Script Component contains the following method.
public override void CreateNewOutputRows()
{
System.IO.StreamReader reader = null;
try
{
bool readFirstLine = false;
int id1 = 0;
int id2 = 0;
string comments = null;
reader = new System.IO.StreamReader(Variables.FilePath); // this refers to a package variable that contains the file path
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (line.Contains("|"))
{
if (readFirstLine)
{
Output0Buffer.AddRow();
Output0Buffer.ID1 = id1;
Output0Buffer.ID2 = id2;
Output0Buffer.Comments = comments;
}
else
{
readFirstLine = true;
}
string[] fields = line.Split('|');
id1 = Convert.ToInt32(fields[0]);
id2 = Convert.ToInt32(fields[1]);
comments = fields[2];
}
else
{
comments += " " + line;
}
if (reader.EndOfStream)
{
Output0Buffer.AddRow();
Output0Buffer.ID1 = id1;
Output0Buffer.ID2 = id2;
Output0Buffer.Comments = comments;
}
}
}
catch
{
if (reader != null)
{
reader.Close();
reader.Dispose();
}
throw;
}
}
The result set is:
ID1 ID2 Comments
=== === ========
1 2 Comment1 Comment2
4 5 Comment3 Comment4 Comment5
6 7 Comment6

String is to Substring, as ArrayList is to?

In Java, and many other languages, one can grab a subsection of a string by saying something like String.substring(begin, end). My question is, Does there exist a built-in capability to do the same with Lists in Java that returns a sublist from the original?
This method is called subList and exists for both array and linked lists. Beware that the list it returns is backed by the existing list so updating the original one will update the slice.
The answer can be found in the List API: List#subList(int, int) (can't figure out how to get the link working....)
Be warned, though, that this is a view of the underlying list, so if you change the original list, you'll change the sublist, and the semantics of the sublist is undefined if you structurally modify the original list. So I suppose it isn't strictly what you're looking for...
If you want a structurally independent subsection of the list, I believe you'll have to do something like:
ArrayList<something> copy = new ArrayList<>(oldList.subsection(begin, end));
However, this will retain references to the original objects in the sublist. You'll probably have to manually clone everything if you want a completely new list.
The method is called sublist and can be found here in the javadocs
http://docs.oracle.com/javase/7/docs/api/java/util/ArrayList.html#subList(int, int)
You can use subList(start, end)
ArrayList<String> arrl = new ArrayList<String>();
//adding elements to the end
arrl.add("First");
arrl.add("Second");
arrl.add("Third");
arrl.add("Random");
arrl.add("Click");
System.out.println("Actual ArrayList:"+arrl);
List<String> list = arrl.subList(2, 4);
System.out.println("Sub List: "+list);
Ouput :
Actual ArrayList:[First, Second, Third, Random, Click]
Sub List: [Third, Random]
You might just want to make a new method if you want it to be exactly like substring is to String.
public static List<String> sub(List<String> strs, int start, int end) {
List<String> ret = new ArrayList<>(); //Make a new empty ArrayList with String values
for (int i = start; i < end; i++) { //From start inclusive to end exclusive
ret.add(strs.get(i)); //Append the value of strs at the current index to the end of ret
}
return ret;
}
public static List<String> sub(List<String> strs, int start) {
List<String> ret = new ArrayList<>(); //Make a new empty ArrayList with String values
for (int i = start; i < strs.size(); i++) { //From start inclusive to the end of strs
ret.add(strs.get(i)); //Append the value of strs at the current index to the end of ret
}
return ret;
}
If myStrings is an ArrayList of the following Strings: {"do","you","really","think","I","am","addicted","to","coding"}, then sub(myStrings,1,6) would return {"you", "really", "think", "I", "am"} and sub(myStrings,4) would return {"I", "am", "addicted", "to", "coding"}. Also by doing sub(myStrings, 0) it would rewrite myStrings as a new ArrayList which could help with referencing problems.

How to get the last input ID in a textfile?

Can someone help me in my problem?
Because I'm having a hard time of on how to get the last input ID in a text file. My back end is a text file.
Thanks.
this is the sample content of the text file of my program.
ID|CODE1|CODE2|EXPLAIN|CODE3|DATE|PRICE1|PRICE2|PRICE3|
02|JKDHG|hkjd|Hfdkhgfdkjgh|264|56.46.54|654 654.87|878 643.51|567 468.46|
03|DEJSL|hdsk|Djfglkdfjhdlf|616|46.54.56|654 654.65|465 465.46|546 546.54|
01|JANE|jane|Jane|251|56.46.54|534 654.65|654 642.54|543 468.74|
how would I get the last input id so that the id of the input line wouldn't back to number 1?
Make a function that read file and return list of lines(string) like this:
public static List<string> ReadTextFileReturnListOfLines(string strPath)
{
List<string> MyLineList = new List<string>();
try
{
// Create an instance of StreamReader to read from a file.
StreamReader sr = new StreamReader(strPath);
string line = null;
// Read and display the lines from the file until the end
// of the file is reached.
do
{
line = sr.ReadLine();
if (line != null)
{
MyLineList.Add(line);
}
} while (!(line == null));
sr.Close();
return MyLineList;
}
catch (Exception E)
{
throw E;
}
}
I am not sure if
ID|CODE1|CODE2|EXPLAIN|CODE3|DATE|PRICE1|PRICE2|PRICE3|
is part of the file but you have to adjust the index of the element you want to get
, then get the element in the list.
MyStringList(1).split("|")(0);
If you're looking for the last (highest) number in the ID field, you could do it with a single line in LINQ:
Dim MaxID = (From line in File.ReadAllLines("file.txt")
Skip 1
Select line.Split("|")(0)).Max()
What this code does is gets an array via File.ReadAllLines, skips the first line (which appears to be a header), splits each line on the delimiter (|), takes the first element from that split (which is ID) and selects the maximum value.
In the case of your sample input, the result is "03".

Lucene: how to preserve whitespaces etc when tokenizing stream?

I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is
Term1: Term2 Stopword! Term3
Term4
then I want the output to look like
Term1': Term2' Stopword! Term3'
Term4'
(where Termi' is translation of Termi) instead of simply
Term1' Term2' Term3' Term4'
Currently I am doing the following:
PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
PatternAnalyzer.WHITESPACE_PATTERN,
false,
WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);
while (ts.incrementToken()) { // loop over tokens
String termIn = charTermAttribute.toString();
...
}
but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!
============ UPDATE!
I tried splitting the original stream into "words" and "non-words". It seems to work fine. Not sure whether it's the most efficient way, though:
public ArrayList splitToWords(String sIn)
{
if (sIn == null || sIn.length() == 0) {
return null;
}
char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>();
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
boolean newIsLetter = Character.isLetter(c[pos]);
if (newIsLetter == curIsLetter) {
continue;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
tokenStart = pos;
curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));
return list;
}
Well it doesn't really lose whitespace, you still have your original text :)
So I think you should make use of OffsetAttribute, which contains startOffset() and endOffset() of each term into your original text. This is what lucene uses, for example, to highlight snippets of search results from the original text.
I wrote up a quick test (uses EnglishAnalyzer) to demonstrate:
The input is:
Just a test of some ideas. Let's see if it works.
The output is:
just a test of some idea. let see if it work.
// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
String input = "Just a test of some ideas. Let's see if it works.";
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
StringBuilder output = new StringBuilder(input);
// in some cases, the analyzer will make terms longer or shorter.
// because of this we must track how much we have adjusted the text so far
// so that the offsets returned will still work for us via replace()
int delta = 0;
TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
ts.reset();
while (ts.incrementToken()) {
String term = termAtt.toString();
int start = offsetAtt.startOffset();
int end = offsetAtt.endOffset();
output.replace(delta + start, delta + end, term);
delta += (term.length() - (end - start));
}
ts.close();
System.out.println(output.toString());
}

searching a list object

I have a list:
Dim list As New List(Of String)
with the following items:
290-7-11
1255-7-12
222-7-11
290-7-13
What's an easy and fast way to search if duplicate of "first block" plus "-" plus "second block" is already in the list. Example the item 290-7 appears twice, 290-7-11 and 290-7-13.
I am using .net 2.0
If you only want to know if there are duplicates but don't care what they are...
The easiest way (assuming exactly two dashes).
Boolean hasDuplicatePrefixes = list
.GroupBy(i => i.Substring(0, i.LastIndexOf('-')))
.Any(g => g.Count() > 1)
The fastest way (at least for large sets of strings).
HashSet<String> hashSet = new HashSet<String>();
Boolean hasDuplicatePrefixes = false;
foreach (String item in list)
{
String prefix = item.Substring(0, item.LastIndexOf('-'));
if (hashSet.Contains(prefix))
{
hasDuplicatePrefixes = true;
break;
}
else
{
hashSet.Add(prefix);
}
}
If there are cases with more than two dashes, use the following. This will still fail with a single dash.
String prefix = item.Substring(0, item.IndexOf('-', item.IndexOf('-') + 1));
In .NET 2.0 use Dictionary<TKey, TValue> instead of HashSet<T>.
Dictionary<String, Boolean> dictionary= new Dictionary<String, Boolean>();
Boolean hasDuplicatePrefixes = false;
foreach (String item in list)
{
String prefix = item.Substring(0, item.LastIndexOf('-'));
if (dictionary.ContainsKey(prefix))
{
hasDuplicatePrefixes = true;
break;
}
else
{
dictionary.Add(prefix, true);
}
}
If you don't care about readability and speed, use an array instead of a list, and you are a real fan of regular expressions, you can do the following, too.
Boolean hasDuplicatePrefixes = Regex.IsMatch(
String.Join("#", list), #".*(?:^|#)([0-9]+-[0-9]+-).*#\1");
Do you want to stop user from adding it?
If so, a HashTable with the key as first block-second block could be of use.
If not, LINQ is the way to go.
But, it will have to traverse the list to check.
How big can this list be?
EDIT: I don't know if HashTable has generic version.
You could also use SortedDictionary which can take generic arguments.
If you're list contains only strings, then you can simply make a method that takes the string you want to find along with the list:
Boolean isStringDuplicated(String find, List<String> list)
{
if (list == null)
throw new System.ArgumentNullException("Given list is null.");
int count = 0;
foreach (String s in list)
{
if (s.Contains(find))
count += 1;
if (count == 2)
return true;
}
return false;
}
If you're numbers have a special significance in your program, don't be afraid to use a class to represent them instead of sticking with strings. Then you would have a place to write all the custom functionality you want for said numbers.