I have a file (around 10k entries) with following format:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
A;B;C;n/a;n/a
D;E;F;56.011;13.099
D;E;F;56.01;13.01
D;E;F;n/a;n/a
I;B;C;n/a;n/a
containing duplicates, some without, others with mildly contradicting LAT;LONG coordinates.
I only want to store first unique value of [$1;$2;$3;$4;$5] as output, so desired output should look like:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
I'd assume that I want to create an array, but I struggle with proper formating of it... so any help appreciated !
I'm glad you have it working, but personally, I would suggest something a little more along the lines of:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
Example Use/Output
With your data in file, you could then do:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
You can shorten it to about your example (which simply checks if the unique index of the first three fields combined has been set yet and relies on the default print operation to output the first records having the unique combination):
$ awk -F";" '!seen[$1,$2,$3]++' file
However, using the joined fields $1,$2,$3 as the index is about the only way you can ensure uniqueness.
If you say your works, then it is certainly shorter. Let me know if you have further questions.
Found it by stopping to look for creating arrays
created a new $1 being $1,$2,$3, but the other solutions is indeed more elegant, here is the command I came up with after merging the fields in the file (and setting them as new $1), which I then didn't have to do
awk -F';' '!seen[($1)]++' file1.csv > file2.csv
I'm trying to parse out all of the lines in between different headers and footers to different files using an awk script in a for loop. For example, I have a file with a list of mismatches with sample-name headers (compiled.csv) that looks like this:
19-T00,,,,,,,,,,,,,,,,
1557,WT,,,,,,,,,,,,,,,
6,109-G->A,110-G->A,,,,,,,,,,,,,,
3,183-G->A,,,,,,,,,,,,,,,
19-T10,,,,,,,,,,,,,,,,
642,WT,,,,,,,,,,,,,,,
206,24->G,,,,,,,,,,,,,,,
19-T21,,,,,,,,,,,,,,,,
464,24->G,,,,,,,,,,,,,,,
19-TSpl,,,,,,,,,,,,,,,,
2219,24->G,,,,,,,,,,,,,,,
20-T00,,,,,,,,,,,,,,,,,,
...
...
My goal for the lines above would be to pass all the lines from the 19-T00 to the 2219,24->G,,,,,,,,,,,,,,, in a sample output file called sample-19.csv.
The sample names all share the pattern [0-9][0-9]-T*. And my approach to doing this first was based on creating an array with all 20 sample names (i.e. 19, 20, 21...). I am trying to execute the following loop, and output files are created but they are blank.
for i in {0,19}
do a="$i"
b=`echo $i+1 | bc`
header="${array[$a]}-T"; footer="${array[$b]}-T"
name=`echo $header | cut -d"-" -f1`
awk -F, -v start="$header" -v finish="$footer" '/^start*/,/^finish*/' compiled.csv >"sample-"$name".csv"
done
If I do this manually with the one-liner:
awk '/^19-T*/,/^20-T*/' compiled.csv >sample-19.csv it works fine. So I think there may be a problem in the variable passing, but I don't know how to fix it.
I know there are some other threads discussing the header-footer approach using awk, but I just think my syntax needs some help. If anyone has any advice by way of more experienced eyes, it would be much appreciated. Let me know if anything isn't clear.
Thanks,
Matt
All you need is something like this (untested):
awk '
/^[0-9][0-9]-T00,/ {
close(out)
out = "sample-" $0
sub(/-T00.*/,".csv",out)
}
{ print > out }
' compiled.csv
If you're ever again considering processing text with a shell loop make sure to read why-is-using-a-shell-loop-to-process-text-considered-bad-practice first
using awk
awk --posix '/[0-9]{2}-T00/{split($0,a,"-"); name=a[1]} {print $0>"sample-"name".cas"}' file
Output will be two files "sample-19.csv" and "sample-20.csv" for your contents
I have a file with the following records:
a,1
a,1,2
a,1,2,3
b,4
b,4,5
b,4,5,6
I want the output like this:
a,1,2,3
b,4,5,6
It's really unclear what you are trying to do here. It's even less clear what you have tried so far (good StackOverflow questions usually involve some code)! You've read the FAQ, right?
If your input is in a file called input_file.csv, then the following awk program will give you the output you have said you want. Whether it will work for your real data is anyone's guess.
% awk -F',' '{
lines[$1] = $0
}
END {
for (line in lines) {
print lines[line]
}
}' input_file.csv
I offer no explanation as to what this simple script does, but a handy reference for awk.
Thanks for your appreciation!
As requested
awk '/......./' input
a,1,2,3
b,4,5,6