if $column_A="" equals to delete $column_A in awk?

if $column_A="" equals to delete $column_A in awk? - awk

I wish to delete one column of my data in awk but what I found is using command like $column_A="". Is column_A really deleted in this way?
For example, I wish to delete the second column and I found a solution: awk 'BEGIN{FS="\t";OFS="\t"}!($2="")' which print the result like: $1^0^0$3. It seems that it is the content of the second column is deleted but the second column.

after reading dev-null's comment, I got idea what are you asking...
My answer is: it depends on how do you define "a column is deleted".
see this example:
kent$ echo "foo,bar,blah"|awk -F, -v OFS="," '{$2="";print}'
foo,,blah
kent$ echo "foo,bar,blah"|awk -F, -v OFS="," '{print $1,$3}'
foo,blah
You see the difference? If you set the $x="" The column is still there, but it becomes an empty string. So the FS before and after stay. If this is what you wanted, it is fine. Otherwise just skip outputing the target column, like the 2nd example shows.

I would use cut for that:
cut -d$'\t' -f1,3- file
-f1,3- selects the first field, skips field 2 and then selects fields 3 to end.

Related

awk - store first occurrence based on cell

I have a file (around 10k entries) with following format:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
A;B;C;n/a;n/a
D;E;F;56.011;13.099
D;E;F;56.01;13.01
D;E;F;n/a;n/a
I;B;C;n/a;n/a
containing duplicates, some without, others with mildly contradicting LAT;LONG coordinates.
I only want to store first unique value of [$1;$2;$3;$4;$5] as output, so desired output should look like:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
I'd assume that I want to create an array, but I struggle with proper formating of it... so any help appreciated !

I'm glad you have it working, but personally, I would suggest something a little more along the lines of:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
Example Use/Output
With your data in file, you could then do:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
You can shorten it to about your example (which simply checks if the unique index of the first three fields combined has been set yet and relies on the default print operation to output the first records having the unique combination):
$ awk -F";" '!seen[$1,$2,$3]++' file
However, using the joined fields $1,$2,$3 as the index is about the only way you can ensure uniqueness.
If you say your works, then it is certainly shorter. Let me know if you have further questions.

Found it by stopping to look for creating arrays
created a new $1 being $1,$2,$3, but the other solutions is indeed more elegant, here is the command I came up with after merging the fields in the file (and setting them as new $1), which I then didn't have to do
awk -F';' '!seen[($1)]++' file1.csv > file2.csv

Awk - Grep - Match the exact string in a file

I have a file that looks like this
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
ON,454644,FRED84848,Super Man,65757,555,Free
I need to match the values in the fourth column exactly as they are written. So if I am searching for "Super" I need it to return the line with "Super" only.
ON,111111,TEN000812,Super,7483747483,767,Free
Likewise, if I'm looking for "Super Man" I need that exact line returned.
ON,454644,FRED84848,Super Man,65757,555,Free
I have tried using grep, but grep will match all instances that contain Super. So if I do this:
grep -i "Super" file.txt
It returns all lines, because they all contain "Super"
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
ON,454644,FRED84848,Super Man,65757,555,Free
I have also tired with awk, and I believe I'm close, but when I do:
awk '$4==Super' file.txt
I still get output like this:
ON,111111,TEN000812,Super,7483747483,767,Free
ON,262762,BOB747474,SuperMan,4347374,676,Free
I have been at this for hours, and any help would be greatly appreciated at this point.

You were close, or I should say very close just put field delimiter as comma in your solution and you are all set.
awk 'BEGIN{FS=","} $4=="Super"' Input_file
Also one more thing in OP's attempt while comparison with 4th field with string value, string should be wrapped in "
OR in case you want to mention value to be compared as an awk variable then try following.
awk -v value="Super" 'BEGIN{FS=","} $4==value' Input_file

You are quite close actually, you can try :
awk -F, '$4=="Super" {print}' file.txt
I find this form easier to grasp. Slightly longer than #RavinderSingh13 though
-F is the field separator, in this case comma
Next you have a condition followed by action
Condition is to check if the fourth field has the string Super
If the string is found, print it

How to delete from match pattern in a specific column until end of file

I want to delete the records in a file once a pattern is found and delete all lines from that pattern until end of the file
I tried doing in awk but I haven't found if there is a simpler way to do it
So I want to match the pattern in the second column and then delete records from that pattern until the end of the file
awk -F"," '$2 ~ /100000/ {next} {print}' file.csv
So the above code skips those lines however as you can see i need to add multiple match patterns to ignore lines after the ones that have the value 100000 in 2nd column are ignored. Please note that the values in 2nd column appear sequentially so after 100000 would come 100001 and there is no fixed end number.

Not sure if I got your problem correctly IMHO I believe you need following.
awk '$2==100000{print;exit} 1' Input_file
This will print till the line whose 2nd column is 100000 and then since you don't want to print anything do in spite of simply reading further file and skipping it, this will exit from code which will additionally save our time too.
OR as per Ed Morton sir's nice suggestion:
awk '1; $2==100000{exit}' Input_file

Finding sequence in data

I to use awk to find the sequence of pattern in a DNA data but I cannot figure out how to do it. I have a text file "test.tx" which contains a lot of data and I want to be able to match any sequence that starts with ATG and ends with TAA, TGA or TAG and prints them.
for instance, if my text file has data that look like below. I want to find and match all the existing sequence and output as below.
AGACGCCGGAAGGTCCGAACATCGGCCTTATTTCGTCGCTCTCTTGCTTTGCTCGAATAAACGAGTTTGGCTTTATCGAATCTCCGTACCGTAAGGTCGAAAACGGCCGGGTCATTGAGTACGTGAAAGTACAAAATGG
GTCCGCGAATTTTTCGGTTCGTCTCAGCTTTCGCAGTTTATGGATCAGACGAACCCGCTCTCTGAAATTACTCATAAACGCAGGCTCTCGGCGCTCGGGCCCGGCGGACTCTCGCGGGAGCGTGCAGGTTTCGAAGTTC
GGATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAGGCGAACTGCTCGAAAATCAATTCCGAATCGGGCTTGAGCGAATGGAGCGGGCCATCAAGGAAAAAATGTCTATCCAGCAGGATATGCAAACGACG
AAAGTATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAAATTCAATATCAAAATGGGACGCCCCGAGCGCGACCGTATAGACGATCCGCTGCTTGCGCCGATGGATTTCATCGACGTTGTGAA
ATGAGACCGGGCGATCCGCCGACTGTGCCAACCGCCTACCGGCTTCTGG
Print out matches:
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
I try something like this, but it only display the rows that starts with ATG. it doesn't actually fix my problem
awk '/^AGT/{print $0}' test.txt

assuming the records are not spanning multiple lines
$ grep -oP 'ATG.*?T(AA|AG|GA)' file
ATGGATCAGACGAACCCGCTCTCTGA
ATGATATCGACCATCTCGGCAATCGACGCGTTCGGGCCGTAG
ATGTTTTTCGATCCGCGCCGATTCGACCTCTCAAGAGTCGGAAGGCTTAA
ATGGGACGCCCCGAGCGCGACCGTATAG
ATGGATTTCATCGACGTTGTGA
non-greedy match, requires -P switch (to find the first match, not the longest).

Could you please try following.
awk 'match($0,/ATG.*TAA|ATG.*TGA|ATG.*TAG/){print substr($0,RSTART,RLENGTH)}' Input_file

Keep only the line that is latest in the file and is a duplicate based on two fields

This is related to the questions
awk - Remove line if field is duplicate
sed/awk + regex delete duplicate lines where first field matches (ip address)
I have a file like this:
FOO,BAR,100,200,300
BAZ,TAZ,500,600,800
FOO,BAR,900,1000,1000
HERE,THERE,1000,200,100
FOO,BAR,100,10000,200
BAZ,TAZ,100,40,500
The duplicates are determined by the first two fields. In addition, the more "recent" record (lower in the file / higher line number) is the one that should be retained.
What is an awk script that will output:
BAZ,TAZ,100,40,500
FOO,BAR,100,10000,200
HERE,THERE,1000,200,100
Output order is not so important.
Explanation of awk syntax would be great.

This is easy in awk : we just need to feed an array with a key combined with the 1st and the 2nd columns and the rest as values :
$ awk -F, '{a[$1","$2]=$3","$4","$5}END{for(i in a)print i,a[i]}' OFS=, file.txt
BAZ,TAZ,100,40,500
HERE,THERE,1000,200,100
FOO,BAR,100,10000,200

This might work for you (tac and GNU sort):
tac file | sort -sut, -k1,2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

if $column_A="" equals to delete $column_A in awk? - awk

I would use cut for that: cut -d$'\t' -f1,3- file -f1,3- selects the first field, skips field 2 and then selects fields 3 to end.

Related

awk - store first occurrence based on cell

Awk - Grep - Match the exact string in a file

How to delete from match pattern in a specific column until end of file

Finding sequence in data

Keep only the line that is latest in the file and is a duplicate based on two fields

Categories

Resources