View duplicate based on column value - awk

I have a CSV file with multiple records :
$ cat input.csv
Name;region;latitude;longitude;X;Y;code;N
USE;Cal;78676;8576;FDG;265;V;763
UTYE;Cal;56;8899;FDG;265;V;763
UKV;Cal;46;8576;FDG;265;R;763
UMV;Cal;785754;763;FDG;67;V;763
UBE;Cal;78676;8576;FDG;2651;V;763
UMV;mag;785754;763;FDG;67;V;763
UBE;mag;78676;8576;FDG;265;V;763
I need to view lines that have the same value in the second column and different values in 6th and 7th column.
For example the First two rows should be deleted a duplicate $2;$6;$7
USE;Cal;78676;8576;FDG;265;V;763
UTYE;Cal;56;8899;FDG;265;V;763
Rows 3,4,5,6 and 7 should be kept duplicate value in $2 but different value in $6 and/or $7
UKV;Cal;46;8576;FDG;265;R;763
UMV;Cal;785754;763;FDG;67;V;763
UBE;Cal;78676;8576;FDG;2651;V;763
UMV;mag;785754;763;FDG;67;V;763
UBE;mag;78676;8576;FDG;265;V;763
Expected output :
$ cat output.csv
Name;region;latitude;longitude;X;Y;code;N
UKV;Cal;46;8576;FDG;265;R;763
UMV;Cal;785754;763;FDG;67;V;763
UBE;Cal;78676;8576;FDG;2651;V;763
UMV;mag;785754;763;FDG;67;V;763
UBE;mag;78676;8576;FDG;265;V;763
I tried something like this :
awk -F\; 'NR==1; $NF in a{if (a[$NF]!=0){print a[$NF];a[$NF]=0}print;next}{a[$NF]=$0}' input.csv
It didn't work properly it doubled the records i have and i couldn't filter the result based on $6 and $7

Seems like awk -F';' 'NR==1{print;next} FNR==NR{a[$2,$6,$7]+=1;next;} a[$2,$6,$7]==1' input.csv input.csv should do the trick, e.g.
$ cat input.csv
Name;region;latitude;longitude;X;Y;code;N
USE;Cal;78676;8576;FDG;265;V;763
UTYE;Cal;56;8899;FDG;265;V;763
UKV;Cal;46;8576;FDG;265;R;763
UMV;Cal;785754;763;FDG;67;V;763
UBE;Cal;78676;8576;FDG;2651;V;763
UMV;mag;785754;763;FDG;67;V;763
UBE;mag;78676;8576;FDG;265;V;763
$ awk -F';' 'NR==1{print;next} FNR==NR{a[$2$,6,$7]+=1;next;} a[$2,$6,$7]==1' input.csv input.csv
Name;region;latitude;longitude;X;Y;code;N
UKV;Cal;46;8576;FDG;265;R;763
UMV;Cal;785754;763;FDG;67;V;763
UBE;Cal;78676;8576;FDG;2651;V;763
UMV;mag;785754;763;FDG;67;V;763
UBE;mag;78676;8576;FDG;265;V;763
This works logically by reading the file twice, the first time creating a map of the distinct 2nd/6th/7th field values to a count of their occurrences, then the second only printing those lines with a count of 1. More specifically -
-F';' tells awk the delimiter is a semi (not a comma like you provided, or spaces which is awks default.)
NR==1{print;next;} NR is the current line number for all files being read, so the block of code here will only execute for the first line of the first file. In this case, we print the header and go to the next line.
FNR==NR{a[$2,$6,$7]+=1;next;} FNR is the current files line number, so FNR==NR means this will only execute for the first file, as once the second file begins being read, FNR will reset. For the first file, this maps the count of the distinct 2nd/6th/7th fields. a[$2,$6,$7] is using $2,$6,$7 as a key in the array a and setting or adding 1 to it's value. Following this, we immediately go to the next line.
a[$2,$6,$7]==1 this can implicitly be read as a[$2,$6,$7]==1{print} and it will only be reached for the second files rows, as for all of the first file, we've called next before this could be ran. This looks up the mapped count of $2,$6,$7 and prints the row if it is equal to 1, thus only printing the desired rows.
edit:
Re-reading your question, it could be that you only want the rows with distinct $2/$6/$7 values that also have more than one row with the 2nd field value.
That is, given this input with the 6th adn 8th rows added, which have "unq" and "ant" for their second field, you would not want those rows since there is only one row with that 2nd field value.
$ cat input.csv
Name;region;latitude;longitude;X;Y;code;N
USE;Cal;78676;8576;FDG;265;V;763
UTYE;Cal;56;8899;FDG;265;V;763
UKV;Cal;46;8576;FDG;265;R;763
UKV;Unq;46;8576;FDG;265;R;763
UMV;Cal;785754;763;FDG;67;V;763
UBE;Cal;78676;8576;FDG;2651;V;763
UMV;mag;785754;763;FDG;67;V;763
UMV;Ant;785754;763;FDG;67;V;763
UBE;mag;78676;8576;FDG;265;V;763
If that is the case, just adding a bit to the awk works - adding a second array, b, that we keep count of the 2nd field value in. Then we only print if $2,$6,$7 is unique AND the count of $2 in the file is >1.
$ awk -F';' 'NR==1{print;next} FNR==NR{b[$2]+=1;a[$2,$6,$7]+=1;next;} a[$2,$6,$7]==1&&b[$2]>1' input.csv input.csv
Name;region;latitude;longitude;X;Y;code;N
UKV;Cal;46;8576;FDG;265;R;763
UMV;Cal;785754;763;FDG;67;V;763
UBE;Cal;78676;8576;FDG;2651;V;763
UMV;mag;785754;763;FDG;67;V;763
UBE;mag;78676;8576;FDG;265;V;763

Related

awk choose a line with $1 present in a file and output with a changed field

I've tried to use Awk to do the following:
I have a large txt file with first column the name of a gene and different values, essentially numeric, in each column.
Now I have a file with a list of genes (not all genes, just a subset) that I want to modify.
Initially I just removed lines using something I found in a forum
awk -F '\t' ' FILENAME=="gene_list" {arr[$1]; next} # create an array without values
!($1 in arr)' gene_list original_file.txt > modified_file.txt
This worked great but now I need to keep all rows (in the same order) but modify these genes to do something like:
if ($1 in arr) {print $1, $2, $3-($4/10), $4}
else {print $0}
So you see, this time, if it is different (the gene is not in my list), I want to keep the whole line, otherwise I want to keep the whole line but modify the value in one column by a given number.
If you could include something so that the value remains an integer that would be great. I'll also have to replace by 0 if the value becomes negative. But this I know how to do , at least in a separate command.
Edit: minimal example:
list of genes in a txt file, one under the other:
ccl5
cxcr4
setx
File to modify: (I put comas as field separator here, but there should be tab to separate the fields)
ccl4,3,18000,50000
ccl5,4,400,5000
cxcr4,5,300,2500
apoe,4,100,90
setx,3,200,1903
Expected output: (I remove the 10th of 4th column when gene in first column matches a gene in my separate txt file, otherwise I keep the full line unchanged)
ccl4,3,18000,50000
ccl5,4,0,5000
cxcr4,5,50,2500
apoe,4,100,90
setx,3,10,1903
Just spell out the arithmetic constraints.
The following is an attempt to articulate it in idiomatic Awk.
if (something) { print } can be rearticulated as just something. So just 1 (which is always true) is a common idiom for "print all lines (if you reach this point in the script before hitting next).
Rounding a floating-point number can be done with sprintf("%1.0f", n) which correctly rounds up if the fraction is bigger than 0.5 (int(n) would always round down).
awk 'BEGIN { FS=OFS="\t" }
FILENAME=="gene_list" {arr[$1]; next}
$1 in arr { x=sprintf("%1.0f", $3-($4/10));
if (x<0) x=0; print $1, $2, x, $4; next }
1' gene_list original_file.txt > modified_file.txt
Demo: https://ideone.com/oDjKhf

How do I match a pattern and then copy multiple lines?

I have two files that I am working with. The first file is a master database file that I am having to search through. The second file is a file that I can make that allows me to name the items from the master database that I would like to pull out. I have managed to make an AWK solution that will search the master database and extract the exact line that matches the second file. However, I cannot figure out how to copy the lines after the match to my new file.
The master database looks something like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40006X/50006/60006/3/10/3/
10047A/20047/30047/0/5/23./XXXX/
10048A/20048/30048/0/5/23./XXXX/
10049A/20049/30049/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
40009X/50009/60009/3/10/3/
10054A/20054/30054/0/5/23./XXXX/
10055A/20055/30055/0/5/23./XXXX/
10056A/20056/30056/0/5/23./XXXX/
40010X/50010/60010/3/10/3/
10057A/20057/30057/0/5/23./XXXX/
10058A/20058/30058/0/5/23./XXXX/
10059A/20059/30059/0/5/23./XXXX/
In my example, the lines that start with 4000 is the first line that I am matching up to. The last number in that row is what tells me how many lines there are to copy. So in the first line, 40005X/50005/60005/3/10/9/, I would be matching off of the 40005X, and the 9 in that line tells me that there are 9 lines underneath that I need to copy with it.
The second file is very simple and looks something like this:
40005X
40007X
40008X
As the script finds each match, I would like to move the information from the first file to a new file for analysis. The end result would look like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
The code that I currently have that will match the first line is this:
#! /bin/ksh
file1=input_file
file2=input_masterdb
file3=output_test
awk -F'/' 'NR==FNR {id[$1]; next} $1 in id' $file1 $file2 > $file3
I have had the most success with AWK, however I am open to any suggestion. However, I am working on this on a UNIX system. I would like to keep it as a KSH script, since most of the other scripts that I use with this are written in that format, and I am most familiar with it.
Thank you for your help!!
Your existing awk matches correctly the rows from the ids' file, you now need to add a condition to print N lines ahead after reading the last field of the matching row. So we will set a variable p to the number of lines to print plus one (the current one), and decrease per row printing.
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p-->0{print}' file1 file2
or the same with last condition more "awkish" (by Ed Morton) and covering any possible extreme case of a huge file
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p&&p--' file1 file2
here the print condition is omitted, as it is the default action, and the condition is true again as long as decreasing p is positive.
another one
$ awk -F/ 'NR==FNR {a[$1]; next}
!n && $1 in a {n=$(NF-1)+1}
n&&n--' file2 file1
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
this takes care if any of the content lines match the ids given. This will only look for another id after the specified number of lines printed.
Could you please try following, written and tested with shown samples in GNU awk. Considering that you want to print lines from line which stars from digits X here. Where Input_file2 is file having only ids and Input_file1 is master file as per OP's question.
awk '
{
sub(/ +$/,"")
}
FNR==NR{
a[$0]
next
}
/^[0-9]+X/{
match($0,/[0-9]+\/$/)
no_of_lines_to_print=substr($0,RSTART,RLENGTH-1)
found=count=""
}
{
if(count==no_of_lines_to_print){ count=found="" }
for(i in a){
if(match($0,i)){
found=1
print
next
}
}
}
found{
++count
}
count<=no_of_lines_to_print && count!=""
' Input_file2 Input_file1

If two columns match change the third

File 1.txt:
13002:1:3:6aw:4:g:Dw:S:5342:dsan
13003:5:3s:6s:4:g:D:S:3456:fdsa
13004:16:t3:6:4hh:g:D:S:5342:inef
File 2.txt:
13002:6544
13003:5684
I need to replace the old data in column 9 of 1.txt with new data from column 2 of 2.txt if it exists. I think this can be done line by line as both files have the same column 1 field. This is a 3Gb file size. I have been playing about with awk but can't achieve the following.
I was trying the following:
awk 'NR==FNR{a[$1]=$2;} {$9a[b[2]]}' 1.txt 2.txt
Expected result:
13002:1:3:6aw:4:g:Dw:S:6544:dsan
13003:5:3s:6s:4:g:D:S:5684:fdsa
13004:16:t3:6:4hh:g:D:S:5342:inef
You seem to have a couple of odd typos in your attempt. You want to replace $9 with the value from the array if it is defined. Also, you want to make sure Awk uses colon as separator both on input and output.
awk -F : 'BEGIN { OFS=FS }
NR==FNR{a[$1]=$2; next}
$1 in a {$9 = a[$1] } 1' 2.txt 1.txt
Notice how 2.txt is first, so that NR==FNR is true when you are reading this file, but not when you start reading 1.txt. The next in the first block prevents Awk from executing the second condition while you are reading the first file. And the final 1 is a shorthand for an unconditional print which of course will be executed for every line in the second file, regardless of whether you replaced anything.

AWK - get value between two strings over multiple lines

input.txt:
>block1
111111111111111111111
>block2
222222222222222222222
>block3
333333333333333333333
AWK command:
awk '/>block2.*>/' input.txt
Expected output
222222222222222222222
However, AWK is returning nothing. What am I misunderstanding?
Thanks!
If you want to print the line after the line containing >block2, then you could use:
awk '/^>block2$/ { nr=NR+1 } NR == nr { print }'
Track the record number plus 1 when you find the match; when the current record number matches the remembered one, print the current record.
If you want all the lines between the line >block2 and >block3, then you'd use:
awk '/^>block2$/,/^>block3/ {if ($0 !~ /^>block[23]$/) print }'
For all lines between the two markers, if the line doesn't match either marker, print it. The output is the same with the sample data file.
another awk
$ awk 'c&&c--; /^>block2/{c=1}' file
222222222222222222222
c specifies how many lines you want to print after the match. If you want the text between two markers
$ awk '/^>block3/{exit} s; /^>block2/{s=1}' file
222222222222222222222
if there are multiple instances and you want them all, just change exit to s=0
You probably meant:
$ awk '/>/{f=/^>block2$/;next} f' file
222222222222222222222

awk to ignore double quote and compare two files

I have two input file
FILE 1
123
125
123
129
and file 2
"a"|"123"|"anc"
"b"|"124"|"ind"
"c"|"123"|"su"
"d"|"122"|"aus"
OUTPUT:
"b"|"124"|"ind"
"d"|"122"|"aus"
now how can i compare and print the difference of $1 from file1 and $2 from file2. i'm having trouble cause of the double quote(").
So how can I compare the difference ignoring the double quote?
$ awk 'FNR==NR{a[$1]=1;next} a[$3]==0' file1 FS='["|]+' file2
"b"|"124"|"ind"
"d"|"122"|"aus"
How it works:
file1 FS='["|]+' file2
This list of files tells awk to read file1 first, then change the field separator to any combination of double-quotes and vertical bars and then read file2.
FNR==NR{a[$1]=1;next}
FNR is the number of lines that awk has read from the current file and NR is the total number of lines read. Consequently, FNR==NR is true only while reading the first file. The commands which follow in braces are only executed for the first file.
This creates an associative array a whose keys are the first fields of file1 and whose values are 1. The next command tells awk to skip the rest of the commands and start over on the next line.
a[$3]==0
This is true only if the number in field 3 did not occur in file1. If it is true, then the default action is taken which is to print the line. (With the field separator that we have chosen, the number you are interested in is in field 3.)
Alternative
$ awk 'FNR==NR{a[$1]=1;next} a[substr($2,2,length($2)-2)]==0' file1 FS='|' file2
"b"|"124"|"ind"
"d"|"122"|"aus"
This is similar to the above except that the field separator is just a vertical bar. In this case, the number that you are interested in is in field 2. We use substr to remove one character from either end of field 2 which has the effect of removing the double-quotes.