Add new column with times same value was found in 2 columns - awk

Add new column with value of how many times the values in columns 1 and 2 contends exactly same value.
input file
46849,39785,2,012,023,351912.29,2527104.70,174.31
46849,39785,2,012,028,351912.45,2527118.70,174.30
46849,39785,3,06,018,351912.12,2527119.51,174.33
46849,39785,3,06,020,351911.80,2527105.83,174.40
46849,39797,2,012,023,352062.45,2527118.50,173.99
46849,39797,2,012,028,352062.51,2527105.51,174.04
46849,39797,3,06,020,352063.29,2527116.71,174.13,
46849,39809,2,012,023,352211.63,2527104.81,173.74
46849,39809,2,012,028,352211.21,2527117.94,173.69
46849,39803,2,012,023,352211.63,2527104.81,173.74
46849,39803,2,012,028,352211.21,2527117.94,173.69
46849,39801,2,012,023,352211.63,2527104.81,173.74
Expected output file:
4,46849,39785,2,012,023,351912.29,2527104.70,174.31
4,46849,39785,2,012,028,351912.45,2527118.70,174.30
4,46849,39785,3,06,018,351912.12,2527119.51,174.33
4,46849,39785,3,06,020,351911.80,2527105.83,174.40
3,46849,39797,2,012,023,352062.45,2527118.50,173.99
3,46849,39797,2,012,028,352062.51,2527105.51,174.04
3,46849,39797,3,06,020,352063.29,2527116.71,174.13,
2,46849,39809,2,012,023,352211.63,2527104.81,173.74
2,46849,39809,2,012,028,352211.21,2527117.94,173.69
2,46849,39803,2,012,023,352211.63,2527104.81,173.74
1,46849,39803,2,012,028,352211.21,2527117.94,173.69
1,46849,39801,2,012,023,352211.63,2527104.81,173.74
attempt:
awk -F, '{x[$1 $2]++}END{ for(i in x) {print i,x[i]}}' file
4684939785 4
4684939797 3
4684939801 1
4684939803 2
4684939809 2

Could you please try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1,$2]++
next
}
{
print a[$1,$2],$0
}
' Input_file Input_file
Explanation: reading Input_file 2 times. Where first time I am creating an array named a with index of first and second field and counting their value on each occurrence too. On 2nd time file reading it printing count of the first 2 fields total and then printing while line.
One liner code:
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1,$2]++;next} {print a[$1,$2],$0}' Input_file Input_file

Related

Printing lines of file2 when two fields from file1 match substrings of a single field in file2

Goal: To print lines of File2 when field 1 ($1) and field 4 ($4) of File1 both match a substring in field 4 ($4) on lines beginning with ">" in File2.
Important note #1: The lines being printed to output include the line being searched and all the lines following it until the next line with a ">".
Example: When fields 1 and 4 of File1 are 2776 & 2968 respectively, these should be searched against field 4 of File2 to evntually find the match 2776-2968(+) (because both numbers of File1 match a substring in field 4 of File2). The order of the numbers in the string does not matter - 2968-2776(+) should also be considered a match. Since they match, that line of File2 is printed with all lines below it until another line with ">" is encountered.
Important Note #2: File1 is tab-delimited: \t. File 2 is colon-delimited: :.
File1:
Transcription_Start Translation_Start Translation_Stop Transcription_Stop Strand Expression
2776 2968 + 920
17374 17563 + 1959
2968 2786 - 802
17563 17375 - 1694
19606 19395 - 1914
File2:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
Desired Output:
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
This is what I've tried so far (it outputs the full contents of File2, thus failing to produce the desired output):
$ awk -F"\t|:" 'NR==FNR{a[$4]; next} ($1 in a) || ($4 in a)' File1 File2 > Output
>-::NC_013316.1:2776-2968(+)
ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTC
TCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAAC
GCTCGGAAACGGACGCTAATACCGCATAC
>-::NC_013316.1:17374-17563(+)
AAAATTAAAGAAAATTCTAAAAAAATAAAAGATAGAATTTCAATTAAGTAAAAAAGTGAA
>-::NC_013316.1:2786-2968(-)
GTTCCTCCTTGTCACTATTTTAAACAAATTCCTATTGATACACTAAAAGTATATTATTTC
>antisense_CDR20291_r27::NC_013316.1:10830-11707(-)
TATTTCTTGTTCCTTTTTTCAAGGACAAGTAAATAAATTAACCTACTGTTTAATTTTCAA
>antisense_CDR20291_r27::NC_013316.1:11814-11874(-)
TTCCTTTGAGTTTCACTCTTGCGAGCGTACTTCCCAGGCGGA
How can I process my files with awk (or similar) to achieve my goal?
With your shown samples, please try following. Written and tested with GNU awk.
awk '
FNR==NR{
arr[$1,$2]
next
}
/^>/{
found=""
if((($5,$6) in arr) || (($6,$5) in arr)){
found=1
}
}
found
' file1 FS=":|-|\\\\(" file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when file1 is being read.
arr[$1,$2] ##Creating arr with index of 1st and 2nd field.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found="" ##Nullifying found here.
if((($5,$6) in arr) || (($6,$5) in arr)){ ##Checking condition if either 5th 6th field is present in arr OR 6th 5th field as a key present in arr then do following.
found=1 ##Setting found to 1 here.
}
}
found ##Checking condition if found is set then print that line.
' file1 FS=":|-|\\\\(" file2 ##Mentioning Input_file(s) and setting field separator before Input_file2 to get exact values to match.

How to compare two columns of two csv files with awk?

I have two csv files I need to compare against one column.
My member.csv file looks like:
ID|lastName|firstName
01|Lastname01|Firstname01
02|Lastname02|Firstname02
The second file check-ID.csv looks like:
Lastname01|Name01|pubID01|Hash01
Lastname02|Name02|pubID02|Hash02a
Lastname03|Name03|pubID03|Hash03
Lastname02|Name02|pubID02|Hash02b
Lastname01|Name01|pubID01|Hash01b
--> Lastname03 is not in my member.csv !
What I want is to check if the value of the first column of check-ID.csv is equal to value of second column in member.csv.
My attempt with script.awk is
NR==FNR{a[$1]=$1; b[$1]=$0; next}
$2==a[$1]{ delete b[$1]}
END{for (i in b ) print b[i]}
executing with
awk -f script.awk check-ID.csv member.csv
The problem is that the result is not filtered.
I like to get a filtered and sorted output so only members are listed like this:
Lastname01|Name01|pubID01|Hash01
Lastname01|Name01|pubID01|Hash01b
Lastname02|Name02|pubID02|Hash02a
Lastname02|Name02|pubID02|Hash02b
Any help appreciated!
Could you please try following. I think you were close only thing is you could change your Input_files reading sequence. Where I am reading members Input_file first and then check-ID.csv because later Input_file has all details in it which needs to be printed and we need to only check for 2nd field from members Input_file.
awk '
BEGIN{
FS="|"
}
FNR==NR{
a[$2]
next
}
($1 in a)
' members.csv check-ID.csv |
sort -t'|' -k1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting field separator as | here.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when first Input_file named members.csv is being read.
a[$2] ##Creating array a with index 2nd field here.
next ##next will skip all further statements from here.
}
($1 in a) ##Checking condition if 1st field is preent in a then print that line.
' members.csv check-ID.csv | ##Mentioning Input_file names here and sending its output to sort command.
sort -t'|' -k1 ##Sorting output(which we got from awk command above) by setting separator as | and by first field.

How to count values between empty cells

I'm facing one problem which is bigger than me. I have 18 relative large text files (ca 30k lines each) and I need to count the values between the empty cells in the second column. Here is a simple example of my file:
Metabolism
line_1 10.2
line_2 10.1
line_3 10.3
TCA_cycle
line_4 10.7
line_5 10.8
Pyruvate_metab
line_6 100.8
In reality, I have circa 500 description lines (Metabolism, TCA_cycle, etc.) and the range of lines is between zero to a few hundred.
I would like to count values for each block (block starts with a description and corresponding lines are always below), e.g.
Metabolism 30.6
line_1 10.2
line_2 10.1
line_3 10.3
TCA_cycle 21.5
line_4 10.7
line_5 10.8
Pyruvate_metab 100.8
line_6 100.8
Or just
30.3
21.5
100.8
It won't be a problem if results will be printed line by line into an additional file... Or another alternative way.
There is one tricky thing and it's descriptions without lines with numbers.
Transport
line_1000 100.1
line_1001 100.2
Cell_signal
Motility
Processing
Translation
line_1002 500.1
line_1003 200.2
And even for those lines and would like to get 0 value.
Transport 200.3
line_1000 100.1
line_1001 100.2
Cell_signal 0
Motility 0
Processing 0
Translation 700.3
line_1002 500.1
line_1003 200.2
The rest of the file looks same and it's consistent - 2 columns, tab separators, descriptions in the first column, values in the second, no spaces (only underlines).
Actually I have no experience with more sophisticated coding so I really don't know how to solve it in the command line. I've already tried some Excel ways but it was painful and unsuccessful.
With tac and any awk:
tac file | awk 'NF==2{sum+=$2; print; next} {print $1 "\t" sum; sum=0}' | tac
With two improvements proposed by kvantour and Ed Morton. See the comments.
tac file | awk '($NF+0==$NF){sum+=$2; print; next} {print $1 "\t" sum+0; sum=0}' | tac
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
if($0!~/line/){ a[$0]; prev=$0 }
else { a[prev]+=$NF }
next
}
!/line/{
$0=$0 OFS (a[$0]?a[$0]:0)
}
1' Input_file Input_file
OR in case you want output in good looking form add column -t to above command like as follows:
awk '
FNR==NR{
if($0!~/line/){ a[$0]; prev=$0 }
else { a[prev]+=$NF }
next
}
!/line/{
$0=$0 OFS (a[$0]?a[$0]:0)
}
1' Input_file Input_file | column -t
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking FNR==NR which will be TRUE when Input_file is being read first time.
if($0!~/line/){ a[$0]; prev=$0 } ##checking condition if line contains string line and setting index of current line in a and setting prev value to current line.
else { a[prev]+=$NF } ##Else if line not starting from line then creating array a with index prev variable and keep on adding last field value to same index of array.
next ##next will skip all further statements from here.
}
!/line/{ ##Checking if current line doesnot have line keyword in it then do following.
$0=$0 OFS (a[$0]?a[$0]:0) ##Re-creating current line with its current value then OFS(which is space by default) then either add value of a[$0] or 0 based on current line value is NOT NULL here.
}
1 ##Printing current line here.
' Input_file Input_file ##Mentioning Input_file names here.
In plain awk:
awk '{
if (NF == 1) {
if (blockname)
printf("%s\t%.2f\n%s", blockname, sum, lines)
blockname = $0
sum = 0
lines=""
} else if (NF == 2) {
sum += $2
lines = lines $0 "\n"
}
next
}
END { printf("%s\t%.2f\n%s", blockname, sum, lines) }
' input.txt

Display range of rows from first column with awk

I have a table as shown below, here I just want to print all the Emp_Name from the first column starting from the second row.
Emp_Name Position Experience
Cara Senior 12
Doc Junior 6
Quinn Lead 14
Cedric Manager 18
Collen Junior 8
I know that awk '{print $1}' will print the first column from the table but how to skip first row or field i.e. Emp_Name and print all the names from the second row to the last field? Here last field or row number could be any number (not known).
Any help would be appreciated.
Not fully clear though, in case you want to skip only first row then try following.
awk 'FNR>1' Input_file
OR to print 1st column use:
awk 'FNR>1{print $1}' Input_file
In case you do not know on which field Emp_No will come and you want to look for its column number from 1st row AND DO NOT want to print the same column from rest of the row then try following.
awk '
BEGIN{
OFS="\t"
}
FNR==1{
for(i=1;i<=NF;i++){
if($i=="Emp_Name"){
val=i
next
}
}
}
{
for(i=1;i<=NF;i++){
if(i==val){
continue
}
else{
value=(value?value OFS:"")$i
}
}
print value
value=""
}
' Input_file

Select current and previous line if values are the same in 2 columns

Check values in columns 2 and 3, if the values are the same in the previous line and current line( example lines 2-3 and 6-7), then print the lines separated as ,
Input file
1 1 2 35 1
2 3 4 50 1
2 3 4 75 1
4 7 7 85 1
5 8 6 100 1
8 6 9 125 1
4 6 9 200 1
5 3 2 156 2
Desired output
2,3,4,50,1,2,3,4,75,1
8,6,9,125,1,4,6,9,200,1
I tried to modify this code, but not results
awk '{$6=$2 $3 - $p2 $p3} $6==0{print p0; print} {p0=$0;p2=p2;p3=$3}'
Thanks in advance.
$ awk -v OFS=',' '{$1=$1; cK=$2 FS $3} pK==cK{print p0, $0} {pK=cK; p0=$0}' file
2,3,4,50,1,2,3,4,75,1
8,6,9,125,1,4,6,9,200,1
With your own code and its mechanism updated:
awk '(($2=$2) $3) - (p2 p3)==0{printf "%s", p0; print} {p0=$0;p2=$2;p3=$3}' OFS="," file
2,3,4,50,12,3,4,75,1
8,6,9,125,14,6,9,200,1
But it has underlying problem, so better use this simplified/improved way:
awk '($2=$2) FS $3==cp{print p0,$0} {p0=$0; cp=$2 FS $3}' OFS=, file
The FS is needed, check the comments under Mr. Morton's answer.
Why your code fails:
Concatenate (what space do) has higher priority than minus-.
You used $6 to save the value you want to compare, and then it becomes a part of $0 the line.(last column). -- You can change it to a temporary variable name.
You have a typo (p2=p2), and you used $p2 and $p3, which means to get p2's value and find the corresponding column. So if p2==3 then $p2 equals $3.
You didn't set OFS, so even if your code works, the output will be separated by spaces.
print will add a trailing newline\n, so even if above problems don't exist, you will get 4 lines instead of the 2 lines output you wanted.
Could you please try following too.
awk 'prev_2nd==$2 && prev_3rd==$3{$1=$1;print prev_line,$0} {prev_2nd=$2;prev_3rd=$3;$1=$1;prev_line=$0}' OFS=, Input_file
Explanation: Adding explanation for above code now.
awk '
prev_2nd==$2 && prev_3rd==$3{ ##Checking if previous lines variable prev_2nd and prev_3rd are having same value as current line 2nd and 3rd field or not, if yes then do following.
$1=$1 ##Resetting $1 value of current line to $1 only why because OP needs output field separator as comma and to apply this we need to reset it to its own value.
print prev_line,$0 ##Printing value of previous line and current line here.
} ##Closing this condition block here.
{
prev_2nd=$2 ##Setting current line $2 to prev_2nd variable here.
prev_3rd=$3 ##Setting current line $3 to prev_3rd variable here.
$1=$1 ##Resetting value of $1 to $1 to make comma in its values applied.
prev_line=$0 ##Now setting pre_line value to current line edited one with comma as separator.
}
' OFS=, Input_file ##Setting OFS(output field separator) value as comma here and mentioning Input_file name here.