printing multiple NR from one file based on the value from other file using awk - awk

I want to print out multiple rows from one file based on the input values from the other.
Following is the representation of file 1:
2
4
1
Following is the representation of file 2:
MANCHKLGO
kflgklfdg
fhgjpiqog
fkfjdkfdg
fghjshdjs
jgfkgjfdk
ghftrysba
gfkgfdkgj
jfkjfdkgj
Based on the first column of the first file, the code should first print the second row of the second file followed by fourth row and then the first row of the second file. Hence, the output should be following:
kflgklfdg
fkfjdkfdg
MANCHKLGO
Following are the codes that I tried:
awk 'NR==FNR{a[$1];next}FNR in a{print $0}' file1.txt file2.txt
However, as expected, the output is not in the order as it first printed the first row then the second and fourth row is the last. How can I print the NR from the second file as exactly in the order given in the first file?

Try:
$ awk 'NR==FNR{a[NR]=$0;next} {print a[$1]}' file2 file1
kflgklfdg
fkfjdkfdg
MANCHKLGO
How it works
NR==FNR{a[NR]=$0;next}
This saves the contents of file2 in array a.
print a[$1]
For each number in file1, we print the desired line of file2.
Solution to earlier version of question
$ awk 'NR==FNR{a[NR]=$0;next} {print a[2*$1];print a[2*$1+1]}' file2 file1
fkfjdkfdg
fghjshdjs
gfkgfdkgj
jfkjfdkgj
kflgklfdg
fhgjpiqog

Another take:
awk '
NR==FNR {a[$1]; order[n++] = $1; next}
FNR in a {lines[FNR] = $0}
END {for (i=0; i<n; i++) print lines[order[i]]}
' file1.txt file2.txt
This version stores fewer lines in memory, if your files are huge.

Related

Use awk to remove lines based on a column from another file

I have the following code that works to extract lines from the multiple-column file_1 that have a value in the first column that appears in the single-column file_2:
awk 'NR==FNR{a[$1][$0];next} $0 in a {for (i in a[$0]) print i}' file_1 file_2
I got this code from the answer to this question: AWK to filter a file based upon columns of another file
I want to change to code to do the opposite, namely to remove every line from file_1 where the first column matches any value that appears in the single-column file_2. How do I do this?
awk 'NR==FNR { arr[$0]="1";next } arr[$1]!="1" { print $0 }' file2 file1
Process the second file first (NR==FNR) and create an array called arr, with the line ($0) as the key and 1 the value. Then when processing the next file (file1), check if the first space delimited field ($1) exists as a key in the array arr and if it doesn't, print the line.
Direct the output to a file if you want to store the results:
awk 'NR==FNR { arr[$0]="1";next } arr[$1]!="1" { print $0 }' file2 file1 > file3

Awk Remove lines if one column matches another column, and keep line if max value from another column

I have a file of ~8,000 lines. I am trying to remove the lines where when the 5th column matches (in this case ga2016mldlzd), but keep only the line with the max value in the 6th column. For example, if given this:
-25.559,129.8529,6674.560547,2.0,ga2016mldlzd,6
-25.5596,129.8565,6902.750651,2.0,ga2016mldlzd,7
-25.5450,129.830,969.8079427,2.0,ga2016mldlzd,8
-25.5450,129.834,57.04752604,2.0,ga2016mldlzd,9
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
remove all lines except the final line with 10 as the max value, to get this. I'm stumped as to how this could be done either in awk or sed?
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If tried this:
awk -F, '!a[$5]++'
but I want to keep last column e.g., the column with '10', rather than the column with '6'. Thanks
Keep track of the max and line associated with that max and print at the end:
awk -F, '
{
if ($6>max[$5]) {
max[$5]=$6
tl[$5]=$0
}
}
END{
for (l in tl) print tl[l]
}' file
Prints:
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
The order of the file will be lost; ie, the groups may be reordered compared to the original file.
If you are dealing with a file with many different keys for $5 and not all of them could fit in memory, you could sort into blocks grouped by the fifth field and then by the numeric value of the sixth. Then have awk print the last line every time the fifth field changes. Since it is sorted, that will be the max:
sort -t , -k 5,5 -k 6n file |
awk -F, '
FNR==1{lf=$5;ll=$0}
lf!=$5{print ll}
{ll=$0; lf=$5}
END{print $0}'
# same print out
The second there will be way slower but way less memory for a large number of $5 uniq values.
If you want to maintain original order of lines then use this awk:
awk -F, 'NR==FNR {if ($6 > max[$5]) max[$5] = $6; next} $5 in max && max[$5] == $6' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If you want to filter for ga2016mldlzd while maintaining original order of lines then use this awk:
awk -F, '
NR==FNR {
if ($5 == "ga2016mldlzd" && $6 > max[$5]) {
max[$5] = $6
n = FNR
}
next
}
FNR == n' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10

Awk command has unexpected results when comparing two files

I am using an awk command to compare the first column in two file.
I want to take col1 of file1 and if there is a match in col1 of file2, update the "date updated" in the last column. If there is no match, I want to append the entire line of file1 to file2 and append a "date updated" value to that line as well. Here is the command I'm currently using:
awk 'FNR == NR { f1[$1] = $0; next }
$1 in f1 { print; delete f1[$1] }
END { for (user in f1) print f1[user] }' file1 file2
File1:
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeNEW
mschrode,161.1,plasma-de+,serv01,datetimeNEW
File2:
jbomba,114.0,plasma-de+,serv01,datetimeOLD
mschrode,104.0,plasma-de+,serv01,datetimeOLD
deleteme,192.0,random,serv01,datetimeOLD #<---- Needs to be removed: WORKS!
Expected Output:(order does not matter)
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeOLD #<---- NEED THIS VALUE
mschrode,161.1,plasma-de+,serv01,datetimeOLD #<---- NEED THIS VALUE
Current Output:(order does not matter)
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,100.0,plasma-de+,serv01,datetimeNEW #<----WRONG OUTPUT
mschrode,161.1,plasma-de+,serv01,datetimeNEW #<----WRONG OUTPUT
The Logic Broken Down:
If $usr/col1 in file2 does NOT exist in file1
remove entire line from file2
(ex: line3 in file2, user: deleteme)
If $usr/col1 in file1 does NOT exist in file2
append entire line to file2
(ex: lines 1-5 in file1)
So the issue is, when there IS a match between the two files, I need to keep the information from file2, not the information from file1. In the output examples above you'll see I need to keep the datetimeOLD from file2 along with the new information from file1.
Set field separator to comma, and read file2 first:
$ awk -F',' 'FNR==NR{a[$1]=$0;next} $1 in a{print a[$1];next} 1' file2 file1
tnash,172.2,plasma-de+,serv01,datetimeNEW
jhwagner,169.4,plasma-de+,serv01,datetimeNEW
steadmah,161.1,plasma-de+,serv01,datetimeNEW
guillwt,158.3,plasma-de+,serv01,datetimeNEW
mwinebra,122.2,plasma-de+,serv01,datetimeNEW
jbomba,114.0,plasma-de+,serv01,datetimeOLD
mschrode,104.0,plasma-de+,serv01,datetimeOLD

(g)awk next file on partially blank line

The Problem
I just need to combine a whole bunch of files and strip out the header (line 1) from the 1st file.
The Data
Here are the last three lines (with line 1: header) from three of these files:
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""
START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""
Problem (Continued)
As you can see, the last line has a number (it's a column total) in column 5. Of course, I don't want that last line. But it's (obviously) on a different line number in each file.
(G)awk is clearly the solution, but I don't know (g)awk.
What I've Tried
I've tried a number of combinations of things, but I guess the one that I'm most surprised does not work is:
gawk '
{ if (!$1 ) nextfile }
NR == 1 {$0 = "Filename" "StartDate" OFS $0; print}
FNR > 1 {$0 = FILENAME StartDate OFS $0; print}
' OFS=',' */*.csv > ../path/file.csv
Expected Output (by request)
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"
And, of course, I've tried searching both Google and SO. Most of the answers I see require much more awk knowledge than I have, just to understand them. (I'm not a data wrangler, but I have a data wrangling task.)
Thanks for any help!
this should do...
awk 'NR==1; FNR==1{next} FNR>2{print p} {p=$0}' file{1..3}
print first header, skip other headers and last lines.
Another awk approach:-
awk -F, '
NR == 1 {
header = $0
print
next
}
FNR > 1 && $1 != "\"\""
' *.csv
Something like the following should do the trick:
awk -F"," 'NR==1{header=$0; print $0} $0!=header && $1!=""{print $0}' */*.csv > ../path/file.csv\
Here awk will:
Split the records by comma -F","
If this is the first record awk encounters, it sets variable header to the entire contents of the line and then prints the header NR==1{header=$0; print $0}
If the contents of the current line are not a header and the first field isn't empty (indicating a "total" line), then print the line $0!=header && $1!=""{print $0}'
As mentioned in my comment below, if the first field of your records always begin with an 8 digit date, then you could simplify (this is less generic than the code above):
awk -F"," 'NR == 1 || $1 ~ /"[0-9]{8}"/ {print $0} /*.csv > outfile.csv
Essentially that says if this is the first record to process then print it (it's a header) OR || if the first field is an 8 digit number surrounded by double quotes then print it.

How to print two lines of several files to a new file with speicific order?

I have a task to do with awk. I am doing sequence analysis for some genes.
I have several files with sequences in order. I would like to extract first sequence of each file into new file and like till the last sequence. I know only how to do with first or any specific line with awk.
awk 'FNR == 2 {print; nextfile}' *.txt > newfile
Here I have input like this
File 1
Saureus081.1
ATCGGCCCTTAA
Saureus081.2
ATGCCTTAAGCTATA
Saureus081.3
ATCCTAAAGGTAAGG
File 2
SaureusRF1.1
ATCGGCCCTTAC
SauruesRF1.2
ATGCCTTAAGCTAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
File 3
SaureusN305.1
ATCGGCCCTTACT
SauruesN305.2
ATGCCTTAAGCTAGA
SaureusN305.3
ATCCTAAAGGTAATG
And similar files 12 are there
File 4
.
.
.
.File 12
Required Output
Newfile
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SaureusRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
I guess this task can be done easily with awk but not getting any idea how to do for multiple lines
Based on the modified question, the answer shall be done with some changes.
$ awk -F'.' 'NR%2{k=$2;v=$0;getline;a[k]=a[k]?a[k] RS v RS $0:v RS $0} END{for(i in a)print a[i]}' file1 file2 file3
Saureus081.1
ATCGGCCCTTAA
SaureusRF1.1
ATCGGCCCTTAC
SaureusN305.1
ATCGGCCCTTACT
Saureus081.2
ATGCCTTAAGCTATA
SauruesRF1.2
ATGCCTTAAGCTAGG
SauruesN305.2
ATGCCTTAAGCTAGA
Saureus081.3
ATCCTAAAGGTAAGG
SaureusRF1.3
ATCCTAAAGGTAAGC
SaureusN305.3
ATCCTAAAGGTAATG
Brief explanation,
Set '.' as the delimeter
For every odd record, distinguish k=$2 as the key of array a
Invoke getline to set $0 of next record as the value corresponds to the key k
Print the whole array for the last step
If your data is very large, I would suggest creating temporary files:
awk 'FNR%2==1 { filename = $1 }
{ print $0 >> filename }' file1 ... filen
Afterwards, you can cat them together:
cat Seq1 ... Seqn > result
This has the additional advantage that it will work if not all sequences are present in all files.
paste + awk solution:
paste File1 File2 | awk '{ p=$2;$2="" }NR%2{ k=p; print }!(NR%2){ v=p; print $1 RS k RS v }'
paste File1 File2 - merge corresponding lines of files
p=$2;$2="" - capture the value of the 2nd field which is the respective key/value from File2
The output:
Seq1
ATCGGCCCTTAA
Seq1
ATCGGCCCTTAC
Seq2
ATGCCTTAAGCTATA
Seq2
ATGCCTTAAGCTAGG
Seq3
ATCCTAAAGGTAAGG
Seq3
ATCCTAAAGGTAAGC
Additional approach for multiple files:
paste Files[0-9]* | awk 'NR%2{ k=$1; n=NF; print k }
!(NR%2){ print $1; for(i=2;i<=n;i++) print k RS $i }'