searching multiple patterns in awk - awk

I've a text file of thousands of lines
:ABC:xyz:1234:200:some text:xxx:yyyy:11818:AAA:BBB
:ABC:xyz:6789:200:some text:xxx:yyyy:203450:AAA:BBB
:EFG:xyz:11818:200:some text:xxx:yyyy:154678:AAA:BBB
:HIJ:xyz:203450:200:some text:xxx:yyyy:154678:AAA:BBB
:KLM:xyz:7777:200:some text:xxx:yyyy:11818:AAA:BBB
.....
....
:DEL:xyz:1234:200:some text:xxx:yyyy:203450:AAA:BBB
I need to find more than one occurrence of the 9th column i.e the o/p should show
:ABC:xyz:1234:200:some text:xxx:yyyy:11818:AAA:BBB
:KLM:xyz:7777:200:some text:xxx:yyyy:11818:AAA:BBB
:ABC:xyz:6789:200:some text:xxx:yyyy:203450:AAA:BBB
:DEL:xyz:1234:200:some text:xxx:yyyy:203450:AAA:BBB
I tried:
awk -F ":" '$9 > 2 {split($0,a,":"); print $0}'
this prints all the records.

awk -F':' 'NR==FNR{cnt[$9]++;next} cnt[$9]>1' file file
or if you don't want to parse the file twice:
awk -F':' 'cnt[$9]++{printf "%s", prev[$9]; delete prev[$9]; print; next} {prev[$9]=$0 ORS}' file

This should do it in pure awk:
awk -F":" '{if( s[$9] ){ print } else if( f[$9] ){ print f[$9]; s[$9]=1; print }; f[$9]=$0 }'
Explanation:
The "f" array stores values of the 9th column that have occurred at least once.
The "s" array stores values of the 9th column that have occurred twice or more.
If the 9th column has occurred before, print the first occurrence, and this line.
If the 9th column has occurred twice or more before, print this line.

Here is another awk
awk -F: '{++a[$9];b[NR]=$0} END {for (i=1;i<=NR;i++) {split(b[i],c,":");if (a[c[9]]>1) print b[i]}}' file

Related

Use awk to remove lines based on a column from another file

I have the following code that works to extract lines from the multiple-column file_1 that have a value in the first column that appears in the single-column file_2:
awk 'NR==FNR{a[$1][$0];next} $0 in a {for (i in a[$0]) print i}' file_1 file_2
I got this code from the answer to this question: AWK to filter a file based upon columns of another file
I want to change to code to do the opposite, namely to remove every line from file_1 where the first column matches any value that appears in the single-column file_2. How do I do this?
awk 'NR==FNR { arr[$0]="1";next } arr[$1]!="1" { print $0 }' file2 file1
Process the second file first (NR==FNR) and create an array called arr, with the line ($0) as the key and 1 the value. Then when processing the next file (file1), check if the first space delimited field ($1) exists as a key in the array arr and if it doesn't, print the line.
Direct the output to a file if you want to store the results:
awk 'NR==FNR { arr[$0]="1";next } arr[$1]!="1" { print $0 }' file2 file1 > file3

printing multiple NR from one file based on the value from other file using awk

I want to print out multiple rows from one file based on the input values from the other.
Following is the representation of file 1:
2
4
1
Following is the representation of file 2:
MANCHKLGO
kflgklfdg
fhgjpiqog
fkfjdkfdg
fghjshdjs
jgfkgjfdk
ghftrysba
gfkgfdkgj
jfkjfdkgj
Based on the first column of the first file, the code should first print the second row of the second file followed by fourth row and then the first row of the second file. Hence, the output should be following:
kflgklfdg
fkfjdkfdg
MANCHKLGO
Following are the codes that I tried:
awk 'NR==FNR{a[$1];next}FNR in a{print $0}' file1.txt file2.txt
However, as expected, the output is not in the order as it first printed the first row then the second and fourth row is the last. How can I print the NR from the second file as exactly in the order given in the first file?
Try:
$ awk 'NR==FNR{a[NR]=$0;next} {print a[$1]}' file2 file1
kflgklfdg
fkfjdkfdg
MANCHKLGO
How it works
NR==FNR{a[NR]=$0;next}
This saves the contents of file2 in array a.
print a[$1]
For each number in file1, we print the desired line of file2.
Solution to earlier version of question
$ awk 'NR==FNR{a[NR]=$0;next} {print a[2*$1];print a[2*$1+1]}' file2 file1
fkfjdkfdg
fghjshdjs
gfkgfdkgj
jfkjfdkgj
kflgklfdg
fhgjpiqog
Another take:
awk '
NR==FNR {a[$1]; order[n++] = $1; next}
FNR in a {lines[FNR] = $0}
END {for (i=0; i<n; i++) print lines[order[i]]}
' file1.txt file2.txt
This version stores fewer lines in memory, if your files are huge.

(g)awk next file on partially blank line

The Problem
I just need to combine a whole bunch of files and strip out the header (line 1) from the 1st file.
The Data
Here are the last three lines (with line 1: header) from three of these files:
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170101","20170131","1","5.49","EUR","5.49"
"20170101","20170131","1","4.27","EUR","4.27"
"","","","","9.76",""
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170201","20170228","1","5.49","EUR","5.49"
"20170201","20170228","1","4.88","EUR","4.88"
"20170201","20170228","1","0.61","EUR","0.61"
"20170201","20170228","1","0.61","EUR","0.61"
"","","","","11.59",""
START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT"
"20170301","20170331","1","4.88","EUR","4.88"
"20170301","20170331","1","4.27","EUR","4.27"
"","","","","9.15",""
Problem (Continued)
As you can see, the last line has a number (it's a column total) in column 5. Of course, I don't want that last line. But it's (obviously) on a different line number in each file.
(G)awk is clearly the solution, but I don't know (g)awk.
What I've Tried
I've tried a number of combinations of things, but I guess the one that I'm most surprised does not work is:
gawk '
{ if (!$1 ) nextfile }
NR == 1 {$0 = "Filename" "StartDate" OFS $0; print}
FNR > 1 {$0 = FILENAME StartDate OFS $0; print}
' OFS=',' */*.csv > ../path/file.csv
Expected Output (by request)
"START_DATE","END_DATE","UNITS","COST","COST_CURRENCY","AMOUNT
20170101","20170131","1","5.49","EUR","5.49
20170101","20170131","1","4.27","EUR","4.27
20170201","20170228","1","5.49","EUR","5.49
20170201","20170228","1","4.88","EUR","4.88
20170201","20170228","1","0.61","EUR","0.61
20170201","20170228","1","0.61","EUR","0.61
20170301","20170331","1","4.88","EUR","4.88
20170301","20170331","1","4.27","EUR","4.27"
And, of course, I've tried searching both Google and SO. Most of the answers I see require much more awk knowledge than I have, just to understand them. (I'm not a data wrangler, but I have a data wrangling task.)
Thanks for any help!
this should do...
awk 'NR==1; FNR==1{next} FNR>2{print p} {p=$0}' file{1..3}
print first header, skip other headers and last lines.
Another awk approach:-
awk -F, '
NR == 1 {
header = $0
print
next
}
FNR > 1 && $1 != "\"\""
' *.csv
Something like the following should do the trick:
awk -F"," 'NR==1{header=$0; print $0} $0!=header && $1!=""{print $0}' */*.csv > ../path/file.csv\
Here awk will:
Split the records by comma -F","
If this is the first record awk encounters, it sets variable header to the entire contents of the line and then prints the header NR==1{header=$0; print $0}
If the contents of the current line are not a header and the first field isn't empty (indicating a "total" line), then print the line $0!=header && $1!=""{print $0}'
As mentioned in my comment below, if the first field of your records always begin with an 8 digit date, then you could simplify (this is less generic than the code above):
awk -F"," 'NR == 1 || $1 ~ /"[0-9]{8}"/ {print $0} /*.csv > outfile.csv
Essentially that says if this is the first record to process then print it (it's a header) OR || if the first field is an 8 digit number surrounded by double quotes then print it.

awk ternay operator, count fs with ,

How to make this command line:
awk -F "," '{NF>0?$NF:$0}'
to print the last field of a line if NF>0, otherwise print the whole line?
Working data
bogota
dept math, bogota
awk -F, '{ print ( NF ? $NF : $0 ) }' file
Actually, you don't need ternary operator for this, but use :
awk -F, '{print $NF}' file
This will print the last field, i.e, if there are more than 1 field, it will print the last field, if line has only one field, it will print the same.

Adding columns with awk. What is wrong with this awk command?

I want to add two columns to a file with ~10,000 columns. I want to insert as the first column the nr 22 on each row. Then I want the original first column as the second column, then as the third column I want to insert the line nr (NR), and after that I want the rest of the original columns to be printed. I thought I could do that with the following awk line:
awk '{print 22, $1, NR; for(i=2;i<=NF;++i) print $i}' file
It prints the first three columns (22, $1, NR) well, but after that, there is a new line started for each value, so the file is printed like this:
22 $1 NR
$2
$3
$4
etc...
instead of:
22 $1 NR $2 $3 $4 etc...
What did I do wrong?
How about using printf instead since print adds a newline.
awk '{printf("%d, %d, %d, ", 22, $1, NR); for(i=2;i<=NF;++i) printf("%d, ", i)}' file
Or you can play with the ORS and OFS, the Output Record Separator and the Output Field Separator. Normally you add those in a BEGIN statement like this:
awk 'BEGIN { ORS = " " } {print 22, $1, NR; for(i=2;i<=NF;++i) print $i}{print "\n"}' file
Note that an extra printf "\n" is needed, else everything ends up on one line...
Read more in gawk manual output separators
For more precise control over the output format than what is provided by print(which print a newline by default), use printf.