awk to ignore double quote and compare two files - awk

I have two input file
FILE 1
123
125
123
129
and file 2
"a"|"123"|"anc"
"b"|"124"|"ind"
"c"|"123"|"su"
"d"|"122"|"aus"
OUTPUT:
"b"|"124"|"ind"
"d"|"122"|"aus"
now how can i compare and print the difference of $1 from file1 and $2 from file2. i'm having trouble cause of the double quote(").
So how can I compare the difference ignoring the double quote?

$ awk 'FNR==NR{a[$1]=1;next} a[$3]==0' file1 FS='["|]+' file2
"b"|"124"|"ind"
"d"|"122"|"aus"
How it works:
file1 FS='["|]+' file2
This list of files tells awk to read file1 first, then change the field separator to any combination of double-quotes and vertical bars and then read file2.
FNR==NR{a[$1]=1;next}
FNR is the number of lines that awk has read from the current file and NR is the total number of lines read. Consequently, FNR==NR is true only while reading the first file. The commands which follow in braces are only executed for the first file.
This creates an associative array a whose keys are the first fields of file1 and whose values are 1. The next command tells awk to skip the rest of the commands and start over on the next line.
a[$3]==0
This is true only if the number in field 3 did not occur in file1. If it is true, then the default action is taken which is to print the line. (With the field separator that we have chosen, the number you are interested in is in field 3.)
Alternative
$ awk 'FNR==NR{a[$1]=1;next} a[substr($2,2,length($2)-2)]==0' file1 FS='|' file2
"b"|"124"|"ind"
"d"|"122"|"aus"
This is similar to the above except that the field separator is just a vertical bar. In this case, the number that you are interested in is in field 2. We use substr to remove one character from either end of field 2 which has the effect of removing the double-quotes.

Related

Print filenames & line number with number of fields greater than 'x'

I am running Ubuntu Linux. I am in need to print filenames & line numbers containing more than 7 columns. There are several hundred thousand files.
I am able to print the number of columns per file using awk. However the output I am after is something like
file1.csv-463 which is to suggest file1.csv has more than 7 records on line 463. I am using awk command awk -F"," '{print NF}' * to print the number of fields across all files.
Please could I request help?
If you have GNU awk with you, try following code then. This will simply check condition if NF is greater than 7 then it will print that particular file's file name along with line number and nextfile will take program to next Input_file which will save our time because we need not to read whole Input_file then.
awk -F',' 'NF>7{print FILENAME,FNR;nextfile}' *.csv
Above will print only very first match of condition to get/print all matched lines try following then:
awk -F',' 'NF>7{print FILENAME,FNR}' *.csv
This might work for you (GNU sed):
sed -Ens 's/\S+/&/8;T;F;=;p' *.csv | paste - - -
If there is no eighth column, break.
Output the file name F, the line number = and print the current line p.
Feed the output into a paste command which prints three lines as one.
N.B. The -s option resets the line numbers for each file, without it, it will number each line for the entire input.

awk: Adding a new column based on concatenated value of two columns

I am trying to add a new column to a text file based on the concatenated values of two columns. Value is being inserted in the middle instead of the end of the string.
I am using awk. Here are two sample lines
$ head -1 file.txt
8502CC169154|02|GA|TN|89840|9|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|1|TEAM1|1639009|1000000|0|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|85|00|37421||241|20|331|1052A|5000|0|.1500|Chattanooga|47065|.000|025|35|25000|0|0|0|0|0|718||E|-17.00|-17.00|-17.00|-17.00|-17.00|-2.55|-2.55|-2.55|-2.55|D|C9N7I4115531902|-2.19|-2.19|-2.19|-2.19|-14.81|051|2008-12-31 00:00:00.000|151|2008-12-17 00:00:00.000|||AC|CC|Y||2008-12-31 00:00:00.000|.000000|A|.000000|.000000|.000000|Y|8502CC169154-8|8502CC169154|8|||122130|122130M|7764298|RA
I tried the following.
$ head -1 file.txt | awk -F'|' '{$(NF+1)=$1"-"$6;}1' OFS='|'
I am expecting a new column at the end of the string. But you can see that the concatenated field is being inserted in the middle of the string instead of the end of the string.
8502CC169154|02|GA|TN|89840|9|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|1|TEAM1|1639009|1000000|0|2008-11-15 00:00:00.000|2009-11-15 00:00:00.000|85|00|37421||241|20|331|1052A|5000|0|.1500|Chattanooga|47065|.000|025|35|25000|0|0|0|0|0|718||E|-17.00|-17.00|-17.00|-17.00|-17.00|-2.55|-2.55|-2.55|-2.55|D|C9N7I4115531902|-2.19|-2.19|-2.19|-2.19|-14.81|051|2008-12-31 00:00:00.000|151|2008|8502CC169154-9.000|||AC|CC|Y||2008-12-31 00:00:00.000|.000000|A|.000000|.000000|.000000|Y|8502CC169154-8|8502CC169154|8|||122130|122130M|7764298|RA
Your original code works for me using GNU awk but I suspect that not all awks support setting $(NF+1). To avoid that, try:
head -1 file.txt | awk -F'|' '{$0=$0 FS $1"-"$6;}1' OFS='|'
Awk is a surprising powerful language and it has all the capabilities that head has, making the pipeline unnecessary. So, for greater efficiency, try the simple command:
awk -F'|' '{print $0 FS $1"-"$6; exit}' file.txt
How it works:
-F'|'
This sets the field separator to a vertical bar.
print $0 FS $1"-"$6
This prints the output line that you want which consists of the original line, $0, followed by a field separator, FS, followed by combination of the first field, a dash, and the sixth field.
exit
After the first line is printed, this tells awk to exit. This eliminates the need for head -1.

awk/gawk - remove line if line 2 doesn't exist

I have a .txt file with 2 rows, and a seperator, some lines only contain 1 row though, so I want to remove those that only contain 1 row.
example of lines are
Line to keep,
Iamnotyours:email#email.com
Line to remove,
Iamnotyours:
Given your posted sample input all you need is:
grep -v ':$' file
or if you insist on awk for some reason:
awk '!/:$/' file
If that's not all you need then edit your question to clarify your requirements.
awk to the rescue!
$ awk -F: 'NF==2' file
prints only the lines with two fields
$ awk -F: 'NF>1' file
prints lines more than one field. Your case, you have the separator in place, the field count will be two. You need to check whether second field is empty
$ awk -F: '$2!=""' file

Getting numerical sub-string of fields using awk

I was wondering how I can get the numerical sub-string of fields using awk in a text file like what is shown below. I am already familiar with substr() function. However, since the length of fields are not fixed, I have no idea how to separate text from numerical part.
A.txt
"Asd.1"
"bcdujcd.2"
"mshde.3333"
"deuhdue.777"
P.S. All the numbers are separated from text part with a single dot (.).
You may try like this:
rt$ echo "bcdujcd.2"|awk -F'[^0-9]*' '$0=$2'
If you don't care about any non-digit parts of the line and only want to see the digit parts as output you could use:
awk '{gsub(/[^[:digit:]]+/, " ")}7' A.txt
which will generate:
1
2
3333
777
as output (there's a leading space on each line for the record).
If there can only be one number field per line than the replacement above can be "" instead of " " in the gsub and the leading space will do away. The replacement with the space will keep multiple numerical fields separated by a space if they occur on a single line. (i.e. "foo.88.bar.11" becomes 88 11 instead of 8811).
If you just need the second (period delimited) field of each line of that sort then awk -F. '{print $2}' will do that.
$ awk -F'[".]' '{print $3}' file
1
2
3333
777

Command to replace specific column of csv file for first 100 rows

Following command is replacing second column with value e in a complete csv file,
but what if i want to replace only in first 100 rows.
awk -F, '{$2="e";}1' OFS=, file
Rest of the rows of csv file should be intact..
awk -F, 'NR<101{$2="e";}1' OFS=, file
NR built-in variable gives you either the total number of records being processed or line number depending on the usage. In the above awk example, NR variable has line number. When you put the pattern NR<101 the action will become true for first 100 lines. Once it is false, it will default to 1 which will print remaining lines as-is.
try this:
awk -F, 'NR<=100{$2="e"}1' OFS=, file