awk: for every record extract specific information - awk

Simplified example of my file looks like this:
# FamilyName_A
Information 1 2 3
Information 4 5 6
# FamilyName_B
Information 7 8 9
# FamilyName_C
Information 10 11 12
Information 13 14 15
Information 16 17 18
Record separator is #. For every record I want to print: record ID (Family Name (first word after record separator) and first to columns of next lines. For the output like this:
FamilyName_A Information 1
FamilyName_A Information 4
FamilyName_B Information 7
FamilyName_C Information 10
FamilyName_C Information 13
FamilyName_C Information 16
I tried doing this by myself:
awk 'BEGIN {RS="#"} {print $1}' -- This prints me Record ID
But I don't know how to do the rest (loop to print for every record specific fields).

Use the following script
$1 == # { current=$2; next; }
{ print current, $1, $2; }
Depending on your input data the expression to catch the record header may slightly change. For the data you provided both $1 == #, /^#/ and /^# FamilyName/ are perfectly suitable, but if your input data differs a bit, you may need to adjust the condition.

On one line:
awk 'BEGIN { family = ""} { if ($1 == "#") family = $2; else print family, $1, $2 }' input.txt
Explanation
BEGIN {
family = "";
}
{
if ($1 == "#")
family = $2
else
print family, $1, $2
}
Set family to empty string.
Check each line: if starts with #, remember family name.
If no #, print last remembered family name and first two fields.

Related

how to extract lines which have no duplicated values in first column?

For some statistics research, I want to separate my data which have duplicated value in first column. I work with vim.
suppose that a part of my data is like this:
Item_ID Customer_ID
123 200
104 134
734 500
123 345
764 347
1000 235
734 546
as you can see, some lines have equal values in first column,
i want to generate two separated files, which one of them contains just non repeated values and the other contains lines with equal first column value.
for above example i want to have these two files:
first one contains:
Item_ID Customer_ID
123 200
734 500
123 345
734 546
and second one contains:
Item_ID Customer_ID
104 134
764 347
1000 235
can anybody help me?
I think awk would be a better option here.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] > 1' input.txt input.txt > dup.txt
Prettier version of awk code:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
Overview
We loop over the text twice. By supplying the same file to our awk script twice we are effectively looping over the text twice. First time though the loop count the number of times we see our field's value. The second time though the loop output only the records which have a field value count of 1. For the duplicate line case we only output lines which have field value counts greater than 1.
Awk primer
awk loops over lines (or records) in a text file/input and splits each line into fields. $1 for the first field, $2 for the second field, etc. By default fields are separated by whitespaces (this can be configured).
awk runs each line through a series of rules in the form of condition { action }. Any time a condition matches then action is taken.
Example of printing the first field which line matches foo:
awk '/foo/ { print $1 }` input.txt
Glory of Details
Let's take a look at finding only the unique lines which the first field only appears once.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
Prettier version for readability:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
awk 'code' input > output - run code over the input file, input, and then redirect the output to file, output
awk can take more than one input. e.g. awk 'code' input1.txt input2.txt.
Use the same input file, input.txt, twice to loop over the input twice
awk 'FNR == NR { code1; next } code2' file1 file2 is a common awk idiom which will run code1 for file1 and run code2 for file2
NR is the current record (line) number. This increments after each record
FNR is the current file's record number. e.g. FNR will be 1 for the first line in each file
next will stop executing any more actions and go to the next record/line
FNR == NR will only be true for the first file
$1 is the first field's data
seen[$1]++ - seen is an array/dictionary where we use the first field, $1, as our key and increment the value so we can get a count
$0 is the entire line
print ... prints out the given fields
print $0 will print out the entire line
just print is short for print $0
condition { print $0 } can be shorted to condition { print } which can be shorted further to just condition
seen[$1] == 1 which check to see if the first field's value count is equal to 1 and print the line
Here is an awk solution:
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > (a[b[i]]==1?"single":"multiple")}' file
cat single
104 134
764 347
1000 235
cat multiple
123 200
734 500
123 345
734 546
PS I skipped the first line, but it could be implemented.
This way you get one file for single hits, one for double, one for triple etc.
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > "file"a[b[i]]}'
That would require some filtering of the list of lines in the buffer. If you're really into statistics research, I'd go search for a tool that is better suited than a general-purpose text editor, though.
That said, my PatternsOnText plugin has some commands that can do the job:
:2,$DeleteUniqueLinesIgnoring /\s\+\d\+$/
:w first
:undo
:2,$DeleteAllDuplicateLinesIgnoring /\s\+\d\+$/
:w second
As you want to filter on the first column, the commands' /{pattern}/ has to filter out the second column; \s\+\d\+$ matches the final number and its preceding whitespace.
:DeleteUniqueLinesIgnoring (from the plugin) gives you just the duplicates, :DeleteAllDuplicateLinesIgnoring just the unique lines. I simply :write them to separate files and :undo in between.

awk command to conditionally compare 2 consecutive lines with different columns

This is my sample input file:
xxxxx,12345,yy,ABN,ABE,47,20171018130030,122021010147421,2,IN,3,13,9741588177,32
xxxxxx,9741588177,yy,ABN,ABE,54,20171018130030,122025010227014,2,IN,3,15,12345,32
I want to compare 2 consecutive lines in this file with this condition:
The 12th field of the 1st line and 12th field of the 2nd line must be 13 and 15, respectively.
If the conditions in point 1 are met, then the 2nd field of line 1 (which has the 12th field value as 13) must match the 13th field of line 2 (which has the 12th field as 15).
The file contains many such lines where the above condition is not met, I would like to print only those lines which meet conditions 1 and 2.
Any help in this regard is greatly appreciated!
It's not clear if you want to compare the lines in groups of 2 (ie, compare lines 1 and 2, and then lines 3 and 4) or serially (ie, compare lines 1 and 2, and then 2 and 3). For the latter:
awk 'NR > 1 && prev_12 == 13 && $12 == 15 &&
prev_2 == $13 {print prev; print $0}
{prev=$0; prev_12=$12; prev_2=$2}' FS=, input-file
For the former, add the condition NR % 2 == 0 . (I'm assuming you intended to mention that fields are comma separated, which appears to be the case judging by the input.)
Wish you'd used a few more lines of sample input and provided expected output so we're not all just guessing but MAYBE this is what you want to do:
$ cat tst.awk
BEGIN { FS="," }
(p[12] == 13) && ($12 == 15) && (p[2] == $13) { print p[0] ORS $0 }
{ split($0,p); p[0]=$0 }
$ awk -f tst.awk file
xxxxx,12345,yy,ABN,ABE,47,20171018130030,122021010147421,2,IN,3,13,9741588177,32
xxxxxx,9741588177,yy,ABN,ABE,54,20171018130030,122025010227014,2,IN,3,15,12345,32
another awk
$ awk -F, '$12==13 {p0=$0; p2=$2; c=1; next}
c&&c-- && $12==15 && p2==$13 {print p0; print}' file
start capturing only when the initial match on $12 of the first line.
c&&c-- is a smart counter (count-down here), which will stop at 0 (due to first c before the ampersand). Ed Morton has a post with a lot more examples of the smart counters

How to find whoch part of OR condition is met when you have 40 conditions in Unix

I have a file which is having 40 fields and each should have particular length. I put a OR condition as below and checked if it is meeting the requirement and print something any of the field length is more than what is required. But I want to know and print which field exactly is more than what is required.
command:
awk -F "|" 'length ($1) > 10 || length ($2) > 30 || length ($3) > 50 || length ($4) > 15 ||...|| length ($40) > 55' /path/filename
your existing code will not test all the conditions after the first resulting true, due to short circuiting. If you want to check them all, better to keep the size requirements in variable and loop through all fields, one example can be
$ awk -F'|' -v size="10|30|50..." '
BEGIN{split(size,s)}
{c=sep="";
for(i=1;i<=NF;i++)
if(length($i)>s[i]) {c=c sep i; sep=FS};
if(c) print $0,c}' file
No need to write too many field conditions manually. Since you haven't showed us the expected output then based on your statements following code is written.
awk -F"|" '{for(i=1;i<=NF;i++){if(length($i)>40){print i,$i" having more than 40 length"}}}' Input_file
Above will print a field number, field's value which is having length more than 40.
EDIT: Adding an example on same, let's say following is the Input_file.
cat Input_file
vbrwvkjrwvbrwvbrwv123|vwkjrbvrwnbvrwvkbvkjrwbvbwvwbvrwbvrwbvvbjbvhjrwv|rwvirwvhbrwvbrwvbrwvbhrwbvhjrwbvhjrwbvjrwbvhjwbvhjvbrwvbrwhjvb
123|wwd|wfwcwc
awk -F"|" '{for(i=1;i<=NF;i++){if(length($i)>40){print i,$i" having more than 40 length"}}}' file3499
2 vwkjrbvrwnbvrwvkbvkjrwbvbwvwbvrwbvrwbvvbjbvhjrwv having more than 40 length
3 rwvirwvhbrwvbrwvbrwvbhrwbvhjrwbvhjrwbvjrwbvhjwbvhjvbrwvbrwhjvb having more than 40 length
This is basically the same as karakfa's answer, just ... more whitespacey
awk -F "|" '
BEGIN {
max[1] = 10
max[2] = 30
max[3] = 50
max[4] = 15
# ...
max[40] = 55
}
{
for (i=1; i<=NF; i++) {
if (length($i) > max[i]) {
printf "Error: line %d, column %d is wider than %d\n", NR, i, max[i]
}
}
}
' file

AWK script for two columns

I have two columns like this:
(A) (B)
Adam 30
Jon 55
Robert 35
Jokim 99
Adam 32
Adam 31
Jokim 88
I want an AWK script to check if Adam( or any name ) in column A becomes 30 in column B then delete all Adam names from column A, it does not matter whether Adam becomes 31 or 32 later, and then print the results.
I have a log list in reality and I do not want the code to be depended on "Adam". So, What I want exactly is basically wherever 30 is existed in $2 so delete the respective value in $1 and also search in $1 to find all values which are the same as the deleted value.
You can read the columns into variables and check the value of the second column for the value you are looking for then sed the file to delete all the column 1 entries:
cp test.txt out.txt && CHK=30 && while read a b; do
[ "${b}" = "${CHK}" ] && sed -i "/^${a}/d" out.txt
done < test.txt
Note: If you may have regex values in the columns you may need to escape them, also if you possibly have blanks you may want to check for null first before the test on column 2.
And since you specified AWK here is a somewhat elegant awk way to do this, using a check flag to look ahead prior to printing:
awk -vCHK=30 '{if($2~CHK)block=$1; if($1!=block)print}' test.txt
To remove the entries from the first occurence of Adam, 30:
$1 == "Adam" && $2 == 30 { found = 1 }
!(found && $1 == "Adam")
To remove all Adam entries if any Adam, 30 exists:
$1 == "Adam" && $2 == 30 { found = 1 }
!(found && $1 == "Adam") { lines[nlines++] = $0 }
END { for (i in lines) print lines[i] }
To remove all names which have a 30 the second column:
NR == FNR && $2 == 30 { foundnames[$1] = 1 }
NR != FNR && !($1 in foundnames)
You must call this last version with the input filename twice, ie awk process.awk file.txt file.txt

Selecting a field after a string using awk

I'm very new to awk having just been introduced to it over the weekend.
I have a question that I'm hoping someone may be able to help me with.
How would one select a field that follows a specific string?
How would I expand this code to select more than one field following a specific string?
As an example, for any given line in my text file I have something like
2 of 10 19/4/2014 school name random text distance 800m more random text time 2:20:22 winner someonefast.
Some attributes are very consistent so I can easily extract these fields. For example 2, 10 and the date. However, there is often a lot of variable text before the next field that I wish to extract. Hence the question. Using awk can I extract the next field following a string? For example I'm interested in the fields following the /distance/ or /time/ string in combination with $1, $3, $4, $5.
Your help will be greatly appreciated.
Andy
Using awk you can select the field following a string. Here is an example:
echo '2 of 10 19/4/2014 school name random text distance 800m more random text time 2:20:22 winner someonefast.' |
awk '{
for(i=1; i<=NF; i++) {
if ( i ~ /^[1345]$/ ) {
extract = (extract ? extract FS $i : $i)
}
if ( $i ~ /distance|time/ ) {
extract = (extract ? extract FS $(i+1): $(i+1))
}
}
print extract
}'
2 10 19/4/2014 school 800m 2:20:22
What we are doing here is basically allowing awk to split on default delimiter. We create a for loop to iterate over all fields. NF stores number of fields for a given line. So we start from 1 and go all the way to the end.
In our first conditional block, we just inspect the field number. If it is 1 or 3 or 4 or 5, we create a variable called extract which concatenates the values of these fields separated by the field separator.
In our second conditional block, we check if the value of the field is either distance or time. If it is we again append to our variable but this time instead of the current value, we do $(i+1) which is basically the value of the next field or you can say value of a field that follows a specific string.
When you have name = value situations like you do here, it's best to create an array that maps the names to the values and then just print the values for the names you're interested in, e.g.:
$ awk '{for (i=1;i<=NF;i++) v[$i]=$(i+1); print $1, $3, $4, $5, v["distance"], v["time"]}' file
2 10 19/4/2014 school 800m 2:20:22
Basic:
awk '{
for (i = 6; i <= NF; ++i) {
if ($i == "distance") distance = $(i + 1)
if ($i == "time") time = $(i + 1)
}
print $1, $3, $4, $5, distance, time
}' file
Output:
2 10 19/4/2014 school 800m 2:20:22
But it's not enough to get all other significant texts which is still part of the school name after $5. You should add another condition.
The better solution is to have another delimiter besides spaces like tabs and use \t as FS.