AWK script for two columns - awk

I have two columns like this:
(A) (B)
Adam 30
Jon 55
Robert 35
Jokim 99
Adam 32
Adam 31
Jokim 88
I want an AWK script to check if Adam( or any name ) in column A becomes 30 in column B then delete all Adam names from column A, it does not matter whether Adam becomes 31 or 32 later, and then print the results.
I have a log list in reality and I do not want the code to be depended on "Adam". So, What I want exactly is basically wherever 30 is existed in $2 so delete the respective value in $1 and also search in $1 to find all values which are the same as the deleted value.

You can read the columns into variables and check the value of the second column for the value you are looking for then sed the file to delete all the column 1 entries:
cp test.txt out.txt && CHK=30 && while read a b; do
[ "${b}" = "${CHK}" ] && sed -i "/^${a}/d" out.txt
done < test.txt
Note: If you may have regex values in the columns you may need to escape them, also if you possibly have blanks you may want to check for null first before the test on column 2.
And since you specified AWK here is a somewhat elegant awk way to do this, using a check flag to look ahead prior to printing:
awk -vCHK=30 '{if($2~CHK)block=$1; if($1!=block)print}' test.txt

To remove the entries from the first occurence of Adam, 30:
$1 == "Adam" && $2 == 30 { found = 1 }
!(found && $1 == "Adam")
To remove all Adam entries if any Adam, 30 exists:
$1 == "Adam" && $2 == 30 { found = 1 }
!(found && $1 == "Adam") { lines[nlines++] = $0 }
END { for (i in lines) print lines[i] }
To remove all names which have a 30 the second column:
NR == FNR && $2 == 30 { foundnames[$1] = 1 }
NR != FNR && !($1 in foundnames)
You must call this last version with the input filename twice, ie awk process.awk file.txt file.txt

Related

how to extract lines which have no duplicated values in first column?

For some statistics research, I want to separate my data which have duplicated value in first column. I work with vim.
suppose that a part of my data is like this:
Item_ID Customer_ID
123 200
104 134
734 500
123 345
764 347
1000 235
734 546
as you can see, some lines have equal values in first column,
i want to generate two separated files, which one of them contains just non repeated values and the other contains lines with equal first column value.
for above example i want to have these two files:
first one contains:
Item_ID Customer_ID
123 200
734 500
123 345
734 546
and second one contains:
Item_ID Customer_ID
104 134
764 347
1000 235
can anybody help me?
I think awk would be a better option here.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] > 1' input.txt input.txt > dup.txt
Prettier version of awk code:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
Overview
We loop over the text twice. By supplying the same file to our awk script twice we are effectively looping over the text twice. First time though the loop count the number of times we see our field's value. The second time though the loop output only the records which have a field value count of 1. For the duplicate line case we only output lines which have field value counts greater than 1.
Awk primer
awk loops over lines (or records) in a text file/input and splits each line into fields. $1 for the first field, $2 for the second field, etc. By default fields are separated by whitespaces (this can be configured).
awk runs each line through a series of rules in the form of condition { action }. Any time a condition matches then action is taken.
Example of printing the first field which line matches foo:
awk '/foo/ { print $1 }` input.txt
Glory of Details
Let's take a look at finding only the unique lines which the first field only appears once.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
Prettier version for readability:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
awk 'code' input > output - run code over the input file, input, and then redirect the output to file, output
awk can take more than one input. e.g. awk 'code' input1.txt input2.txt.
Use the same input file, input.txt, twice to loop over the input twice
awk 'FNR == NR { code1; next } code2' file1 file2 is a common awk idiom which will run code1 for file1 and run code2 for file2
NR is the current record (line) number. This increments after each record
FNR is the current file's record number. e.g. FNR will be 1 for the first line in each file
next will stop executing any more actions and go to the next record/line
FNR == NR will only be true for the first file
$1 is the first field's data
seen[$1]++ - seen is an array/dictionary where we use the first field, $1, as our key and increment the value so we can get a count
$0 is the entire line
print ... prints out the given fields
print $0 will print out the entire line
just print is short for print $0
condition { print $0 } can be shorted to condition { print } which can be shorted further to just condition
seen[$1] == 1 which check to see if the first field's value count is equal to 1 and print the line
Here is an awk solution:
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > (a[b[i]]==1?"single":"multiple")}' file
cat single
104 134
764 347
1000 235
cat multiple
123 200
734 500
123 345
734 546
PS I skipped the first line, but it could be implemented.
This way you get one file for single hits, one for double, one for triple etc.
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > "file"a[b[i]]}'
That would require some filtering of the list of lines in the buffer. If you're really into statistics research, I'd go search for a tool that is better suited than a general-purpose text editor, though.
That said, my PatternsOnText plugin has some commands that can do the job:
:2,$DeleteUniqueLinesIgnoring /\s\+\d\+$/
:w first
:undo
:2,$DeleteAllDuplicateLinesIgnoring /\s\+\d\+$/
:w second
As you want to filter on the first column, the commands' /{pattern}/ has to filter out the second column; \s\+\d\+$ matches the final number and its preceding whitespace.
:DeleteUniqueLinesIgnoring (from the plugin) gives you just the duplicates, :DeleteAllDuplicateLinesIgnoring just the unique lines. I simply :write them to separate files and :undo in between.

awk command to conditionally compare 2 consecutive lines with different columns

This is my sample input file:
xxxxx,12345,yy,ABN,ABE,47,20171018130030,122021010147421,2,IN,3,13,9741588177,32
xxxxxx,9741588177,yy,ABN,ABE,54,20171018130030,122025010227014,2,IN,3,15,12345,32
I want to compare 2 consecutive lines in this file with this condition:
The 12th field of the 1st line and 12th field of the 2nd line must be 13 and 15, respectively.
If the conditions in point 1 are met, then the 2nd field of line 1 (which has the 12th field value as 13) must match the 13th field of line 2 (which has the 12th field as 15).
The file contains many such lines where the above condition is not met, I would like to print only those lines which meet conditions 1 and 2.
Any help in this regard is greatly appreciated!
It's not clear if you want to compare the lines in groups of 2 (ie, compare lines 1 and 2, and then lines 3 and 4) or serially (ie, compare lines 1 and 2, and then 2 and 3). For the latter:
awk 'NR > 1 && prev_12 == 13 && $12 == 15 &&
prev_2 == $13 {print prev; print $0}
{prev=$0; prev_12=$12; prev_2=$2}' FS=, input-file
For the former, add the condition NR % 2 == 0 . (I'm assuming you intended to mention that fields are comma separated, which appears to be the case judging by the input.)
Wish you'd used a few more lines of sample input and provided expected output so we're not all just guessing but MAYBE this is what you want to do:
$ cat tst.awk
BEGIN { FS="," }
(p[12] == 13) && ($12 == 15) && (p[2] == $13) { print p[0] ORS $0 }
{ split($0,p); p[0]=$0 }
$ awk -f tst.awk file
xxxxx,12345,yy,ABN,ABE,47,20171018130030,122021010147421,2,IN,3,13,9741588177,32
xxxxxx,9741588177,yy,ABN,ABE,54,20171018130030,122025010227014,2,IN,3,15,12345,32
another awk
$ awk -F, '$12==13 {p0=$0; p2=$2; c=1; next}
c&&c-- && $12==15 && p2==$13 {print p0; print}' file
start capturing only when the initial match on $12 of the first line.
c&&c-- is a smart counter (count-down here), which will stop at 0 (due to first c before the ampersand). Ed Morton has a post with a lot more examples of the smart counters

How to find whoch part of OR condition is met when you have 40 conditions in Unix

I have a file which is having 40 fields and each should have particular length. I put a OR condition as below and checked if it is meeting the requirement and print something any of the field length is more than what is required. But I want to know and print which field exactly is more than what is required.
command:
awk -F "|" 'length ($1) > 10 || length ($2) > 30 || length ($3) > 50 || length ($4) > 15 ||...|| length ($40) > 55' /path/filename
your existing code will not test all the conditions after the first resulting true, due to short circuiting. If you want to check them all, better to keep the size requirements in variable and loop through all fields, one example can be
$ awk -F'|' -v size="10|30|50..." '
BEGIN{split(size,s)}
{c=sep="";
for(i=1;i<=NF;i++)
if(length($i)>s[i]) {c=c sep i; sep=FS};
if(c) print $0,c}' file
No need to write too many field conditions manually. Since you haven't showed us the expected output then based on your statements following code is written.
awk -F"|" '{for(i=1;i<=NF;i++){if(length($i)>40){print i,$i" having more than 40 length"}}}' Input_file
Above will print a field number, field's value which is having length more than 40.
EDIT: Adding an example on same, let's say following is the Input_file.
cat Input_file
vbrwvkjrwvbrwvbrwv123|vwkjrbvrwnbvrwvkbvkjrwbvbwvwbvrwbvrwbvvbjbvhjrwv|rwvirwvhbrwvbrwvbrwvbhrwbvhjrwbvhjrwbvjrwbvhjwbvhjvbrwvbrwhjvb
123|wwd|wfwcwc
awk -F"|" '{for(i=1;i<=NF;i++){if(length($i)>40){print i,$i" having more than 40 length"}}}' file3499
2 vwkjrbvrwnbvrwvkbvkjrwbvbwvwbvrwbvrwbvvbjbvhjrwv having more than 40 length
3 rwvirwvhbrwvbrwvbrwvbhrwbvhjrwbvhjrwbvjrwbvhjwbvhjvbrwvbrwhjvb having more than 40 length
This is basically the same as karakfa's answer, just ... more whitespacey
awk -F "|" '
BEGIN {
max[1] = 10
max[2] = 30
max[3] = 50
max[4] = 15
# ...
max[40] = 55
}
{
for (i=1; i<=NF; i++) {
if (length($i) > max[i]) {
printf "Error: line %d, column %d is wider than %d\n", NR, i, max[i]
}
}
}
' file

Is there a way to use awk to REMOVE lines based on threshold value?

I have a bunch of identifiers in the first column and scores for individual samples (for those identifiers) in the next columns, like this;
ID 1 2 3
21 20 70 80
13 44 50 10
I know the awk syntax to count how many instances there when every value in a row is less than 20 (($2 < 20) && ($3 < 20) && ($4 < 20)), but I don't know how to filter them out.
If I do (($2 > 20) && ($3 > 20) && ($4 > 20)) and print those and save them, it is not the same, because you will have instances in the first example where one value is less than 20 and the row is still kept because not ALL values are less than 20 (e.g. 10 40 45) . With the > version, all values must be greater than 20, so this row would have been deleted.
Can you please help me? Maybe I need sed?
Thanks!
You can check if one of the value doesn't satisfy your condition iterating to NF and print the whole line according to this :
awk '{
if (NR != 1){
remove = 0;
for (i = 1; i <= NF; i++) {
if ($i < 20) {
remove = 1;
break;
}
}
if (remove == 0){
print $0
}
}
}' test.txt
It's not very clear what you're asking without the provided desired output. Also, your input file seems to have a header increasing confusion.
This is the alternatives you can use, comment indicates what records will be printed. You can extend to additional columns.
awk -v t=20 '$2<t && $3<t' file # all strictly less
awk -v t=20 '!($2<t && $3<t)' file # any greater or equal
awk -v t=20 '$2<t || $3<t' file # any strictly less
awk -v t=20 '!($2<t || $3<t)' file # all greater or equal
perhaps will help you to understand, these basic equalities
!(p && q) == !p || !q # for logical p,q
!(p || q) == !p && !q
!(x<y) == x>=y # for numerical x,y
You are most probably doing something wrong.The statement "you will have instances in the first example where one value is less than 20 and the row is still kept because not ALL values are less than 20 (e.g. 10 40 45)" is not valid. Using && you ask for a logical AND and chained AND will result to output if all conditions AND returns true; meaning that the row is not kept:
$ echo "10 40 45" |awk '(($1<20) && ($2<20) && ($3<20))'
#Output : no output
If you want to keep above row then you need OR:
$ echo "10 40 45" |awk '(($1<20) || ($2<20) || ($3<20))'
#Output:
10 40 45
Similarly :
$ echo "10 40 45" |awk '(($1>20) && ($2>20) && ($3>20))'
# Output: No Output
$ echo "10 40 45" |awk '(($1>20) || ($2>20) || ($3>20))'
#Output:
10 40 45

How to get random lines with a given distribution with awk?

I have two tabular files that I want to merge, but prior to that I want to reduce the second one.
The first file let's say File1 is tabular and is like this
A 67 98 56
A 22 29 62
A 11 99 28
B 37 88 14
B 33 99 65
We have 3 lines with A and two with B
File2 contains 3000 lines between A and B, I want to randomly select lines from File2 but exactly the same number of A and B than File1, which means just 3 random lines with A and two with B
Any one have an idea on how to do this with awk ?
Thanks
#!/bin/bash
read -r acount bcount <<< $(csplit file2 '/^B /')
awk -v "acount=$acount" -v "bcount=$bcount" '
NR == FNR {
arr[$1]++;
next
}
! setup {
setup = 1
while (arandcount < acount) {
line = int(rand() * acount) + 1
if (! alines[line]) {
alines[line] = 1
arandcount++
}
}
while (brandcount < bcount) {
line = int(rand() * bcount) + 1
if (! blines[line]) {
blines[line] = 1
brandcount++
}
}
}
FILENAME == "xx00" && FNR in alines {
print
}
FILENAME == "xx01" && FNR in blines {
print
}' file1 xx00 xx01
Where "xx00" and "xx01" are the names of the files created by csplit.
The csplit command splits the input file on the regex and outputs the line counts of each output file. The read command puts those counts into the variables. The variables are passed into the AWK program.
The first block reads all the lines in file1 and counts each "type" (A or B).
The second block picks the line numbers to select by choosing a random number between 1 and the count of lines of the "type". This block only gets executed once because of the flag.
The last two blocks check each line's record number to see if it's in the array of picked line numbers and prints it if it is.
This might work for you:
grep '^A' file2 | sort -R | head -$(grep -c '^A' file1) >file3
grep '^B' file2 | sort -R | head -$(grep -c '^B' file1) >>file3
N.B. This assumes file1 is sorted.