Extract lines in File A based on two columns in File B - awk

I have two files of the same number of column (tab delimited) that look like this
File A:
12345 Fish Apple 7123
321 Chicken Apple 9912
661 Ant Apple 316
File B:
321 Duck Orange 9912
12345 Bird Orange 7123
661 Eagle Orange 34
Expected Output:
Fiile A_edited
661 Ant Apple 316
Based on the ID from column 1 and column 4 in File B, if both values appear in column 1 and column 4 of the line in File A, I want to remove the line from File A. I tried using grep to do this, but the two lists are very long, around 66Gb each, so it's still running after a day. Is there any other faster way besides grep that I can do it?
p/s: the number of columns is actually more than 4, shown here only four for simplicity.
awk '{print $1 "\t"$4}'B.txt >> B_edited.txt
# Extract the line number in A.txt containing lines where two IDs are present in B_edited.txt
cat B_edited.txt|while read ID1 ID2
do
grep -nE "$ID1.*$ID2" A.txt|cut -c 1 >> LineNumber.txt
done
# Remove duplicates of line numbers
sort -u LineNumber.txt >> LineNumberUnique.txt
# Output only lines from A.txt where line numbers are not in the list
awk 'FNR == NR { h[$1]; next } !(FNR in h)' LineNumberUnique.txt A.txt >> A_edited.txt
I would greatly appreciate any help!
Thanks,
Jen

$ awk '{k=$1FS$4} NR==FNR{keys[k];next} !(k in keys)' fileB fileA
661 Ant Apple 316
To overwrite fileA with the output, just add > tmp && mv tmp fileA or use -i inplace if you have GNU awk 4.*.

Related

How to find the position of word in a list of string to use in awk?

Morning guys,
I often have files that I want to grep+awk but that have a lot of fields.
I'm interested in one in particular (so I'd like to awk '{print $i}') but how can I know what position (ie "i" here) my column is, other than counting it manually?
With files of around 70 fields, I'd be saving lot of time! :)
Thanks a lot,
[Edit]
Following Ian McGowan's suggestion, I'll look for the column number in the file's header:
head -1 myFile | awk '{for (i=1; i<=NF; i++) printf("%3d=%s\n", i, $i); }' | grep -i <the_column_Im_looking_for>
Thanks everyone,
Will1v
I was looking for a sample but:
$ cat > file
this is
the sample
$ awk '{
for(i=1;i<=NF;i++)
if($i=="sample")
print NR,i
}' file
2 2
I do this all the time when trying to profile some large text delimited file.
$head -4 myfile
4A 1 321 537 513.30
4B 0.00
8 592 846 905.66
9B2 39 887 658.77
Transpose or pivot by looping over the columns/fields:
$awk '{ for (i=1; i<=NF; i++) printf("%4d %3d=%s\n", NR, i, $i); }' < myfile
1 1=4A
1 2=1
1 3=321
1 4=537
1 5=513.30
2 1=4B
2 2=0.00
3 1=8
3 2=592
3 3=846
3 4=905.66
4 1=9B2
4 2=39
4 3=887
4 4=658.77
You can add printf("row=%-4d col=%-3d:%s\n", NR, i, $i); or whatever in the format mask for printf, and then grep for just the data you care about to find out the column, or if you know the columns grep for col=44 to get the 44th column.
xargs -n1 will print the columns one per line, so you can do this:
head -1 file | xargs -n1 | grep -n "column_name"

Delete "0" or "1" from the end of each line, except the first line

the input file looks like
Kick-off team 68 0
Ball safe 69 1
Attack 77 8
Attack 81 4
Throw-in 83 0
Ball possession 86 3
Goal kick 100 10
Ball possession 101 1
Ball safe 114 13
Throw-in 123 9
Ball safe 134 11
Ball safe 135 1
Ball safe 137 2
and at the end it should look like this:
Kick-off team 68 0
Attack 77 8
Attack 81 4
Ball possession 86 3
Goal kick 100 10
Ball safe 114 13
Throw-in 123 9
Ball safe 134 11
Ball safe 137 2
my solution is
awk '{print $NF}' test.txt | sed -re '2,${/(^0$|^1$)/d}'
how can i directly change the file, e.g. sed -i?
sed -i '2,$ {/[^0-9][01]$/d}' test.txt
2,$ lines to act upon, this one says 2nd line to end of file
{/[^0-9][01]$/d} from filtered lines, delete those ending with 0 or 1
'2,$ {/ [01]$/d}' can be also used if character before last column is always a space
With awk which is better suited for column manipulations:
awk 'NR==1 || ($NF!=1 && $NF!=0)' test.txt > tmp && mv tmp test.txt
NR==1 first line
($NF!=1 && $NF!=0) last column shouldn't be 0 or 1
can also use $NF>1 if last column only have non-negative numbers
> tmp && mv tmp test.txt save output to temporary file and then move it back as original file
With GNU awk, there is inplace option awk -i inplace 'NR==1 || ($NF!=1 && $NF!=0)' test.txt
Here's my take on this.
sed -i.bak -e '1p;/[^0-9][01]$/d' file.txt
The sed script prints the first line, then deletes all subsequent lines that match the pattern you described. This assumes that your first line would be a candidate for deletion; if it contains something other than 0 or 1 in the last field, this script will print it twice. And the -i option is what tells sed to edit in-place (with a backup file).
Awk doesn't have an equivalent option for editing files in-place -- if you want that kind of functionality, you need to implement it in a shell wrapper around your awk script, as #sundeep suggested.
Note that I'm not using GNU sed, but this command should work equally well with it.
awk to the rescue!
$ awk 'NR==1 || $NF && $NF!=1' file
or more cryptic
$ awk 'NR==1 || $NF*($NF-1)' file
This might work for you (GNU sed):
sed -i '1b;/\s[01]$/d' file
Other than the first line, delete any line ending in 0 or 1.

Find duplicates and give sum of values in column next to it (UNIX) (with solution -> need faster way)

I am writing a script for bioinformatical use. I have a file with 2 columns, in which column A shows a number and column B a specific string. I need a script that will search the file for the string in column B, IF any duplicates are found the number in column A should all be added up, duplicates should be removed and only one line with column A having the sum and column B the string should remain.
I have written something that does exactly that, but because I am not really a programmer I am sure there is a much faster way. My files contain sometimes 500k lines and my code takes to long for such files. Please have a look at it and see what I could change to speed things up. Also I can't use uniq because for this Id have to also use sort but the order of the lines have to stay the way they are!
13 ABCD
15 BGDA
12 ABCD
10 BGDA
10 KLMN
17 BGDA
should become
25 ABCD
42 BGDA
10 KLMN
This does it but for a file with 500k lines it takes too long:
for AASEQUENCE in file.txt;
do
#see how many lines the file has and save that number in $LN
LN="$(wc -l $AASEQUENCE | cut -d " " -f 1)"
for ((i=1;i<=${LN};i++));
do
#Create a variable that will have just the string from column B
#save it in $STRING
STRING="$(cut -f2 $AASEQUENCE | head -n $i| tail -n 1 | cut -f1)";
#create $UNIQ: a variable that will have number+string of that
#line. This will be used in the ELSE-statement, IF there are no
#duplicates of the string, it will just be added to the
# output file without further processing
UNIQ="$(head -n $i $AASEQUENCE | tail -n 1)"
for DUPLICATE in $AASEQUENCE;
do
#create variable that will display the number of lines
#of duplicates. IF its 1 the IF-statement will jump to the ELSE
#part as there are no duplicates
VAR="$(grep -w "${STRING}" $DUPLICATE | wc -l)"
#Now add up all the numbers from column A that have $STRING in
#column B
TOTALCOUNT="$(grep -w "${STRING}" $DUPLICATE | cut -f1 | awk
'{SUM += $1} END {print SUM}')"
#Create a file that the single line can be put into it
touch MERGED_`basename $AASEQUENCE`
#The IF-statement checks if the AA occurs more than once
#If it does a second IF-statement checks if this AA-sequence has
#already been added.
#If it hasnt been added, it will be, if not nothing happens.
ALREADYMATCHED="$(grep -w "${STRING}" MERGED_`basename
$AASEQUENCE` | wc -l)"
if [[ "$VAR" > 1 ]];
then if [[ "$ALREADYMATCHED" != 0 ]]; then paste <(echo
"$TOTALCOUNT") <(echo "$STRING") --delimiters ' '>>
MERGED_`basename $AASEQUENCE` ;fi;
else echo $UNIQ >> MERGED_`basename $AASEQUENCE` ;fi
done;
done;
done;
P.S: When I have fileA.txt fileB.txt ... and use file* the loop still always stops after the first file. Any suggestions why?
maybe pure awk solution?
$ cat > in
13 ABCD
15 BGDA
12 ABCD
10 BGDA
10 KLMN
17 BGDA
$ awk '{dc[$2] += $1} END{for (seq in dc) {print dc[seq], seq}}' in
25 ABCD
42 BGDA
10 KLMN
$

Partial id match and merge multiple to one

I have two files, File1 and File2. File1 has 6000 rows and file2 has 3000 rows. I want to match the ids and merge the files based on matches, which is simple. But, the ids in file1 and file2 only match partially. Have a look at the files. For every id (row) in file2 there must be two matching ids (rows) in file 1. Also, not all the ids in file2 are present in file1. I had tried awk but didn't get the desired output.
File1
1_A01_A
1_A01_B
2_B03_A
2_B03_B
1_A02_A
1_A02_B
2_B04_A
2_B04_B
1_A03_A
1_A03_B
2_B05_A
2_B05_B
1_A04_A
1_A04_B
2_B06_A
2_B06_B
1_A06_A
1_A06_B
2_B07_A
2_B07_B
1_A07_A
1_A07_B
2_B08_A
2_B08_B
9_F10_A
9_F10_B
12_D08_A
12_D08_B
5505744243493_F09.CEL_A_A
5505744243493_F09.CEL_B_B
File2
1_A01 14
2_B03 13
1_A02 4
2_B04 14
1_A03 11
2_B05 8
1_A04 18
2_B06 15
1_A06 10
2_B07 4
1_A07 8
2_B08 22
1_A08 5
2_B09 15
1_A09 20
2_B10 17
awk -F" " 'FNR==NR{a[$1]=$2;next}{for(i in a){if($1~i){print $1" "a[i];next}}}' file1.txt file2.txt
FNR==NR will be true while awk reads file 1 and false when it reads file 2. The part of code starting from {for(i in a} .. will be executed for file 2. $1~i looks for Like condition and then for relevant matches the output is printed.
by mistake I have used different file notations. My file1.txt contains the content of file2.txt in the problem statement and vise versa
Output
1_A01_A|14
1_A01_B|14
2_B03_A|13
2_B03_B|13
1_A02_A|4
1_A02_B|4
2_B04_A|14
2_B04_B|14
1_A03_A|11
1_A03_B|11
2_B05_A|8
2_B05_B|8
1_A04_A|18
1_A04_B|18
2_B06_A|15
2_B06_B|15
1_A06_A|10
1_A06_B|10
2_B07_A|4
2_B07_B|4
1_A07_A|8
1_A07_B|8
2_B08_A|22
2_B08_B|22
This might work for you (GNU sed):
sed -r 's|^(\S+)(\s+\S+)$|s/^\1.*/\&\2/p|' file2 | sed -nf - file1
This creates a sed script from file2 and then runs it against the data in file1.
N.B. The order of either file is unimportant and file1 is processed only once.

grep: Keeping lines that has specific string in certain column

I am trying to pick out the lines that have certain value in certain column and save it to an output. I am trying to do this with grep. Is it possible?
My data is looks like this:
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
melon 1 ewtedf wersdf
orange 3 qqqwetr hredfg
I want to pick out lines that have value 5 on its 2nd column and save it to new outputfile.
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
I would appreciate for help!
It is probably possible with grep but the adequate tool to perform this operation is definitely awk. You can filter every line having 5 on the second column with
awk '$2 == 5'
Explanation
awk splits it inputs in records (usually a line) and fields (usually a column) and perform actions on records matching certain conditions. Here
awk '$2 == 5'
is a short form for
awk '$2 == 5 {print($0)}'
which translates to
For each record, if the second field ($2) is 5, print the full record ($0).
Variations
If you need to choose dynamically the key value used to filter your values, use the -v option of awk:
awk -v "key=5" '$2 == key {print($0)}'
If you need to keep the first line of the file because it contains a header to the table, use the NR variable that keeps track of the ordinal number of the current record:
awk 'NR == 1 || $2 == 5'
The field separator is a regular expression defining which text separates columns, it can be modified with the -F field. For instance, if your data were in a basic CSV file, the filter would be
awk -F", *" '$2 == 5'
Visit the awk tag wiki to find a few useful information to get started learning awk.
To print when the second field is 5 use: awk '$2==5' file
Give this a try:
grep '^[^\s]\+\s5.*$' file.txt
the pattern looks for start of line, followed by more than one non-space character, followed by space, followed by 5, follwed by any number of chars, followed by eol.
You can get following command.
$ cat data.txt
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
melon 1 ewtedf wersdf
orange 3 qqqwetr hredfg
grape 55 kkkkkkk aaaaaa
$ grep -E '[^ ]+ +5 .*' data.txt > output.txt
$ cat output.txt
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
You can get the answer only with grep command.
But I strongly recommend you use awk command.
The simple way to do it is:
grep '5' MyDataFile
The result:
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf
To capture that in a new file:
grep '5' MyDataFile > newfile
Note: that will find a 5 anywhere in MyDataFile. To restrict to the second column, a short script is what would suit your needs. If you want to limit it to the second column only, then a quick script like the following will do. Usage: script number datafile:
#!/bin/bash
while read -r fruit num stuff || [ -n "$stuff" ]; do
[ "$num" -eq "$1" ] && printf "%s %s %s\n" "$fruit" "$num" "$stuff"
done <"$2"
output:
$ ./fruit.sh 5 dat/mydata.dat
apple 5 abcdefd ewdsf
peach 5 ewtdsfe wtesdf