how to keep newline(s) when selecting a given column with awk

how to keep newline(s) when selecting a given column with awk - awk

Suppose I have a file like this (disclaimer: this is not fixed I can have more than 7 rows, and more than 4 columns)
R H A 23
S E A 45
T E A 34
U A 35
Y T A 35
O E A 353
J G B 23
I want the output to select second column if third column is A but keeping newline or whitespace character.
output should be:
HEE TE
I tried this:
awk '{if ($3=="A") print $2}' file | awk 'BEGIN{ORS = ""}{print $1}'
But this gives:
HEETE%
Which has a weird % and is missing the space.

You may use this gnu-awk solution using FIELDWIDTHS:
awk 'BEGIN{ FIELDWIDTHS = "1 1 1 1 1 1 *" } $5 == "A" {s = s $3}
END {print s}' file
HEE TE
awk splits each record using width values provided in this variable FIELDWIDTHS.
1 1 1 1 1 1 * means each of first 6 columns will have single character length and remaining text will be filled in 7th column. Since you have a space after each value so $2,$4,$6 will be filled with a single space and $1,$3,$5 will be filled with the provided values in input.
$5 == "A" {s = s $3}: Here we are checking if $5 is A and if that condition is true then we keep appending value of $3 in a variable s. In the END block we just print variable s.
Without using fixed width parsing, awk will treat A in 4th row as $2.
Or else if we let spaces part of column value then use:
awk '
BEGIN{ FIELDWIDTHS = "2 2 2 *" }
$3 == "A " {s = s substr($2,1,1)}
END {print s}
' file

Related

How do i compare alphanumeric characters in non sequential order?

Currently i am using a awk script which compares numbers in non sequential order and prints the difference . It works pretty well for numbers but if i have alphanumeric characters , it doesn't seem to work well
In its current state , apart from simply comparing the numbers it does 2 things additionally :
Currently it accounts for the zeros before a number or character and compares the absolute values only ignoring zeros before a number or character
Currently If the same number or character occurs multiple times in both files , it outputs the additional occurance
i just want the script to work well for alphanumeric characters as well as currently it only seem to work well with plain numbers. Can someone please edit the script to have the desired output while also considering the above 2 conditions
Current script
awk '{k=$0+0}
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}' file1 file2
Example below
cat file1
1
01
001
8
2B
12
13C
027B
0027B
cat file2
1
2
08
12
13C
02B
9
27B
Expected output/result
1
1
2
9
27B
Explanation of expected output
In file1 : "1" , "01" , "001" evaluates to 1 * 3 times
In file 2 : "1" is present only once
Hence "1" is present twice in result ( 3-1 times )
"2" and "9" are exclusively present in file2 , So obviously both simply form part of output
In file1 : '027B" , "0027B" evaluates to 27B * 2 times
In file 2 - "27B" is present only once
Hence '27B" is present once in result ( 2 -1 times )
Explanation of matched items ( ones not forming part of expected output )
"8" from file1 ( line 4 )is matched with "08" from file2 ( line 3)
"12" from file1 ( line 6) is matched with "12" from file2 ( line 4)
"13C" from file1 (line 7 ) is matched with "13C" from file2 ( line 5 )
"2B" from file1 ( line 5 ) is matched with "02B" from file2 ( line 6 )
Lastly the order of items in expected output should be in ascending order like shown in my above example, lets say if the eg above had 3 in expected output it should read vertically as 1 1 2 3 9 27B

It should be enough to remove leading zeros when forming the key (with a special case for zero values like 0000):
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}
$ awk -f a.awk file1 file2
2
9
27B
1
1
RE-EDIT
If you just want the values sorted numerically, pipe into sort:
$ awk -f a.awk file1 file2 | sort -n
1
1
2
3
4
5
9
27B
To output in the order as found in file2, you can remember the order in another array and then do all the printing in the END block. This version will output the values in the order of file2, with any values only in file1 printed last.
/^0+$/ { k = 0 }
/[^0]/ { k = $0; sub(/^0*/, "", k) }
NR==FNR {a[k]++; next}
{ b[FNR] = k }
!(k in a && a[k]--) { a[k] = 1 }
END {
for (i=1; i<=FNR; ++i) {
k = b[i]
while(a[k]-->0) print k
}
for (k in a) {
while(a[k]-->0) print k
}
}
$ awk -f a.awk file1 file2
1
1
2
9
27B
3
4
5

Print if col2 is greater than col2 in last line

I want to use awk to extract line with greater column value than last line.
the sample data like that
a 3
a 5
a 4
b 1
c 2
c 3
c 6
I try to use below command, but not worked
awk '{if(($1!=a) || ($1==a && $2>b)){getline; print}};{a=$1;b=$2}'
the expected output:
a 3
a 5
b 1
c 2
c 3
c 6
only "a 4" line should be removed, because 4 is smaller than 2nd column of last line (5).
But the actual result from my code:
a 5
c 2
c 6
How can I resolve it? Thanks

Here is one:
$ awk '$1!=p1 || $2>p2; {p1=$1;p2=$2}' file
a 3
a 5
b 1
c 2
c 3
c 6
If $1 changes or $2 is greater from previous round, print.
Generic solution for more fields, see this comment below.

Could you please try following(when you are NOT looking for same 1st field value should be compared).
awk '$2>prev; {prev=$2}' Input_file
In case you want to look for same 1st field's comparisons with 2nd field values then try following.
awk 'prev_1st!=$1 || prev!=""; $2>prev && prev_1st==$1; {prev=$2;prev_1st=$1}' Input_file
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
prev_1st!=$1 || prev!="" ##Checking condition if prev_1st variable NOT equal to $1 OR variable prev is NULL then simply print the line.
$2>prev && prev_1st==$1 ##Checking condition if $2 is greater than prev AND prev_1st equals to $1 then print the line.
{
prev=$2 ##Creating variable prev and setting its value to $2.
prev_1st=$1 ##Creating variable prev_1st and setting its value to $1.
}
' Input_file ##Mentioning Input_file name here.

Adding numbers of a field

I am having a text file with multiple rows and two or four column. If two column then 1st column is id and 2nd is number and if four column 1st and 2nd is id and 3rd and 4th is number. For the four column rows 2nd and 4th column cells can have multiple entry separated by comma. If there is two column only I want to print them as it is; but if there is four column I want to print only the 1st column id and in the second column I want the sum of all the number present in 3rd and 4th column for that row.
Input
CG AT,AA,CA 17 1,1,1
GT 14
TB AC,TC,TA,GG,TT,AR,NN,NM,AB,AT,TT,TC,CA,BB,GT,AT,XT,MT,NA,TT 552 6,1,1,2,2,1,2,1,5,3,4,1,2,1,1,1,3,4,5,4
TT CG,GT,TA,GB 105 3,4,1,3
Expected Output
CG 20
GT 14
TB 602
TT 116

If there are no leading spaces in the actual file, use $1 instead of $2.
$ awk -F '[ ,]+' '{for(i=1; i<=NF; i++) s+=$i; print $2, s; s=0}' <<EOF
CG AT,AA,CA 17 1,1,1
GT 14
TB AC,TC,TA,GG,TT,AR,NN,NM,AB,AT,TT,TC,CA,BB,GT,AT,XT,MT,NA,TT 552 6,1,1,2,2,1,2,1,5,3,4,1,2,1,1,1,3,4,5,4
TT CG,GT,TA,GB 105 3,4,1,3
EOF
CG 20
GT 14
TB 602
TT 116
-F '[ ,]+' means "fields are delimited by one or more spaces or commas".
There is no condition associated with the {action}, so it will be performed on every line.
NF is the Number of Fields, and $X refers to the Xth field.
Strings are equal to 0, so we can simply add every field together to get a sum.
After we print the first non-blank field and our sum, we reset the sum for the next line.

Here is a solution coded to follow your instruction as closely as possible (with no field-splitting tricks so that it's easy to reason about):
awk '
NF == 2 {
print $1, $2
next
}
NF == 4 {
N = split($4, f, /,/)
for (i = 1; i <= N; ++i)
$3 += f[i]
print $1, $3
}'
I noticed though that your input section contains leading spaces. If leading spaces are actually present (and are irrelevant), we can add a leading { sub(/^ +/, "") } to the script.

How to print all columns after matching on key field

How can I join all fields of each row from both files after matching on a key field? How to generalize this one-liner if the number of fields is unknown in f2?
f2:
a 1 2
b 3 4
c 5 6
f3:
10 a x y z
11 g x y z
12 j x y z
observed:
a 10 x y z
a1 10 x y z
Desired:
a 1 2 10 x y z
These are my best attempts but are incorrect:
awk 'FNR==NR{a[$1]=$2;next} ($2 in a) {print a[$2],$0}' f2.txt f3.txt > f4.txt
awk 'FNR==NR{a[$1]=$2$3;next} ($2 in a) {print a[$2],$0}' f2.txt f3.txt > f4.txt

awk 'NR==FNR{a[$1]=$0;next} ($2 in a){print a[$2],$1,$3,$4,$5}' f2.txt f3.txt > f4.txt
save the whole as value and column1 as key, when read 2nd file, check column2 in array a or not, if it is, print a[$2] and the rest columns
A shorter way(the disadvantage of this command is there's one extra space between 10 and x):
awk 'NR==FNR{a[$1]=$0;next} ($2 in a){second=$2; $2="";print a[second],$0}' f2.txt f3.txt > f4.txt
replace $2 of 2nd file with empty string, and print the whole line $0

if your files are sorted in the keys as in your example, join is the tool for this task
join -11 -22 f2.txt f3,txt

#mxttgen31: try:
awk 'FNR==NR{Q=$2;$2="";A[Q]=$0;next} ($1 in A){print $0,A[$1]}' f3 f2
Explanation of above command as follows:
awk 'FNR==NR{ ##### Checking condition FNR==NR here, where FNR and NR both denotes the number of line,
only difference between FNR and NR is as we could read mutiple files from awk,
value of FNR will be RESET on next file's start, where NR's value will be keep on increasing till
it completes reading all the file. so this condition will be TRUE only when first Input_file(which is f3 here) will be TRUE.
Q=$2; ##### Assigning second field's value to variable Q.
$2=""; ##### making second field's value to NULL now.
A[$2]=$0; ##### Create an array named A whose index is $2 and value is current line.
next} ##### putting next(awk's in-built keyword) which skips all next further statements and take the cursor again to starting.
($1 in A) ##### Cursor will come here whenever second Input_file is being read, so here checking $1(first field) is present in array A then do following.
{print $0,A[$1]} ##### print the current line($0) of current file named f2 and print array A's value whose index is $1 of current file f2.
' f3 f2 ##### Mentioning Input_files here.

Print the 1st and every nth column of a text file using awk

I have a txt file contains a total of 10177 columns and a total of approximately 450,000 rows. The information is separated by tabs. I am trying to trim the file down using awk so that it only prints the 1-3, 5th, and every 14th column after the fifth one.
My file has a format that looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... 10177
A B C D E F G H I J K L M N O P Q R S T ...
X Y X Y X Y X Y X Y X Y X Y X Y X Y X Y ...
I am hoping to generate an output txt file (also separated with tab) that contains:
1 2 3 5 18 ...
A B C E R ...
X Y X X Y ...
The current awk code I have looks like (I am using cygwin to use the code):
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
But the result I am getting shows something like:
123518...ABCER...XYXXY...
When opened with excel program, the results are all mashed into 1 single cell.
In addition, when I try to include code
for (i=0;i<=3;i++) printf "%s ",$i
in the awk to get the first 3 columns, it just prints out the original input document together with the mashed result. I am not familiar with awk, so I am not sure what causes this issue.

Awk field numbers, strings, and array indices all start at 1, not 0, so when you do:
for (i=0;i<=3;i++) printf "%s ",$i
the first iteration prints $0 which is the whole record.
You're on the right track with:
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
but never do printf with input data as the only argument to printf since then printf will treat it as a format string without data (rather than what you want which is a plain string format with your data) and then that will fail cryptically if/when your input data contains formatting characters like %s or %d. So, always use printf "%s", $i, never printf $i.
The problem you're having with excel, I would guess, is you're trying to double click on the file and hoping excel knows what to do with it (it won't, unlike if this was a CSV). You can import tab-separated files into excel after it's opened though - google that.
You want something like:
awk '
BEGIN { FS=OFS="\t" }
{
for (i=1; i<=3; i++) {
printf "%s%s", (i>1?OFS:""), $i
}
for (i=5; i<=NF; i+=14) {
printf "%s%s", OFS, $i
}
print ""
}
' file
I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

In awk using conditional operator in for:
$ awk 'BEGIN { FS=OFS="\t" }
{
for(i=1; i<=NF; i+=( i<3 ? 1 : ( i==3 ? 2 : 14 )))
printf "%s%s", $i, ( (i+14)>NF ? ORS : OFS)
}' file
1 2 3 5 19
A B C E S
X Y X X X
In the for if i<3 increment by one, if i==3 increment by two to get to 5 and after that by 14.

I would be tempted to solve the problem along the following lines. I think you'll find you save time by not iterating in awk.
$ cols="$( { echo 1 2 3; seq 5 14 10177; } | sed 's/^/$/; 2,$ s/^/, /' )"
$ awk -F\\t "{print $cols}" test.txt

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to keep newline(s) when selecting a given column with awk - awk

Related

How do i compare alphanumeric characters in non sequential order?

Print if col2 is greater than col2 in last line

Adding numbers of a field

How to print all columns after matching on key field

Print the 1st and every nth column of a text file using awk

Categories

Resources