Remove empty columns by awk - awk

I have a input file, which is tab delimited, but I want to remove all empty columns. Empty columns : $13=$14=$15=$84=$85=$86=$87=$88=$89=$91=$94
INPUT: tsv file with more than 90 columns
a b d e g...
a b d e g...
OUTPUT: tsv file without empty columns
a b d e g....
a b d e g...
Thank you

This might be what you want:
$ printf 'a\tb\tc\td\te\n'
a b c d e
$ printf 'a\tb\tc\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=""} 1'
a c e
$ printf 'a\tb\tc\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=RS; gsub("(^|"FS")"RS,"")} 1'
a c e
Note that the above doesn't remove all empty columns as some potential solutions might do, it only removes exactly the column numbers you want removed:
$ printf 'a\tb\t\td\te\n'
a b d e
$ printf 'a\tb\t\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=RS; gsub("(^|"FS")"RS,"")} 1'
a e

remove ALL empty columns:
If you have a tab-delimited file, with empty columns and you want to remove all empty columns, it implies that you have multiple consecutive tabs. Hence you could just replace those with a single tab and delete then the first starting tab if you also removed the first column:
sed 's/\t\+/\t/g;s/^\t//' <file>
remove SOME columns: See Ed Morton or just use cut:
cut --complement -f 13,14,15,84,85,86,87,88,89,91,94 <file>
remove selected columns if and only if they are empty:
Basically a simple adaptation from Ed Morton :
awk 'BEGIN{FS=OFS="\t"; n=split(col,a,",")}
{ for(i=1;i<=n;++i) if ($a[i]=="") $a[i]=RS; gsub("(^|"FS")"RS,"") }
1' col=13,14,15,84,85,86,87,88,89,91,94 <file>

Related

Put a comma in a specific column

I would like to know how to put a comma in one column (space). For example.
a b c d e
And I would like this.
a b c d, e
A comma in the 4th space.
I tried with this command.
awk -F '{print $4}' < file.txt | cut -d"," -f4-
$ awk '{$4=$4","}1' file
a b c d, e
If you have only 5 fields(or in case you have more fields in your Input_file and you want to perform this for second last field) in your Input_file then following may also help you in same.
awk '{$(NF-1)=$(NF-1)","} 1' Input_file
Or with sed simply replace 4th space with comma as follows.
sed 's/ /, /4' Input_file
echo a b c d e| awk '{$0=gensub(/ /,", ",4)}1'
a b c d, e

awk remove mirrored duplicates from 2 columns

Big question:
I want a list of the unique combinations between two fields in a data frame.
Example data:
A B
C D
E F
B A
C F
E F
I would like to be able to get the result of 4 unique combinations: AB, CD, EF, and CF. Since BA and and BA contain the same components but in a different order, I only want one copy (it is a mutual relationship so BA is the same thing as AB)
Attempt:
So far I have tried sorting and keeping unique lines:
sort file | uniq
but of course that produces 5 combinations:
A B
C D
E F
B A
C F
I do not know how to approach AB/BA being considered the same. Any suggestions on how to do this?
The idiomatic awk approach is to order the index parts:
$ awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
A B
C D
E F
C F
another awk magic
awk '!a[$1,$2] && !a[$2,$1]++' file
In awk:
$ awk '($1$2 in a){next}{a[$1$2];a[$2$1]}1' file
A B
C D
E F
C F
Explained:
($1$2 in a) { next } # if duplicate in hash, next record
{ a[$1$2]; a[$2$1] } 1 # hash reverse also and output
It works for single char fields. If you want to use it for longer strings, add FS between fields, like a[$1 FS $2] etc. (thanks #EdMorton).

paste command leaves empty rows

Hi I was trying to paste multiple files (each with a single column but different number of rows) together.
paste file1.txt file2.txt paste3.txt ... paste100 > out.txt
input file 1:
A
B
C
input file 2:
D
E
input file 3:
F
G
H
I
J
output:
A D F
B E G
C H
I
J
When I cut column 2 (cut -f2) from out.txt file , it gives column 2 with 3 empty rows (probably because column has 5 rows so to match to column 3, it created 2 extra empty rows). Same goes for column 1 (less out.txt | cut -f1) which gives 2 empty rows. Any ideas why does it show the empty rows?
less out.txt | cut -f1
A
B
C
empty cell
empty cell
less out.txt | cut -f2
D
E
empty cell
empty cell
empty cell
I was expecting to see-
column 1
A
B
C
column 2
D
E
None of the rows are empty, some of them just don't have all of the fields populated but they do still have the field separators (tabs) that paste output. cut has no way of knowing that you don't want the empty fields printed.
Try:
awk -v f=1 -F'\t' '$f!=""{print $f}' file
awk -v f=2 -F'\t' '$f!=""{print $f}' file
instead.
We can use awk to do the replacement of blank fields with the empty cell value. Here I have used '|' as a delimiter to make the replacement operation flawless. The delimiter can again be converted to tabs or spaces if wanted with tr -s '|' '\t'.
$ paste -d '|' {a,b,c}.txt|awk 'BEGIN{FS=OFS="|"} {print ($1=="" ? "empty cell" : $1), ($2=="" ? "empty cell" : $2), ($3=="" ? "empty cell" : $3)}'
Output with pipe delimited.
A|D|F
B|E|G
C|empty cell|H
empty cell|empty cell|I
empty cell|empty cell|J
Output with tab delimited.
A D F
B E G
C empty cell H
empty cell empty cell I
empty cell empty cell J

Print Specific Columns using AWK

I'm trying to fetch the data from column B to D from a tab delimited file "FILE". The simple AWK code I use fetch the data, but unfortunately keeps the output in a single column and remove the identifiers (shown below).
Any suggestions please.
CODE
awk '{for(i=2;i<=4;++i)print $i}' FILE
FILE
A B C D E F G
1_at 10.8435630935 10.8559287854 8.6666141543 8.820310681 9.9024050571 8.613199083 11.9807771094
2_at 4.7615531106 4.5209119307 11.2467919586 8.8105151099 7.1831990104 11.0645055836 4.3726598561
3_at 6.0025262754 5.4058080843 3.2475272982 3.1869728585 3.5654989547
OUTPUT OBTAINED
B
C
D
10.8435630935
10.8559287854
8.6666141543
4.7615531106
4.5209119307
11.2467919586
6.0025262754
5.4058080843
3.2475272982
Why don't you directly use cut?
$ cut -d$'\t' -f2-4 < file
B C D
10.8435630935 10.8559287854 8.6666141543
4.7615531106 4.5209119307 11.2467919586
6.0025262754 5.4058080843 3.2475272982
With awk you would need printf to avoid new lines of print:
awk -F"\t" '{for(i=2;i<=4;++i) printf "%s%s", $i, (i==4?RS:FS)}'

In AWK, is it possible to specify "ranges" of fields?

In AWK, is it possible to specify "ranges" of fields?
Example. Given a tab-separated file "foo" with 100 fields per line, I want to print only the fields 32 to 57 for each line, and save the result in a file "bar". What I do now:
awk 'BEGIN{OFS="\t"}{print $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57}' foo > bar
The problem with this is that it is tedious to type and prone to errors.
Is there some syntactic form which allows me to say the same in a more concise and less error prone fashion (like "$32..$57") ?
Besides the awk answer by #Jerry, there are other alternatives:
Using cut (assumes tab delimiter by default):
cut -f32-58 foo >bar
Using perl:
perl -nle '#a=split;print join "\t", #a[31..57]' foo >bar
Mildly revised version:
BEGIN { s = 32; e = 57; }
{ for (i=s; i<=e; i++) printf("%s%s", $(i), i<e ? OFS : "\n"); }
You can do it in awk by using RE intervals. For example, to print fields 3-6 of the records in this file:
$ cat file
1 2 3 4 5 6 7 8 9
a b c d e f g h i
would be:
$ gawk 'BEGIN{f="([^ ]+ )"} {print gensub("("f"{2})("f"{4}).*","\\3","")}' file
3 4 5 6
c d e f
I'm creating an RE segment f to represent every field plus it's succeeding field separator (for convenience), then I'm using that in the gensub to delete 2 of those (i.e the first 2 fields), remember the next 4 for reference later using \3, and then delete what comes after them. For your tab-separated file where you want to print fields 32-57 (i.e. the 26 fields after the first 31) you'd use:
gawk 'BEGIN{f="([^\t]+\t)"} {print gensub("("f"{31})("f"{26}).*","\\3","")}' file
The above uses GNU awk for it's gensub() function. With other awks you'd use sub() or match() and substr().
EDIT: Here's how to write a function to do the job:
gawk '
function subflds(s,e, f) {
f="([^" FS "]+" FS ")"
return gensub( "(" f "{" s-1 "})(" f "{" e-s+1 "}).*","\\3","")
}
{ print subflds(3,6) }
' file
3 4 5 6
c d e f
Just set FS as appropriate. Note that this will need a tweak for the default FS if your input file can start with spaces and/or have multiple spaces between fields and will only work if your FS is a single character.
I'm late but this is quick at to the point so I'll leave it here. In cases like this I normally just remove the fields I don't need with gsub and print. Quick and dirty example, since you know your file is delimited by tabs you can remove the first 31 fields:
awk '{gsub(/^(\w\t){31}/,"");print}'
example of removing 4 fields because lazy:
printf "a\tb\tc\td\te\tf\n" | awk '{gsub(/^(\w\t){4}/,"");print}'
Output:
e f
This is shorter to write, easier to remember and uses less CPU cycles than horrendous loops.
You can use a combination of loops and printf for that in awk:
#!/bin/bash
start_field=32
end_field=58
awk -v start=$start_field -v end=$end_field 'BEGIN{OFS="\t"}
{for (i=start; i<=end; i++) {
printf "%s" $i;
if (i < end) {
printf "%s", OFS;
} else {
printf "\n";
}
}}'
This looks a bit hacky, however:
it properly delimits your output based on the specified OFS, and
it makes sure to print a new line at the end for each input line in the file.
I do not know a way to do field range selection in awk. I know how to drop fields at the end of the input (see bellow), but not easily at the beginning. Bellow, the hard way to drop fields at the beginning.
If you know a character c that is not included in your input, you could use the following awk script:
BEGIN { s = 32; e = 57; c = "#"; }
{ NF = e # Drop the fields after e.
$s = c $s # Put a c in front of the s field.
sub(".*"c, "") # Drop the chars before c.
print # Print the edited line.
}
EDIT:
And I just thought that you can always find a character that is not in the input: use \n.
Unofrtunately don't seem to have access to my account anymore, but also don't have 50 rep to add a comment anyway.
Bob's answer can be simplified a lot using 'seq':
echo $(seq -s ,\$ 5 9| cut -d, -f2-)
$6,$7,$8,$9
The minor disadvantage is you have to specify your first field number as one lower.
So to get fields 3 through 7, I specify 2 as the first argument.
seq -s ,\$ 2 7 sets field seperator for seq at ',$' and yields 2,$3,$4,$5,$6,$7
cut -d, -f2- sets field delimiter at ',' and basically cuts of everything before the first comma, by showing everything from the second field on. Thus resulting in $3,$4,$5,$6,$7
When combined with Bob's answer, we get:
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(seq -s ,\$ 2 7| cut -d, -f2-)}" awk.txt
3 4 5 6 7
c d e f g
$
I use this simple function, which does not check that the field range exists in the line.
function subby(f,l, s) {
s = $f
for(i=f+1;i<=l;i++)
s = sprintf("%s %s",s,$i)
return s
}
(I know OP requested "in AWK" but ... )
Using bash expansion on the command line to generate arguments list;
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(c="" ;for i in {3..7}; do c=$c\$$i, ; done ; c=${c%%,} ; echo $c ;)}" awk.txt
3 4 5 6 7
c d e f g
explanation ;
c="" # var to hold args list
for i in {3..7} # the required variable range 3 - 7
do
# replace c's value with concatenation of existing value, literal $, i value and a comma
c=$c\$$i,
done
c=${c%%,} # remove trailing/final comma
echo $c #return the list string
placed on single line using semi-colons, inside $() to evaluate/expand in place.