awk to read specific column from a file - awk

I have a small problem and I would appreciate helping me in it.
In summary, I have a file:
1,5,6,7,8,9
2,3,8,5,35,3
2,46,76,98,9
I need to read specific lines from it and print them into another text document. I know I can use (awk '{print "$2" "$3"}') to print the second and third columns beside each other. However, I need to use two statement as (awk '{print "$2"}' >> file.text) then (awk '{print "$3"}' >> file.text), but the two columns would appear under each other and not beside each other.
How can I make them appear beside each other?

If you must extract the columns in separate processes, use paste to stitch them together. I assume your shell is bash/zsh/ksh, and I assume the blank lines in your sample input should not be there.
paste -d, <(awk -F, '{print $2}' file) <(awk -F, '{print $3}' file)
produces
5,6
3,8
46,76
Without the process substitutions:
awk -F, '{print $2}' file > tmp1
awk -F, '{print $3}' file > tmp2
paste -d, tmp1 tmp2 > output
Update based on your answer:
On first appearance, that's a confusing setup. Does this work?
for (( x=1; x<=$number_of_features; x++ )); do
feature_number=$(sed -n "$x {p;q}" feature.txt)
if [[ -f out.txt ]]; then
cut -d, -f$feature_number file.txt > out.txt
else
paste -d, out.txt <(cut -d, -f$feature_number file.txt) > tmp &&
mv tmp out.txt
fi
done
That has to read the file.txt file a number of times. It would clearly be more efficient to only have to read it once:
awk -F, -f numfeat=$number_of_features '
# read the feature file into an array
NR==FNR {
colno[++i] = $0
next
}
# now, process the file.txt and emit the desired columns
{
sep = ""
for (i=1; i<=numfeat; i++) {
printf "%s%s", sep, $(colno[i])
sep = FS
}
print ""
}
' feature.txt file.txt > out.txt

Thanks all for contributing in the answers. I believe that i should be more clearer in my question, sorry for that.
My code is as follow:
for (( x = 1; x <= $number_of_features ; x++ )) # the number extracted from a text file
do
feature_number=$(awk 'FNR == "'$x'" {print}' feature.txt)
awk -F, '{print $"'$feature_number'"}' file.txt >> out.txt
done
Basically, I extract the feature number (which is the same as column number) from a text document and then print that column. the text document may contains many features number.
The thing is, each time I have different features number (which reflect the column number). so, applying the above solutions are not sufficient for this problem.
I hope it is clearer now.
Waiting for your comments please.
Thanks
Ahmad

instead of using awks file redirection, use shell redirection eg
awk '{print $2,$3}' >> file
the comma is replaced with the value of the output field seperator( space by default ).

Related

Bash command to print even number columns from multiple files

I have files t_1.24/data.dat, t_2.48/data.dat, t_3.72/data.dat ... and each file have two columns. I want to grep the 2nd columns of each file and put them together column by column. I know I can paste them together and do awk '{print $2, $4, ..., $2*n}, but since I have a large number of files, it's obvious not a good way to do it and I believe there are much better solutions. Could anyone give some suggestions to solve this?
Edited: In my case, the files have the same number of lines and each column is separated by space without header. For example, if t_10.48/data.dat is:
9.10000000e+00 -1.14092155e-03
9.10023800e+00 -1.14131197e-03
9.10047601e+00 -1.14171327e-03
9.10071401e+00 -1.14212571e-03
t_2.14/data.dat is:
9.10000000e+00 -1.09822747e-03
9.10023800e+00 -1.09833529e-03
9.10047601e+00 -1.09844835e-03
9.10071401e+00 -1.09856643e-03
what I want is :
-1.09822747e-03 -1.14092155e-03
-1.09833529e-03 -1.14131197e-03
-1.09844835e-03 -1.14171327e-03
-1.09856643e-03 -1.14212571e-03
And I do need to paste them in the order of original file name (eg. t_2.48 has to be before t_10.48).
$ paste $(printf '%s\n' t_*/data.dat | sort -t'_' -k2,2n) |
awk '{for (i=2; i<=NF; i+=2) printf "%s%s", $i, (i<NF ? OFS : ORS)}'
-1.09822747e-03 -1.14092155e-03
-1.09833529e-03 -1.14131197e-03
-1.09844835e-03 -1.14171327e-03
-1.09856643e-03 -1.14212571e-03
Use cut and
paste:
paste <(cut -f2 file1) <(cut -f2 file2) ...
You can also generate and run the command in bash using a Perl one-liner like so:
perl -e '$cmd = join q{ }, q{paste}, map { "<(cut -f2 $_)" } #ARGV; system qq{bash -c "$cmd"};' file1 file2 ...

Removing content of a column based on number of occurences

I have a file (; seperated) with data like this
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3;000-200.2
211111121;000-000.1;000-000.2
I would like to remove any $3 that has less than 3 occurences, so the outcome would be like
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
That is, only $3 got deleted, as it had only a single occurence
Sadly I am not really sure if (thus how) this could be done relatively easily (as doing the =COUNT.IF matching, and manuel delete in Excel feels quite embarrassing)
$ awk -F';' 'NR==FNR{cnt[$3]++;next} cnt[$3]<3{sub(/;[^;]+$/,"")} 1' file file
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
or if you prefer:
$ awk -F';' 'NR==FNR{cnt[$3]++;next} {print (cnt[$3]<3 ? $1 FS $2 : $0)}' file file
this awk one-liner can help, it processes the file twice:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]<3{NF--}7' file file
Though that awk solutions are the best in terms of performance, your goal could be also achieved with something like this:
while IFS=" " read a b;do
if [[ "$a" -lt "3" ]];then
sed -i "s/$b//" b.txt
fi
done <<<"$(cut -d";" -f3 b.txt |sort |uniq -c)"
Operation is based on the output of cut which counts occurrences.
$cut -d";" -f3 b.txt |sort |uniq -c
7 000-000.2
1 000-200.2
3 020-000.8
Above works for editing source file in place, so keep a back up for testing.
You can feed the file twice to awk. On the first run you gather a statistic that you use in the second run:
script.awk
FNR == NR { stats[ $3 ]++
next
}
{ if( stats[$3] < 3) print $1 $2
else print
}
Run it like this: awk -F\; -f script.awk yourfile yourfile .
The condition FNR == NR is true during processing of the first filename given to awk. The next statement skips the second block.
Thus the second block is only used for processing the second filename given to awk (which is here the same as the first filename).

awk printing the second to last record of a file

I have a file set up like
Words on
many line
%
More Words
on many lines
%
Even More Words
on many lines
%
and I would like to output the second to last record of this file where the record is delimited by % after each block of text.
I have used:
awk -v RS=\% ' END{ print NR }' $f
to find the number of records (1136). Then I did
awk -v RS=\% ' { print $(NR-1) }' $f
and
awk -v RS=\% ' { print $(NR=1135) }' $f
.
Neither of these worked, and, instead, displayed a record towards the beginning of the file and a many blank lines.
OUTPUT:
"You know, of course, that the Tasmanians, who never committed adultery, are
now extinct."
-- M. Somerset Maugham
"The
is
what
that
This output had many, many more blank lines and contains a record near the middle of the file.
awk -v RS=\% 'END{ print $(NR-1) }' $f
returns a blank line. The same command with different $(NR-x) values also returns a blank line.
Can someone help me to print the second to last record in this case?
Thanks
You can do:
awk '{this=last;last=$0} END{print this}' file
Or, if you don't mind having the entire file in memory:
awk '{a[NR]=$0} END{print a[NR-1]}' file
Or, if it is just line count (or record count) based, you can keep a rolling deletion going so you are not too piggish on memory:
$ seq 999999 | tail -2
999998
999999
$ seq 999999 | awk '{a[NR]=$0; delete a[NR-3]} END{print a[NR-1]}'
999998
If they are blocks of text the same method works if you can separate the blocks into delimited records.
Given:
$ echo "$txt"
Words on
many line
%
More Words
on many lines
%
Even More Words
on many lines
%
You can do:
$ echo "$txt" | awk -v RS=\% '{a[NR]=$0} END{print a[NR-1]}'
Even More Words
on many lines
$ echo "$txt" | awk -v RS=\% '{a[NR]=$0} END{print a[NR-2]}'
More Words
on many lines
If you want to not print the leading and trailing \n you can do:
$ echo "$txt" | awk 'BEGIN{RS="%\n"} {a[NR]=$0} END{printf a[NR-2]}'
Words on
many line
Finally, if you know the specific record you want to print, do it this way in awk:
$ seq 999999 | awk -v mrk=1135 'NR==mrk{print; exit}'
1135
If you want a random record, you can do:
$ awk -v min=1 -v max=1135 'BEGIN{srand()
RS="%\n"
tgt=int(min+rand()*(max-min+1))
}
NR==tgt{print; exit}' file
Does the solution have to be with awk? Just using head and tail would be simpler.
tail -2 file.txt | head 1 > justthatline.txt
The best way for this would be to use the BEGIN construct.
awk 'BEGIN{RS="%\n"; ORS="%\n"}(NR>=2){print}' file
RS and ORS set the input file and output record separators respectively.

awk first line not working removing columns

I'm trying to remove columns beyond number 26 from all lines of a file, using this code:
awk '{ FS = ";" ; for(i=1;i<NF;i++) if (i<26) printf $i FS}{print $26}'
It is working well in all the lines but for the first one, where it shows 2 more fields (and cuts the last in two).
Is there anything wrong in my code?
Thanks a lot
This is because you set FS on every line, while it should be in a BEGIN{} block (or outside as a parameter, like others answers correctly suggest):
awk 'BEGIN{FS=";"} {for(i=1;i<NF;i++) if (i<26) printf $i FS}{print $26}' file
In fact, to accomplish your goal it is easier to use cut:
cut -d';' -f-26 file
^ ^^^
| all fields up to the 26th
delimiter
Example with 4 cols
sample file:
$ cat a
1col1;col2;col3;col4;col5;col6
2col1;col2;col3;col4;col5;col6
3col1;col2;col3;col4;col5;col6
previous code:
$ awk '{FS=";"; for(i=1;i<NF;i++) if (i<4) printf $i FS}{print $4}' a
2col1;col2;col3;col4
3col1;col2;col3;col4
new code:
$ awk 'BEGIN{FS=";"} {for(i=1;i<NF;i++) if (i<4) printf $i FS}{print $4}' a
1col1;col2;col3;col4
2col1;col2;col3;col4
3col1;col2;col3;col4
with cut:
$ cut -d';' -f-4 a
1col1;col2;col3;col4
2col1;col2;col3;col4
3col1;col2;col3;col4
You can try this awk,
awk -F';' 'NF>26{NF=26}1' OFS=';' yourfile
#fedorqui is right.
But you can also use this to set Field Separator :
awk -F";" '{for(i=1;i<NF;i++) if (i<26) printf $i FS}{print $26}' file

Using each line of awk output as grep pattern

I want to find every line of a file that contains any of the strings held in a column of a different file.
I have tried
grep "$(awk '{ print $1 }' file1.txt)" file2.txt
but that just outputs file2.txt in its entirety.
I know I've done this before with a pattern I found on this site, but I can't find that question anymore.
I see in the OP's comment that maybe the question is no longer a question. However, the following slight modification will handle the blank line situation. Just add a check to make sure the line has at least one field:
grep "$(awk '{if (NF > 0) print $1}' file1)" file2
And if the file with the patterns is simply a set of patterns per line, then a much simpler version of it is:
grep -f file1 file2
That causes grep to use the lines in file1 as the patterns.
THere is no need to use grep when you have awk
awk 'FNR==NR&&NF{a[$0];next}($1 in a)' file2 file1
$(awk '{ print $1 }' file1.txt) | grep text > file.txt