Word Count using AWK - awk

I have file like below :
this is a sample file
this file will be used for testing
this is a sample file
this file will be used for testing
I want to count the words using AWK.
the expected output is
this 2
is 1
a 1
sample 1
file 2
will 1
be 1
used 1
for 1
the below AWK I have written but getting some errors
cat anyfile.txt|awk -F" "'{for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}'

It works fine for me:
awk '{for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}' testfile
used 1
this 2
be 1
a 1
for 1
testing 1
file 2
will 1
sample 1
is 1
PS you do not need to set -F" ", since its default any blank.
PS2, do not use cat with programs that can read data itself, like awk
You can add sort behind code to sort it.
awk '{for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}' testfile | sort -k 2 -n
a 1
be 1
for 1
is 1
sample 1
testing 1
used 1
will 1
file 2
this 2

Instead of looping each line and saving the word in array ({for(i=1;i<=NF;i++) a[$i]++}) use gawk with multi-char RS (Record Separator) definition support option and save each field in array as following(It's a little bit fast):
gawk '{a[$0]++} END{for (k in a) print k,a[k]}' RS='[[:space:]]+' file
Output:
used 1
this 2
be 1
a 1
for 1
testing 1
file 2
will 1
sample 1
is 1
In above gawk command I defines space-character-class [[:space:]]+ (including one or more spaces or \new line character) as record separator.

Here is Perl code which provides similar sorted output to Jotne's awk solution:
perl -ne 'for (split /\s+/, $_){ $w{$_}++ }; END{ for $key (sort keys %w) { print "$key $w{$key}\n"}}' testfile
$_ is the current line, which is split based on whitespace /\s+/
Each word is then put into $_
The %w hash stores the number of occurrences of each word
After the entire file is processed, the END{} block is run
The keys of the %w hash are sorted alphabetically
Each word $key and number of occurrences $w{$key} is printed

Related

Filter a file removing lines just with all 0

I need to remove rows from a file with all "0" in the differents columns
Example
seq_1
seq_2
seq_3
data_0
0
0
1
data_1
0
1
4
data_2
0
0
0
data_3
6
0
2
From the example, I need a new file just with the row of data_2. Because it has just all "0" numbers.
I was try using grep and awk but I dont know how to filter just between column $2:4
$ awk 'FNR>1{for(i=2;i<=NF;i++)if($i!=0)next}1' file
Explained:
$ awk 'FNR>1 { # process all data records
for(i=2;i<=NF;i++) # loop all data fields
if($i!=0) # once non-0 field is found
next # on to the next record
}1' file # output the header and all-0 records
Very poorly formated output as the sample data is in some kind of table format which it probably is not IRL:
seq_1 seq_2 seq_3
data_2 0 0 0
With awk you can rely on field string representation:
$ awk 'NR>1 && $2$3$4=="000"' test.txt > result.txt
Using sed, find lines matching a pattern of one or more spaces followed by a 0 (3 times) and if found print the line.
sed -nr '/\s+0\s+0\s+0/'p file.txt > new_file.txt
Or with awk, if columns 2, 3 and 4 are equal to a 0, print the line.
awk '{if ($2=="0" && $3=="0" && $4=="0"){print $0}}' file.txt > new_file.txt
EDIT: I ran the time command on these a bunch of times and the awk version is generally faster. Could add up if you are searching a large file. Of course your mileage may vary!

Sort a file preserving the header as first position with bash

When sorting a file, I am not preserving the header in its position:
file_1.tsv
Gene Number
a 3
u 7
b 9
sort -k1,1 file_1.tsv
Result:
a 3
b 9
Gene Number
u 7
So I am tryig this code:
sed '1d' file_1.tsv | sort -k1,1 > file_1_sorted.tsv
first='head -1 file_1.tsv'
sed '1 "$first"' file_1_sorted.tsv
What I did is to remove the header and sort the rest of the file, and then trying to add again the header. But I am not able to perform this last part, so I would like to know how can I copy the header of the original file and insert it as the first row of the new file without substituting its actuall first row.
You can do this as well :
{ head -1; sort; } < file_1.tsv
** Update **
For macos :
{ IFS= read -r header; printf '%s\n' "$header" ; sort; } < file_1.tsv
a simpler awk
$ awk 'NR==1{print; next} {print | "sort"}' file
$ head -1 file; tail -n +2 file | sort
Output:
Gene Number
a 3
b 9
u 7
Could you please try following.
awk '
FNR==1{
first=$0
next
}
{
val=(val?val ORS:"")$0
}
END{
print first
print val | "sort"
}
' Input_file
Logical explanation:
Check condition FNR==1 to see if its first line; then save its values to variable and move on to next line by next.
Then keep appending all lines values to another variable with new line till last line.
Now come to END block of this code which executes when Input_file is done being read, there print first line value and put sort command on rest of the lines value there.
This will work using any awk, sort, and cut in any shell on every UNIX box and will work whether the input is coming from a pipe (when you can't read it twice) or from a file (when you can) and doesn't involve awk spawning a subshell:
awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
The above uses awk to stick a 0 at the front of the header line and a 1 in front of the rest so you can sort by that number then whatever other field(s) you want to sort on and then remove the added field again with a cut. Here it is in stages:
$ awk -v OFS='\t' '{print (NR>1), $0}' file
0 Gene Number
1 a 3
1 u 7
1 b 9
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2
0 Gene Number
1 a 3
1 b 9
1 u 7
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
Gene Number
a 3
b 9
u 7

print whole variable contents if the number of lines are greater than N

How to print all lines if certain condition matches.
Example:
echo "$ip"
this is a sample line
another line
one more
last one
If this file has more than 3 lines then print the whole variable.
I am tried:
echo $ip| awk 'NR==4'
last one
echo $ip|awk 'NR>3{print}'
last one
echo $ip|awk 'NR==12{} {print}'
this is a sample line
another line
one more
last one
echo $ip| awk 'END{x=NR} x>4{print}'
Need to achieve this:
If this file has more than 3 lines then print the whole file. I can do this using wc and bash but need a one liner.
The right way to do this (no echo, no pipe, no loops, etc.):
$ awk -v ip="$ip" 'BEGIN{if (gsub(RS,"&",ip)>2) print ip}'
this is a sample line
another line
one more
last one
You can use Awk as follows,
echo "$ip" | awk '{a[$0]; next}END{ if (NR>3) { for(i in a) print i }}'
one more
another line
this is a sample line
last one
you can also make the value 3 configurable from an awk variable,
echo "$ip" | awk -v count=3 '{a[$0]; next}END{ if (NR>count) { for(i in a) print i }}'
The idea is to store the contents of the each line in {a[$0]; next} as each line is processed, by the time the END clause is reached, the NR variable will have the line count of the string/file you have. Print the lines if the condition matches i.e. number of lines greater than 3 or whatever configurable value using.
And always remember to double-quote the variables in bash to avoid undergoing word-splitting done by the shell.
Using James Brown's useful comment below to preserve the order of lines, do
echo "$ip" | awk -v count=3 '{a[NR]=$0; next}END{if(NR>3)for(i=1;i<=NR;i++)print a[i]}'
this is a sample line
another line
one more
last one
Another in awk. First test files:
$ cat 3
1
2
3
$ cat 4
1
2
3
4
Code:
$ awk 'NR<4{b=b (NR==1?"":ORS)$0;next} b{print b;b=""}1' 3 # look ma, no lines
[this line left intentionally blank. no wait!]
$ awk 'NR<4{b=b (NR==1?"":ORS)$0;next} b{print b;b=""}1' 4
1
2
3
4
Explained:
NR<4 { # for tghe first 3 records
b=b (NR==1?"":ORS) $0 # buffer them to b with ORS delimiter
next # proceed to next record
}
b { # if buffer has records, ie. NR>=4
print b # output buffer
b="" # and reset it
}1 # print all records after that

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

Extracting block of data from a file

I have a problem, which surely can be solved with an awk one-liner.
I want to split an existing data file, which consists of blocks of data into separate files.
The datafile has the following form:
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
And i want to store every single block of data in a separate file, named - for example - "1.dat", ".dat", "3.dat",...
The problem is, that each block doesn't have a specific line number, they are just delimited by two "new lines".
Thanks in advance,
Jürgen
This should get you started:
awk '{ print > ++i ".dat" }' RS= file.txt
If by two "new lines" you mean, two newline characters:
awk '{ print > ++i ".dat" }' RS="\n\n" file.txt
See how the results differ? Setting a null RS (i.e. the first example) is probably what you're looking for.
Another approach:
awk 'NF != 0 {print > $1 ".dat"}' file.txt