I need to remove rows from a file with all "0" in the differents columns
Example
seq_1
seq_2
seq_3
data_0
0
0
1
data_1
0
1
4
data_2
0
0
0
data_3
6
0
2
From the example, I need a new file just with the row of data_2. Because it has just all "0" numbers.
I was try using grep and awk but I dont know how to filter just between column $2:4
$ awk 'FNR>1{for(i=2;i<=NF;i++)if($i!=0)next}1' file
Explained:
$ awk 'FNR>1 { # process all data records
for(i=2;i<=NF;i++) # loop all data fields
if($i!=0) # once non-0 field is found
next # on to the next record
}1' file # output the header and all-0 records
Very poorly formated output as the sample data is in some kind of table format which it probably is not IRL:
seq_1 seq_2 seq_3
data_2 0 0 0
With awk you can rely on field string representation:
$ awk 'NR>1 && $2$3$4=="000"' test.txt > result.txt
Using sed, find lines matching a pattern of one or more spaces followed by a 0 (3 times) and if found print the line.
sed -nr '/\s+0\s+0\s+0/'p file.txt > new_file.txt
Or with awk, if columns 2, 3 and 4 are equal to a 0, print the line.
awk '{if ($2=="0" && $3=="0" && $4=="0"){print $0}}' file.txt > new_file.txt
EDIT: I ran the time command on these a bunch of times and the awk version is generally faster. Could add up if you are searching a large file. Of course your mileage may vary!
I have the following piece of code:
awk '{h[$1]++}; END { for(k in h) print k, h[k]}' ${infile} >> ${outfile2}
Which does part of what I want: printing out the unique values and then also counting how many times these unique values have occurred. Now, I want to print out the 2nd and 3rd column as well from each unique value. For some reason the following does not seem to work:
awk '{h[$1]++}; END { for(k in h) print k, $2, $3, h[k]}' ${infile} >> ${outfile2}
awk '{h[$1]++}; END { for(k in h) print k, h[$2], h[$3], h[k]}' ${infile} >> ${outfile2}
The first prints out the last index's 2nd and 3rd column, whereas the second code prints out nothing except k and h[k].
${infile} would look like:
20600 33.8318 -111.9286 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
29400 33.9455 -113.5430 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
The desired output would be:
20600, 33.8318, -111.9286, 3
30900, 33.3979, -111.8140, 2
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
You were close and you can do it all in awk, but if you are going to store the count based on field 1 and also have field 2 and field 3 available in END to output, you also need to store field 2 & 3 in arrays indexed by field 1 (or whatever field you are keeping count of). For example you could do:
awk -v OFS=', ' '
{ h[$1]++; i[$1]=$2; j[$1]=$3 }
END {
for (a in h)
print a, i[a], j[a], h[a]
}
' infile
Where h[$1] holds the count of the number of times field 1 is seen indexing the array with field 1. i[$1]=$2 captures field 2 indexed by field 1, and then j[$1]=$3 captures field 3 indexed by field 1.
Then within END all that is needed is to output field 1 (a the index of h), i[a] (field 2), j[a] (field 3), and finally h[a] the count of the number of times field 1 was seen.
Example Use/Output
Using your example data, you can just copy/middle-mouse-paste the code at the terminal with the correct filename, e.g.
$ awk -v OFS=', ' '
> { h[$1]++; i[$1]=$2; j[$1]=$3 }
> END {
> for (a in h)
> print a, i[a], j[a], h[a]
> }
> ' infile
20600, 33.8318, -111.9286, 3
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
30900, 33.3979, -111.8140, 2
Which provides the output desired. If you need to preserve the order of records in the order of the output you show, you can use string-concatenation to group fields 1, 2 & 3 as the index of the array and then output the array and index, e.g.
$ awk '{a[$1", "$2", "$3]++}END{for(i in a) print i ", " a[i]}' infile
20600, 33.8318, -111.9286, 3
30600, 33.4461, -111.7876, 2
29400, 33.9455, -113.5430, 1
30900, 33.3979, -111.8140, 2
Look things over and let me know if you have further questions.
GNU datamash is a very handy tool for working on groups of columnar data in files that makes this trivial to do.
Assuming your file uses tabs to separate columns like it appears to:
$ datamash -s --output-delimiter=, -g 1,2,3 count 3 < input.tsv
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
Though it's not much more complicated in awk, using a multi dimensional array:
$ awk 'BEGIN { OFS=SUBSEP="," }
{ group[$1,$2,$3]++ }
END { for (g in group) print g, group[g] }' input.tsv
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
20600,33.8318,-111.9286,3
30900,33.3979,-111.8140,2
If you want sorted output instead of random order for this one, if using GNU awk, add a PROCINFO["sorted_in"] = "#ind_str_asc" in the BEGIN block, or otherwise pipe the output through sort.
You can also get the same effect by pipelining a bunch of utilities (including awk and uniq):
$ sort -k1,3n input.tsv | cut -f1-3 | uniq -c | awk -v OFS=, '{ print $2, $3, $4, $1 }'
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
I have file like below :
this is a sample file
this file will be used for testing
this is a sample file
this file will be used for testing
I want to count the words using AWK.
the expected output is
this 2
is 1
a 1
sample 1
file 2
will 1
be 1
used 1
for 1
the below AWK I have written but getting some errors
cat anyfile.txt|awk -F" "'{for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}'
It works fine for me:
awk '{for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}' testfile
used 1
this 2
be 1
a 1
for 1
testing 1
file 2
will 1
sample 1
is 1
PS you do not need to set -F" ", since its default any blank.
PS2, do not use cat with programs that can read data itself, like awk
You can add sort behind code to sort it.
awk '{for(i=1;i<=NF;i++) a[$i]++} END {for(k in a) print k,a[k]}' testfile | sort -k 2 -n
a 1
be 1
for 1
is 1
sample 1
testing 1
used 1
will 1
file 2
this 2
Instead of looping each line and saving the word in array ({for(i=1;i<=NF;i++) a[$i]++}) use gawk with multi-char RS (Record Separator) definition support option and save each field in array as following(It's a little bit fast):
gawk '{a[$0]++} END{for (k in a) print k,a[k]}' RS='[[:space:]]+' file
Output:
used 1
this 2
be 1
a 1
for 1
testing 1
file 2
will 1
sample 1
is 1
In above gawk command I defines space-character-class [[:space:]]+ (including one or more spaces or \new line character) as record separator.
Here is Perl code which provides similar sorted output to Jotne's awk solution:
perl -ne 'for (split /\s+/, $_){ $w{$_}++ }; END{ for $key (sort keys %w) { print "$key $w{$key}\n"}}' testfile
$_ is the current line, which is split based on whitespace /\s+/
Each word is then put into $_
The %w hash stores the number of occurrences of each word
After the entire file is processed, the END{} block is run
The keys of the %w hash are sorted alphabetically
Each word $key and number of occurrences $w{$key} is printed
Can't find a solution, even though thousands of variations of this question have been asked.
I want to add a column of 1's to a tab-delimited file using awk or sed.
The file will have about 20 million lines, so something efficient would be nice.
turn this:
a b c
r j k
i t w
into this:
a b c 1
r j k 1
i t w 1
One simple way. Modify Input and Output field separators to a tab. The NF variable keeps last column, so increment for a new one, assign the 1 number and print:
awk 'BEGIN { FS = OFS = "\t" } { $(NF+1) = 1; print $0 }' infile
It yields:
a b c 1
r j k 1
i t w 1
Code for sed:
sed 's/$/&\t1/' file
Assuming you used awk -F'\t' instead of just awk:
{
print $0 FS 1;
}
If you didn't use the -F option, replace FS 1 with "\t1".
I have two similar files (both with 3 columns). I'd like to check if these two files contains the same elements (but listed in a different orders). First of all I'd like to compare only the 1st columns
file1.txt
"aba" 0 0
"abc" 0 1
"abd" 1 1
"xxx" 0 0
file2.txt
"xyz" 0 0
"aba" 0 0
"xxx" 0 0
"abc" 1 1
How can I do it using awk? I tried to have a look around but I've found only complicate examples. What if I want to include also the other two columns on the comparison? The output should give me the number of matching elements.
To print the common elements in both files:
$ awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
"aba"
"abc"
"xxx"
Explanation:
NR and FNR are awk variables that store the total number of records and the number of records in the current files respectively (the default record is a line).
NR==FNR # Only true when in the first file
{
a[$1] # Build associative array on the first column of the file
next # Skip all proceeding blocks and process next line
}
($1 in a) # Check in the value in column one of the second files is in the array
{
# If so print it
print $1
}
If you want to match the whole lines then use $0:
$ awk 'NR==FNR{a[$0];next}$0 in a{print $0}' file1 file2
"aba" 0 0
"xxx" 0 0
Or a specific set of columns:
$ awk 'NR==FNR{a[$1,$2,$3];next}($1,$2,$3) in a{print $1,$2,$3}' file1 file2
"aba" 0 0
"xxx" 0 0
To print the number of matching elements, here's one way using awk:
awk 'FNR==NR { a[$1]; next } $1 in a { c++ } END { print c }' file1.txt file2.txt
Results using your input:
3
If you'd like to add extra columns (for example, columns one, two and three), use a pseudo-multidimensional array:
awk 'FNR==NR { a[$1,$2,$3]; next } ($1,$2,$3) in a { c++ } END { print c }' file1.txt file2.txt
Results using your input:
2