AWK select rows where all columns are equal - awk

I have a file with tab-separated values where the number of columns is not known a priori. In other words the number of columns is consistent within a file but different files have different number of columns. The first column is a key, the other columns are some arbitrary values.
I need to filter out the rows where the values are not the same. For example, assuming that the number of columns is 4, I need to keep the first 2 rows and filter out the 3-rd:
1 A A A
2 B B B
3 C D C
I'm planning to use AWK for this purpose, but I don't know how to deal with the fact that the number of columns is unknown. The case of the known number of columns is simple, this is a solution for 4 columns:
$2 == $3 && $3 == $4 {print}
How can I generalize the solution for arbitrary number of columns?

If you guarantee no field contains regex-active chars and the first field never match the second, and there is no blank line in the input:
awk '{tmp=$0;gsub($2,"")} NF==1{print tmp}' file
Note that this solution is designed for this specific case and less extendable than others.

Another slight twist on the approach. In your case you know you want to compare fields 2-4 so you can simply loop from i=3;i<=NF checking $i!=$(i-1) for equality, and if it fails, don't print, get the next record, e.g.
awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1'
Example Use/Output
With your data in file.txt:
$ awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1' file.txt
1 A A A
2 B B B

Could you please try following. This will compare all columns from 2nd column to till last column and check if every element is equal or not. If they are all same it will print line.
awk '{for(i=3;i<=NF;i++){if($(i-1)==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
OR(by hard coding $2 in code, since if $2=$3 AND $3=$4 it means $2=$3=$4 so intentionally taking $2 in comparison rather than having i-1 fetching its previous value.)
awk '{for(i=3;i<=NF;i++){if($2==$i){count++}};if((NF-2)==count){print};count=""}' Input_file

I'd use a counter t with initial value of 2 to add the number of times $i == $(i+1) where i iterates from 2 to NF-1. print the line only if t==NF is true:
awk -F'\t' '{t=2;for(i=2;i<NF;i++){t+=$i==$(i+1)}}t==NF' file.txt

Here is a generalisation of the problem:
Select all lines where a set of columns have the same value: c1 c2 c3 c4 ..., where ci can be any number:
Assume we want to select the columns: 2 3 4 11 15
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if ($(a[i])!=$(a[1])) next}1' file
A bit more robust, in case a line might not contain all fields:
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if (a[i] <= NF) if ($(a[i])!=$(a[1])) next}1' file

Related

bash remove duplicates from four columns where order does not matter [duplicate]

This question already has answers here:
How can I use ":" as an AWK field separator?
(9 answers)
Closed 3 years ago.
I have a file where I'd like to check the content of 4 columns, the order can be reversed between couples of these columns, that means if the columns are a,b,c,d then they can appear also as c,d,a,b. So columns a,b and c,d are locked but can be swapped between each other.
I found a similar post here
remove redundancy in a file based on two fields, using awk
however the solution does not work at all
Even with just two columns
a;b
d;a
b;a
r;f
r;y
a;b
a;d
If I apply the solutions provided and given as correct I end up with duplicates
$ awk '!seen[$1,$2]++ && !seen[$2,$1]++' file
a;b
d;a
b;a
r;f
r;y
a;d
As you can see there's still a;b and b;a
Any suggestion to make this work, considering also there would be four columns, for example
Dallas;Texas;Berlin;Germany
Paris;France;Tokyo;Japan
Berlin;Germany;Dallas;Texas
Florence;Italy;Dublin;Ireland
Berlin;Germany;Texas;Dallas
Should give
Dallas;Texas;Berlin;Germany
Paris;France;Tokyo;Japan
Florence;Italy;Dublin;Ireland
Berlin;Germany;Texas;Dallas
Note that the last line should not be deleted because that's a different record, so a,b and c,d should be considered as locked couples, so a,b,c,d or c,d,a,b should be considered as a duplicate but not other cases.
Well for the original example with two fields, you have missed defining the ; as the input field separator. The same would have worked had you run it as
awk -F';' '!seen[$1,$2]++ && !seen[$2,$1]++' file
For multiple records in a row on a de-limiter, it is better to sort those records by alphabetical order and use the logic. The below logic works irrespective of the order of the elements in a line.
Needs GNU awk because of asort() function.
The input and output delimiters are not needed for the below case, because on every line we use the records split by ; to construct the unique key and print the whole line when it is unique.
awk '{
split($0, arr, ";"); key="";
asort(arr);
for (i=1; i<=length(arr); i++) {
key = ( key FS arr[i] )
}
}!unique[key]++' file
In so called one-liner (aka unreadable) way
awk '{ split($0, arr, ";"); asort(arr); key=""; for (i=1; i<=length(arr); i++) { key = ( key FS arr[i]) }; }!unique[key]++' file
As noted in the comments, if the possible alternates for a,b,c,d is just c,d,a,b then doing below would just suffice
awk -F';' '!seen[$1,$2,$3,$4]++ && !seen[$3,$4,$1,$2]++' file

In a CSV file, subtotal 2 columns based on a third one, using AWK in KSH

Disclaimers:
1) English is my second language, so please forgive any grammatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have an CSV file that looks like this:
c1,c2,c3,c4,c5,134.6,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0.18,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,416.09,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,12.1,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,480.64,,c8,c9,SERVER4,c11
c1,c2,c3,c4,c5,,83.65,c8,c9,SERVER5,c11
c1,c2,c3,c4,c5,,253.15,c8,c9,SERVER6,c11
c1,c2,c3,c4,c5,,18.84,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,8.12,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,22.45,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,117.81,c8,c9,SERVER8,c11
c1,c2,c3,c4,c5,,96.34,c8,c9,SERVER9,c11
Complementary facts:
1) File has 11 columns.
2) The data in columns 1, 2, 3, 4, 5, 8, 9 and 11 is irrelevant in this case. In other words, I will only work with columns 6, 7 and 10.
3) Column 10 will be typically alphanumeric strings (server names), though it may contain also "-" and/or "_".
4) Columns 6 and 7 will have exclusively numbers, with up to two decimal places (A possible value is 0). Only one of the two will have data per line, never both.
What I need as an output:
- A single occurrence of every string in column 10 (as column 1), then the sum (subtotal) of it's values in column 6 (as column 2) and last, the sum (subtotal) of it's values in column 7 (as column 3).
- If the total for a field is "0" the field must be left empty, but still must exist (it's respective comma has to be printed).
- **Note** that the strings in column 10 will be already alphabetically sorted, so there is no need to do that part of the processing with AWK.
Output sample, using the sample above as an input:
SERVER1,134.6,,
SERVER2,0.18,,
SERVER3,428.19,,
SERVER4,480.64,,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,26.96
I've already found within these pages not one, but two AWK oneliners that PARTIALLY accomplish what it need:
awk -F "," 'NR==1{last=$10; sum=0;}{if (last != $10) {print last "," sum; last=$10; sum=0;} sum += $6;}END{print last "," sum;}' inputfile
awk -F, '{a[$10]+=$6;}END{for(i in a)print i","a[i];}' inputfile
My "problems" in both cases are the same:
- Subtotals of 0 are printed.
- I can only handle the sum of one column at a time. Whenever I try to add the second one, I get either a syntax error or it does simply not print the third column at all.
Thanks in advance for your support people!
Regards,
Martín
something like this?
$ awk 'BEGIN{FS=OFS=","}
{s6[$10]+=$6; s7[$10]+=$7}
END{for(k in s6) print k,(s6[k]?s6[k]:""),(s7[k]?s7[k]:"")}' file | sort
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34
note that your treatment of commas is not consistent, you're adding an extra one when the last field is zero (count the commas)
Your posted expected output doesn't seem to match your posted sample input so we're guessing but this might be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="," }
$10 != prev {
if (NR > 1) {
print prev, sum6, sum7
}
sum6 = sum7 = ""
prev = $10
}
$6 { sum6 += $6 }
$7 { sum7 += $7 }
END { print prev, sum6, sum7 }
$ awk -f tst.awk file
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34

Awk: Append output to new field in existing file

Is there a way to print the output of an awk script to an existing file as a new field every time?
Hi!
I'm very new at awk (so my terminology might not be correct, sorry about that!) and I'm trying to print the output of a script that will operate on several hundred files to the same file, in different fields.
For example, my data files have this structure:
#File1
1
Values, 2, Hanna
20
15
Values, 2, Josh
30
56
Values, 2, Anna
50
70
#File2
2
Values, 2, Hanna
45
60
Values, 2, Josh
98
63
Values, 2, Anna
10
56
I have several of these files, which are divided by numbered month, with the same names, but different values. I want files that are named by the name of the person, and the values in fields by month, like so:
#Hanna
20 45
15 60
#Josh
30 98
56 63
#Anna
50 10
70 56
In my script, I search for the word "values", and determine which records to print (based on the number after "value"). This works fine. Then I want to print these values. It works fine for one file, with the command:
Print $0 > name #the varible name have I saved to be = $3 of the correct row
This creates three files correctly named "Hanna", "Josh" and "Anna", with their values. However, I would like to run the script for all my datafiles, and append them to only one "Hanna"-file etc, in a new field.
So what I'm looking for is something like print $0 > $month name, reading out like "print the record to the field corresponding to the month"
I have tried to find a solution, but most solutions either just paste temporary files together or append the values after the existing ones (so that they all are in field 1). I want to avoid the temporary files and have them in different fields (so that I get a kind of matrix-structure).
Thank you in advance!
try following, though I have not checked all permutations and combinations and only considered your post. Also your output Josh column is not consistent also(Or please do let us know if more conditions are there for same too). Let me know how it goes then.
awk 'FNR==NR{if($0 ~ /^Values/){Q=$NF;B[$NF]=$NF;i="";next};A[Q,++i]=$0;next} /^Values/{V=$NF;print "#"B[V];i="";next} B[V]{print A[V,++i],$0}' file1 file2
EDIT: Adding a non-one liner form of solution too.
awk 'FNR==NR{
if($0 ~ /^Values/){
Q=$NF;
B[$NF]=$NF;
i="";
next
};
A[Q,++i]=$0;
next
}
/^Values/{
V=$NF;
print "#"B[V];
i="";
next
}
B[V]{
print A[V,++i],$0
}
' file1 file2
EDIT2: Adding explanation too now for same.
awk 'FNR==NR{ ###Checking condition FNR==NR where this condition will be TRUE only when first file named file1 is being read. FNR and NR both indicate number of lines in a Input_file, only difference between them is FNR value will be RESET whenever there is next Input_file is being read and NR value will be keep on incresing till all the Input_files are read.
if($0 ~ /^Values/){ ###Checking here if any line starts from string Values if yes then perform following operations.
Q=$NF; ###Creating a variable named Q whose value is the last field of the line.
B[$NF]=$NF;###Creating an array named B whose index is $NF(last field of the line) and value is same too.
i=""; ###Making variable i value to NULL now.
next ###using next here, it is built-in keyword for awk and it will skip all further statements now.
};
A[Q,++i]=$0; ###Creating an array named A whose index is Q and variable i with increasing value with 1 to it, each time it comes on this statement.
next ###Using next will skip all further statements now.
}
/^Values/{ ###All statements from here will be executed when second file named file2 is being read. So I am checking here if a line starts from string Values then do following.
V=$NF; ###create variable V whose value is $NF of current line.
print "#"B[V]; ###printing the string # then value of array B whose index is variable V.
i=""; ###Nullifying the variable i value here.
next ###next will sip all the further statements now.
}
B[V]{ ###Checking here if array B with index V is having a value in it, then perform following on it too.
print A[V,++i],$0 ###printing the value of array A whose index is variable V and variable i increasing value with 1 and current line.
}
' file1 file2 ###Mentioning the Input_files here named file1 and file2.

How to get a field by counting the column ( number of character)

I have a logfile.txt and I want to specify the filed $4 but based on number of column not number of field because the fields are separated by spaces characters and the field 2 ($2) may contain a values separated by space. I want to count lines but I don't know how specify $4 without causing a problem if the field 2 ($2) contain a space character.
here is my file:
KJKJJ1KLJKJKJ928482711 PIEJHHKIA 87166188177633 AJHHHH77760 00666667 876876800874 2014100898798789979879877770
KJKJJ1KLJKJKJ928482711 HKHG 81882776553868 HGHALJLKA700 00876763 216897879879 2014100898798789979879877770
KJKJJ1KLJKJKJ928482711 UUT UGGT 81762665356426 HGJHGHJG661557008 00778787 268767860704 2014100898798789979879877770
KJKJJ1KLJKJKJ9284827kj ARTH HGG 08276255534867 HGJHGHJG661557008 00876767 212668767684 2014100898798789979879877770
here is the code :
awk 'END { OFS="\t"; for (k in c) print c[k],"\t"k,"\t"f[k] } { k = $4 c[k]++; f[k]=substr($0,137,8) }' logfile.txt
I WANT TO COUNT BASED ON field $4. but to specify this field in code we must based on number of character (substr ($0,..,..) :
the output shold be :
1 20141008 AJHHHH77760
1 20141008 HGHALJLKA700
2 20141008 HGJHGHJG661557008
If your records are composed of fixed width fields you can use cut(1)
% cut -c1-22,23-42,43-62,... --output-delimiter=, file | sed 's/, */,/g' > file.csv
% awk -F, '{your_code}' file.csv
please write a range for each of your fixed width fields, in place of the ... ellipsis.
I have written ranges only for the first three, lazy me.
If you don't want to bother with an intermediate file, just use a | pipe.

Awk: printing undetermined number of columns

I have a file that contains a number of fields separated by tab. I am trying to print all columns except the first one but want to print them all in only one column with AWK. The format of the file is
col 1 col 2 ... col n
There are at least 2 columns in one row.
Sample
2012029754 901749095
2012028240 901744459 258789
2012024782 901735922
2012026032 901738573 257784
2012027260 901742004
2003062290 901738925 257813 257822
2012026806 901741040
2012024252 901733947 257493
2012024365 901733700
2012030848 901751693 260720 260956 264843 264844
So I want to tell awk to print column 2 to column n for n greater than 2 without printing blank lines when there is no info in column n of that row, all in one column like the following.
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
This is the first time I am using awk, so bear with me. I wrote this from command line which works:
awk '{i=2;
while ($i ~ /[0-9]+/)
{
printf "%s\n", $i
i++
}
}' bth.data
It is more of a seeking approval than asking a question whether it is the right way of doing something like this in AWK or is there a better/shorter way of doing it.
Note that the actual input file could be millions of lines.
Thanks
Is this what you want as output?
awk '{for(i=2; i<=NF; i++) print $i}' bth.data
gives
901749095
901744459
258789
901735922
901738573
257784
901742004
901738925
257813
257822
901741040
901733947
257493
901733700
901751693
260720
260956
264843
264844
NF is one of several pre-defined awk variables. It indicates the number of fields on a given input line. For instance, it is useful if you want to always print out the last field in a line print $NF. Or of course if you want to iterate through all or part of the fields on a given line to the end of the line.
Seems like awk is the wrong tool. I would do:
cut -f 2- < bth.data | tr -s '\t' '\n'
Note that with -s, this avoids printing blank lines as stated in the original problem.