shell awk script to remove duplicate lines

shell awk script to remove duplicate lines - awk

I am trying to remove duplicate lines from a file including the original ones but the following command that I am trying is sorting the lines and I want them to be in the same order as they are in input file.
awk '{++a[$0]}END{for(i in a) if (a[i]==1) print i}' test.txt
Input:
123
aaa
456
123
aaa
888
bbb
Output I want:
456
888
bbb

Simpler code if you are okay with reading input file twice:
$ awk 'NR==FNR{a[$0]++; next} a[$0]==1' ip.txt ip.txt
456
888
bbb
With single pass:
$ awk '{a[NR]=$0; b[$0]++} END{for(i=1;i<=NR;i++) if(b[a[i]]==1) print a[i]}' ip.txt
456
888
bbb

If you want to do this in awk only then could you please try following; if not worried about order.
awk '{a[$0]++};END{for(i in a){if(a[i]==1){print i}}}' Input_file
To get the unique values in same order in which they occur in Input_file try following.
awk '
!a[$0]++{
b[++count]=$0
}
{
c[$0]++
}
END{
for(i=1;i<=count;i++){
if(c[b[i]]==1){
print b[i]
}
}
}
' Input_file
Output will be as follows.
456
888
bbb
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
!a[$0]++{ ##Checking condition if current line is NOT occur in array a with more than 1 occurrence then do following.
b[++count]=$0 ##Creating an array b with index count whose value is increasing with 1 and its value is current line value.
}
{
c[$0]++ ##Creating an array c whose index is current line and its value is occurrence of current lines.
}
END{ ##Starting END block for this awk program here.
for(i=1;i<=count;i++){ ##Starting for loop from here.
if(c[b[i]]==1){ ##Checking condition if value of array c with index is value of array b with index i equals to 1 then do following.
print b[i] ##Printing value of array b.
}
}
}
' Input_file ##Mentioning Input_file name here.

awk '{ b[$0]++; a[n++]=$0; }END{ for (i in a){ if(b[a[i]]==1) print a[i] }}' input
Lines are added to array b, the order of lines is kept in array a.
If, in the end, the count is 1, the line is printed.
Sorry, i misread the question at first, and i corrected the answer, to be almost the same as #Sundeep ...

Related

multiple condition store in variable and use as if condition in awk

table1.csv:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
33629|EEE
33630|FFF
Aims:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Using command:
awk 'BEGIN{FS="|";OFS="|"} {if($2=="AAA" && $2=="BBB" && $2=="CCC" && $2=="DDD"){print $1,$2}}' table1.csv
However, trying to be more automatic, since the categories may increase.
list1.csv:
AAA BBB CCC DDD
list=`cat list1.csv`
awk -v list=$list 'BEGIN{FS="|";OFS="|"} {if($2==list){print $1,$2}}' table1.csv
Which means, can I stored $2=="AAA" && $2=="BBB" ....... into a variable by using list1.csv?
Expected output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
So, any suggestion on storing the multiple condition in one variable?
Thanks all!

$ awk 'NR==FNR{for(i=1;i<=NF;i++)a[$i];next}FNR==1{FS="|";$0=$0}($2 in a)' list table
Output:
33622|AAA
33623|AAA
33624|BBB
33625|CCC
33626|DDD
33627|AAA
33628|BBB
Explained:
$ awk '
NR==FNR { # process list
for(i=1;i<=NF;i++) # hash all items in file
a[$i]
next # possibility for multiple lines
}
FNR==1 { # changing FS in the beginning of table file
FS="|"
$0=$0
}
($2 in a)' list table

Almost same logic Like James Brown's nice answer, just adding here a small variant which is setting field separator in Input_file places itself.
awk 'FNR==NR{for(i=1;i<=NF;i++){arr[$i]};next} ($2 in arr)' list FS="|" table
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when list is being read.
for(i=1;i<=NF;i++){ ##Going through all fields here.
arr[$i] ##Creating arr with index of current column value here.
}
next ##next will skip all further statements from here.
}
($2 in arr) ##Checking condition if 2nd field is present in arr then print that line from table file.
' list FS="|" table ##mentioning Input_file(s) here and setting FS as | before table file.

How to count values between empty cells

I'm facing one problem which is bigger than me. I have 18 relative large text files (ca 30k lines each) and I need to count the values between the empty cells in the second column. Here is a simple example of my file:
Metabolism
line_1 10.2
line_2 10.1
line_3 10.3
TCA_cycle
line_4 10.7
line_5 10.8
Pyruvate_metab
line_6 100.8
In reality, I have circa 500 description lines (Metabolism, TCA_cycle, etc.) and the range of lines is between zero to a few hundred.
I would like to count values for each block (block starts with a description and corresponding lines are always below), e.g.
Metabolism 30.6
line_1 10.2
line_2 10.1
line_3 10.3
TCA_cycle 21.5
line_4 10.7
line_5 10.8
Pyruvate_metab 100.8
line_6 100.8
Or just
30.3
21.5
100.8
It won't be a problem if results will be printed line by line into an additional file... Or another alternative way.
There is one tricky thing and it's descriptions without lines with numbers.
Transport
line_1000 100.1
line_1001 100.2
Cell_signal
Motility
Processing
Translation
line_1002 500.1
line_1003 200.2
And even for those lines and would like to get 0 value.
Transport 200.3
line_1000 100.1
line_1001 100.2
Cell_signal 0
Motility 0
Processing 0
Translation 700.3
line_1002 500.1
line_1003 200.2
The rest of the file looks same and it's consistent - 2 columns, tab separators, descriptions in the first column, values in the second, no spaces (only underlines).
Actually I have no experience with more sophisticated coding so I really don't know how to solve it in the command line. I've already tried some Excel ways but it was painful and unsuccessful.

With tac and any awk:
tac file | awk 'NF==2{sum+=$2; print; next} {print $1 "\t" sum; sum=0}' | tac
With two improvements proposed by kvantour and Ed Morton. See the comments.
tac file | awk '($NF+0==$NF){sum+=$2; print; next} {print $1 "\t" sum+0; sum=0}' | tac
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
if($0!~/line/){ a[$0]; prev=$0 }
else { a[prev]+=$NF }
next
}
!/line/{
$0=$0 OFS (a[$0]?a[$0]:0)
}
1' Input_file Input_file
OR in case you want output in good looking form add column -t to above command like as follows:
awk '
FNR==NR{
if($0!~/line/){ a[$0]; prev=$0 }
else { a[prev]+=$NF }
next
}
!/line/{
$0=$0 OFS (a[$0]?a[$0]:0)
}
1' Input_file Input_file | column -t
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking FNR==NR which will be TRUE when Input_file is being read first time.
if($0!~/line/){ a[$0]; prev=$0 } ##checking condition if line contains string line and setting index of current line in a and setting prev value to current line.
else { a[prev]+=$NF } ##Else if line not starting from line then creating array a with index prev variable and keep on adding last field value to same index of array.
next ##next will skip all further statements from here.
}
!/line/{ ##Checking if current line doesnot have line keyword in it then do following.
$0=$0 OFS (a[$0]?a[$0]:0) ##Re-creating current line with its current value then OFS(which is space by default) then either add value of a[$0] or 0 based on current line value is NOT NULL here.
}
1 ##Printing current line here.
' Input_file Input_file ##Mentioning Input_file names here.

In plain awk:
awk '{
if (NF == 1) {
if (blockname)
printf("%s\t%.2f\n%s", blockname, sum, lines)
blockname = $0
sum = 0
lines=""
} else if (NF == 2) {
sum += $2
lines = lines $0 "\n"
}
next
}
END { printf("%s\t%.2f\n%s", blockname, sum, lines) }
' input.txt

bash compare two columns with exact match

I am comparing columns between two files for exact match but I am ending up with inaccurate result. Example as follows.
File1 File2
adam sunny
jhon adam
kelly adam
matt kevin
stuart adam
Gary Gary
When we look at the files there is only match i.e. Garry. My output should be following.
Emptyline
Emptyline
Emptyline
Emptyline
Emptyline
Gary
In order to achieve requirement. I am running the following command
awk 'NR==FNR { n[$1]=$0;next } ($1 in n) { print n[$1],$2 }' file1 file2
and I am getting output as follows
adam
adam
adam
Garry

You should be tracking line numbers, not just line contents:
$ awk 'NR==FNR { lines[NR]=$0; next }
{ if ($0 == lines[FNR]) print; else print "" }' file1.txt file2.txt
Gary

1st solution: With simple awk.
awk 'FNR==NR{a[FNR]=$0;next} a[FNR]==$0{print;next} {print ""}' file1 file2
OR as per anubhava sir's comment:
awk 'FNR==NR{a[FNR]=$0;next} a[FNR]!=$0{$0=""} 1' file1 file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first file Input_file1 is being read.
a[FNR]=$0 ##Creating an array a with index FNR and value of current line here.
next ##next will skip all further statements from here.
}
a[FNR]==$0{ ##Checking condition if value of array a with FNR index and current line is equal then do following.
print $0,a[FNR] ##Printing current line and value array a with index FNR here.
}
' file1 file2 ##Mentioning Input_file names here
2nd solution: Considering that your actual Input_file(s) have only 2 columns as per shown samples, could you please try following then.
paste Input_file1 Input_file2 | awk '$1==$2{print $1};$1!=$2{print ""}'
This code will only print lines whose values are equal in Input_file1 and Input_file2.

Add new column with times same value was found in 2 columns

Add new column with value of how many times the values in columns 1 and 2 contends exactly same value.
input file
46849,39785,2,012,023,351912.29,2527104.70,174.31
46849,39785,2,012,028,351912.45,2527118.70,174.30
46849,39785,3,06,018,351912.12,2527119.51,174.33
46849,39785,3,06,020,351911.80,2527105.83,174.40
46849,39797,2,012,023,352062.45,2527118.50,173.99
46849,39797,2,012,028,352062.51,2527105.51,174.04
46849,39797,3,06,020,352063.29,2527116.71,174.13,
46849,39809,2,012,023,352211.63,2527104.81,173.74
46849,39809,2,012,028,352211.21,2527117.94,173.69
46849,39803,2,012,023,352211.63,2527104.81,173.74
46849,39803,2,012,028,352211.21,2527117.94,173.69
46849,39801,2,012,023,352211.63,2527104.81,173.74
Expected output file:
4,46849,39785,2,012,023,351912.29,2527104.70,174.31
4,46849,39785,2,012,028,351912.45,2527118.70,174.30
4,46849,39785,3,06,018,351912.12,2527119.51,174.33
4,46849,39785,3,06,020,351911.80,2527105.83,174.40
3,46849,39797,2,012,023,352062.45,2527118.50,173.99
3,46849,39797,2,012,028,352062.51,2527105.51,174.04
3,46849,39797,3,06,020,352063.29,2527116.71,174.13,
2,46849,39809,2,012,023,352211.63,2527104.81,173.74
2,46849,39809,2,012,028,352211.21,2527117.94,173.69
2,46849,39803,2,012,023,352211.63,2527104.81,173.74
1,46849,39803,2,012,028,352211.21,2527117.94,173.69
1,46849,39801,2,012,023,352211.63,2527104.81,173.74
attempt:
awk -F, '{x[$1 $2]++}END{ for(i in x) {print i,x[i]}}' file
4684939785 4
4684939797 3
4684939801 1
4684939803 2
4684939809 2

Could you please try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
a[$1,$2]++
next
}
{
print a[$1,$2],$0
}
' Input_file Input_file
Explanation: reading Input_file 2 times. Where first time I am creating an array named a with index of first and second field and counting their value on each occurrence too. On 2nd time file reading it printing count of the first 2 fields total and then printing while line.
One liner code:
awk 'BEGIN{FS=OFS=","} FNR==NR{a[$1,$2]++;next} {print a[$1,$2],$0}' Input_file Input_file

ignore spaces after particular column no

Hi Everyone I have below data.
61684 376 23 106 38695633 1 0 0 -1 /C/Program Files (x86)/ 16704 root;TrustedInstaller#NT:SERVICE root;TrustedInstaller#NT:SERVICE 0 1407331175 1407331175 1247541608
8634 416 13 86 574126 1 0 0 -1 /E/KYCImages/ 16832 root;kycfinal#CGKYCAPP03 root;None#CGKYCAPP03 0 1406018846 1406018846 1352415392
60971 472 22 86 38613076 1 0 0 -1 /E/KYCwebsvc binaries/ 16832 root;kycfinal#CGKYCAPP03 root;None#CGKYCAPP03 0 1390829495 1390829495 1353370744
1 416 10 86 1 1 0 0 -1 /E/KycApp/ 16832 root;kycfinal#CGKYCAPP03 root;None#CGKYCAPP03 0 1411465772 1411465772 1351291187
Now I am using below code:
awk 'BEGIN{FPAT = "([^ ]+)|(\"[^\"]+\")"}{print $10}' | awk '$1!~/^\/\./' | sort -u | sed -e 's/\,//g' | perl -p00e 's/\n(?!\Z)/;/g' filename
I am getting this output
/C/Program;/E/KycApp/;/E/KYCImages/;/E/KycServices/;/E/KYCwebsvc
However I need to start the output from $10 till "/" is again encountered, basically I want to ignore any spaces from column 10 till "/" is encountered.
Is it possible?
Desired output is
/C/Program Files (x86)/;/E/KycApp/;/E/KYCImages/;/E/KycServices/;/E/KYCwebsvc binaries/

With single gawk:
awk 'BEGIN{ FPAT="/[^/]+/[^/]+/"; PROCINFO["sorted_in"]="#ind_str_asc"; IGNORECASE = 1 }
{ a[$1] }END{ for(i in a) r=(r!="")? r";"i : i; print r }' filename
The output (without /E/KycServices/; - cause it's not within your input):
/C/Program Files (x86)/;/E/KycApp/;/E/KYCImages/;/E/KYCwebsvc binaries/

try following too in single awk.
awk '{match($0,/\/.*\//);VAL=VAL?VAL ORS substr($0,RSTART,RLENGTH):substr($0,RSTART,RLENGTH)} END{num=split(VAL, array,"\n");for(i=1;i<=num;i++){printf("%s%s",array[i],i==num?"":";")};print""}' Input_file
Will add non-one liner form of solution with explanation too shortly.
EDIT1: Adding non-one liner form of solution successfully too now.
awk '{
match($0,/\/.*\//);
VAL=VAL?VAL ORS substr($0,RSTART,RLENGTH):substr($0,RSTART,RLENGTH)
}
END{
num=split(VAL, array,"\n");
for(i=1;i<=num;i++){
printf("%s%s",array[i],i==num?"":";")
};
print""
}
' Input_file
EDIT2: Adding explanation of code in non-one liner form of solution too now.
awk '{
match($0,/\/.*\//); ##Using match functionality of awk which will match regex to find the string in a line from / to \, note I am escaping them here too.
VAL=VAL?VAL ORS substr($0,RSTART,RLENGTH):substr($0,RSTART,RLENGTH) ##creating a variable named VAL here which will concatenate its own value if more than one occurrence are there. Also RSTART and RSTART are the variables of built-in awk which will be having values once a match has TRUE value which it confirms once a regex match is found in a line.
}
END{ ##Starting this block here.
num=split(VAL, array,"\n");##creating an variable num whose value will be number of elements in array named array, split is a built-in keyword of awk which will create an array with a defined delimiter, here it is new line.
for(i=1;i<=num;i++){ ##Starting a for loop here whose value will go till num value from i variable value 1 to till num.
printf("%s%s",array[i],i==num?"":";") ##printing the array value whose index is variable i and second string it is printing is semi colon, there a condition is there if i value is equal to num then print null else print a semi colon.
};
print"" ##print NULL value to print a new line.
}
' Input_file ###Mentioning the Input_file here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

shell awk script to remove duplicate lines - awk

Simpler code if you are okay with reading input file twice: $ awk 'NR==FNR{a[$0]++; next} a[$0]==1' ip.txt ip.txt 456 888 bbb With single pass: $ awk '{a[NR]=$0; b[$0]++} END{for(i=1;i<=NR;i++) if(b[a[i]]==1) print a[i]}' ip.txt 456 888 bbb

Related

multiple condition store in variable and use as if condition in awk

How to count values between empty cells

bash compare two columns with exact match

Add new column with times same value was found in 2 columns

ignore spaces after particular column no

Categories

Resources