Only output line if value in specific column is unique - awk

Input:
line1 a gh
line2 a dd
line3 c dd
line4 a gg
line5 b ef
Desired output:
line3 c dd
line5 b ef
That is, I want to output line only in the case that no other line includes the same value in column 2. I thought I could do this with combination of sort (e.g. sort -k2,2 input) and uniq, but it appears that with uniq I can only skip columns from the left (-f avoid comparing the first N fields). Surely there's some straightforward way to do this with awk or something.

You can do this as a two-pass awk script:
awk 'NR==FNR{a[$2]++;next} a[$2]<2' file file
This runs through the file once incrementing a counter in an array whose key is the second field of each line, then runs through a second time printing only those lines whose counter is less than 2.
You'd need multiple reads of the file because at any point during the first read, you can't possibly know whether there will be another instance of the second field of that line later in the file.

Here is a one pass awk solution:
awk '{a1[$2]++;a2[$2]=$0} END{for (a in a1) if (a1[a]==1) print a2[a]}' file
The original order of the file will be lost however.

You can combine awk, grep, sort and uniq for a quick one-liner:
grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d) " input.txt
Edit, to avoid the regexes, \+ and \backreferences:grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d | sed 's/[^+0-9]/\\&/g') " input.txt

alternative to awk to demonstrate that it can still be done with sort and uniq (there is option -u for this), however setting up the right format requires some juggling (decorate/do stuff/undecorate pattern).
$ paste file <(cut -d' ' -f2 file) | sort -k2 | uniq -uf3 | cut -f1
line5 b ef
line3 c dd
as a side effect you lose the original sorting order, which can be recovered as well if you add line numbers...

Related

Sort a file preserving the header as first position with bash

When sorting a file, I am not preserving the header in its position:
file_1.tsv
Gene Number
a 3
u 7
b 9
sort -k1,1 file_1.tsv
Result:
a 3
b 9
Gene Number
u 7
So I am tryig this code:
sed '1d' file_1.tsv | sort -k1,1 > file_1_sorted.tsv
first='head -1 file_1.tsv'
sed '1 "$first"' file_1_sorted.tsv
What I did is to remove the header and sort the rest of the file, and then trying to add again the header. But I am not able to perform this last part, so I would like to know how can I copy the header of the original file and insert it as the first row of the new file without substituting its actuall first row.
You can do this as well :
{ head -1; sort; } < file_1.tsv
** Update **
For macos :
{ IFS= read -r header; printf '%s\n' "$header" ; sort; } < file_1.tsv
a simpler awk
$ awk 'NR==1{print; next} {print | "sort"}' file
$ head -1 file; tail -n +2 file | sort
Output:
Gene Number
a 3
b 9
u 7
Could you please try following.
awk '
FNR==1{
first=$0
next
}
{
val=(val?val ORS:"")$0
}
END{
print first
print val | "sort"
}
' Input_file
Logical explanation:
Check condition FNR==1 to see if its first line; then save its values to variable and move on to next line by next.
Then keep appending all lines values to another variable with new line till last line.
Now come to END block of this code which executes when Input_file is done being read, there print first line value and put sort command on rest of the lines value there.
This will work using any awk, sort, and cut in any shell on every UNIX box and will work whether the input is coming from a pipe (when you can't read it twice) or from a file (when you can) and doesn't involve awk spawning a subshell:
awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
The above uses awk to stick a 0 at the front of the header line and a 1 in front of the rest so you can sort by that number then whatever other field(s) you want to sort on and then remove the added field again with a cut. Here it is in stages:
$ awk -v OFS='\t' '{print (NR>1), $0}' file
0 Gene Number
1 a 3
1 u 7
1 b 9
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2
0 Gene Number
1 a 3
1 b 9
1 u 7
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
Gene Number
a 3
b 9
u 7

How to use grep and sed simultaneously using pipe

I have 2 files
File 1
TRINITY_DN10039_c1_g1_i1 216 Brassica rapa
TRINITY_DN10270_c0_g1_i1 233 Pan paniscus
TRINITY_DN10323_c0_g1_i2 209 Corynebacterium aurimucosum ATCC 700975
.
.
TRINITY_DN10462_c0_g1_i1 257 Helwingia himalaica
TRINITY_DN10596_c0_g1_i1 205 Homo sapiens
TRINITY_DN10673_c0_g2_i2 323 Anaerococcus prevotii DSM 20548
File 2
TRINITY_DN9856_c0_g1_i1 len=467 path=[0:0-466]
GATGCGGGCCAATATGAATGTGAGATTACTAATGAATTGGGGACTAAAAA
TRINITY_DN9842_c0_g1_i1 len=208 path=[0:0-207]
AAGTAATTTTATATCACTTGTTACATCGCAATTCGTGAGTTAAACTTAAT
.
.
TRINITY_DN9897_c0_g1_i1 len=407 path=[0:0-406]
AACTTTATTAACTTGTTGTACATATTTATTAATGCAAATACATATAGAG
TRINITY_DN9803_c0_g1_i1 len=795 path=[0:0-794]
AACTAAGACAAACTTCGCGGAGCAGTTAGAAAATATTACAAGAGATTTG
I want to delete 2 lines(same line and next line) in file2 whose pattern matches with the first column words of 1st file
awk '{print $1}' file1 | sed '/here_i_want_to_insert_output_of_pipe/{N;d;}' file2
If the field has no special characters in the first field, like . or / or [ or ( or \ or any regex-special characters, your idea is actually not that bad:
sed "$(cut -d' ' -f1 file1 | sed 's#.*#/&/{N;d}#')" file2
cut -d' ' -f1 file1 - extract first field from file1
| sed
.* - replace anything. ie. the first field from file1
/&/{N;d} - the & is substituted for the whole thing we are replacing. So for the first field. So it becomes /<first field>/{N;d}
then wrap it around sed "<here>" file2
No so much known feature, you can use another character for /regex/ with syntax \<char>regex<char> like \!regex!. Below I use ~:
sed "$(cut -d' ' -f1 file1 | sed 's#.*#\\~&~{N;d}#')" file2
If you however do have any special characters on the first field, then if you don't care about sorting: You can replace two lines in file2 for a single line with some magic separator (I chose ! below), then sort it and sort file1, and then just join them. The -v2 makes join output unpairable lines from second file - ie. not matched lines. After that restore the newline, by replacing the magic separator ! for a newline:
join -v2 <(cut -d' ' -f1 file1 | sort) <(sed 'N;s/\n/!/' file2 | sort -k1) |
tr '!' '\n'
If the output needs to be sorted as in file2, you can number lines in file2 and re-sort the output on line numbers:
join -11 -22 -v2 <(cut -d' ' -f1 file1 | sort) <(sed 'N;s/\n/!/' file2 | nl -w1 | sort -k2) |
sort -k2 | cut -d' ' -f1,3- | tr '!' '\n'
Tested on repl
I would do something like this with one awk, unless file1 is really really really huge :
awk 'NR==FNR{a[$1]++; next}counter{counter--}$1 in a{counter=2}!counter' <file1> <file2>
Input :
file1
TRINITY_DN10039_c1_g1_i1 216 Brassica rapa
TRINITY_DN10270_c0_g1_i1 233 Pan paniscus
TRINITY_DN10323_c0_g1_i2 209 Corynebacterium aurimucosum ATCC 700975
hello
TRINITY_DN10462_c0_g1_i1 257 Helwingia himalaica
TRINITY_DN10596_c0_g1_i1 205 Homo sapiens
TRINITY_DN10673_c0_g2_i2 323 Anaerococcus prevotii DSM 20548
file2 :
TRINITY_DN9856_c0_g1_i1 len=467 path=[0:0-466]
GATGCGGGCCAATATGAATGTGAGATTACTAATGAATTGGGGACTAAAAA
TRINITY_DN9842_c0_g1_i1 len=208 path=[0:0-207]
AAGTAATTTTATATCACTTGTTACATCGCAATTCGTGAGTTAAACTTAAT
TRINITY_DN9897_c0_g1_i1 len=407 path=[0:0-406]
AACTTTATTAACTTGTTGTACATATTTATTAATGCAAATACATATAGAG
hello
world
TRINITY_DN9803_c0_g1_i1 len=795 path=[0:0-794]
AACTAAGACAAACTTCGCGGAGCAGTTAGAAAATATTACAAGAGATTTG
Output :
TRINITY_DN9856_c0_g1_i1 len=467 path=[0:0-466]
GATGCGGGCCAATATGAATGTGAGATTACTAATGAATTGGGGACTAAAAA
TRINITY_DN9842_c0_g1_i1 len=208 path=[0:0-207]
AAGTAATTTTATATCACTTGTTACATCGCAATTCGTGAGTTAAACTTAAT
TRINITY_DN9897_c0_g1_i1 len=407 path=[0:0-406]
AACTTTATTAACTTGTTGTACATATTTATTAATGCAAATACATATAGAG
TRINITY_DN9803_c0_g1_i1 len=795 path=[0:0-794]
AACTAAGACAAACTTCGCGGAGCAGTTAGAAAATATTACAAGAGATTTG
I would do this with process substitution like so:
while read -r -d '' line; do
sed -i "/^${line}/{N;d;}" file2
done < <(awk '{printf "%s\0", $1}' file1 | sed 's|[][\\/.*^$]|\\&|g')
The reason for delimiting by nullbytes rather than newlines is because it's usually the best way.
Edit:
Updated to quote special characters with \ so sed won't malfunction.

Grep specific part of string from another file

I want to grep the first three digits of numbers in 1.txt from the first three digits after zeros in 2.txt.
cat 1.txt
23456
12345
6789
cat 2.txt
20000023485 xxx888
20000012356 xxx888
20000067234 xxx234
Expected output
20000023485 xxx888
20000012356 xxx888
awk 'FNR==NR {a[substr($1,0,3)];next}
{match($1, /0+/);
if(substr($1, RSTART+RLENGTH,3) in a)print}' 1.txt 2.txt
{a[substr($1,0,3)];next} - stores the first 3 characters in an associative array.
match($1, /0+/);if(substr($1, RSTART+RLENGTH,3) in a)
Matches the 3 charaacters after the series of zeroes and checks whether these 3 characters are present in the associative array that was created earlier and prints the whole line if match is found.
Try this with grep:
grep -f <(sed 's/^\(...\).*/00\1/' file1) file2
Output:
20000023485 xxx
20000012356 xxx
grep -f will match a series of patterns from the given file, one per line. But first you need to turn 1.txt into the patterns you want. In your case, you want the first three characters of each line of 1.txt, after zeros: 00*234, 00*123, etc. (I'm assuming you want at least one zero.)
sed -e 's/^\(...\).*$/00*\1/' 1.txt > 1f.txt
grep -f 1f.txt 2.txt

Delete multiple strings/characters in a file

I have a curl output generated similar below, Im working on a SED/AWK script to eliminate unwanted strings.
File
{id":"54bef907-d17e-4633-88be-49fa738b092d","name":"AA","description","name":"AAxxxxxx","enabled":true}
{id":"20000000000000000000000000000000","name":"BB","description","name":"BBxxxxxx","enabled":true}
{id":"542ndf07-d19e-2233-87gf-49fa738b092d","name":"AA","description","name":"CCxxxxxx","enabled":true}
{id":"20000000000000000000000000000000","name":"BB","description","name":"DDxxxxxx","enabled":true}
......
I like to modify this file and retain similar below,
AA AAxxxxxx
BB BBxxxxxx
AA CCxxxxxx
BB DDxxxxxx
AA n.....
BB n.....
Is there a way I could remove word/commas/semicolons in-between so I can only retain these values?
Try this awk
curl your_command | awk -F\" '{print $(NF-9),$(NF-3)}'
Or:
curl your_command | awk -F\" '{print $7,$13}'
A semantic approach ussing perl:
curl your_command | perl -lane '/"name":"(\w+)".*"name":"(\w+)"/;print $1." ".$2'
For any number of name ocurrences:
curl your_command | perl -lane 'printf $_." " for ( $_ =~ /"name":"(\w+)"/g);print ""'
This might work for you (GNU sed):
sed -r 's/.*("name":")([^"]*)".*\1([^"]*)".*/\2 \3/p;d' file
This extracts the fields following the two name keys and prints them if successful.
Alternatively, on simply pattern matching:
sed -r 's/.*:.*:"([^"]*)".*:"([^"]*)".*:.*/\1 \2/p;d' file
In this particular case, you could do
awk -F ":|," '{print $4,$7}' file2 |tr -d '"'
and get
AA AAxxxxxx
BB BBxxxxxx
AA CCxxxxxx
BB DDxxxxxx
Here, the field separator is either : or ,, we print the fourth and seventh field (because all lines have the entries in these two fields) and finally, we use tr to delete the " because you don't want to have it.

Count files that have unique prefixes

I have set of files that looks like the following. I'm looking for a good way to count all files that have unique prefixes, where "prefix" is defined by all characters before the second hyphen.
0406-0357-9.jpg 0591-0349-9.jpg 0603-3887-27.jpg 59762-1540-40.jpg 68180-517-6.jpg
0406-0357-90.jpg 0591-0349-90.jpg 0603-3887-28.jpg 59762-1540-41.jpg 68180-517-7.jpg
0406-0357-91.jpg 0591-0349-91.jpg 0603-3887-29.jpg 59762-1540-42.jpg 68180-517-8.jpg
0406-0357-92.jpg 0591-0349-92.jpg 0603-3887-3.jpg 59762-1540-5.jpg 68180-517-9.jpg
0406-0357-93.jpg 0591-0349-93.jpg 0603-3887-30.jpg 59762-1540-6.jpg
Depending on what you actually want output, either of these might be what you want:
ls | awk -F'-' '{c[$1"-"$2]++} END{for (p in c) print p, c[p]}'
or
ls | awk -F'-' '!seen[$1,$2]++{count++} END{print count+0}'
If it's something else, update your question to show the output you're looking for.
This should do it:
ls *.jpg | cut -d- -s -f1,2 | uniq | wc -l
Or if your prefixes are always 4 digits, one dash, 4 digits, you don't need cut:
ls *.jpg | uniq -w9 | wc -l
Parses ls (bad, but it doesn't look like it will cause a problem with these filenames),
uses awk to set the field separator as -.
!seen[$1,$2]++) uses an associative array with $1,$2 as the key and increments, then checks if the value equals 0 to ensure it is only printed once (based on $1 and $2).
print prints on screen :)
ls | awk 'BEGIN{FS="-" ; printf("%-20s%-10s\n","Prefix","Count")} {seen[$1"-"$2]++} END{ for (k in seen){printf("%-20s%-10i\n",k,seen[k])}}'
Will now count based on prefix with headers :)