Remove duplicate line that contain an unknown string - awk

file.txt
test (CODE:700|SIZE:2356)
asdasdad (CODE:700|SIZE:124)
xcvxcva (CODE:700|SIZE:8974)
asdavasdasdasd (CODE:700|SIZE:124)
link-categories (CODE:700|SIZE:8974)
edit (CODE:700|SIZE:124)
I need command get all duplicated SIZE: value , then remove all duplicated lines have this value except one line, i mean the output should be like this:
test (CODE:700|SIZE:2356)
xcvxcva (CODE:700|SIZE:8974)
asdavasdasdasd (CODE:700|SIZE:124)
i found this command sed '/SIZE:124/,+1 d' file.txt in Remove duplicate line only contain specific string
but this command removed all lines, what i need is remove duplicated lines except one line + this command will not search for duplicated SIZE: value, so it's not working!
What i need is:
search for duplicated SIZE: value like 124 above!
all lines have this value remove it, except one line or two line if you can.

It can be done using this simple awk also:
awk -F '[ |]+' '!seen[$NF]++{print}' file
test (CODE:700|SIZE:2356)
asdasdad (CODE:700|SIZE:124)
xcvxcva (CODE:700|SIZE:8974)

Could you please try following.
awk 'match($0,/SIZE:[0-9]+/){val=substr($0,RSTART,RLENGTH);array[val]=$0;val=""} END{for(key in array){print array[key]}}' Input_file
OR adding a non-one liner form of solution:
awk '
match($0,/SIZE:[0-9]+/){
val=substr($0,RSTART,RLENGTH)
array[val]=$0
val=""
}
END{
for(key in array){
print array[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/SIZE:[0-9]+/){ ##Using match function to match regex of SIZE: then digits in each line here.
val=substr($0,RSTART,RLENGTH) ##Creating variable val whose value is sub string of current line which has matched value from current line.
array[val]=$0 ##Creating an array named array with index of variable val and value is current line.
val="" ##Nullify variable val here.
}
END{ ##Starting END block of this awk program here.
for(key in array){ ##Traversing through array here.
print array[key] ##Printing array value here.
}
}
' Input_file ##Mentioning Input_file name here.

Related

Remove fields in file A that don't contain matches in file B, line by line

I have a series of paired files, tab separated.
I want to compare line by line each pair and keep in file B only the fields that contain a match with the paired file A.
Example file A:
a b
d c
Example file B:
f>543 h<456 b>536 d>834 v<75345 a>12343
t>4562 c>623 f>3246 h>1345 d<52312
Desired output:
b>536 a>12343
c>623 d<52312
So far I have tried:
Convert files B in one-liner files:
cat file B | sed 's/\t/\n/g' > file B.mod
Grep one string in file A from file B, print the matching line and the next line, convert the output from 2 line back to single tab separated line:
cat file B.mod | grep -A1 (string) | awk '{printf "%s%s",$0,NR%2?"\t":"\n" ; }'
...but this failed since I realized that the matches can be in different order in A and B, as in the example above.
I'd appreciate some help as this goes far beyond my bash skills.
With your shown samples, please try following awk code.
awk '
FNR==NR{
for(i=1;i<=NF;i++){
arr[FNR,$i]
}
next
}
{
val=""
for(i=1;i<=NF;i++){
if((FNR,substr($i,1,1)) in arr){
val=(val?val OFS:"")$i
}
}
print val
}
' filea fileb
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk Program from here.
FNR==NR{ ##Checking condition FNR==NR which will be true when filea is being read.
for(i=1;i<=NF;i++){ ##Traversing through all fields here.
arr[FNR,$i] ##Creating array with index of FNR,current field value here.
}
next ##next will skip all further statements from here.
}
{
val="" ##Nullify val here.
for(i=1;i<=NF;i++){ ##Traversing through all fields here.
if((FNR,substr($i,1,1)) in arr){ ##checking condition if 1st letter of each field with FNR is present in arr then do following.
val=(val?val OFS:"")$i ##Creating val which has current $i value in it and keep adding values per line here.
}
}
print val ##Printing val here.
}
' filea fileb ##Mentioning Input_file names here.

Remove words that are less than two characters long AND don't contain a vowel

You can remove words less than length 2 with
sed -e 's/ [a-zA-Z0-9]\{1\} / /g'
although I'm not sure how remove only words that don't contain a vowel AND are less than length 2, in one command.
Thus a sentence
this is my w example of a sentence p
would end like
this is my example of a sentence
Could you please try following.
awk '
{
val=""
for(i=1;i<=NF;i++){
if($i!~/[aieou]/ && length($i)<2){ a="" }
else{ val=(val?val OFS:"")$i }
}
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting an awk program from here.
{
val="" ##Nullifying val value here.
for(i=1;i<=NF;i++){ ##Starting a for loop from here.
if($i!~/[aieou]/ && length($i)<2){ a="" } ##Checking condition if field is NOT containing any vowels and length is lesser than 2 then do nothing.
else{ val=(val?val OFS:"")$i } ##Else(in case above condition is FALSE) create val which contains current field value.
}
print val ##Printing val here.
}
' Input_file ##Mentioning Input_file name here.
With a GNU sed, you can use
LC_ALL=C sed 's/[ \t]*\b[b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]\b//g' file
# Or
LC_ALL=C sed 's/[ \t]*\b[b-df-hj-np-tv-z]\b//gI' file
It matches zero or more spaces or tabs, then a consonant word consisting of a single letter where \b mark word boundaries. LC_ALL=C is used to make sure the bracket expression ranges are compliant with the ASCII table codes.
See an online demo.

How to use Awk to output multiple consecutive lines

Input/File
A:1111
B:21222
C:33rf33
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
J:efee
Basically need to select the line that contains 2122 (i.e line B/2)
& line which starts with 444dct4 (i.e Line D) till efe989ef (i.e line I/9)
To summarize
Select Line B (contains 2122)
Select Line D (444dct4) till Line I
Desired Output
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Could you please try following, written and tested with shown samples in GNU awk. This one also takes care in case line's 2nd column 21222 in between range of 444dct4 to efe989ef then it will NOT re-print it.
awk -F':' '
$2=="21222" && !found{
print
next
}
$2=="444dct4"{
found=1
}
found
$2=="efe989ef"{
found=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here and setting field separator as colon here.
$2=="21222" && !found{ ##Checking if 2nd field is 21222 and found is NOT set then try following.
print ##Printing the current line here.
next ##next will skip all further statements from here.
}
$2=="444dct4"{ ##Checking condition if 2nd field is 444dct4 then do following.
found=1 ##Setting found to 1 here.
}
found ##Checking condition if found is SET then print that line.
$2=="efe989ef"{ ##Checking condition if 2nd field is efe989ef then do following.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.
$ awk -F: '
/2122/ { # line that contains 2122
print
next # to avoid duplicate printing if 2122 also in D-I
}
$2~/^444dct4/,$2~/efe989ef/ # starts with 444dct4 till efe989ef
' file
Output:
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Edit:
One-liner:
$ awk -F: '/2122/{print; next} $2~/^444dct4/,$2~/efe989ef/' file.txt
awk -v str1="2122" -v str2="444dct4" -v str3="efe989ef" 'BEGIN { flag=0 } $0 ~ str1 { print } $0 ~ str2 { flag=1 } $0 ~ str3 { flag=0;print;next } flag' file
For flexibility, set the line to find as str1, the from as str2 and the to as str3. Set a print flag (flag) to begin with. When 2122 is in the second field print. Then when the second field begins with 44dct4 set the print flag to one. When the second field starts with efe989ef, set the print flag to 0, print the line and skip to the next record. The variable flag will then determine what does and doesn't get printed.

Awk adding a pipe at the end of the first line

I have a little problem with my awk command.
The objective is to add a new column to my CSV :
The header must be "customer_id"
The next rows must be a customer_id from an array
Here is my csv :
email|event_date|id|type|cha|external_id|name|date
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13
I would like to have this output :
email|event_date|id|type|cha|external_id|name|date|customer_id
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
But when I'm doing the awk I have this result :
awk -v a="$(echo "${customerIdList[#]}")" 'BEGIN{FS=OFS="|"} FNR==1{$(NF+1)="customer_id"} FNR>1{split(a,b," ")} {print $0,b[NR-1]}' test.csv
email|event_date|id|type|cha|external_id|name|date|customer_id|
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
Where customerIdList = (20200 20201)
There is a pipe just after the "customer_id" header and I don't know why :(
Can someone help me ?
Could you please try following, written and tested with shown samples.
awk -v var="${customerIdList[*]}" '
BEGIN{
num=split(var,arr," ")
}
FNR==1{
print $0"|customer_id"
next
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"")
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v var="${customerIdList[*]}" ' ##Starting awk program from here, creating var variable and passing array values to it.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var,arr," ") ##Splitting var into arr with space delimiter.
}
FNR==1{ ##Checking condition if this is first line.
print $0"|customer_id" ##Then printing current line with string here.
next ##next will skip all further statements from here.
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"") ##Checking condition if value of arr with current line number -1 is NOT NULL then add its value to current line with pipe else do nothing.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.
awk -v IdList="${customerIdList[*]}" 'BEGIN { split(IdList,ListId," ") } NR > 1 { $0=$0"|"ListId[NR-1]}1' file
An array will need to be created within awk and so pass the array as a space separated string and then use awk's split function to create the array IdList. The ignoring the headers (NR>1), set the line equal to the line plus the index of ListId array NR-1.

How to use sed/awk to extract text between two patterns when a specific string must exist in the text block

I have found several answers on how to sed/awk between two patterns but I need also to find only the specific text block that has a string inside!
Text example:
<requirement id = "blabla.1"
slogan = "Handling of blabla"
work-package = "bla444.2"
logical-node = "BLA-C"
level = "System"
>
Bla bla.
</requirement>
<requirement id = "bla.2"
slogan = "Reporting of blabla"
work-package = "bla444.1"
logical-node = "BLA-C"
level = "System"
>
Bla bla bla.
</requirement>
So the goal is to get only the text block between & which should have bla444.1 in the work-package! This should give me in the example only the last text block. Of course the file that i would like to sed have more requirements and several with the needed work-package, so not only the last text block that sed will find.
sed -e 's/<requirement\(.*\)<\/requirement/\1/' file
The above sed line will give all the text blocks (requirements).
One thing is that the text block has no fixed line count but all will have work-package!
Could you please try following.
awk '
/^<requirement/{
if(found && value){
print value
}
found=value=""
}
{
value=(value?value ORS:"")$0
}
/work-package.*bla444.1\"$/{
found=1
}
END{
if(found && value){
print value
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
/^<requirement/{ ##Checking condition if line starts from string <requirement then do following.
if(found && value){ ##Checking condition if found and value is NOT NULL then do following.
print value ##Printing value(which contains all blocks value, explained further code) here.
}
found=value="" ##Nullifying variables found and value variables here.
}
{
value=(value?value ORS:"")$0 ##Creating variable value whose value is keep concatenating its own value each time cursor comes here.
}
/work-package.*bla444.1\"$/{ ##Checking condition if a line has string work-package till bla444.1 then do following.
found=1 ##Making variable found and setting value to 1, kind of FLAG enabling stuff.
}
END{ ##Starting END block of this awk code here.
if(found && value){ ##Checking condition if found and value is NOT NULL then do following.
print value ##Printing value variable here.
}
}
' Input_file ##Mentioning Input_file name here.
This might work for you (GNU sed):
sed -n '/<requirement/{:a;N;/<\/requirement/!ba;/work-package = "bla444\.1"/p}' file
Filter lines between <requirement and </requirement> and if those lines contain the string work-package = "bla444.1" print the collection.
Or perhaps:
sed -ne '/<requirement/{' -e ':a' -e 'N' -e '/<\/requirement/!ba' -e '/work-package = "bla444\.1"/p' -e '}' file
Or:
cat <<\! | sed -nf - file
/<requirement/{
:a
N
/<\/requirement/!ba
/work-package = "bla444\.1"/p
}
!