How to find a match to a partial string and then delete the string from the reference file using awk? - awk

I have a problem that I have been trying to solve, but have not been able to figure out how to do it. I have a reference file that has all of the devices in my inventory by bar code.
Reference file:
PTR10001,PRINTER,SN A
PTR10002,PRINTER,SN B
PTR10003,PRINTER,SN C
MON10001,MONITOR,SN A
MON10002,MONITOR,SN B
MON10003,MONITOR,SN C
CPU10001,COMPUTER,SN A
CPU10002,COMPUTER,SN B
CPU10003,COMPUTER,SN C
What I would like to do is make a file where I only have to put the abbreviation of what I need on it.
File 2 would look like this:
PTR
CPU
MON
MON
The desired output of this would be a file that would tell me what items by barcode that I need to pull off the shelf.
Desired output file:
PTR10001
CPU10001
MON10001
MON10002
As seen in the output, since I cannot have 2 of the same barcode, I need it to look through the reference file and find the first match. After the number is copied to the output file, I would like to remove the number from the reference file so that it doesn't repeat the number.
I have tried several iterations of awk, but have not been able get the desired output.
The closest that I have gotten is the following code:
awk -F'/' '{ key = substr($1,1,3) } NR==FNR {id[key]=$1; next} key in id { $1=id[key] } { print }' $file1 $file2 > $file3
I am writing this in ksh, and would like use awk as I think this would be the best answer to the problem.
Thanks for helping me with this.

First solution:
From your detailed description, I assume order doesn't matter, as you want to know what to pull off the shelf. So you could do the opposite, first read file2, count the items, and then go to the shelf and get them.
awk -F, 'FNR==NR{c[$0]++; next} c[substr($1,1,3)]-->0{print $1}' file2 file1
output:
PTR10001
MON10001
MON10002
CPU10001
Second solution:
Your awk is very close to what you want, but you need a second dimension in your array, and not overwriting the existing ids. We will do it with a pseudo-2-d array (BTW GNU awk has real 2-dimensional arrays) where we store the ids like PTR10001,PTR10002,PTR10003, we retrieve them with split and we remove from shelf also.
> cat tst.awk
BEGIN { FS="," }
NR==FNR {
key=substr($1,1,3)
ids[key] = (ids[key]? ids[key] "," $1: $1) #append new id.
next
}
$0 in ids {
split(ids[$0], tmp, ",")
print(tmp[1])
ids[$0]=substr(ids[$0],length(tmp[1])+2) #remove from shelf
}
Output
awk -f tst.awk file1 file2
PTR10001
CPU10001
MON10001
MON10002
Here we keep the order of file2 as this is based on the idea you have tried.

Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
iniVal[$0]++
next
}
{
counter=substr($0,1,3)
}
counter in iniVal{
if(++currVal[counter]<=iniVal[counter]){
print $1
if(currVal[counter]==iniVal[counter]){ delete iniVal[$0] }
}
}
' Input_file2 FS="," Input_file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which is true when Input_file2 is being read.
iniVal[$0]++ ##Creating array iniVal with index of current line with increment of 1 each time it comes here.
next ##next will skip all further statements from here.
}
{
counter=substr($0,1,3) ##Creating counter variable which has 1st 3 characters of Input_file1 here.
}
counter in iniVal{ ##Checking if counter is present in iniVal then do following.
if(++currVal[counter]<=iniVal[counter]){ ##Checking if currValarray with index of counter value is lesser than or equal to iniVal then do following.
print $1 ##Printing 1st field of current line here.
if(currVal[counter]==iniVal[counter]){ ##Checking if currVal value is equal to iniVal with index of counter here.
delete iniVal[$0] ##If above condition is TRUE then deleting iniVal here.
}
}
}
' Input_file2 FS="," Input_file1 ##Mentioning Input_file names here.

Related

AWK calculate percentages for each input file and output summary to single file

I have hundreds of csv files with the same format. I want to 'summarise' each file by counting occurrences of the word "Correct" in column 3 and calculating the percentage of "Corrects" per file (i.e. "Correct"s / total number of rows in that file). I am currently doing this for each file with a shell 'for-loop', but this isn't ideal for reasons.
Minimal reproducible example:
cat file1.csv
id,prediction,evaluation
1,high,Correct
2,low,Correct
3,high,Incorrect
4,low,Incorrect
cat file2.csv
id,prediction,evaluation
1,high,Correct
2,low,Correct
3,high,Correct
4,low,Incorrect
Correct answer for each individual file:
awk 'BEGIN{FS=OFS=","; print "model,total_correct,accuracy"} NR>1{n++; if($3 == "Correct"){correct++}} END{print FILENAME, correct, correct / n}' file1.csv
model,total_correct,accuracy
file1.csv,2,0.5
awk 'BEGIN{FS=OFS=","; print "model,total_correct,accuracy"} NR>1{n++; if($3 == "Correct"){correct++}} END{print FILENAME, correct, correct / n}' file2.csv
model,total_correct,accuracy
file2.csv,3,0.75
My desired outcome:
model,total_correct,accuracy
file1.csv,2,0.5
file2.csv,3,0.75
Thanks for any advice.
With GNU awk you can try following code. Written and tested with shown samples. Using ENDFILE here to make life easy. Also added 2 more conditions into the code: 1st: Increasing count for n when there is NO NULL line. 2nd: While getting average to make sure no error comes(in case zero records found) it should print N/A rather than an OOTB generated error. I have also changed from NR>1 to FNR>1 since NR will be a cumulative count and we need FNR which reset the line number from each Input_file's beginning.
awk '
BEGIN{
FS=OFS=","
print "model,total_correct,accuracy"
}
FNR>1{
if(NF) { n++ }
if($3 == "Correct"){ correct++ }
}
ENDFILE{
printf("%s,%d,%.02f\n",FILENAME, correct, (n>0&&n?(correct / n):"N/A"))
correct=n=0
}
' *.csv
With the standard awk, you can increment counts in an array indexed by filename and whether the third column is "Correct", and iterate through the filenames in the end to output the statistics:
awk '
BEGIN{FS=OFS=",";print"model,total_correct,accuracy"}
FNR>1{++r[FILENAME,$3=="Correct"]}
END{
for(i=1;i<ARGC;++i){
f=ARGV[i];
print f,c=r[f,1],c/(r[f,0]+c)
}
}' *.csv

How to find and match an exact string in a column using AWK?

I'm having trouble on matching an exact string that I want to find in a file using awk.
I have the file called "sup_groups.txt" that contains:
(the structure is: "group_name:pw:group_id:user1<,user2>...")
adm:x:4:syslog,adm1
admins:x:1006:adm2,adm12,manuel
ssl-cert:x:122:postgres
ala2:x:1009:aceto,salvemini
conda:x:1011:giovannelli,galise,aceto,caputo,haymele,salvemini,scala,adm2,adm12
adm1Group:x:1022:adm2,adm1,adm3
docker:x:998:manuel
now, I want to extract the records that have in the user list the user "adm1" and print the first column (the group name), but you can see that there is a user called "adm12", so when i do this:
awk -F: '$4 ~ "adm1" {print $1}' sup_groups.txt
the output is:
adm
admins
conda
adm1Group
the command of course also prints those records that contain the string "adm12", but I don't want these lines because I'm interested only on the user "adm1".
So, How can I change this command so that it just prints the lines 1 and 6 (excluding 2 and 5)?
thank you so much and sorry for my bad English
EDIT: thank you for the answers, u gave me inspiration for the solution, i think this might work as well as your solutions but more simplified:
awk -F: '$4 ~ "adm,|adm1$|:adm1," {print $1}' sup_groups.txt
basically I'm using ORs covering all the cases and excluding the "adm12"
let me know if you think this is correct
1st solution: Using split function of awk. With your shown samples, please try following awk code.
awk -F':' '
{
num=split($4,arr,",")
for(i=1;i<=num;i++){
if(arr[i]=="adm1"){
print
}
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here setting field separator as : here.
{
num=split($4,arr,",") ##Using split to split 4th field into array arr with delimiter of ,
for(i=1;i<=num;i++){ ##Running for loop till value of num(total elements of array arr).
if(arr[i]=="adm1"){ ##Checking condition if arr[i] value is equal to adm1 then do following.
print ##printing current line here.
}
}
}
' Input_file ##Mentioning Input_file name here.
2nd solution: Using regex and conditions in awk.
awk -F':' '$4~/^adm1,/ || $4~/,adm1,/ || $4~/,adm1$/' Input_file
OR if 4th field doesn't have comma at all then try following:
awk -F':' '$4~/^adm1,/ || $4~/,adm1,/ || $4~/,adm1$/ || $4=="adm1"' Input_file
Explanation: Making field separator as : and checking condition if 4th field is either equal to ^adm1,(starting adm1,) OR its equal to ,adm1, OR its equal to ,adm1$(ending with ,adm1) then print that line.
This should do the trick:
$ awk -F: '"," $4 "," ~ ",adm1," { print $1 }' file
The idea behind this is the encapsulate both the group field between commas such that each group entry is encapsulated by commas. So instead of searching for adm1 you search for ,adm1,
So if your list looks like:
adm2,adm12,manuel
and, by adding commas, you convert it too:
,adm2,adm12,manuel,
you can always search for ,adm1, and find the perfect match .
once u setup FS per task requirements, then main body becomes barely just :
NF = !_ < NF
or even more straight forward :
{m,n,g}awk —- --NF
=
{m,g}awk 'NF=!_<NF' OFS= FS=':[^:]*:[^:]*:[^:]*[^[:alpha:]]?adm[0-9]+.*$'
adm
admins
conda
adm1Group

Grepping all strings on the same line from multiple files

Trying to find a way to grep all names on one line for 100 files. grepping all names available in each file must appear on the same line.
FILE1
"company":"COMPANY1","companyDisplayName":"CM1","company":"COMPANY2","companyDisplayName":"CM2","company":"COMPANY3","companyDisplayName":"CM3",
FILE2
"company":"COMPANY99","companyDisplayName":"CM99"
The output i actually want is, ( include file name as prefix.)
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99
i tried grep -oP '(?<="company":")[^"]*' * but i get results like this :
FILE1:COMPANY1
FILE1:COMPANY2
FILE1:COMPANY3
FILE2:COMPANY99
Could you please try following.
awk -F'[,:]' '
BEGIN{
OFS=","
}
{
for(i=1;i<=NF;i++){
if($i=="\"company\""){
val=(val?val OFS:"")$(i+1)
}
}
gsub(/\"/,"",val)
print FILENAME":"val
val=""
}
' Input_file1 Input_file2
Explanation: Adding explanation for above code.
awk -F'[,:]' ' ##Starting awk program here and setting field separator as colon OR comma here for all lines of Input_file(s).
BEGIN{ ##Starting BEGIN section of awk here.
OFS="," ##Setting OFS as comma here.
} ##Closing BEGIN BLOCK here.
{ ##Starting main BLOCK here.
for(i=1;i<=NF;i++){ ##Starting a for loop which starts from i=1 to till value of NF.
if($i=="\"company\""){ ##Checking condition if field value is equal to "company" then do following.
val=(val?val OFS:"")$(i+1) ##Creating a variable named val and concatenating its own value to it each time cursor comes here.
} ##Closing BLOCK for if condition here.
} ##Closing BLOCK for, for loop here.
gsub(/\"/,"",val) ##Using gsub to gklobally substitute all " in variable val here.
print FILENAME":"val ##Printing filename colon and variable val here.
val="" ##Nullifying variable val here.
} ##Closing main BLOCK here.
' Input_file1 Input_file2 ##Mentioning Input_file names here.
Output will be as follows.
Input_file1:COMPANY1,COMPANY2,COMPANY3
Input_file2:COMPANY99
EDIT: Adding solution in case OP needs to use grep and want to get final output from its output(though I will recommend to use awk solution itself since we are NOT using multiple commands or sub-shells).
grep -oP '(?<="company":")[^"]*' * | awk 'BEGIN{FS=":";OFS=","} prev!=$1 && val{print prev":"val;val=""} {val=(val?val OFS:"")$2;prev=$1} END{if(val){print prev":"val}}'
There are two tools that can take the output of your grep command and reformat it the way you want. First tool is GNU datamash. Second is tsv-summarize from eBay's tsv-utils package (disclaimer: I'm the author). Both tools solve this in similar ways:
$ # The grep output
$ echo $'FILE1:COMPANY1\nFILE1:COMPANY2\nFILE1:COMPANY3\nFILE2:COMPANY99' > grep-output.txt
$ cat grep-output.txt
FILE1:COMPANY1
FILE1:COMPANY2
FILE1:COMPANY3
FILE2:COMPANY99
$ # Using GNU datamash
$ cat grep-output.txt | datamash -field-separator : --group 1 unique 2
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99
$ # Using tsv-summarize
$ cat grep-output.txt | tsv-summarize --delimiter : --group-by 1 --unique-values 2 --values-delimiter ,
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99

awk to parse field if specific value is in another

In the awk below I am trying to parse $2 using the _ only if $3 is a specific valus (ID). I am reading that parsed value into an array and going to use it as a key in a lookup. The awk does execute but the entire line 2 or line with ID in $3 prints not just the desired. The print statement is only to see what results (for testing only) and will not be part of the script. Thank you :).
awk
awk -F'\t' '$3=="ID"
f="$(echo $2|cut -d_ -f1,1)"
{
print $f
}' file
file tab-delimited
R_Index locus type
17 chr20:31022959 NON
18 chr11:118353210-chr9:20354877_KMT2A-MLLT3.K8M9 ID
desired
$f = chr11:118353210-chr9:20354877
Not completely clear, could you please try following.
awk '{split($2,array,"_");if(array[2]=="KMT2A-MLLT3.K8M9"){print array[1]}}' Input_file
Or if you want top change 2nd field's value along with printing all lines then try following once.
awk '{split($2,array,"_");if(array[2]=="KMT2A-MLLT3.K8M9"){$2=array[1]}} 1' Input_file

remove all lines in a file containing a string from another file

I'd like to remove all the lines of a file based on matching a string from another file. This is what I have used but it only deletes some:
grep -vFf to_delete.csv inputfile.csv > output.csv
Here are sample lines from my input file (inputfile.csv):
Ata,Aqu,Ama3,Abe,0.053475,0.025,0.1,0.11275,0.1,0.15,0.83377
Ata135,Aru2,Aba301,A29,0.055525,0.025,0.1,0.082825,0.075,0.125
Ata135,Atb,Aca,Am54,0.14695,0.1,0.2,0.05255,0.025,0.075,0.8005,
Adc,Aru7,Ama301,Agr84,0.002075,0,0.025,0.240075,0.2,0.
My file "to_delete.csv" looks like this for example:
Aqu
Aca
So any line with those strings should get deleted, in this case, lines 1 and 3 should get deleted. Sample desired output:
Ata135,Aru2,Aba301,A29,0.055525,0.025,0.1,0.082825,0.075,0.125
Adc,Aru7,Ama301,Agr84,0.002075,0,0.025,0.240075,0.2,0.
EDIT: Since OP had carriage characters in his files so adding solution for that too now.
cat -v Input_file ##To check if carriage returns are there or not.
tr -d '\r' < Input_file > temp_file && mv temp_file Input_file
Since your samples of Input_file and expected output is not clear so couldn't fully test it, could you please try following.(if you are ok with awk), append > temp_file && mv temp_file Input_file in code to save output into Input_file itself.
awk -F, 'FNR==NR{a[$0];next} {for(i=1;i<=NF;i++){if($i in a){next}}} 1' to_delete.csv Input_file > temp_file && mv temp_file Input_file
Explanation: Adding explanation for above code too now.
awk -F, ' ##Setting field separator as comma here.
FNR==NR{ ##checking condition FNR==NR which will be TRUE when first Input_file is being read.
a[$0] ##Creating an array named a whose index is $0.
next ##next will skip all further statements from here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop from value i=1 to till value of NF.
if($i in a){ ##checking if $i is present in array a if yes then go into this condition block.
next ##next will skip all further statements(since we DO NOt want to print any matching contents)
} ##Closing if block now.
} ##Closing for block here.
} ##Closing block which should be executed for 2nd Input_file here.
1 ##awk works on pattern and action method so making condition TRUE here and not mentioning any action so by default print of current line will happen.
' to_delete.csv Input_file ##Mentioning Input_file names here now.