I have written a bash script where trying to obtain a new file from two files.
File1:
1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11
1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:05
1000534726081,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,X,2020.01.01 01:25:05
File2:
1000846364118;0;;2021.04.04;9914;100084636;ISATD;U;TEST;1234567890;2;;0;0;0;0;2020.10.12.00:00:00;0;0
1000830686890;0;;2021.03.02;9807;100083068;ISATD;U;TEST;1234567891;2;;0;0;0;0;2020.10.12.00:00:01;0;0
1000835819335;0;;2021.03.21;9990;100083581;ISATD;U;TEST;1234567892;2;;0;0;0;0;2020.10.12.00:00:03;0;0
1000683648398;0;;2020.10.31;9829;100068364;ISATD;U;TEST;1234567893;2;;0;0;0;0;2020.10.12.00:00:06;0;0
New file will have rows from file1 only which is having pattern 'U' in it with extra column where 10th field(123456789X) of file2 will be there. So my final output will be like this:
1000846364118,9,369,9901,0,2020.05.20 13:20:52,2020.07.14 16:38:11,2021.03.14 00:00:00,U,2020.07.14 16:38:11,1234567890
1000683648398,9,369,9901,0,2019.05.04 19:50:39,2019.06.23 14:27:17,2019.12.31 23:59:59,U,2020.01.01 01:25:05,1234567893
My script is below and working fine but the only issue is the data with which I am plying is huge and to generate the file output it is taking too much time. I put a timespan after every step and found that for loop portion is taking hours to generate few KB data wherein I am playing with few hundred MBs of data. Need help to optimise it.
cat /dev/null > new_file
used_Serial_Number=`grep U file1 | awk -F "," '{print $1}'`
echo "Serial no extracted at `date`" # Till this portion is getting completed in 2-3mins
for i in $used_Serial_Number; do
msisdn=`grep $i file2 | awk -F ";" '{print $10}'`
grep $i file1 | awk -v msisdn=$msisdn -F "," 'BEGIN { OFS = "," } { print $0 , msisdn }' >> new_file
done
Could you please try following, written and tested with shown samples in GNU awk. In case your 9th field of Input_file1 could be u OR U then change from $9=="U" TO tolower($9)=="u" for matching both cases.
awk '
BEGIN{
FS=";"
OFS=","
}
FNR==NR{
a[$1]=$10
next
}
($1 in a) && $9=="U"{
print $0,a[$1]
}
' Input_file2 FS="," Input_file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=";" ##Setting FS as ; here.
OFS="," ##Setting OFS as , here.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when Input_file2 is being read.
a[$1]=$10 ##Creating array a with index $1 and value is $10 here.
next ##next will skip all further statements from here.
}
($1 in a) && $9=="U"{ ##Checking if $1 is in a and 9th field is U then do following.
print $0,a[$1] ##Printing current line along with value of a with index of $1 here.
}
' file2 FS="," file1 ##Mentioning Input_file2 then setting FS as , and mentioning Input_file1 here.
Related
Example file:
Pattern 1
AAAAAAAAAA
BBBBBBBBBB
Pattern 2
I want to print the lines between two patterns in a file in one line.
From a previous question How to print lines between two patterns, inclusive or exclusive (in sed, AWK or Perl)? I found the very nice
awk '/Pattern 1/{flag=1; next} /Pattern 2/{flag=0} flag' file
With output:
AAAAAAAAAA
BBBBBBBBBB
My desired output:
AAAAAAAAAABBBBBBBBBB
Have your awk program in this way, written and tested in GNU awk.
awk '
/Pattern 2/{
if(found){
print val
}
found=""
next
}
/Pattern 1/{
found=1
next
}
found{
val=val $0
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/Pattern 2/{ ##Checking if Pattern 2 is found here then do following.
if(found){ ##Checking if found is set then do following.
print val ##Printing val here.
}
found="" ##Nullifying found here.
next ##next will skip all statements from here.
}
/Pattern 1/{ ##Checking if Pattern 1 is found in current line.
found=1 ##Setting found to 1 here.
next ##next will skip all statements from here.
}
found{ ##Checking condition if found is SET then do following.
val=val $0 ##Creating val variable here which is keep adding current line values in it.
}
' Input_file ##Mentioning Input_file name here.
You may use this awk:
awk '/Pattern 2/ {if (s!="") print s; s=f=""} f {s = s $0} /Pattern 1/ {f=1}' file
AAAAAAAAAABBBBBBBBBB
And also with awk:
awk -v RS= '!/Pattern/ {sub(/\n/,"");print}' file
AAAAAAAAAABBBBBBBBBB
With GNU awk for multi-char RS and assuming your "Pattern"s really to take up whole lines and can't occur elsewhere in your input (easy fix if that's wrong):
$ awk -v RS='Pattern 2' 'sub(/.*Pattern 1/,""){gsub(/\n/,""); print}' file
AAAAAAAAAABBBBBBBBBB
or with any awk:
awk 'f{ if (/Pattern 2/){print buf; f=0} else buf=buf $0 } /Pattern 1/{f=1; buf=""}' file
AAAAAAAAAABBBBBBBBBB
You can set the output record separator to an empty string by using -v ORS=:
awk -v ORS= '/Pattern 1/{flag=1; next} /Pattern 2/{flag=0} flag' file
See an online demo.
To print a newline at the end, add END{print "\n"}:
awk -v ORS= '/Pattern 1/{flag=1; next} /Pattern 2/{flag=0} flag; END{print "\n"}' file > newfile
See the Ubuntu 18 screenshot:
INPUT :
TT,SS,ECID,CDID,ODID,Symbol,Side,LastQty,LastPx,CumQty,AvgPx,
"20191008-13:32:52","RO","0284","378MT","r7ot","SPD","1","100","290.67","400","290.67",
"20191008-13:33:13","RO","02DJ","378MT","r7o","SPD","1","100","290.68","2248","290.655",
"20191008-13:33:26","RO","FATS","378MTA","r7ot","PDF","1","100","290.92","2751","290.608",
Output should be :
SPD 200
PDF 100
Tried doing it using but doesn't work
$ awk '{a[$3]+=$4}END{for(i in a) print i,a[i]}' file
EDIT2: Since OP has old awk where FPAT is not there so as per samples I added following code then.
awk -F, '{gsub(/\r/,"")} FNR>1{gsub(/"/,"",$8);gsub(/"/,"",$6);a[$6]+=$8} END{for(i in a){print i,a[i] | "sort -k1"}}' Input_file
EDIT: Since OP changed Input_file completely so adding this solution now. Written and tested with GNU awk.
awk -v FPAT='[^,]*|"[^"]+"' '
gsub(/\r/,"")
FNR>1{
gsub(/"/,"",$8)
gsub(/"/,"",$6)
a[$6]+=$8
}
END{
for(i in a){
print i,a[i]
}
}
' Input_file
OR to sort output in alphabetic order try following.
awk -v FPAT='[^,]*|"[^"]+"' '{gsub(/\r/,"")} FNR>1{gsub(/"/,"",$8);gsub(/"/,"",$6);a[$6]+=$8} END{for(i in a){print i,a[i] | "sort -k1"}}' Input_file
You were close, problem with your approach is you haven't set field separator as , in your code but your Input_file has separator as , so it is not having $3 at all and hence not working. Could you please try following.
awk -F"[[:space:]]*,[[:space:]]*" 'FNR>1{a[$3]+=$4} END{for(i in a){print i,a[i]}}' Input_file
PS: Thanks to oguz ismail for letting know about field separator set.
Trying to find a way to grep all names on one line for 100 files. grepping all names available in each file must appear on the same line.
FILE1
"company":"COMPANY1","companyDisplayName":"CM1","company":"COMPANY2","companyDisplayName":"CM2","company":"COMPANY3","companyDisplayName":"CM3",
FILE2
"company":"COMPANY99","companyDisplayName":"CM99"
The output i actually want is, ( include file name as prefix.)
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99
i tried grep -oP '(?<="company":")[^"]*' * but i get results like this :
FILE1:COMPANY1
FILE1:COMPANY2
FILE1:COMPANY3
FILE2:COMPANY99
Could you please try following.
awk -F'[,:]' '
BEGIN{
OFS=","
}
{
for(i=1;i<=NF;i++){
if($i=="\"company\""){
val=(val?val OFS:"")$(i+1)
}
}
gsub(/\"/,"",val)
print FILENAME":"val
val=""
}
' Input_file1 Input_file2
Explanation: Adding explanation for above code.
awk -F'[,:]' ' ##Starting awk program here and setting field separator as colon OR comma here for all lines of Input_file(s).
BEGIN{ ##Starting BEGIN section of awk here.
OFS="," ##Setting OFS as comma here.
} ##Closing BEGIN BLOCK here.
{ ##Starting main BLOCK here.
for(i=1;i<=NF;i++){ ##Starting a for loop which starts from i=1 to till value of NF.
if($i=="\"company\""){ ##Checking condition if field value is equal to "company" then do following.
val=(val?val OFS:"")$(i+1) ##Creating a variable named val and concatenating its own value to it each time cursor comes here.
} ##Closing BLOCK for if condition here.
} ##Closing BLOCK for, for loop here.
gsub(/\"/,"",val) ##Using gsub to gklobally substitute all " in variable val here.
print FILENAME":"val ##Printing filename colon and variable val here.
val="" ##Nullifying variable val here.
} ##Closing main BLOCK here.
' Input_file1 Input_file2 ##Mentioning Input_file names here.
Output will be as follows.
Input_file1:COMPANY1,COMPANY2,COMPANY3
Input_file2:COMPANY99
EDIT: Adding solution in case OP needs to use grep and want to get final output from its output(though I will recommend to use awk solution itself since we are NOT using multiple commands or sub-shells).
grep -oP '(?<="company":")[^"]*' * | awk 'BEGIN{FS=":";OFS=","} prev!=$1 && val{print prev":"val;val=""} {val=(val?val OFS:"")$2;prev=$1} END{if(val){print prev":"val}}'
There are two tools that can take the output of your grep command and reformat it the way you want. First tool is GNU datamash. Second is tsv-summarize from eBay's tsv-utils package (disclaimer: I'm the author). Both tools solve this in similar ways:
$ # The grep output
$ echo $'FILE1:COMPANY1\nFILE1:COMPANY2\nFILE1:COMPANY3\nFILE2:COMPANY99' > grep-output.txt
$ cat grep-output.txt
FILE1:COMPANY1
FILE1:COMPANY2
FILE1:COMPANY3
FILE2:COMPANY99
$ # Using GNU datamash
$ cat grep-output.txt | datamash -field-separator : --group 1 unique 2
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99
$ # Using tsv-summarize
$ cat grep-output.txt | tsv-summarize --delimiter : --group-by 1 --unique-values 2 --values-delimiter ,
FILE1:COMPANY1,COMPANY2,COMPANY3
FILE2:COMPANY99
I'd like to remove all the lines of a file based on matching a string from another file. This is what I have used but it only deletes some:
grep -vFf to_delete.csv inputfile.csv > output.csv
Here are sample lines from my input file (inputfile.csv):
Ata,Aqu,Ama3,Abe,0.053475,0.025,0.1,0.11275,0.1,0.15,0.83377
Ata135,Aru2,Aba301,A29,0.055525,0.025,0.1,0.082825,0.075,0.125
Ata135,Atb,Aca,Am54,0.14695,0.1,0.2,0.05255,0.025,0.075,0.8005,
Adc,Aru7,Ama301,Agr84,0.002075,0,0.025,0.240075,0.2,0.
My file "to_delete.csv" looks like this for example:
Aqu
Aca
So any line with those strings should get deleted, in this case, lines 1 and 3 should get deleted. Sample desired output:
Ata135,Aru2,Aba301,A29,0.055525,0.025,0.1,0.082825,0.075,0.125
Adc,Aru7,Ama301,Agr84,0.002075,0,0.025,0.240075,0.2,0.
EDIT: Since OP had carriage characters in his files so adding solution for that too now.
cat -v Input_file ##To check if carriage returns are there or not.
tr -d '\r' < Input_file > temp_file && mv temp_file Input_file
Since your samples of Input_file and expected output is not clear so couldn't fully test it, could you please try following.(if you are ok with awk), append > temp_file && mv temp_file Input_file in code to save output into Input_file itself.
awk -F, 'FNR==NR{a[$0];next} {for(i=1;i<=NF;i++){if($i in a){next}}} 1' to_delete.csv Input_file > temp_file && mv temp_file Input_file
Explanation: Adding explanation for above code too now.
awk -F, ' ##Setting field separator as comma here.
FNR==NR{ ##checking condition FNR==NR which will be TRUE when first Input_file is being read.
a[$0] ##Creating an array named a whose index is $0.
next ##next will skip all further statements from here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop from value i=1 to till value of NF.
if($i in a){ ##checking if $i is present in array a if yes then go into this condition block.
next ##next will skip all further statements(since we DO NOt want to print any matching contents)
} ##Closing if block now.
} ##Closing for block here.
} ##Closing block which should be executed for 2nd Input_file here.
1 ##awk works on pattern and action method so making condition TRUE here and not mentioning any action so by default print of current line will happen.
' to_delete.csv Input_file ##Mentioning Input_file names here now.
I would like to modify a file by including the size of following line using awk.
My file is like this:
>AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
>AAAS::1215902:1215986:-:1::NW_015494524.1:1215902-1215986(-)
ATGCGATGCTAGCTAGCTCGAT
>AAAS:1215614:1215701:-:1::NW_015494524.1:1215614-1215701(-)
ATGCCGCGACGCAGCACCCGACGCGCAG
I am using awk to modify it to have the following format:
>Assembly_AAAS_1_16
ATGTCGATGCTCGATC
>Assembly_AAAS_2_22
ATGCGATGCTAGCTAGCTCGAT
>Assembly_AAAS_3_28
ATGCCGCGACGCAGCACCCGACGCGCAG
I have used awk to modify the first part.
awk -F":" -v i=1 '/>/{print ">Assembly_" $1 "_" val i "_";i++;next} {print length($0)} 1' infile | sed -e "s/_>/_/g" > outfile
I can use print length($0) but how to print it in the same line?
Thanks
EDIT2: Since OP has changed the sample data again so adding this code now.
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {sub(/ +$/,"");print val length($0) ORS $0}' Input_file
OR
awk -v val="Assembly_AAAS_" '/>/{++i;val=">"val i "_";next} {print val length($1) ORS $0;}' Input_file
Above will remove spaces from last of the lines of Input_file, in case you don't need it then remove sub(/ +$/,""); part from above code please.
EDIT: As per OP changed solution now.
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{value="\047" val i val1;i++;next} {print value length($0) ORS $0}' Input_file
OR
awk -v i=1 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '
/>/{ value="\047" val i val1;
i++;
next}
{
print value length($0) ORS $0
}
' Input_file
Following awk may help you on same.
awk -v i="" -v j=2 '/>/{print "\047>Assembly_GeneName1_"++i"_sizeline"j;j+=2;next} 1' Input_file
Solution 2nd:
awk -v i=1 -v j=2 -v val=">Assembly_GeneName1_" -v val1="_sizeline" '/>/{print "\047" val i val1 j;j+=2;i++;next} 1' Input_file
What you are dealing with is a beautiful example of records which are not lines. awk is a record parser and by default, a record is defined to be a line. With awk you can define a record to be a block of text using the record separator RS.
RS : The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more
than one character, the results are unspecified. If RS is null, then
records are separated by sequences consisting of a <newline> plus one
or more blank lines, leading or trailing blank lines shall not result
in empty records at the beginning or end of the input, and a <newline>
shall always be a field separator, no matter what the value of FS is.
So the goal is to define the record to be
AAAS:1220136:1220159:-:0::NW_015494524.1:1220136-1220159(-)
ATGTCGATGCTCGATC
And this can be done by defining the RS="\n<". Furthremore we will use \n as a field separator FS. This way you can get the requested length as length($2) and the count by using the record count NR.
A simple awk script is then:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_AAAS_"NR"_"length($2)}
{print $1,$2}' <file>
This will do exactly what you want.
note: we use print $1,$2 and not print $0 as the last record might have 3 fields (if the last char of the file is a newline). This would imply that you would have an extra empty line at the end of your file.
If you want to pick the AAAS string out of $1 you can use substr($1,1,match($1,":")-1) to pick it up. This results in this:
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)}
{print $1,$2}' <file>
Finally, be aware that the above solution only works if there are no spaces in $2, if you want to change that, you can do this :
awk 'BEGIN{RS="\n<"; FS=OFS="\n"}
{ gsub(/[[:blank:]]/,"",$2);
$1=">Assembly_"substr($1,1,match($1,":")-1)"_"NR"_"length($2)
}
{ print $1,$2 }' <file>