how can I exclude some data based on another data - awk

I have two data, 1.txt and 2.txt, I am trying to remove those that are in 1.txt from the 2.txt
1.txt is
P13674
Q7Z4N8
O14683
Q9UHX1
and 2.txt is the following
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
the output will look like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
I was trying to do it with bash like this
cat 1.txt | grep -A 2.txt | sed -n -e '1,/>/ {/>/ !' > myout.txt

$ awk -F'|' 'NR==FNR{a[$0];next} /^>/{f=($2 in a ? 0 : 1)} f' 1.txt 2.txt
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP

In case you want to print lines whose 2nd column is NOT coming in 1.txt then try following.
awk 'FNR==NR{a[$0];next} /^>/ {flag=$2 in a?1:0}flag' 1.txt FS="|" 2.txt
Above will print all the lines which are satisfying condition for > and coming after it, in case OP wants to print only 2 lines after > line then try following.
awk 'FNR==NR{a[$0];next} /^>/ {flag=$2 in a?1:0;count=2} flag && count-->0' 1.txt FS="|" 2.txt

something like this should work (not tested), needs gawk
$ awk 'NR==FNR {a[$1]; next}
!($2 in a) {print rt $0} {rt=RT}' file1 RS='(^|\n)>' FS='|' file2

If you want to try Perl, check below
perl -ne ' BEGIN {#x=map{chomp;$_} qx(cat 1.txt);$pat=join("|",#x) } print if not /$pat/ '
with inputs
$ perl -ne ' BEGIN {#x=map{chomp;$_} qx(cat 1.txt);$pat=join("|",#x) } print if not /$pat/ ' 2.txt
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
$

That was a bit tricky. Here you go.
grep -v -f <(sed -e 's/^\(.*\)$/\^>sp|\1|/' 1.txt) 2.txt | sed -ne '/^>/{p;n;p}'
The regular expressions (regex) are in the old (rather than revised/extended) POSIX style. The first regex, /^\(.*\)$/, matches each line of 1.txt to change, for example, P13674 to ^>sp|P13674|. The last, as you see, matches the beginning of a line in 2.txt. You could omit the first pattern, just typing grep -v -f 1.txt 2.txt | sed ..., and that would still practically probably work. I just thought it neater to match the beginning of a line more precisely.
The second regex, /^>/, distinguishes lines that begin with > from those that do not, because these two kinds of line want different handling.
By the way, I agree with you. The Awk solution is probably preferable. However, since you had started with Grep, my solution has followed in that vein.

Related

How do I get awk to print fields from the second row of a file?

I have a file that looks like this:
measured 10.8 0.0000 0.0000 0.0236 0.0304 0.0383 0.0433 0.0437 0.0442 0.0452
0.0455 0.0448 0.0440 0.0423 0.0386 0.0344 0.0274 0.0000 0.0000
I want gawk to print all the numbers in one long single column like this:
0.0000
0.0000
0.0236
0.0304
0.0383
0.0433
0.0437
0.0442
0.0452
0.0455
0.0448
0.0440
0.0423
0.0386
0.0344
0.0274
0.0000
0.0000
I run the command gawk '/measured/ { printf $3"\n" $4"\n" $5"\n" $6"\n" $7"\n" $8"\n" $9"\n" $10"\n" $11"\n" $12"\n" $13"\n" $14"\n" $15"\n" $16"\n" $17"\n" $18"\n" }' filename.txt
But I just get the first row of numbers:
0.0000
0.0000
0.0236
0.0304
0.0383
0.0433
0.0437
0.0442
0.0452
How do I get gawk to print the second row?
$ cat tst.awk
BEGIN { OFS = "\n" }
/measured/ { c=2; $1=$2=""; $0=$0 }
c && c-- { $1=$1; print }
$ awk -f tst.awk file
0.0000
0.0000
0.0236
0.0304
0.0383
0.0433
0.0437
0.0442
0.0452
0.0455
0.0448
0.0440
0.0423
0.0386
0.0344
0.0274
0.0000
0.0000
$ grep -A1 measured file | tr -s ' ' \\n | tail -n+4
0.0000
0.0000
0.0236
0.0304
0.0383
0.0433
0.0437
0.0442
0.0452
0.0455
0.0448
0.0440
0.0423
0.0386
0.0344
0.0274
0.0000
0.0000
with awk
$ awk -v OFS='\n' '/measured/ {p=1; for(i=3;i<=NF;i++) print $i; next}
p {$1=$1; print; exit}' file
If the number of fields is guaranteed to be as in the example, you can use the following command:
awk '{for(i=NF-8;i<=NF;i++){print $i}}' input.file
The GNU implementation of Awk allows an arbitrary regular expression as the RS record separator If the keyword measured occurs before each batch of numbers, we can use that keyword as the separator:
$ gawk 'BEGIN { RS = "measured" } { for (i = 1; i <= NF ; i++) print "field " i " = " $i }'
measured 10.8 0.0000 0.0000 0.0236 0.0304 0.0383 0.0433 0.0437 0.0442 0.0452
0.0455 0.0448 0.0440 0.0423 0.0386 0.0344 0.0274 0.0000 0.000
field 1 = 10.8
field 2 = 0.0000
field 3 = 0.0000
field 4 = 0.0236
field 5 = 0.0304
field 6 = 0.0383
field 7 = 0.0433
field 8 = 0.0437
field 9 = 0.0442
field 10 = 0.0452
field 11 = 0.0455
field 12 = 0.0448
field 13 = 0.0440
field 14 = 0.0423
field 15 = 0.0386
field 16 = 0.0344
field 17 = 0.0274
field 18 = 0.0000
field 19 = 0.000
As you can see, all the fields between the measured record separators are parsed out regardless of line breaks. Fields are separated on any mixture of spaces, tabs and newlines.
Note that because measured appears first, we get an empty record. The output you see above is, effectively, from the second record. The first record is the whitespcae before measured, which contains no fields.
In other words, he record separator is really expected to be a terminator, except that it can be missing after the last record.

How do I insert the filename and header to the beginning of a csv using sed/awk/bash/unix tools?

My file names are like this.
.
├── 10000000.csv
└── 10000001.csv
They contain the following.
100..00.csv
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Chae,Jesusa,Cummings,Female,deifier2040#live.com,775-861-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
100..01.csv
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Fleta,Rosette,Hurley,Other,tobacconist1857#outlook.com,1-952-701-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
Bennett,Dennis,George,Male,dona1910#live.com,(980) 033-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
Tommye,Pamula,Diaz,Other,dovelet1967#live.com,204.950.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
Albert,Jerrold,Cohen,Other,bolio2036#live.com,+1-(122)-459-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
Louetta,Collene,Best,Fluid,dinner1922#live.com,1-506-051-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
Kandace,Wesley,Diaz,Female,closterium1820#yahoo.com,+1-(777)-098-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22
I would like to get the following.
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,Chae,Jesusa,Cummings,Female,deifier2040#live.com,775-861-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000001,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000001,Fleta,Rosette,Hurley,Other,tobacconist1857#outlook.com,1-952-701-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
10000001,Bennett,Dennis,George,Male,dona1910#live.com,(980) 033-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
10000001,Tommye,Pamula,Diaz,Other,dovelet1967#live.com,204.950.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
10000001,Albert,Jerrold,Cohen,Other,bolio2036#live.com,+1-(122)-459-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
10000001,Louetta,Collene,Best,Fluid,dinner1922#live.com,1-506-051-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
10000001,Kandace,Wesley,Diaz,Female,closterium1820#yahoo.com,+1-(777)-098-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22
I have done this to get it right for the data. But I can't figure out how to do it for the header.
awk ' gsub(".csv","",FILENAME); {print FILENAME (NF?",":"") $0}' *.csv
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,Chae,Jesusa,Cummings,Female,deifier2040#live.com,775-861-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000001,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000001,Fleta,Rosette,Hurley,Other,tobacconist1857#outlook.com,1-952-701-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
10000001,Bennett,Dennis,George,Male,dona1910#live.com,(980) 033-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
10000001,Tommye,Pamula,Diaz,Other,dovelet1967#live.com,204.950.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
10000001,Albert,Jerrold,Cohen,Other,bolio2036#live.com,+1-(122)-459-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
10000001,Louetta,Collene,Best,Fluid,dinner1922#live.com,1-506-051-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
10000001,Kandace,Wesley,Diaz,Female,closterium1820#yahoo.com,+1-(777)-098-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22
But if I do this with sed, I can do the header for the first file alone, but not the data OR the remaining files.
sed '1s/.*/customer,&/' *.csv
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Chae,Jesusa,Cummings,Female,deifier2040#live.com,775-861-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Fleta,Rosette,Hurley,Other,tobacconist1857#outlook.com,1-952-701-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
Bennett,Dennis,George,Male,dona1910#live.com,(980) 033-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
Tommye,Pamula,Diaz,Other,dovelet1967#live.com,204.950.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
Albert,Jerrold,Cohen,Other,bolio2036#live.com,+1-(122)-459-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
Louetta,Collene,Best,Fluid,dinner1922#live.com,1-506-051-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
Kandace,Wesley,Diaz,Female,closterium1820#yahoo.com,+1-(777)-098-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22
I am open to any of the linux tools available. I would like to keep python as a secondary tool. This is just an illustration. I have a billion files.
I would also like to replace the files inplace. I know I can do this with -i for sed but I don't know for awk.
$ awk -v OFS=',' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' 10000000.csv
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,Chae,Jesusa,Cummings,Female,deifier2040#live.com,775-861-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
so with any awk you could do:
for file in *.csv; do
awk 'script' "$file" > tmp && mv tmp "$file"
done
or with GNU awk for "inplace" editing:
$ tail -n +1 10000000.csv 10000001.csv
==> 10000000.csv <==
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Chae,Jesusa,Cummings,Female,deifier2040#live.com,775-861-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
==> 10000001.csv <==
first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
Fleta,Rosette,Hurley,Other,tobacconist1857#outlook.com,1-952-701-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
Bennett,Dennis,George,Male,dona1910#live.com,(980) 033-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
Tommye,Pamula,Diaz,Other,dovelet1967#live.com,204.950.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
Albert,Jerrold,Cohen,Other,bolio2036#live.com,+1-(122)-459-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
Louetta,Collene,Best,Fluid,dinner1922#live.com,1-506-051-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
Kandace,Wesley,Diaz,Female,closterium1820#yahoo.com,+1-(777)-098-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22
.
$ awk -i inplace -v OFS=',' 'FNR==1{cust=FILENAME; sub(/\.csv$/,"",cust)} {print (FNR>1 ? cust : "customer"), $0}' 10000000.csv 10000001.csv
.
$ tail -n +1 10000000.csv 10000001.csv
==> 10000000.csv <==
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000000,Chae,Jesusa,Cummings,Female,deifier2040#live.com,775-861-8750,911 Hauser Pike,Moline,Georgia,Cameroon,2016-06-29,2016-07-16,36298,2016-07-17,Acer,493.86,14,354.77,Broken,123.68,898.13
==> 10000001.csv <==
customer,first_name,middle_name,last_name,gender,email,phone_number,address,city,state,country,date_order_start,date_order_complete,invoice_number,invoice_date,item,item_price,quantity,cost,job_name,job_price,total_cost
10000001,Fleta,Rosette,Hurley,Other,tobacconist1857#outlook.com,1-952-701-1210,35 Freelon Arcade,Beaverton,Rhode Island,Cayman Islands,2009-06-08,2009-06-29,39684,2009-07-01,NVIDIA GeForce GTX 980,474.31,16,395.79,Broken,157.53,1088.04
10000001,Bennett,Dennis,George,Male,dona1910#live.com,(980) 033-4131,505 Robert C Levy Arcade,Wellington,Louisiana,Mexico,2019-05-09,2019-05-19,37938,2019-05-21,8GB,187.67,16,205.77,Service,170.21,1007.85
10000001,Tommye,Pamula,Diaz,Other,dovelet1967#live.com,204.950.4445,1001 Canby Boulevard,Edinburg,Massachusetts,Gambia,2004-05-02,2004-05-24,31364,2004-05-26,Lenovo,137.21,13,193.63,Replacement,246.43,934.31
10000001,Albert,Jerrold,Cohen,Other,bolio2036#live.com,+1-(122)-459-8491,1181 Baden Avenue,Menomonee Falls,Texas,Tajikistan,2019-08-03,2019-08-12,37768,2019-08-15,Intel® Iris™ Graphics 6100,396.46,17,223.02,Service,118.53,960.27
10000001,Louetta,Collene,Best,Fluid,dinner1922#live.com,1-506-051-7050,923 Barry Viaduct,Laurel,Illinois,St. Barthélemy,2009-03-02,2009-03-06,39557,2009-03-07,AMD Radeon R9 M395X,133.9,11,198.49,Fix,178.54,1055.32
10000001,Kandace,Wesley,Diaz,Female,closterium1820#yahoo.com,+1-(777)-098-5414,341 Garlington Run,Santa Maria,New Jersey,Mexico,2005-10-09,2005-10-10,30543,2005-10-14,Samsung,590.29,5,354.85,Service,292.56,1032.22
Assuming there are no nasty filenames, something like this works:
find . -name '*.csv' -printf "%f\n" |
sed 's/.csv$//' |
xargs -I{} sed -i '1s/^/customer,/; 1!s/^/{},/' {}.csv
First I find all the csv files and print only the filename.
Then remove the .csv suffix from files.
Then I run xargs, so for each file...
...add to the beginning of the line customer, if it's the first line, otherwise I add the filename without .csv to the beginning of the line.
Another solution with awk:
find . -name \*.csv -exec awk '{ gsub(".csv","",FILENAME); if (NR == 1) {print "customer," $0; next; } else print FILENAME (NF?",":"") $0}' {} \;
if you want a blank line before next file:
find . -name \*.csv -exec awk '{ gsub(".csv","",FILENAME); if (NR == 1) {print "customer," $0; next; } else print FILENAME (NF?",":"") $0} END { print ""; }' {} \;
this find file with csv extension and execute awk script on each.
For a single file you can do
awk '(FNR==1){f=FILENAME+0; print "customer,",$0; next}
{print f","$0 }' input.csv > output.csv
For all CSV files in one go
awk '(FNR==1){f=FILENAME+0; print "customer,",$0; next}
{print f","$0 > (FILENAME".new") }' *.csv
Try sed -i '1 i\anything' file.
sed is a replace
-i is to replace in place
1 indicates line 1, aka the header
i\ means prepend
file is the filename, or matching pattern for your files.

how to split data in 5 letters around one specific letter for each section

I have a data like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
I am trying to find 5 letter left and 5 letters right to each F for each section and then calculate the number of E or D in each of them
a representative output looks like below
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RQCSWFAGCTN 0 0
LLYQLFRNLFC 0 0
LFRNLFCSYGL 0 0
NNSGLFFLCGN 0 0
NSGLFFLCGNG 0 0
GVYKGFPPKWS 0 0
TNLRSFIHKVT 0 0
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
GQKITFHGEGD 1 1
KDHAVFTRRGE 1 1
RGEDLFMCMDI 1 2
EALCGFQKPIS 1 0
RLIIEFKVNFP 1 0
EFKVNFPENGF 2 0
FPENGFLSPDK 1 0
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
ATTPQFPDMIL 0 1
RGHSHFVSDVV 0 1
SSDGQFALSGS 0 1
TTTRRFVGHTK 0 0
VLSVAFSSDNR 0 1
VSCVRFSPNSS 0 0
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
VLIKLFCVHTK 0 0
DVQIRFQPQL 0 1
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
LLLLAFLLLPR 0 0
KRCGGFLIRDD 0 2
LIRDDFVLTAA 0 2
EPTQQFIPVKR 1 0
YNPKNFSNDIM 0 1
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
GAEEKFKEIAE 4 0
RKREIFDRYGE 2 1
ANGTSFSYTFH 0 0
SFSYTFHGDPH 0 1
DPHAMFAEFFG 0 1
AMFAEFFGGRN 1 0
MFAEFFGGRNP 1 0
GGRNPFDTFFG 0 1
NPFDTFFGQRN 0 1
PFDTFFGQRNG 0 1
DIDDPFSGFPM 0 3
DPFSGFPMGMG 0 1
MGMGGFTNVNF 0 0
FTNVNFGRSRS 0 0
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
DGYVKFWQIYI 0 1
LSCLLFCDNHK 0 1
DPDVPFWRFLI 0 2
VPFWRFLITGA 0 0
LQTIRFSPDIF 0 1
FSPDIFSSVSV 0 1
EGHACFSSISE 0 0
SSISEFLLTHP 1 0
HPVLSFGIQVV 0 0
VLIKLFCVHTK 0 0
DVQIRFQPQLN 0 1
TAHEDFTFGES 2 1
HEDFTFGESRP 2 1
PAPADFLSLSS 0 1
MTPDAFMTPSA 0 1
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
ISLGIFPLPAG 0 0
SEQWKFQELSQ 2 0
EENEGFVKVTD 3 1
IENKAFDRNTE 2 1
NTESLFEELSS 3 0
AQRKRFTRVEM 1 0
SSIWQFFSRLF 0 0
SIWQFFSRLFS 0 0
FFSRLFSSSSN 0 0
at the begging I thought of finding the 5 letters left and right to F.
but I could not figure it out how to do that
In awk:
$ awk '
NR%2 {print; next } # print every odd record
{ # the even records are processed
while(match($0,/.{5}F.{0,5}/)) { # get 5 before and upto 5 after F
# 5 before F ^^^ ^^^ 0-5 chars after F
# change to /.{0,5}F.{0,5}/ if needed
print s=substr($0,RSTART,RLENGTH), # print match
gsub(/E/,"E",s), # count of Es
gsub(/D/,"D",s) # count of Ds
$0=substr($0,RSTART+1) # shorten the search string
}
}' file
Some output:
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RQCSWFAGCTN 0 0
LLYQLFRNLFC 0 0 # notice another F in the 5+F+5 window
LFRNLFCSYGL 0 0 # .. getting handled
NNSGLFFLCGN 0 0
NSGLFFLCGNG 0 0
GVYKGFPPKWS 0 0
TNLRSFIHKVT 0 0
...
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
VLIKLFCVHTK 0 0
DVQIRFQPQL 0 1 # ...F{0,5}
Check out this Perl solution
perl -lne ' if (/^[^>]/) { while(/(?<=(\w{5}))F(?=(\w{5}))/g) { $x="$1F$2";$e=()=$x=~/E/g; $d=()=$x=~/D/g; print "$x $e $d" } } else { print }' 5letter.txt
EDIT1:
DVQIRFQPQL 0 1 # Edge case
For accommodating the edge case as mentioned by OP in the comments - the string to the right of F can be less than 5 letters if it is at the end of the line
perl -lne ' if (/^[^>]/) { while(/(?<=(.{5}))F(?=(.{0,5}))/g) { $x="$1F$2";$e=()=$x=~/E/g; $d=()=$x=~/D/g; print "$x $e $d" } } else { print }' 5letter.txt
For each line that doesn't start with >, use positive lookbehind to match 5 letter \w to the left of F and positive lookahead to match 5 letter \w to the right of F. Using while loop and /g on the match operator scan the line and for each match copy the $1F$2 into variable $x. Use list context and count the occurrences of E and D. Print the result finally.
with given inputs
$ perl -lne ' if (/^[^>]/) { while(/(?<=(.{5}))F(?=(.{0,5}))/g) { $x="$1F$2";$e=()=$x=~/E/g; $d=()=$x=~/D/g; print "$x $e $d" } } else { print }' 5letter.txt
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RQCSWFAGCTN 0 0
LLYQLFRNLFC 0 0
LFRNLFCSYGL 0 0
NNSGLFFLCGN 0 0
NSGLFFLCGNG 0 0
GVYKGFPPKWS 0 0
TNLRSFIHKVT 0 0
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
GQKITFHGEGD 1 1
KDHAVFTRRGE 1 1
RGEDLFMCMDI 1 2
EALCGFQKPIS 1 0
RLIIEFKVNFP 1 0
EFKVNFPENGF 2 0
FPENGFLSPDK 1 1
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
ATTPQFPDMIL 0 1
RGHSHFVSDVV 0 1
SSDGQFALSGS 0 1
TTTRRFVGHTK 0 0
VLSVAFSSDNR 0 1
VSCVRFSPNSS 0 0
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
VLIKLFCVHTK 0 0
DVQIRFQPQL 0 1
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
LLLLAFLLLPR 0 0
KRCGGFLIRDD 0 2
LIRDDFVLTAA 0 2
EPTQQFIPVKR 1 0
YNPKNFSNDIM 0 1
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
GAEEKFKEIAE 4 0
RKREIFDRYGE 2 1
ANGTSFSYTFH 0 0
SFSYTFHGDPH 0 1
DPHAMFAEFFG 1 1
AMFAEFFGGRN 1 0
MFAEFFGGRNP 1 0
GGRNPFDTFFG 0 1
NPFDTFFGQRN 0 1
PFDTFFGQRNG 0 1
DIDDPFSGFPM 0 3
DPFSGFPMGMG 0 1
MGMGGFTNVNF 0 0
FTNVNFGRSRS 0 0
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
DGYVKFWQIYI 0 1
LSCLLFCDNHK 0 1
DPDVPFWRFLI 0 2
VPFWRFLITGA 0 0
LQTIRFSPDIF 0 1
FSPDIFSSVSV 0 1
EGHACFSSISE 2 0
SSISEFLLTHP 1 0
HPVLSFGIQVV 0 0
VLIKLFCVHTK 0 0
DVQIRFQPQLN 0 1
TAHEDFTFGES 2 1
HEDFTFGESRP 2 1
PAPADFLSLSS 0 1
MTPDAFMTPSA 0 1
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
ISLGIFPLPAG 0 0
SEQWKFQELSQ 2 0
EENEGFVKVTD 3 1
IENKAFDRNTE 2 1
NTESLFEELSS 3 0
AQRKRFTRVEM 1 0
SSIWQFFSRLF 0 0
SIWQFFSRLFS 0 0
FFSRLFSSSSN 0 0
$
PS:
$ echo "RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLT
RYLTLNASQITNLRSFIHKVTPHR" | perl -ne ' if (/^[^>]/) { $y=$_;while(/(.{10})/g) { $x=$1; $c++ for($x=~/F/g) ; print "$x $c\n"; $c=0 } } '
RNDDDDTSVC
LGTRQCSWFA 1
GCTNRTWNSS 0
AVPLIGLPNT 0
QDYKWVDRNS 0
GLTWSGNDTC 0
LYSCQNQTKG 0
LLYQLFRNLF 2
CSYGLTEAHG 0
KWRCADASIT 0
NDKGHDGHRT 0
PTWWLTGSNL 0
TLSVNNSGLF 1
FLCGNGVYKG 1
FPPKWSGRCG 1
LGYLVPSLT 0
RYLTLNASQI 0
TNLRSFIHKV 1
$
With any awk in any shell on any UNIX box:
$ cat tst.awk
/^>/ { print; next }
{
fpos = 0
while ( match(substr($0,fpos+1),/F/) ) {
fpos += RSTART
str = substr($0,fpos-5,11)
print str, gsub(/E/,"&",str), gsub(/D/,"&",str)
}
}
.
$ awk -f tst.awk file
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RQCSWFAGCTN 0 0
LLYQLFRNLFC 0 0
LFRNLFCSYGL 0 0
NNSGLFFLCGN 0 0
NSGLFFLCGNG 0 0
GVYKGFPPKWS 0 0
TNLRSFIHKVT 0 0
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
GQKITFHGEGD 1 1
KDHAVFTRRGE 1 1
RGEDLFMCMDI 1 2
EALCGFQKPIS 1 0
RLIIEFKVNFP 1 0
EFKVNFPENGF 2 0
FPENGFLSPDK 1 1
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
ATTPQFPDMIL 0 1
RGHSHFVSDVV 0 1
SSDGQFALSGS 0 1
TTTRRFVGHTK 0 0
VLSVAFSSDNR 0 1
VSCVRFSPNSS 0 0
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
VLIKLFCVHTK 0 0
DVQIRFQPQL 0 1
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
LLLLAFLLLPR 0 0
KRCGGFLIRDD 0 2
LIRDDFVLTAA 0 2
EPTQQFIPVKR 1 0
YNPKNFSNDIM 0 1
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
GAEEKFKEIAE 4 0
RKREIFDRYGE 2 1
ANGTSFSYTFH 0 0
SFSYTFHGDPH 0 1
DPHAMFAEFFG 1 1
AMFAEFFGGRN 1 0
MFAEFFGGRNP 1 0
GGRNPFDTFFG 0 1
NPFDTFFGQRN 0 1
PFDTFFGQRNG 0 1
DIDDPFSGFPM 0 3
DPFSGFPMGMG 0 1
MGMGGFTNVNF 0 0
FTNVNFGRSRS 0 0
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
DGYVKFWQIYI 0 1
LSCLLFCDNHK 0 1
DPDVPFWRFLI 0 2
VPFWRFLITGA 0 0
LQTIRFSPDIF 0 1
FSPDIFSSVSV 0 1
EGHACFSSISE 2 0
SSISEFLLTHP 1 0
HPVLSFGIQVV 0 0
VLIKLFCVHTK 0 0
DVQIRFQPQLN 0 1
TAHEDFTFGES 2 1
HEDFTFGESRP 2 1
PAPADFLSLSS 0 1
MTPDAFMTPSA 0 1
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
ISLGIFPLPAG 0 0
SEQWKFQELSQ 2 0
EENEGFVKVTD 3 1
IENKAFDRNTE 2 1
NTESLFEELSS 3 0
AQRKRFTRVEM 1 0
SSIWQFFSRLF 0 0
SIWQFFSRLFS 0 0
FFSRLFSSSSN 0 0
This is an adaptation of the excellent solution of James Brown. The adaptation makes it POSIX compliant, and also corrects for the single case where two F values are less than 5 characters apart. Example:
...RQCSWFAGFCTNRQS...
^ ^
The first window should detect RQCSWFAGFCT while the second window should detect SWFAGFCTNRQ. In the proposed solution, it will only detect AGFCTNRQ, or it might not detect the second F at all. (depending on the usage of .{0,5}F.{0,5} or .{5}F.{0,5} as a regex.
awk '
NR%2 {print; next } # print every odd record
{ # the even records are processed
seq=$0; l=lseq=length($0)
while(match(seq,/F/)) { # find `F`
n = l - lseq + RSTART # get position in $0
print s=substr($0,n-5,(n<6?n+5:11)), # print match and
# correct if F is in the first 5
gsub(/E/,"E",s), # count of Es
gsub(/D/,"D",s) # count of Ds
seq=substr(seq,RSTART+1) # shorten the search string
lseq=lseq-RSTART
}
}' file.fasta
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files.
This gives the result:
bioawk -c fastx '{ print ">" $name }
{ tmp=$seq; l=ltmp=length($seq)
while(match(tmp,/F/)) {
n=l-ltmp+RSTART
print s=substr($seq,n,(n<6?n+5:11)),
gsub(/E/,"E",s),
gsub(/D/,"D",s)
tmp=substr(tmp,RSTART+1)
ltmp=ltmp-RSTART
}}' file.fasta
Here $name is the sequence name (everything after >), and $seq is the complete sequence, even if you have a sequence spanning multiple lines.
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.

Use awk with two different delimiters to split and select columns

How can I tell gawk to use two different delimiters so that I can separate some columns, but select others using the tab-delimited format of my file?
> cat broad_snps.tab
chrsnpID rsID freq_bin snp_maf gene_count dist_nearest_gene_snpsnap dist_nearest_gene_snpsnap_protein_coding dist_nearest_gene dist_nearest_gene_located_within loci_upstream loci_downstream ID_nearest_gene_snpsnap ID_nearest_gene_snpsnap_protein_coding ID_nearest_gene ID_nearest_gene_located_within HGNC_nearest_gene_snpsnap HGNC_nearest_gene_snpsnap_protein_coding flag_snp_within_gene flag_snp_within_gene_protein_coding ID_genes_in_matched_locus friends_ld01 friends_ld02 friends_ld03 friends_ld04 friends_ld05 friends_ld06 friends_ld07 friends_ld08 friends_ld09 -1
10:10001753 10:10001753 7 0.07455 0 98932.0 1045506.0 98932.0 inf 9986766 10039928 ENSG00000224788 ENSG00000048740 ENSG00000224788 CELF2 False False 253.0 103.0 55.0 40.0 35.0 33.031.0 20.0 0.0 -1
10:10001794 10:10001794 41 0.4105 0 98891.0 1045465.0 98891.0 inf 9964948 10071879 ENSG00000224788 ENSG00000048740 ENSG00000224788 CELF2 False False 365.0 299.0 294.0 266.0 168.0 138.58.0 45.0 0.0 -1
10:100023489 10:100023489 10 0.1054 1 4518.0 4518.0 4518.0 4518.0 100023489 100023489 ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 True True ENSG00000138131 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1
10:100025128 10:100025128 45 0.4543 1 2879.0 2879.0 2879.0 2879.0 100025128 100025128 ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 True True ENSG00000138131 112.0 70.0 3.0 0.0 0.0
The output I want:
chr10 10001752 10001753 CELF2
chr10 10001793 10001794 CELF2
chr10 100023488 100023489 LOXL4
chr10 100025127 100025128 LOXL4
chr10 10002974 10002975 LOXL4
The command I am currently using:
cat broad_snps.tab | tail -n+2 | gawk -vOFS="\t" -vFS=":" '{ print "chr"$1, ($2 - 1), $2}' | gawk -vOFS="\t" '{print $1, $2, $3}' > broad_SNPs.bed
Returns this:
chr10 10001752 10001753 10
chr10 10001793 10001794 10
chr10 100023488 100023489 10
chr10 100025127 100025128 10
chr10 10002974 10002975 10
chr10 10003391 10003392 10
chr10 100038815 100038816 10
chr10 10008001 10008002 10
chr10 100093012 100093013 10
I'd like to be able to use the ":" delimiter to split up the first column, but I need to use "\t" to pick out the gene ID.
Thanks!
awk -F'[\t:]' '{print $1, $2, $4, $17}'

AWK, getline - remove the last field

I have a file:
Srednia ocen Wojciech Zieba: 3.83
Srednia ocen Jakub Pelc: 3.76
Srednia ocen Jan Resiak: 3.90
I want to improve the command:
1.
$ awk 'BEGIN{ while( getline $0 < "file") sub($NF,"",$0); print $0}'
Srednia ocen Jan Resiak:
userpc#userpc-desktop:~
2.
$ awk 'BEGIN{ while( getline a < "file") sub($NF,"", a); print a}'
Srednia ocen Jan Resiak: 3.90
userpc#userpc-desktop:~/Pulpit$
I have to do it by getline. My example is only part of the script. I want to remove last field and print the result:
Srednia ocen Wojciech Zieba:
Srednia ocen Jakub Pelc:
Srednia ocen Jan Resiak:
Thank you for your help.
EDIT:
But this way does not print the remaining fields:
$ awk 'BEGIN{ while( getline a < "file") {sub(/[^:]*$/,"",a); print a, $1, $3}}'
Srednia ocen Wojciech Zieba:
Srednia ocen Jakub Pelc:
Srednia ocen Jan Resiak:
bolek#bolek-desktop:~/Pulpit$
This command prints all the fields:
$ awk 'BEGIN{ while( getline $0 < "file") {sub($NF,"",$0); print $0, $1, $3}}'
Srednia ocen Wojciech Zieba: Srednia Wojciech
Srednia ocen Jakub Pelc: Srednia Jakub
Srednia ocen Jan Resiak: Srednia Jan
bolek#bolek-desktop:~/Pulpit$
How to improve the first command? I want to first command also shared lines on the field.
EDIT-1:
I want to improve the command:
awk 'BEGIN{ while( getline a < "file") {sub(/[^:]*$/,"",a); print a, $1, $3}}'
I want to print the result:
Srednia ocen Wojciech Zieba: Srednia Wojciech
Srednia ocen Jakub Pelc: Srednia Jakub
Srednia ocen Jan Resiak: Srednia Jan
an awk way:
awk -F':' '$0=$1 ":"' file
test
kent$ echo "Srednia ocen Wojciech Zieba: 3.83
Srednia ocen Jakub Pelc: 3.76
Srednia ocen Jan Resiak: 3.90"|awk -F':' '$0=$1 ":"'
Srednia ocen Wojciech Zieba:
Srednia ocen Jakub Pelc:
Srednia ocen Jan Resiak:
update
i don't know what you really wanna do, but your while{} scope has problem. maybe you want this?
kent$ awk 'BEGIN{ while( getline $0 < "file") {sub($NF,"",$0); print $0}}'
Srednia ocen Wojciech Zieba:
Srednia ocen Jakub Pelc:
Srednia ocen Jan Resiak:
if you want to introduce your own variable, say a, it has no idea about $NF. based on your example, you could do:
kent$ awk 'BEGIN{ while( getline a < "file") {sub(/[^:]*$/,"",a); print a}}'
Srednia ocen Wojciech Zieba:
Srednia ocen Jakub Pelc:
Srednia ocen Jan Resiak:
No need for awk, use cut and sed:
$ cat in.txt
Srednia ocen Wojciech Zieba: 3.83
Srednia ocen Jakub Pelc: 3.76
Srednia ocen Jan Resiak: 3.90
$ cut -d: -f1 in.txt | sed 's/$/:/'
Srednia ocen Wojciech Zieba:
Srednia ocen Jakub Pelc:
Srednia ocen Jan Resiak:
As it is you set the a variable to each line in turn, but if you omit a $0 will be set and auto-split will be performed, which would allow you to do this:
awk 'BEGIN{ while( getline < "file") { $NF = ""; print $0, $1, $3 } }'
Output:
Srednia ocen Wojciech Zieba: Srednia Wojciech
Srednia ocen Jakub Pelc: Srednia Jakub
Srednia ocen Jan Resiak: Srednia Jan