gsub for substituting translations not working - awk

I have a dictionary dict with records separated by ":" and data fields by new lines, for example:
:one
1
:two
2
:three
3
:four
4
Now I want awk to substitute all occurrences of each record in the input
file, eg
onetwotwotwoone
two
threetwoone
four
My first awk script looked like this and works just fine:
BEGIN { RS = ":" ; FS = "\n"}
NR == FNR {
rep[$1] = $2
next
}
{
for (key in rep)
grub(key,rep[key])
print
}
giving me:
12221
2
321
4
Unfortunately another dict file contains some character used by regular expressions, so I have to substitute escape characters in my script. By moving key and rep[key] into a string (which can then be parsed for escape characters), the script will only substitute the second record in the dict. Why? And how to solve?
Here's the current second part of the script:
{
for (key in rep)
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
print
}
All scripts are run by awk -f translate.awk dict input
Thanks in advance!

Your fundamental problem is using strings in regexp and backreference contexts when you don't want them and then trying to escape the metacharacters in your strings to disable the characters that you're enabling by using them in those contexts. If you want strings, use them in string contexts, that's all.
You won't want this:
gsub(regexp,backreference-enabled-string)
You want something more like this:
index(...,string) substr(string)
I think this is what you're trying to do:
$ cat tst.awk
BEGIN { FS = ":" }
NR == FNR {
if ( NR%2 ) {
key = $2
}
else {
rep[key] = $0
}
next
}
{
for ( key in rep ) {
head = ""
tail = $0
while ( start = index(tail,key) ) {
head = head substr(tail,1,start-1) rep[key]
tail = substr(tail,start+length(key))
}
$0 = head tail
}
print
}
$ awk -f tst.awk dict file
12221
2
321
4

Never mind for asking....
Just some missing parentheses...?!
{
for (key in rep)
{
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
}
print
}
works like a charm.

Related

Issue converting github.com/*/raw/* URLs to raw.githubusercontent.com URLS using AWK

Given the following example URLs:
urls.txt
https://github.com/2RDLive/Pi-Hole/raw/master/Blacklist.txt
https://github.com/34730/asd/raw/master/adaway-export
https://github.com/568475513/secret_domain/raw/master/filter.txt
https://github.com/BlackJack8/iOSAdblockList/raw/master/Regular%20Hosts.txt
https://github.com/CipherOps/MiscHostsFiles/raw/master/MiscAdTrackingHostBlock.txt
https://github.com/DK-255/Pi-hole-list-1/raw/main/Ads-Blocklist
https://github.com/DRSDavidSoft/additional-hosts/raw/master/domains/blacklist/adservers-and-trackers.txt
https://github.com/DRSDavidSoft/additional-hosts/raw/master/domains/blacklist/unwanted-iranian.txt
https://github.com/DandelionSprout/adfilt/raw/master/Alternate%20versions%20Anti-Malware%20List/AntiMalwareHosts.txt
https://github.com/DavidTai780/AdGuard-Home-Private-Rules/raw/master/hosts.txt
https://github.com/DivineEngine/Profiles/raw/master/Quantumult/Filter/Guard/Advertising.list
https://github.com/Hariharann8175/Indicators-of-Compromise-IOC-/raw/master/Ransomware%20URL's
https://github.com/JumbomanXDA/host/raw/main/hosts
https://github.com/Kees1958/W3C_annual_most_used_survey_blocklist/raw/master/EU_US%2Bmost_used_ad_and_tracking_networks
https://github.com/KurzGedanke/kurzBlock/raw/master/kurzBlock.txt
https://github.com/MajkiIT/polish-ads-filter/raw/master/polish-adblock-filters/adblock.txt
https://github.com/MitaZ/Better_Filter/raw/master/Quantumult_X/Filter.list
https://github.com/MrWaste/Ad-BlockList-2019-08-31/raw/master/Pi-Hole%20BackUps/Black%20List/All%20Server%20Black%20List
https://github.com/Neo23x0/signature-base/raw/master/iocs/c2-iocs.txt
https://github.com/Pentanium/ABClientFilters/raw/master/ko/korean.txt
https://github.com/Phentora/AdguardPersonalList/raw/master/blocklist.txt
https://github.com/ShadowWhisperer/BlockLists/raw/master/Lists/Malware
https://github.com/SlashArash/adblockfa/raw/master/adblockfa.txt
https://github.com/SukkaW/Surge/raw/master/List/domainset/reject_sukka.conf
https://github.com/Th3M3/blocklists/raw/master/tracking%26ads.list
https://github.com/TonyRL/blocklist/raw/master/hosts
https://github.com/UnbendableStraw/samsungnosnooping/raw/master/README.md
https://github.com/UnluckyLuke/BlockUnderRadarJunk/raw/master/blockunderradarjunk-list.txt
https://github.com/VernonStow/Filterlist/raw/master/Filterlist.txt
https://github.com/What-Zit-Tooya/Ad-Block/raw/main/Main-Blocklist/Ad-Block-HOSTS.txt
https://github.com/XionKzn/PiHole-Lists/raw/master/PiHole/Blocklist_HOSTS.txt
https://github.com/YanFung/Ads/raw/master/Mobile
https://github.com/Yuki2718/adblock/raw/master/adguard/tracking-plus.txt
https://github.com/Yuki2718/adblock/raw/master/japanese/jp-filters.txt
https://github.com/ZYX2019/host-block-list/raw/master/Custom.txt
https://github.com/abc45628/hosts/raw/master/hosts
https://github.com/aleclee/DNS-Blacklists/raw/master/AdHosts.txt
https://github.com/angelics/pfbng/raw/master/ads/ads-domain-list.txt
https://github.com/blocklistproject/Lists/raw/master/ransomware.txt
https://github.com/cchevy/macedonian-pi-hole-blocklist/raw/master/hosts.txt
https://github.com/craiu/mobiletrackers/raw/master/list.txt
https://github.com/curutpilek12/adguard-custom-list/raw/main/custom
https://github.com/damengzhu/banad/raw/main/jiekouAD.txt
https://github.com/deletescape/noads/raw/master/lists/add-switzerland.txt
https://github.com/doadin/Pi-Hole-Blocklist/raw/main/block.list
https://github.com/dreammjow/MyFilters/raw/main/src/filters.txt
https://github.com/durablenapkin/block/raw/master/streaming.txt
https://github.com/eEIi0A5L/adblock_filter/archive/master.zip
https://github.com/easylist-thailand/easylist-thailand/raw/master/subscription/easylist-thailand.txt
https://github.com/fandagroupofficial/hosts/raw/main/pihole/ads
https://github.com/fandagroupofficial/hosts/raw/main/pihole/log
https://github.com/fandagroupofficial/hosts/raw/main/pihole/trackers
https://github.com/faralai/Pihole-Rules/raw/master/Fara-Popups_Head
https://github.com/faralai/Pihole-Rules/raw/master/Fara-Xiaomi-info
https://github.com/farrokhi/adblock-iran/raw/master/filter.txt
https://github.com/fskreuz/blocklists/raw/dev/domains.txt
https://github.com/ftpmorph/ftprivacy/raw/master/regex-blocklists/smartphone-and-general-ads-analytics-regex-blocklist-ftprivacy.txt
https://github.com/hell-sh/Evil-Domains/raw/master/evil-domains.txt
https://github.com/hosts-file/BulgarianHostsFile/raw/master/bhf.txt
https://github.com/igorskyflyer/ad-void/raw/main/AdVoid.Core.txt
https://github.com/jackrabbit335/UsefulLinuxShellScripts/raw/master/Hosts%20%26%20sourcelist/blacklist.txt
https://github.com/jakdev121/AMS2/raw/master/pi_indo_ads.txt
https://github.com/jakejarvis/ios-trackers/raw/master/blocklist.txt
https://github.com/jasirfayas/jBlocklist/raw/master/domains.lst
https://github.com/javabean/dnsmasq-antispy/raw/master/dnsmasq.ghostery_bugs.conf
https://github.com/javabean/dnsmasq-antispy/raw/master/dnsmasq.zz-extra-servers-manual.conf
https://github.com/jdlingyu/ad-wars/raw/master/hosts
https://github.com/jlonborg/piblacklist/raw/main/blacklist.txt
https://github.com/joaopinto14/PiHole/raw/main/adverts
https://github.com/kang49/kang49regexblacklistproject/raw/main/blacklist
https://github.com/lesong/Surge/raw/main/rule/BanProgramAD.list
https://github.com/lhie1/Rules/raw/master/Auto/REJECT.conf
https://github.com/mayesidevel/PiHoleLists/raw/master/MiscBlocklist
https://github.com/meinhimmel/hosts/raw/master/hosts
https://github.com/mhhakim/pihole-blocklist/raw/master/custom-blocklist.txt
https://github.com/migueldemoura/ublock-umatrix-rulesets/raw/master/Hosts/ads-tracking
https://github.com/minoplhy/filters/raw/main/Resources/blocked.txt
https://github.com/monojp/hosts_merge/raw/master/hosts_blacklist.txt
https://github.com/mtbnunu/ad-blocklist/raw/master/kr-list.txt
https://github.com/mtxadmin/ublock/raw/master/hosts/_telemetry
https://github.com/mullvad/dns-adblock/raw/main/lists/doh/adblock/custom
https://github.com/muxcc/AdsBlockLists/raw/master/aumm.hosts
https://github.com/nimasaj/uBOPa/raw/master/uBOPa.txt
https://github.com/notracking/hosts-blocklists/raw/master/dnscrypt-proxy/dnscrypt-proxy.blacklist.txt
https://github.com/npljy/npljy.github.io/raw/main/blocks/dns.txt
https://github.com/npljy/npljy.github.io/raw/main/blocks/filter.txt
https://github.com/olegwukr/polish-privacy-filters/raw/master/adblock.txt
https://github.com/parseword/nolovia/raw/master/skel/hosts-government-malware.txt
https://github.com/parseword/nolovia/raw/master/skel/hosts-nolovia.txt
https://github.com/pathforwardit/BlockList/raw/main/DomainList
https://github.com/pirat28/IHateTracker/raw/master/iHateTracker.txt
https://github.com/sa-ki13/jmsf/raw/master/japanese_mobile_site_dns_filter.txt
https://github.com/saurane/Turkish-Blocklist/raw/master/Blocklist/domains.txt
https://github.com/scomper/surge-list/raw/master/reject.list
https://github.com/sirsunknight/QuantumultX/raw/master/Filter/Radical-Advertising
https://github.com/smed79/blacklist/raw/master/hosts.txt
https://github.com/soteria-nou/domain-list/archive/master.zip
https://github.com/stamparm/maltrail/raw/master/trails/static/suspicious/pua.txt
https://github.com/sutchan/dnsmasq_ads_filter/raw/main/dnsmasq-ads-filter-list.txt
https://github.com/svetlyobg/svet-custom-domains/raw/master/ads-domains
https://github.com/tomzuu/blacklist-named/raw/master/ad.sites.conf
https://github.com/tomzuu/blacklist-named/raw/master/phishing.sites.conf
https://github.com/tomzuu/blacklist-named/raw/master/pushing.sites.conf
https://github.com/uBlockOrigin/uAssets/raw/master/filters/badware.txt
https://github.com/uBlockOrigin/uAssets/raw/master/filters/filters.txt
https://github.com/uBlockOrigin/uAssets/raw/master/filters/privacy.txt
https://github.com/unchartedsky/adguard-kr/raw/master/adguard-kr.txt
https://github.com/unflac/adFILTER/raw/master/filter.txt
https://github.com/vokins/ad/raw/main/ad.list
https://github.com/willianreis89/ADsBlock/raw/master/list.txt
https://github.com/wrysunny/ad_list/raw/master/adlist.txt
https://github.com/xOS/Config/raw/Her/Surge/RuleSet/Advertising.list
https://github.com/xinggsf/Adblock-Plus-Rule/raw/master/rule.txt
https://github.com/xlimit91/xlimit91-block-list/raw/master/blacklist.txt
https://github.com/xylagbx/ADBLOCK/raw/master/BLOCK/customadblockdomain.txt
https://github.com/ziozzang/adguard/raw/master/filter.txt
https://github.com/zznidar/BAR/raw/master/BAR-list
I'm using this command:
awk 'BEGIN{FS=OFS="/"}{if ($6~/^raw$/){$3="raw.githubusercontent.com"; for(i=0;i<=NF;++i) if (i!=6) {printf("%s%s",$i,(i==NF)?"\n":OFS)}}}' urls.txt
To produce this desired output:
https://raw.githubusercontent.com/2RDLive/Pi-Hole/master/Blacklist.txt
https://raw.githubusercontent.com/34730/asd/master/adaway-export
https://raw.githubusercontent.com/568475513/secret_domain/master/filter.txt
https://raw.githubusercontent.com/BlackJack8/iOSAdblockList/master/Regular%20Hosts.txt
...
But it yields this output:
https://raw.githubusercontent.com/2RDLive/Pi-Hole/raw/master/Blacklist.txt/https://raw.githubusercontent.com/2RDLive/Pi-Hole/master/Blacklist.txt
https://raw.githubusercontent.com/34730/asd/raw/master/adaway-export/https://raw.githubusercontent.com/34730/asd/master/adaway-export
https://raw.githubusercontent.com/568475513/secret_domain/raw/master/filter.txt/https://raw.githubusercontent.com/568475513/secret_domain/master/filter.txt
https://raw.githubusercontent.com/BlackJack8/iOSAdblockList/raw/master/Regular%20Hosts.txt/https://raw.githubusercontent.com/BlackJack8/iOSAdblockList/master/Regular%20Hosts.txt
...
Why is it printing a semblance of the original URL before the correct output?
Here is the above code formatted legibly with gawk -o-:
BEGIN {
FS = OFS = "/"
}
{
if ($6 ~ /^raw$/) {
$3 = "raw.githubusercontent.com"
for (i = 0; i <= NF; ++i) {
if (i != 6) {
printf "%s%s", $i, (i == NF) ? "\n" : OFS
}
}
}
}
Your only real problem is that awk fields, arrays, and strings all start at 1, not 0, so your loop should have started at 1, not 0. As written first time through your loop print $i is doing print $0.
Having said that, I think what you want is the following with a couple of other things tidied up:
$ cat tst.awk
BEGIN { FS=OFS="/" }
sub(/^raw$/,RS,$6) && sub(OFS RS,"") {
$3 = "raw.githubusercontent.com"
print
}
$ awk -f tst.awk urls.txt
https://raw.githubusercontent.com/2RDLive/Pi-Hole/master/Blacklist.txt
https://raw.githubusercontent.com/34730/asd/master/adaway-export
https://raw.githubusercontent.com/568475513/secret_domain/master/filter.txt
https://raw.githubusercontent.com/BlackJack8/iOSAdblockList/master/Regular%20Hosts.txt
https://raw.githubusercontent.com/CipherOps/MiscHostsFiles/master/MiscAdTrackingHostBlock.txt
https://raw.githubusercontent.com/DK-255/Pi-hole-list-1/main/Ads-Blocklist
https://raw.githubusercontent.com/DRSDavidSoft/additional-hosts/master/domains/blacklist/adservers-and-trackers.txt
https://raw.githubusercontent.com/DRSDavidSoft/additional-hosts/master/domains/blacklist/unwanted-iranian.txt
https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate%20versions%20Anti-Malware%20List/AntiMalwareHosts.txt
https://raw.githubusercontent.com/DavidTai780/AdGuard-Home-Private-Rules/master/hosts.txt
https://raw.githubusercontent.com/DivineEngine/Profiles/master/Quantumult/Filter/Guard/Advertising.list
https://raw.githubusercontent.com/Hariharann8175/Indicators-of-Compromise-IOC-/master/Ransomware%20URL's
https://raw.githubusercontent.com/JumbomanXDA/host/main/hosts
https://raw.githubusercontent.com/Kees1958/W3C_annual_most_used_survey_blocklist/master/EU_US%2Bmost_used_ad_and_tracking_networks
https://raw.githubusercontent.com/KurzGedanke/kurzBlock/master/kurzBlock.txt
https://raw.githubusercontent.com/MajkiIT/polish-ads-filter/master/polish-adblock-filters/adblock.txt
https://raw.githubusercontent.com/MitaZ/Better_Filter/master/Quantumult_X/Filter.list
https://raw.githubusercontent.com/MrWaste/Ad-BlockList-2019-08-31/master/Pi-Hole%20BackUps/Black%20List/All%20Server%20Black%20List
https://raw.githubusercontent.com/Neo23x0/signature-base/master/iocs/c2-iocs.txt
https://raw.githubusercontent.com/Pentanium/ABClientFilters/master/ko/korean.txt
https://raw.githubusercontent.com/Phentora/AdguardPersonalList/master/blocklist.txt
https://raw.githubusercontent.com/ShadowWhisperer/BlockLists/master/Lists/Malware
https://raw.githubusercontent.com/SlashArash/adblockfa/master/adblockfa.txt
https://raw.githubusercontent.com/SukkaW/Surge/master/List/domainset/reject_sukka.conf
https://raw.githubusercontent.com/Th3M3/blocklists/master/tracking%26ads.list
https://raw.githubusercontent.com/TonyRL/blocklist/master/hosts
https://raw.githubusercontent.com/UnbendableStraw/samsungnosnooping/master/README.md
https://raw.githubusercontent.com/UnluckyLuke/BlockUnderRadarJunk/master/blockunderradarjunk-list.txt
https://raw.githubusercontent.com/VernonStow/Filterlist/master/Filterlist.txt
https://raw.githubusercontent.com/What-Zit-Tooya/Ad-Block/main/Main-Blocklist/Ad-Block-HOSTS.txt
https://raw.githubusercontent.com/XionKzn/PiHole-Lists/master/PiHole/Blocklist_HOSTS.txt
https://raw.githubusercontent.com/YanFung/Ads/master/Mobile
https://raw.githubusercontent.com/Yuki2718/adblock/master/adguard/tracking-plus.txt
https://raw.githubusercontent.com/Yuki2718/adblock/master/japanese/jp-filters.txt
https://raw.githubusercontent.com/ZYX2019/host-block-list/master/Custom.txt
https://raw.githubusercontent.com/abc45628/hosts/master/hosts
https://raw.githubusercontent.com/aleclee/DNS-Blacklists/master/AdHosts.txt
https://raw.githubusercontent.com/angelics/pfbng/master/ads/ads-domain-list.txt
https://raw.githubusercontent.com/blocklistproject/Lists/master/ransomware.txt
https://raw.githubusercontent.com/cchevy/macedonian-pi-hole-blocklist/master/hosts.txt
https://raw.githubusercontent.com/craiu/mobiletrackers/master/list.txt
https://raw.githubusercontent.com/curutpilek12/adguard-custom-list/main/custom
https://raw.githubusercontent.com/damengzhu/banad/main/jiekouAD.txt
https://raw.githubusercontent.com/deletescape/noads/master/lists/add-switzerland.txt
https://raw.githubusercontent.com/doadin/Pi-Hole-Blocklist/main/block.list
https://raw.githubusercontent.com/dreammjow/MyFilters/main/src/filters.txt
https://raw.githubusercontent.com/durablenapkin/block/master/streaming.txt
https://raw.githubusercontent.com/easylist-thailand/easylist-thailand/master/subscription/easylist-thailand.txt
https://raw.githubusercontent.com/fandagroupofficial/hosts/main/pihole/ads
https://raw.githubusercontent.com/fandagroupofficial/hosts/main/pihole/log
https://raw.githubusercontent.com/fandagroupofficial/hosts/main/pihole/trackers
https://raw.githubusercontent.com/faralai/Pihole-Rules/master/Fara-Popups_Head
https://raw.githubusercontent.com/faralai/Pihole-Rules/master/Fara-Xiaomi-info
https://raw.githubusercontent.com/farrokhi/adblock-iran/master/filter.txt
https://raw.githubusercontent.com/fskreuz/blocklists/dev/domains.txt
https://raw.githubusercontent.com/ftpmorph/ftprivacy/master/regex-blocklists/smartphone-and-general-ads-analytics-regex-blocklist-ftprivacy.txt
https://raw.githubusercontent.com/hell-sh/Evil-Domains/master/evil-domains.txt
https://raw.githubusercontent.com/hosts-file/BulgarianHostsFile/master/bhf.txt
https://raw.githubusercontent.com/igorskyflyer/ad-void/main/AdVoid.Core.txt
https://raw.githubusercontent.com/jackrabbit335/UsefulLinuxShellScripts/master/Hosts%20%26%20sourcelist/blacklist.txt
https://raw.githubusercontent.com/jakdev121/AMS2/master/pi_indo_ads.txt
https://raw.githubusercontent.com/jakejarvis/ios-trackers/master/blocklist.txt
https://raw.githubusercontent.com/jasirfayas/jBlocklist/master/domains.lst
https://raw.githubusercontent.com/javabean/dnsmasq-antispy/master/dnsmasq.ghostery_bugs.conf
https://raw.githubusercontent.com/javabean/dnsmasq-antispy/master/dnsmasq.zz-extra-servers-manual.conf
https://raw.githubusercontent.com/jdlingyu/ad-wars/master/hosts
https://raw.githubusercontent.com/jlonborg/piblacklist/main/blacklist.txt
https://raw.githubusercontent.com/joaopinto14/PiHole/main/adverts
https://raw.githubusercontent.com/kang49/kang49regexblacklistproject/main/blacklist
https://raw.githubusercontent.com/lesong/Surge/main/rule/BanProgramAD.list
https://raw.githubusercontent.com/lhie1/Rules/master/Auto/REJECT.conf
https://raw.githubusercontent.com/mayesidevel/PiHoleLists/master/MiscBlocklist
https://raw.githubusercontent.com/meinhimmel/hosts/master/hosts
https://raw.githubusercontent.com/mhhakim/pihole-blocklist/master/custom-blocklist.txt
https://raw.githubusercontent.com/migueldemoura/ublock-umatrix-rulesets/master/Hosts/ads-tracking
https://raw.githubusercontent.com/minoplhy/filters/main/Resources/blocked.txt
https://raw.githubusercontent.com/monojp/hosts_merge/master/hosts_blacklist.txt
https://raw.githubusercontent.com/mtbnunu/ad-blocklist/master/kr-list.txt
https://raw.githubusercontent.com/mtxadmin/ublock/master/hosts/_telemetry
https://raw.githubusercontent.com/mullvad/dns-adblock/main/lists/doh/adblock/custom
https://raw.githubusercontent.com/muxcc/AdsBlockLists/master/aumm.hosts
https://raw.githubusercontent.com/nimasaj/uBOPa/master/uBOPa.txt
https://raw.githubusercontent.com/notracking/hosts-blocklists/master/dnscrypt-proxy/dnscrypt-proxy.blacklist.txt
https://raw.githubusercontent.com/npljy/npljy.github.io/main/blocks/dns.txt
https://raw.githubusercontent.com/npljy/npljy.github.io/main/blocks/filter.txt
https://raw.githubusercontent.com/olegwukr/polish-privacy-filters/master/adblock.txt
https://raw.githubusercontent.com/parseword/nolovia/master/skel/hosts-government-malware.txt
https://raw.githubusercontent.com/parseword/nolovia/master/skel/hosts-nolovia.txt
https://raw.githubusercontent.com/pathforwardit/BlockList/main/DomainList
https://raw.githubusercontent.com/pirat28/IHateTracker/master/iHateTracker.txt
https://raw.githubusercontent.com/sa-ki13/jmsf/master/japanese_mobile_site_dns_filter.txt
https://raw.githubusercontent.com/saurane/Turkish-Blocklist/master/Blocklist/domains.txt
https://raw.githubusercontent.com/scomper/surge-list/master/reject.list
https://raw.githubusercontent.com/sirsunknight/QuantumultX/master/Filter/Radical-Advertising
https://raw.githubusercontent.com/smed79/blacklist/master/hosts.txt
https://raw.githubusercontent.com/stamparm/maltrail/master/trails/static/suspicious/pua.txt
https://raw.githubusercontent.com/sutchan/dnsmasq_ads_filter/main/dnsmasq-ads-filter-list.txt
https://raw.githubusercontent.com/svetlyobg/svet-custom-domains/master/ads-domains
https://raw.githubusercontent.com/tomzuu/blacklist-named/master/ad.sites.conf
https://raw.githubusercontent.com/tomzuu/blacklist-named/master/phishing.sites.conf
https://raw.githubusercontent.com/tomzuu/blacklist-named/master/pushing.sites.conf
https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/badware.txt
https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/filters.txt
https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/privacy.txt
https://raw.githubusercontent.com/unchartedsky/adguard-kr/master/adguard-kr.txt
https://raw.githubusercontent.com/unflac/adFILTER/master/filter.txt
https://raw.githubusercontent.com/vokins/ad/main/ad.list
https://raw.githubusercontent.com/willianreis89/ADsBlock/master/list.txt
https://raw.githubusercontent.com/wrysunny/ad_list/master/adlist.txt
https://raw.githubusercontent.com/xOS/Config/Her/Surge/RuleSet/Advertising.list
https://raw.githubusercontent.com/xinggsf/Adblock-Plus-Rule/master/rule.txt
https://raw.githubusercontent.com/xlimit91/xlimit91-block-list/master/blacklist.txt
https://raw.githubusercontent.com/xylagbx/ADBLOCK/master/BLOCK/customadblockdomain.txt
https://raw.githubusercontent.com/ziozzang/adguard/master/filter.txt
https://raw.githubusercontent.com/zznidar/BAR/master/BAR-list
The only slightly tricky part in that is sub(/^raw$/,RS,$6) && sub(OFS RS,"") which is how you remove a mid-record field in awk - first convert the field to a string that matches RS since that can't be present in the input (we can use RS directly when it's a string like \n rather than a regexp) so we changed raw to \n in the 6th field which meant the record now contained /\n/ and then removed /\n thereby removing the 6th field and preceding /.

Swapping / rearranging of columns and its values based on inputs using Unix scripts

Team,
I have an requirement of changing /ordering the column of csv files based on inputs .
example :
Datafile (source File) will be always with standard column and its values example :
PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION
MK3,Biberach,15200100_3,Biologics Downstream
MK3,Biberach,15200100_4,Sciona Upstream
MK3,Biberach,15200100_5,Drag envois
MK3,Biberach,15200100_8,flatsylio
MK3,Biberach,15200100_1,bioCovis
these columns (PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION) will be standard for source files and what i am looking for solution to format this and generate new file with the columns which we preferred .
Note : Source / Data file will be always comma delimited
Example : if I pass PRODUCTCODE,BATCHID as input then i would like to have only those column and its data extracted from source file and generate new file .
Something like script_name <output_column> <Source_File_name> <target_file_name>
target file example :
PRODUCTCODE,BATCHID
MK3,15200100_3
MK3,15200100_4
MK3,15200100_5
MK3,15200100_8
MK3,15200100_1
if i pass output_column as "LV1P_DESCRIPTION,PRODUCTCODE" then out file should be like below
LV1P_DESCRIPTION,PRODUCTCODE
Biologics Downstream,MK3
Sciona Upstream,MK3
Drag envios,MK3
flatsylio,MK3
bioCovis,MK3
It would be great if any one can help on this.
I have tried using some awk scripts (got it from some site) but it was not working as expected , since i don't have unix knowledge finding difficulties to modify this .
awk code:
BEGIN {
FS = ","
}
NR==1 {
split(c, ca, ",")
for (i = 1 ; i <= length(ca) ; i++) {
gsub(/ /, "", ca[i])
cm[ca[i]] = 1
}
for (i = 1 ; i <= NF ; i++) {
if (cm[$i] == 1) {
cc[i] = 1
}
}
if (length(cc) == 0) {
exit 1
}
}
{
ci = ""
for (i = 1 ; i <= NF ; i++) {
if (cc[i] == 1) {
if (ci == "") {
ci = $i
} else {
ci = ci "," $i
}
}
}
print ci
}
the above code is saves as Remove.awk and this will be called by another scripts as below
var1="BATCHID,LV2P_DESCRIPTION"
## this is input fields values used for testing
awk -f Remove.awk -v c="${var1}" RESULT.csv > test.csv
The following GNU awk solution should meet your objectives:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" 'BEGIN { split(flds,map,",") } NR==1 { for (i=1;i<=NF;i++) { map1[$i]=i } } { printf "%s",$map1[map[1]];for(i=2;i<=length(map);i++) { printf ",%s",$map1[map[i]] } printf "\n" }' file
Explanation:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" ' # Pass the fields to print as a variable field
BEGIN {
split(flds,map,",") # Split fld into an array map using , as the delimiter
}
NR==1 { for (i=1;i<=NF;i++) {
map1[$i]=i # Loop through the header and create and array map1 with the column header as the index and the column number the value
}
}
{ printf "%s",$map1[map[1]]; # Print the first field specified (index of map)
for(i=2;i<=length(map);i++) {
printf ",%s",$map1[map[i]] # Loop through the other field numbers specified, printing the contents
}
printf "\n"
}' file

Issue excluding records from output using awk [duplicate]

I have a file called domain which contains some domains. For example:
google.com
facebook.com
...
yahoo.com
And I have another file called site which contains some sites URLs and numbers. For example:
image.google.com 10
map.google.com 8
...
photo.facebook.com 22
game.facebook.com 15
..
Now I'm going to count the url number each domain has. For example: google.com has 10+8. So I wrote an awk script like this:
BEGIN{
while(getline dom < "./domain" > 0) {
domain[dom]=0;
}
for(dom in domain) {
while(getline < "./site" > 0) {
if($1 ~/$dom$) #if $1 end with $dom {
domain[dom]+=$2;
}
}
}
}
But the code if($1 ~/$dom$) doesn't run like I want. Because the variable $dom in the regular expression was explained literally. So, the first question is:
Is there any way to use variable $dom in a regular expression?
Then, as I'm new to writing script
Is there any better way to solve the problem I have?
awk can match against a variable if you don't use the // regex markers.
if ( $0 ~ regex ){ print $0; }
In this case, build up the required regex as a string
regex = dom"$"
Then match against the regex variable
if ( $1 ~ regex ) {
domain[dom]+=$2;
}
First of all, the variable is dom not $dom -- consider $ as an operator to extract the value of the column number stored in the variable dom
Secondly, awk will not interpolate what's between // -- that is just a string in there.
You want the match() function where the 2nd argument can be a string that is treated as the regular expression:
if (match($1, dom "$")) {...}
I would code a solution like:
awk '
FNR == NR {domain[$1] = 0; next}
{
for (dom in domain) {
if (match($1, dom "$")) {
domain[dom] += $2
break
}
}
}
END {for (dom in domain) {print dom, domain[dom]}}
' domain site
One way using an awk script:
BEGIN {
FS = "[. ]"
OFS = "."
}
FNR == NR {
domain[$1] = $0
next
}
FNR < NR {
if ($2 in domain) {
for ( i = 2; i < NF; i++ ) {
if ($i != "") {
line = (line ? line OFS : "") $i
}
}
total[line] += $NF
line = ""
}
}
END {
for (i in total) {
printf "%s\t%s\n", i, total[i]
}
}
Run like:
awk -f script.awk domain.txt site.txt
Results:
facebook.com 37
google.com 18
You clearly want to read the site file once, not once per entry in domain. Fixing that, though, is trivial.
Equally, variables in awk (other than fields $0 .. $9, etc) are not prefixed with $. In particular, $dom is the field number identified by the variable dom (typically, that's going to be 0 since domain strings don't convert to any other number).
I think you need to find a way to get the domain from the data read from the site file. I'm not sure if you need to deal with sites with country domains such as bbc.co.uk as well as sites in the GTLDs (google.com etc). Assuming you are not dealing with country domains, you can use this:
BEGIN {
while (getline dom < "./domain" > 0) domain[dom] = 0
FS = "[ .]+"
while (getline < "./site" > 0)
{
topdom = $(NF-2) "." $(NF-1)
domain[topdom] += $NF
}
for (dom in domain) print dom " " domain[dom]
}
In the second while loop, there are NF fields; $NF contains the count, and $1 .. $(NF-1) contain components of the domain. So, topdom ends up containing the top domain name, which is then used to index into the array initialized in the first loop.
Given the data in the question (minus the lines of dots), the output is:
yahoo.com 0
facebook.com 37
google.com 18
The problem of the answers above is that you cannot use the "metacharacters" (e.g. \< for a word boundary at the beginning of a word) if you use a string instead of a regular expression /.../.
If you had a domain xyz.com and two sites ab.xyz.com and cd.prefix_xyz.com, the numbers of the two site entries would be added to xyz.com
Here's a solution using awk's pipe and the sed command:
...
for(dom in domain) {
while(getline < "./site" > 0) {
# let sed replaces occurence of the domain at the end of the site
cmd = "echo '" $1 "' | sed 's/\\<'" dom "'$/NO_VALID_DOM/'"
cmd | getline x
close(cmd)
if (match(x, "NO_VALID_DOM")) {
domain[dom]+=$2;
}
}
close("./site") # this misses in original code
}
...

awk: preserve row order and remove duplicate strings (mirrors) when generating data

I have two text files
g1.txt
alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;
g2.txt
Jack to ride.zip;http://alfa.org;
JKr.rui.rar;http://gamma.org;
Nofj ogk.png;http://gamma.org;
I use this command to run my awk script
awk -f ./join2.sh g1.txt g2.txt > "g3.txt"
and I obtain this output
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;;Jack to ride.zip;http://alfa.org;JKr.rui.rar;http://gamma.org;Nofj ogk.png;http://gamma.org;
alfa beta;www.google.com;
What are the problems?
1. row order is not conservated, for example in the output file g3.txt, the line alfa beta;www.google.com; is after the line Light.... when it should be first, as you can see in g1.txt
2. I have many mirror strings in Light.. line, you can see that in g3.txt
http://alfa.org
http://gamma.org
http://gamma.org
are repeated in same row.
What kind of output for rows, instead, do I want? Like this:
alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;Jack to ride.zip;JKr.rui.rar;Nofj ogk.png;
First: I try to implement a function that check if there are ugual strings inside a row, for example do you see in my row output Light Dweller - CR, Technical Metal... that there are identical string inside that row? For example http://alfa.org and http://gamma.org ? Ok, I don't want this. I want each string, enclosed within delimiters; is present only once and only once for each row. This rule should only apply to the output file, g3.txt
Second: I want that original order of rows in g1.txt must be maintained in the g3.txt output file. For example, in g1.txt I have
alfa beta ...
Light Dweller ...
but my script returns to me a different ordering
Light Dweller ...
alfa beta ...
I want to prevent reordering of rows
My join2.sh script is this
#! /usr/bin/awk -f
BEGIN {
OFS=FS=";"
C=0;
}
{
if (ARGIND == 1) {
X = $NF
T0[$NF] = C++
$NF = ""
if (T1[X]) {
T1[X] = T1[X] $0
} else {
T1[X] = $0
}
} else {
X = $NF
T0[$NF] = C++
$NF = ""
if (T2[X]) {
T2[X] = T2[X] $0
} else {
T2[X] = $0
}
}
}
END {
for (X in T0) {
# concatenate T1[X] and X, since T1[X] ends with ";"
print T1[X] X, T2[X]
}
}
SOLUTION:
You should process g2.txt first like this:
cat join2.awk
BEGIN {
OFS=FS=";"
}
ARGIND == 1 {
map[$2] = ($2 in map ? map[$2] OFS : "") $1
next
}
{
r = $0;
for (i=1; i<=NF; ++i)
if ($i in map)
r = r OFS map[$i]
$0 = r
}
1
Then use it as:
awk -f join2.awk g2.txt g1.txt
alfa beta;www.google.com
Light Dweller - CR, Technical Metal;http://alfa.org;http://beta.org;http://gamma.org;;Jack to ride.zip;JKr.rui.rar;Nofj ogk.png

awk | Rearrange fields of CSV file on the basis of column value

I need you help in writing awk for the below problem. I have one source file and required output of it.
Source File
a:5,b:1,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
Output File
session:4,a=5,b=1,c=2
session:3,a=11,b=3,c=5|3
Notes:
Fields are not organised in source file
In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c
For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol.
Please help.
Will work in any modern awk:
$ cat file
a:5,b:1,c:2,session:4,e:8
a:5,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
$ cat tst.awk
BEGIN{ FS="[,:]"; split("session,a,b,c",order) }
{
split("",val) # or delete(val) in gawk
for (i=1;i<NF;i+=2) {
val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1)
}
for (i=1;i in order;i++) {
name = order[i]
printf "%s%s", (i==1 ? name ":" : "," name "="), val[name]
}
print ""
}
$ awk -f tst.awk file
session:4,a=5,b=1,c=2
session:4,a=5,b=,c=2
session:3,a=11,b=3,c=5|3
If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output.
Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with:
awk '
BEGIN {
FS = "[,:]"
OFS = ","
}
{
for ( i = 1; i <= NF; i+= 2 ) {
if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue }
hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1)
}
asorti( hash, hash_orig )
for ( i = 1; i <= length(hash); i++ ) {
printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ]
}
printf "\n"
delete hash
delete hash_orig
}
' infile
that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields:
session:4,a:5,b:1,c:2,e:8
session:3,a:11,b:3,c:5|3,e:9