How to systematically replace certain parts of a file?

How to systematically replace certain parts of a file? - awk

I am using orgmode on Emacs and want to automatically update parts of an orgmode file using cron scheduling.
I know how to get the cron job to run at the times I choose but now I am faced with the issue of selecting certain parts of the file to change.
I would like to increment numbers at a certain locations in a file everyday (like every day at 3am or something).
So say I have the file fruit.org:
* Apple
age: 2
* Bananas
age: 1
A really bad fruit
* Cranberry
* Death
* Easter
A cool day
I want to select all the numerical values after age and then increment them every day. How would I do this selection and replacing. I believe it would involve regexp and some tool (maybe awk) but I am relatively clueless from there on.

In awk, you could say:
awk '/age:/ { $2++ } { print }' foo.org
If you have a recent version of GNU awk, you can edit the file in-place using the option -i inplace. Otherwise, just do the usual, i.e. redirect to a temporary file and then replace the original:
awk '/age:/ { $2++ } { print }' foo.org > foo.org.tmp && mv foo.org{.tmp,}
That's basically what the inplace option of awk or sed does behind the scenes anyway.

Related

print from match & process several input files

when you scrutiny my questions from the past weeks you find I asked questions similar to this one. I had problems to ask in a demanded format since I did not really know where my problems came from. E. Morton tells me not to use range expression. Well, I do not know what they are excactly. I found in this forum many questions alike mine with working answers.
Like: "How to print following line from a match" (e.g.)
But all solutions I found stop working when I process more than one input file. I need to process many.
I use this command:
gawk -f 1.awk print*.csv > new.txt
while 1.awk contains:
BEGIN { OFS=FS=";"
pattern="row4"
}
go {print} $0 ~ pattern {go = 1}
input file 1 print1.csv contains:
row1;something;in;this;row;;;;;;;
row2;something;in;this;row;;;;;;;
row3;something;in;this;row;;;;;;;
row4;don't;need;to;match;the;whole;line,;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
Input file 2 print2.csv contains the same just for illustration purpose.
The 1.awk (and several others ways I found in this forum to print from match) works for one file. Output:
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
BUT not when I process more input files.
Each time I process this way more than one input file awk commands 'to print from match' seem to be ignored.
As said I was told not to use range expression. I do not know how and maybe the problem is linked to the way I input several files?

just reset your match indicator at the beginning of each file
$ awk 'FNR==1{p=0} p; /row4/{p=1} ' file1 file2
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
UPDATE
From the comments
is it possible to combine your awk with: "If $1="row5" then write in
$6="row5" and delete the value "row5" in $5? In other words, to move
content "row5" in column1, if found there, to new column 6? I could to
this with another awk but a combination into one would be nicer
... $1=="row5"{$6=$5; $5=""} ...
or, if you want to use another field instead of $5 replace $5 with the corresponding field number.

Create postfix aliases file from LDIF using awk

I want to create a Postfix aliases file from the LDIF output of ldapsearch.
The LDIF file contains records for approximately 10,000 users. Each user has at least one entry for the proxyAddresses attribute. I need to create an alias corresponding with each proxyAddress that meets the conditions below. The created aliases must point to sAMAccountName#other.domain.
Type is SMTP or smtp (case-insensitive)
Domain is exactly contoso.com
I'm not sure if the attribute ordering in the LDIF file is consistent. I don't think I can assume that sAMAccountName will always appear last.
Example input file
dn: CN=John Smith,OU=Users,DC=contoso,DC=com
proxyAddresses: SMTP:smith#contoso.com
proxyAddresses: smtp:John.Smith#contoso.com
proxyAddresses: smtp:jsmith#elsewhere.com
proxyAddresses: MS:ORG/ORGEXCH/JOHNSMITH
sAMAccountName: smith
dn: CN=Tom Frank,OU=Users,DC=contoso,DC=com
sAMAccountName: frank
proxyAddresses: SMTP:frank#contoso.com
proxyAddresses: smtp:Tom.Frank#contoso.com
proxyAddresses: smtp:frank#elsewhere.com
proxyAddresses: MS:ORG/ORGEXCH/TOMFRANK
Example output file
smith: smith#other.domain
John.Smith: smith#other.domain
frank: frank#other.domain
Tom.Frank: frank#other.domain
Ideal solution
I'd like to see a solution using awk, but other method are acceptable too. Here are the qualities that are most important to me, in order:
Simple and readable. Self-documenting is better than one-liners.
Efficient. This will be used thousands of times.
Idiomatic. Doing it "the awk way" would be nice if it doesn't compromise the first two goals.
What I've tried
I've managed to make a start on this, but I'm struggling to understand the finer points of awk.
I tried using csplit to create seperate files for each record in the LDIF output, but that seems wasteful since I only want a single file in the end.
I tried setting RS="" in awk to get complete records instead of individual lines, but then I wasn't sure where to go from there.
I tried using awk to split the big LIDF file into separate files for each record and then processing those with another shell script, but that seemed wasteful.

Here a gawk script which you could run like this: gawk -f ldif.awk yourfile.ldif
Please note: the multicharacter value of `RS' is a gawk extension.
$ cat ldif.awk
BEGIN {
RS = "\n\n" # Record separator: empty line
FS = "\n" # Field separator: newline
}
# For each record: loop twice through fields
{
# Loop #1 identifies the sAMAccountName
for (i = 1; i <= NF; i++) {
if ($i ~ /^sAMAccountName: /) {
sAN = substr($i, 17)
break
}
}
# Loop #2 prints output lines
for (i = 1; i <= NF; i++) {
if (tolower($i) ~ /smtp:.*#contoso.com$/) {
split($i, n, ":|#")
print n[3] ": " sAN "#other.domain"
}
}
}

Here is a way to do it using standard awk.
# Display the postfix alias(es) for the previous user (if any)
function dump() {
for(i in id) printf("%s: %s#other.domain\n",id[i],an);
delete id;
}
# store all email names for that user in the id array
/^proxyAddresses:.[Ss][Mm][Tt][Pp]:.*#contoso.com/ {gsub(/^.*:/,"");gsub(/#.*$/,"");id[i++]=$0}
# store the account name
/^sAMAccountName:/ {an=$2};
# When a new record is found, process the previous one
/^dn:/ {dump()}
# Process the last record
END {dump()}

Scaling the values to plot a graph using gnuplot

I have a text file in the below format.The first column represents a timestamp with a very high resolution.The second number represents the sequence number.I want to plot a graph between these two values.i.e Sequence number Over time.For this purpose I want to scale the sequence number and the timestamp.Time stamp can be scaled by subtracting the first time stamp from the remaining time stamps.Sequence number also should be scaled the same way.However when scaled the sequence number can have negative values.How do I write a bash script using awk to achieve this.This file name is print_1010171.txt.Please not that I do have a number of files of the same format.so I want the script to get generic.
5698771509078629376 1133254688
5698771509371165696 1150031904
5698771510035551232 1150031904
5698771510036082688 4170258464
5698771510036715520 2895583264
5698771510037202176 1620908064
5698771510037665280 346232864
5698771510038193664 3366459424
5698771510332259072 2091784224
5698771510332816128 817109024
5698771510333344512 3837335584
5698771510339882240 2562660384
5698771510340411392 1287985184
5698771510340939776 13309984
5698771510348048896 3033536544
5698771510348577280 1758861344
5698771510349228800 484186144
5698771510632804864 3504412704
5698771510633441792 2229737504
5698771510634390272 955062304
5698771510638858496 3975288864
5698771510639347712 2700613664
5698771510642663168 1425938464
5698771510643387136 134486304
5698771510643808768 3154712864
5698771510648858368 1880037664
5698771510649410560 605362464
5698771510655600384 3625589024
5698771510656128768 2350913824
5698771510656657408 1076238624

Very similar to Dennis Williamson's solution -- This should be more efficient (but probably not something you'd ever notice) and it will also silently ignore blank lines (the other solution will give very large negative numbers for blank lines).
#script coolscript.gp
if(!exists("DATAFILE")) DATAFILE='test.dat'
EXT_INDEX=strstr(DATAFILE,'.txt') #assume data has a .txt extension.
set term post enh color
set output DATAFILE[:EXT_INDEX] . '.ps' #gnuplot string slicing and concatenation
plot "< awk 'BEGIN{getline; header_col1=$1; header_col2=$2 }{if(NF){print $1-header_col1,$2-header_col2}}' ".DATAFILE using 1:2
You can definitely do this using an all-gnuplot solution. (See #andyras's nice solution and my answer that he linked to). This (alternate) solution works by reading the first line in awk and assigning the variables header_col1 and header_col2 with the data in column 1 and column 2. It then subtracts those from the future columes (as expected) as long as the line isn't empty.
Note that this solution can be called from the commandline using:
gnuplot -e "DATAFILE='mydatafile.txt'" coolscript.gp
Unfortunately, the quotes are necessary since gnuplot needs them, meaning that if you're using this in a shell loop, you should definitely use the double quotes on the outside as I show.
for FILE in *.dat; do
gnuplot -e "DATAFILE='${FILE}'" coolscript.gp
done

awk 'NR == 1 {basets = $1; baseseq = $2} {print $1 - basets, $2 - baseseq}' inputfile
or, if you don't want to output the initial pair of zeros:
awk 'NR == 1 {basets = $1; baseseq = $2; next} {print $1 - basets, $2 - baseseq}' inputfile

Here is a bash wrapper script which should do what you want:
#!/bin/bash
gnuplot << EOF
set terminal png truecolor size 800,600
set output 'plot_$1.png'
firstx=0
offsetx=0
funcx(x)=(offsetx=(firstx==0)?x:offsetx,firstx=1,x-offsetx)
firsty=0
offsety=0
funcy(x)=(offsety=(firsty==0)?x:offsety,firsty=1,x-offsety)
plot '$1' u (funcx(\$1)):(funcy(\$2))
EOF
To use the script, give it the name of the file you want to plot as an argument:
$ myscript.sh print_1010171.txt
I modified the answer given here to accommodate two variables. See that answer also if you want to subtract the lowest value from all data rather than the first.

Processing apache logs quickly

I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow.
Here is the awk script:
#!/bin/bash
awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1
EDIT:
For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility date recognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP.
Example input: 189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
Returned output: 189.5.56.113,124237889

#OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion
awk 'BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
date[d[o]]=sprintf("%02d",o)
}
}
{
gsub(/\[/,"",$4); gsub(":","/",$4); gsub(/\]/,"",$5)
n=split($4, DATE,"/")
day=DATE[1]
mth=DATE[2]
year=DATE[3]
hr=DATE[4]
min=DATE[5]
sec=DATE[6]
MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec)
print $1,MKTIME
}' file
output
$ more file
189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
$ ./shell.sh
189.5.56.113 1264110895

If you really really need it to be faster, you can do what I did. I rewrote an Apache log file analyzer using Ragel. Ragel allows you to mix regular expressions with C code. The regular expressions get transformed into very efficient C code and then compiled. Unfortunately, this requires that you are very comfortable writing code in C. I no longer have this analyzer. It processed 1 GB of Apache access logs in 1 or 2 seconds.
You may have limited success removing unnecessary printfs from your awk statement and replacing them with something simpler.

If you are using gawk, you can massage your date and time into a format that mktime (a gawk function) understands. It will give you the same timestamp you're using now and save you the overhead of repeated system() calls.

This little Python script handles a ~400MB worth of copies of your example line in about 3 minutes on my machine producing ~200MB of output (keep in mind your sample line was quite short, so that's a handicap):
import time
src = open('x.log', 'r')
dest = open('x.csv', 'w')
for line in src:
ip = line[:line.index(' ')]
date = line[line.index('[') + 1:line.index(']') - 6]
t = time.mktime(time.strptime(date, '%d/%b/%Y:%X'))
dest.write(ip)
dest.write(',')
dest.write(str(int(t)))
dest.write('\n')
src.close()
dest.close()
A minor problem is that it doesn't handle timezones (strptime() problem), but you could either hardcode that or add a little extra to take care of it.
But to be honest, something as simple as that should be just as easy to rewrite in C.

gawk '{
dt=substr($4,2,11);
gsub(/\//," ",dt);
"date -d \""dt"\" +%s"|getline ts;
print $1, ts
}' yourfile

Nano hacks: most useful tiny programs you've coded or come across

It's the first great virtue of programmers. All of us have, at one time or another automated a task with a bit of throw-away code. Sometimes it takes a couple seconds tapping out a one-liner, sometimes we spend an exorbitant amount of time automating away a two-second task and then never use it again.
What tiny hack have you found useful enough to reuse? To make go so far as to make an alias for?
Note: before answering, please check to make sure it's not already on favourite command-line tricks using BASH or perl/ruby one-liner questions.

i found this on dotfiles.org just today. it's very simple, but clever. i felt stupid for not having thought of it myself.
###
### Handy Extract Program
###
extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2) tar xvjf $1 ;;
*.tar.gz) tar xvzf $1 ;;
*.bz2) bunzip2 $1 ;;
*.rar) unrar x $1 ;;
*.gz) gunzip $1 ;;
*.tar) tar xvf $1 ;;
*.tbz2) tar xvjf $1 ;;
*.tgz) tar xvzf $1 ;;
*.zip) unzip $1 ;;
*.Z) uncompress $1 ;;
*.7z) 7z x $1 ;;
*) echo "'$1' cannot be extracted via >extract<" ;;
esac
else
echo "'$1' is not a valid file"
fi
}

Here's a filter that puts commas in the middle of any large numbers in standard input.
$ cat ~/bin/comma
#!/usr/bin/perl -p
s/(\d{4,})/commify($1)/ge;
sub commify {
local $_ = shift;
1 while s/^([ -+]?\d+)(\d{3})/$1,$2/;
return $_;
}
I usually wind up using it for long output lists of big numbers, and I tire of counting decimal places. Now instead of seeing
-rw-r--r-- 1 alester alester 2244487404 Oct 6 15:38 listdetail.sql
I can run that as ls -l | comma and see
-rw-r--r-- 1 alester alester 2,244,487,404 Oct 6 15:38 listdetail.sql

This script saved my career!
Quite a few years ago, i was working remotely on a client database. I updated a shipment to change its status. But I forgot the where clause.
I'll never forget the feeling in the pit of my stomach when I saw (6834 rows affected). I basically spent the entire night going through event logs and figuring out the proper status on all those shipments. Crap!
So I wrote a script (originally in awk) that would start a transaction for any updates, and check the rows affected before committing. This prevented any surprises.
So now I never do updates from command line without going through a script like this. Here it is (now in Python):
import sys
import subprocess as sp
pgm = "isql"
if len(sys.argv) == 1:
print "Usage: \nsql sql-string [rows-affected]"
sys.exit()
sql_str = sys.argv[1].upper()
max_rows_affected = 3
if len(sys.argv) > 2:
max_rows_affected = int(sys.argv[2])
if sql_str.startswith("UPDATE"):
sql_str = "BEGIN TRANSACTION\\n" + sql_str
p1 = sp.Popen([pgm, sql_str],stdout=sp.PIPE,
shell=True)
(stdout, stderr) = p1.communicate()
print stdout
# example -> (33 rows affected)
affected = stdout.splitlines()[-1]
affected = affected.split()[0].lstrip('(')
num_affected = int(affected)
if num_affected > max_rows_affected:
print "WARNING! ", num_affected,"rows were affected, rolling back..."
sql_str = "ROLLBACK TRANSACTION"
ret_code = sp.call([pgm, sql_str], shell=True)
else:
sql_str = "COMMIT TRANSACTION"
ret_code = sp.call([pgm, sql_str], shell=True)
else:
ret_code = sp.call([pgm, sql_str], shell=True)

I use this script under assorted linuxes to check whether a directory copy between machines (or to CD/DVD) worked or whether copying (e.g. ext3 utf8 filenames -> fusebl
k) has mangled special characters in the filenames.
#!/bin/bash
## dsum Do checksums recursively over a directory.
## Typical usage: dsum <directory> > outfile
export LC_ALL=C # Optional - use sort order across different locales
if [ $# != 1 ]; then echo "Usage: ${0/*\//} <directory>" 1>&2; exit; fi
cd $1 1>&2 || exit
#findargs=-follow # Uncomment to follow symbolic links
find . $findargs -type f | sort | xargs -d'\n' cksum

Sorry, don't have the exact code handy, but I coded a regular expression for searching source code in VS.Net that allowed me to search anything not in comments. It came in very useful in a particular project I was working on, where people insisted that commenting out code was good practice, in case you wanted to go back and see what the code used to do.

I have two ruby scripts that I modify regularly to download all of various webcomics. Extremely handy! Note: They require wget, so probably linux. Note2: read these before you try them, they need a little bit of modification for each site.
Date based downloader:
#!/usr/bin/ruby -w
Day = 60 * 60 * 24
Fromat = "hjlsdahjsd/comics/st%Y%m%d.gif"
t = Time.local(2005, 2, 5)
MWF = [1,3,5]
until t == Time.local(2007, 7, 9)
if MWF.include? t.wday
`wget #{t.strftime(Fromat)}`
sleep 3
end
t += Day
end
Or you can use the number based one:
#!/usr/bin/ruby -w
Fromat = "http://fdsafdsa/comics/%08d.gif"
1.upto(986) do |i|
`wget #{sprintf(Fromat, i)}`
sleep 1
end

Instead of having to repeatedly open files in SQL Query Analyser and run them, I found the syntax needed to make a batch file, and could then run 100 at once. Oh the sweet sweet joy! I've used this ever since.
isqlw -S servername -d dbname -E -i F:\blah\whatever.sql -o F:\results.txt

This goes back to my COBOL days but I had two generic COBOL programs, one batch and one online (mainframe folks will know what these are). They were shells of a program that could take any set of parameters and/or files and be run, batch or executed in an IMS test region. I had them set up so that depending on the parameters I could access files, databases(DB2 or IMS DB) and or just manipulate working storage or whatever.
It was great because I could test that date function without guessing or test why there was truncation or why there was a database ABEND. The programs grew in size as time went on to include all sorts of tests and become a staple of the development group. Everyone knew where the code resided and included them in their unit testing as well. Those programs got so large (most of the code were commented out tests) and it was all contributed by people through the years. They saved so much time and settled so many disagreements!

I coded a Perl script to map dependencies, without going into an endless loop, For a legacy C program I inherited .... that also had a diamond dependency problem.
I wrote small program that e-mailed me when I received e-mails from friends, on an rarely used e-mail account.
I wrote another small program that sent me text messages if my home IP changes.
To name a few.

Years ago I built a suite of applications on a custom web application platform in PERL.
One cool feature was to convert SQL query strings into human readable sentences that described what the results were.
The code was relatively short but the end effect was nice.

I've got a little app that you run and it dumps a GUID into the clipboard. You can run it /noui or not. With UI, its a single button that drops a new GUID every time you click it. Without it drops a new one and then exits.
I mostly use it from within VS. I have it as an external app and mapped to a shortcut. I'm writing an app that relies heavily on xaml and guids, so I always find I need to paste a new guid into xaml...

Any time I write a clever list comprehension or use of map/reduce in python. There was one like this:
if reduce(lambda x, c: locks[x] and c, locknames, True):
print "Sub-threads terminated!"
The reason I remember that is that I came up with it myself, then saw the exact same code on somebody else's website. Now-adays it'd probably be done like:
if all(map(lambda z: locks[z], locknames)):
print "ya trik"

I've got 20 or 30 of these things lying around because once I coded up the framework for my standard console app in windows I can pretty much drop in any logic I want, so I got a lot of these little things that solve specific problems.
I guess the ones I'm using a lot right now is a console app that takes stdin and colorizes the output based on xml profiles that match regular expressions to colors. I use it for watching my log files from builds. The other one is a command line launcher so I don't pollute my PATH env var and it would exceed the limit on some systems anyway, namely win2k.

I'm constantly connecting to various linux servers from my own desktop throughout my workday, so I created a few aliases that will launch an xterm on those machines and set the title, background color, and other tweaks:
alias x="xterm" # local
alias xd="ssh -Xf me#development_host xterm -bg aliceblue -ls -sb -bc -geometry 100x30 -title Development"
alias xp="ssh -Xf me#production_host xterm -bg thistle1 ..."

I have a bunch of servers I frequently connect to, as well, but they're all on my local network. This Ruby script prints out the command to create aliases for any machine with ssh open:
#!/usr/bin/env ruby
require 'rubygems'
require 'dnssd'
handle = DNSSD.browse('_ssh._tcp') do |reply|
print "alias #{reply.name}='ssh #{reply.name}.#{reply.domain}';"
end
sleep 1
handle.stop
Use it like this in your .bash_profile:
eval `ruby ~/.alias_shares`

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas