parallelize awk script - files splitting

parallelize awk script - files splitting - awk

I have a small awk script which takes input from a stream and writes to the appropriate file based on the second column value. Here is how it goes:
cat mydir/*.csv | awk -F, '{if(NF==29)print $0 >> "output/"$2".csv"}'
How do I parallelize it, so that it can use multiple cores available in the machine? Right now, this is running on a single core.

you can try this.
I execute 1 awk per source file. Put content in temporary file (in each process it is a series of different one to avoid conflict in same final file and/or too much open/close handle on it). At the end of the awk, it put the content of temporary file into final one and remove temporary
you maybe have to use a batch limiter (a sleep or more smart grouping) if there are lot of file to treat to avoid to kill the machine with too much subprocess concurrent.
rm output/*.csv
for File in mydir/*.csv
do
# shell sub process
{
# ref for a series of temporary file
FileRef="${File##*/}"
awk -F ',' -v FR="${FileRef}" '
NF == 29 {
# put info in temporary file
ListFiles [ OutTemp = "output/"$2".csv_" FR ] = "output/"$2".csv"
print > OutTemp}
END {
# put temporary content into final file
for ( TempFile in ListFiles ) {
Command = sprintf( "cat \042%s\042 >> \042%s\042; rm \042%s\042" \
, TempFile, ListFiles[TempFile], TempFile )
printf "" | Command
}
' File
} &
done
wait
echo ls -l output/*.csv

Untested:
do_one() {
# Make a workdir only used by this process to ensure no files are added to in parallel
mkdir -p $1
cd $1
cat ../"$2" | awk -F, '{if(NF==29)print $0 >> $2".csv"}'
}
export -f do_one
parallel do_one workdir-{%} {} ::: mydir/*.csv
ls workdir-*/ | sort -u |
parallel 'cat workdir*/{} > output/{}'
rm -rf workdir-*
If you want to avoid the extra cat you can use this instead, though I find the cat version easier to read (performance is normally the same on modern systems http://oletange.blogspot.com/2013/10/useless-use-of-cat.html):
do_one() {
# Make a workdir only used by this process to ensure no files are added to in parallel
mkdir -p $1
cd $1
awk -F, <../"$2" '{if(NF==29)print $0 >> $2".csv"}'
}
export -f do_one
parallel do_one workdir-{%} {} ::: mydir/*.csv
ls workdir-*/ | sort -u |
parallel 'cat workdir*/{} > output/{}'
rm -rf workdir-*
But as #Thor writes, you are most likely I/O starved.

Related

Extracting data after a tag and CR with Busybox sed

I have a script that extracts a file from a bash script combined with a binary file. It does so using the following GNU sed syntax
sed -n '/__DATA__/{n;:1;n;p;b1}' /tmp/combined.file > /tmp/binary.file
The files are assembled by cat'ing an ISO file to the end of a bash script. Which is then sent over the network to an embedded device and extracted on the device, piping the ISO file to a temporary dir and executing the bash script to install it.
However, on executing this I get a
sed: unterminated {
Am I missing something here? Is this task possible with BusyBox sed?

It tried the "Second attempt" below with OSX/BSD awk and it failed, just printing up til the first NUL character. So you can't do this job portably with awk or sed.
Here's what should work everywhere given that the POSIX standard says
the input file to tail can be any type
so the input to tail doesn't have to be a POSIX text file (no NULs) and we're exiting from awk before the first NUL is encountered in the input so they should both be happy:
$ tail -n +"$(awk '/^__DATA__$/{print NR+2; exit}' binary.bin)" binary.bin | cat -ev
ER^H^#^#^#M-^PM-^P^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#3��M-^Nռ^#|��f1�f1�fSfQ^FWM-^N�M-^N�R�^#|�^#^F�^#^A��K^F^#^#R�A��U1�0���^Sr^VM-^A�U�u^PM-^C�^At^Kf�^F�^F�B�^U�^B1�ZQ�^H�^S[^O��#PM-^C�?Q��SRP�^#|�^D^#f��^G�D^#^OM-^BM-^#^#f#M-^#�^B��fM-^A>#|��xpu ��{�D|^#^#�M-^C^#isolinux.bin missing or corrupt.^M$
f`f1�f^C^F�{f^S^V�{fRfP^FSj^Aj^PM-^I�f�6�{��^FM-^H�M-^H�M-^R�6�{M-^H�^H�A�^A^BM-^J^V�{�^SM-^Md^Pfa��^^^#Operating system load error.^M$
^��^NM-^J>b^D�^G�^P<$
u��^X���^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#L^D^#^#^#^#^#^#�K�6^#^#M-^#^#^A^#^#?�M-^K^#^#^#^#^#`^\^#^#�������<R^#^#^#^_^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#U�EFI PART^#^#^A^#\^#^#^#]3�.^#^#^#^#^A^#^#^#^#^#^#^#�_^\^#^#^#^#^##^#^#^#^#^#^#^#�_^\^#^#^#^#^#Uc�r^Oqc#M-^Rc^F�$LZ�^L^#^#^#^#^#^#^#�^#^#^#M-^#^#^#^#�t^]F^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#$
Second attempt:
Now that I have a better idea what you're trying to do (process a file consisting of POSIX text lines up to a point and then can contain NUL characters afterwards), try this:
$ cat -ev file
echo "I: Installation finished!"$
exit 0$
$
__DATA__$
$
foo^#bar^#etc
$ cat tst.awk
/^__DATA__$/ { n=NR + 1 }
n && (NR == n) { RS="\0"; ORS="" }
n && (NR > n) { print (c++ ? RS : "") $0 }
$ awk -f tst.awk file | cat -ev
foo^#bar^#etc
The above doesn't try to store any input lines containing NUL in memory, instead it reads \n-terminated text lines until it reaches the line after the one containing __DATA__ and then switches to reading NUL-terminated records into memory and printing NULs between them on output.
It's still undefined behavior per POSIX (see my comments below) but in theory it should work since it just relies on being able to set one variable (RS) to NUL rather than trying to store input strings that contain NULs. Also, setting RS to NUL has been a (flawed) workaround for awk scripts for years to be able to read a whole file into memory at once so being able to set RS to NUL should work in any modern awk.
Using the new sample you provided with the missing blank line after the __DATA__ line added:
$ cat -ev file
#!/bin/bash$
$
echo "I: Awesome Things happened here"$
exit 0$
$
__DATA__$
$
ER^H^#^#^#M-^PM-^P^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#3M-mM-zM-^NM-UM-<^#|M-{M-|f1M-[f1M-IfSfQ^FWM-^NM-]M-^NM-ERM->^#|M-?^#^FM-9^#^AM-sM-%M-jK^F^#^#RM-4AM-;M-*U1M-I0M-vM-yM-M^Sr^VM-^AM-{UM-*u^PM-^CM-a^At^KfM-G^FM-s^FM-4BM-k^UM-k^B1M-IZQM-4^HM-M^S[^OM-6M-F#PM-^CM-a?QM-wM-aSRPM-;^#|M-9^D^#fM-!M-0^GM-hD^#^OM-^BM-^#^#f#M-^#M-G^BM-bM-rfM-^A>#|M-{M-#xpu M-zM-<M-l{M-jD|^#^#M-hM-^C^#isolinux.bin missing or corrupt.^M$
f`f1M-Rf^C^FM-x{f^S^VM-|{fRfP^FSj^Aj^PM-^IM-ffM-w6M-h{M-#M-d^FM-^HM-aM-^HM-EM-^RM-v6M-n{M-^HM-F^HM-aAM-8^A^BM-^J^VM-r{M-M^SM-^Md^PfaM-CM-h^^^#Operating system load error.^M$
^M-,M-4^NM-^J>b^DM-3^GM-M^P<$
uM-qM-M^XM-tM-kM-}^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#L^D^#^#^#^#^#^#M-/KM-66^#^#M-^#^#^A^#^#?M-`M-^K^#^#^#^#^#`^\^#^#M-~M-^?M-^?M-oM-~M-^?M-^?<R^#^#^#^_^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#UM-*EFI PART^#^#^A^#\^#^#^#]3M-%.^#^#^#^#^A^#^#^#^#^#^#^#M-^?_^\^#^#^#^#^##^#^#^#^#^#^#^#M-J_^\^#^#^#^#^#UcM-)r^Oqc#M-^Rc^FM-2$LZM-p^L^#^#^#^#^#^#^#M-P^#^#^#M-^#^#^#^#M-{t^]F^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#$
.
$ awk -f tst.awk file | cat -ev
ER^H^#^#^#M-^PM-^P^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#3M-mM-zM-^NM-UM-<^#|M-{M-|f1M-[f1M-IfSfQ^FWM-^NM-]M-^NM-ERM->^#|M-?^#^FM-9^#^AM-sM-%M-jK^F^#^#RM-4AM-;M-*U1M-I0M-vM-yM-M^Sr^VM-^AM-{UM-*u^PM-^CM-a^At^KfM-G^FM-s^FM-4BM-k^UM-k^B1M-IZQM-4^HM-M^S[^OM-6M-F#PM-^CM-a?QM-wM-aSRPM-;^#|M-9^D^#fM-!M-0^GM-hD^#^OM-^BM-^#^#f#M-^#M-G^BM-bM-rfM-^A>#|M-{M-#xpu M-zM-<M-l{M-jD|^#^#M-hM-^C^#isolinux.bin missing or corrupt.^M$
f`f1M-Rf^C^FM-x{f^S^VM-|{fRfP^FSj^Aj^PM-^IM-ffM-w6M-h{M-#M-d^FM-^HM-aM-^HM-EM-^RM-v6M-n{M-^HM-F^HM-aAM-8^A^BM-^J^VM-r{M-M^SM-^Md^PfaM-CM-h^^^#Operating system load error.^M$
^M-,M-4^NM-^J>b^DM-3^GM-M^P<$
uM-qM-M^XM-tM-kM-}^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#L^D^#^#^#^#^#^#M-/KM-66^#^#M-^#^#^A^#^#?M-`M-^K^#^#^#^#^#`^\^#^#M-~M-^?M-^?M-oM-~M-^?M-^?<R^#^#^#^_^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#UM-*EFI PART^#^#^A^#\^#^#^#]3M-%.^#^#^#^#^A^#^#^#^#^#^#^#M-^?_^\^#^#^#^#^##^#^#^#^#^#^#^#M-J_^\^#^#^#^#^#UcM-)r^Oqc#M-^Rc^FM-2$LZM-p^L^#^#^#^#^#^#^#M-P^#^#^#M-^#^#^#^#M-{t^]F^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#$
Original answer:
Assuming this question is related to your previous question, this will work using any awk in any shell on every UNIX box:
$ awk '/^__DATA__$/{n=NR+1} n && NR>n' file
3<ED>M-^PM-^PM-^PM-^PM-^
When it finds __DATA__ it sets a variable n to the line number to start printing after and then when n is set prints every line for which the line number is greater than n.
The above was run against this input file from your previous question:
$ cat -ev file
echo "I: Installation finished!"$
exit 0$
$
__DATA__$
$
3<ED>M-^PM-^PM-^PM-^PM-^$

Change a string using sed or awk

I have some files which have wrong time and date, but the filename contains the correct time and date and I try to write a script to fix this with the touch command.
Example of filename:
071212_090537.jpg
I would like this to be converted to the following format:
1712120905.37
Note, the year is listed as 07 in the filename, even if it is 17 so I would like the first 0 to be changed to 1.
How can I do this using awk or sed?
I'm quite new to awk and sed, an programming in general. Have tried to search for a solution and instruction, but haven't manage to figure out how to solve this.
Can anyone help me?
Thanks. :)

Take your example:
awk -F'[_.]' '{$0=$1$2;sub(/^./,"1");sub(/..$/,".&")}1'<<<"071212_090537.jpg"
will output:
1712120905.37
If you want the file to be removed, you can let awk generate the mv origin new command, and pipe the output to |sh, like: (comments inline)
listYourFiles| # list your files as input to awk
awk -F'[_.]' '{o=$0;$0=$1$2;sub(/^./,"1");sub(/..$/,".&");
printf "mv %s %s\n",o,$0 }1' #this will print "mv ori new"
|sh # this will execute the mv command

It's completely unnecessary to call awk or sed for this, you can do it in your shell. e.g. with bash:
$ f='071212_090537.jpg'
$ [[ $f =~ ^.(.*)_(.*)(..)\.[^.]+$ ]]
$ echo "1${BASH_REMATCH[1]}${BASH_REMATCH[2]}.${BASH_REMATCH[3]}"
1712120905.37
This is probably what you're trying to do:
for old in *.jpg; do
[[ $old =~ ^.(.*)_(.*)(..)\.[^.]+$ ]] || { printf 'Warning, unexpected old file name format "%s"\n' "$old" >&2; continue; }
new="1${BASH_REMATCH[1]}${BASH_REMATCH[2]}.${BASH_REMATCH[3]}"
[[ -f "$new" ]] && { printf 'Warning, new file name "%s" generated from "%s" already exists, skipping.\n' "$new" "$old" >&2; continue; }
mv -- "$old" "$new"
done
You need that test for new already existing since an old of 071212_090537.jpg or 171212_090537.jpg (or various other values) would create the same new of 1712120905.37

I think sed really is the easiest solution:
You could do this:
▶ for f in *.jpg ; do
new_f=$(sed -E 's/([0-9]{6})_([0-9]{4})([0-9]{2})\.jpg/\1\2.\3.jpg/' <<< $f)
mv $f $new_f
done
For more info:
You probably need to read an introductory tutorial on regular expressions.
Note that the -E option to sed allows use of extended regular expressions, allowing a more readable and convenient expression here.
Use of <<< is a Bashism known as a "here-string". If you are using a shell that doesn't support that, A <<< $b can be rewritten as echo $b | A.
Testing:
▶ touch 071212_090538.jpg 071212_090539.jpg
▶ ls -1 *.jpg
071212_090538.jpg
071212_090539.jpg
▶ for f in *.jpg ; do
new_f=$(sed -E 's/([0-9]{6})_([0-9]{4})([0-9]{2})\.jpg/\1\2.\3.jpg/' <<< $f)
mv $f $new_f
done
▶ ls -1
0712120905.38.jpg
0712120905.39.jpg

How can i repeat a script?

i've search for command or solution to repeat a script after n times but i can't find it.
This is my rusty script:
#!/bin/csh -f
rm -rf result120
rm -rf result127
rm -rf result126
rm -rf result125
rm -rf result128
rm -rf result129
rm -rf result122
rm -rf output
rm -rf aaa
### Get job id from user name
foreach file ( `cat name` )
echo `bjobs -u $file | awk '$1 ~ /^[0-9]+/ {print $1}' >> aaa`
echo "loading"
end
### Read in job id
foreach file ( `cat aaa` )
echo `bjobs -l $file >> result120`
echo "loading"
end
### Get pattern in < >
awk '{\
gsub(/ /,"",$0)}\
BEGIN {\
RS =""\
FS=","\
}\
{\
s=1\
e=150\
if ($1 ~/Job/){\
for(i=s;i<=e;i++){\
printf("%s", $(i))}\
}\
}' result120 > result126
grep -oE '<[^>]+>' result126 > result125
### Get Current Work Location
awk '$1 ~ /<lsf_login..>/ {getline; print $1}' result125 >result122 #result127
### Get another information and paste it with CWD
foreach file1 ( `cat aaa` )
echo `bjobs $file1 >> result128`
echo "getting data"
end
awk '$1 ~ /JOBID/ {getline; printf "%-15s %-15s %-15s %-15s %-20s\n", $1, $2, $3, $4, $5}' result128 >> result129
paste result129 result122 >> output
### Summary
awk '{count1[$2]++}{count2[$4]++}{count3[$3]++}\
END{\
print "\n"\
print "##########################################################################"\
print "There are: ", NR " Jobs"\
for(name in count1){ print name, count1[name]}\
print "\n"\
for(queqe in count2){ print queqe, count2[queqe]}\
print "\n"\
for(stt in count3){ print stt, count3[stt]}\
}' output >> output
And my desire is run it again per 15 minutes to get report. Someone told me use Wait but i've searched for it in man wait and can't find any
useful example. That's why i need yours help to solve this problem.
Thanks a lot.

run the script every 15 mins
while true; do ./script.sh; sleep 900; done
or set a cron job or use watch
For c shell you have to write
while (1)
./script.sh
sleep 900
end
but why use csh since you have bash? Double check the syntax, since I don't remember it much anymore...

Following #karakfa answer, you have basically 2 options.
1) Your first option, even if you use a sleep implements a kind of busy-waiting strategy (https://en.wikipedia.org/wiki/Busy_waiting), this stragegy uses more CPU/memory than your second option (the cron approach) because you will have in memory your processus footprint even if it is actually doing nothing.
2) On the other hand, in the cron approach your processus will only appear while doing useful activities.
Just Imagine if you implement this kind of approach for many programs running on your machine, a lot of memory will be consume by processus in waiting states, it will also have an impact (memory/CPU usage) on the scheduling algorithm of your OS since it will have more processes in queue to manage.
Therefore, I would absolutely recommend the cron/scheduling approach.
Anyway,your cron daemon will be running in background whether you add the entry or not in the crontab, so why not adding it?
Last but not least, imagine if your busy-waiting processus is killed for any reason, if you go for the first option you will need to restart it manually and you might lose a couple of monitoring entries.
Hope it helps you.

Prepend a # to the first line not already having a #

I have a file with options for a command I run. Whenever I run the command I want it to run with the options defined in the first line which is not commented out. I do this using this bash script:
while read run opt c; do
[[ $run == \#* ]] && continue
./submit.py $opt $run -c "$c"
break
done < to_submit.txt
The file to_submit.txt has entries like this:
#167993 options/optionfile.py long description
167995 options/other_optionfile.py other long description
...
After having run the submit script with the options in the last not commented out line, I want to comment out that line after the command ran successfully.
I can find the line number of the options I used adding this to the while loop:
line=$(grep -n "$run" to_submit.txt | grep "$opt" | grep "$c" | cut -f 1 -d ":")
But I'm not sure how to actually prepend a # to that line now. I could probably use head and tail to save the other lines and process that line separately and combine it all back into the file. But this sounds like it's to complicated, there must be an easier sed or awk solution to this.

$ awk '!f && sub(/^[^#]/,"#&"){f=1} 1' file
#167993 options/optionfile.py long description
#167995 options/other_optionfile.py other long description
...
To overwrite the contents of the original file:
awk '!f && sub(/^[^#]/,"#&"){f=1} 1' file > tmp && mv tmp file
just like with any other UNIX command.

Using GNU sed is probably simplest here:
sed '0,/^[^#]/ s//#&/' file
Add option -i if you want to update file in place.
'0,/^[^#]/ matches all lines up to and including the first one that doesn't start with #
s//#&/ then prepends # to that line.
Note that s//.../ (i.e., an empty regex) reuses the last matching regex in the range, which is /^[^#]/ in this case.
Note that the command doesn't work with BSD/OSX sed, unfortunately, because starting a range with 0 so as to allow the range endpoint to match the very first line also is not supported there. It is possible to make the command work with BSD/OSX sed, but it's more cumbersome.

If the input/output file is not very large, you can do it all in Bash:
optsfile=to_submit.txt
has_run_cmd=0
outputlines=()
while IFS= read -r inputline || [[ -n $inputline ]] ; do
read run opt c <<<"$inputline"
if (( has_run_cmd )) || [[ $run == \#* ]] ; then
outputlines+=( "$inputline" )
elif ./submit.py "$opt" "$run" -c "$c" ; then
has_run_cmd=1
outputlines+=( "#$inputline" )
else
exit $?
fi
done < "$optsfile"
(( has_run_cmd )) && printf '%s\n' "${outputlines[#]}" > "$optsfile"
The lines of the file are put in the outputlines array, with a hash prepended to the line that was used in the ./submit.py command. If the command runs successfully, the file is overwritten with the lines in outputlines.

After some searching around I found that
awk -v run="$run" -v opt="$opt" '{if($1 == run && $2 == opt) {print "#" $0} else print}' to_submit.txt > temp
mv -b -f temp to_submit.txt
seems to solve this (without needing to find the line number first, just comparing $ run and $opt). This assumes that the combination of run and opt is enough to identify a line and the comment is not needed (which happens to be true in my case). Not sure how the comment which is spanning multiple fields in awk would also be taken into account.

grep early stop with one match per pattern

Say I have a file where the patterns reside, e.g. patterns.txt. And I know that all the patterns will only be matched once in another file patterns_copy.txt, which in this case to make matters simple is just a copy of patterns.txt.
If I run
grep -m 1 --file=patterns.txt patterns_copy.txt > output.txt
I get only one line. I guess it's because the m flag stopped the whole matching process once the 1st line of the two files match.
What I would like to achieve is to have each pattern in patterns.txt matched only once, and then let grep move to the next pattern.
How do I achieve this?
Thanks.

Updated Answer
I have now had a chance to integrate what I was thinking about awk into the GNU Parallel concept.
I used /usr/share/dict/words as my patterns file and it has 235,000 lines in it. Using BenjaminW's code in another answer, it took 141 minutes, whereas this code gets that down to 11 minutes.
The difference here is that there are no temporary files and awk can stop once it has found all 8 of the things it was looking for...
#!/bin/bash
# Create a bash function that GNU Parallel can call to search for 8 things at once
doit() {
# echo Job: $9
# In following awk script, read "p1s" as a flag meaning "p1 has been seen"
awk -v p1="$1" -v p2="$2" -v p3="$3" -v p4="$4" -v p5="$5" -v p6="$6" -v p7="$7" -v p8="$8" '
$0 ~ p1 && !p1s {print; p1s++;}
$0 ~ p2 && !p2s {print; p2s++;}
$0 ~ p3 && !p3s {print; p3s++;}
$0 ~ p4 && !p4s {print; p4s++;}
$0 ~ p5 && !p5s {print; p5s++;}
$0 ~ p6 && !p6s {print; p6s++;}
$0 ~ p7 && !p7s {print; p7s++;}
$0 ~ p8 && !p8s {print; p8s++;}
{if(p1s+p2s+p3s+p4s+p5s+p6s+p7s+p8s==8)exit}
' patterns.txt
}
export -f doit
# Next line effectively uses 8 cores at a time to each search for 8 items
parallel -N8 doit {1} {2} {3} {4} {5} {6} {7} {8} {#} < patterns.txt
Just for fun, here is what it does to my CPU - blue means maxed out, and see if you can see where the job started in the green CPU history!
Other Thoughts
The above benefits from the fact that the input files are relatively well sorted, so it is worth looking for 8 things at a time because they are likely close to each other in the input file, and I can therefore avoid the overhead associated with creating one process per sought term. However, if your data are not well sorted, that may mean that you waste a lot of time looking further through the file than necessary to find the next 7, or 6 other items. In that case, you may be better off with this:
parallel grep -m1 "{}" patterns.txt < patterns.txt
Original Answer
Having looked at the size of your files, I now think awk is probably not the way to go, but GNU Parallel maybe is. I tried parallelising the problem two ways.
Firstly, I search for 8 items at a time in a single pass through the input file so that I have less to search through with the second set of greps that use the -m 1 parameter.
Secondly, I do as many of these "8-at-a-time" greps in parallel as I have CPU cores.
I use the GNU Parallel job number {#} as a unique temporary filename, and only create 16 (or however many CPU cores you have) temporary files at a time. The temporary files are prefixed ss (for sub-search) so they can call be deleted easily enough when testing.
The speedup seems to be a factor of about 4 times on my machine. I used /usr/share/dict/words as my test files.
#!/bin/bash
# Create a bash function that GNU Parallel can call to search for 8 things at once
doit() {
# echo Job: $9
# Make a temp filename using GNU Parallel's job number which is $9 here
TEMP=ss-${9}.txt
grep -E "$1|$2|$3|$4|$5|$6|$7|$8" patterns.txt > $TEMP
for i in $1 $2 $3 $4 $5 $6 $7 $8; do
grep -m1 "$i" $TEMP
done
rm $TEMP
}
export -f doit
# Next line effectively uses 8 cores at a time to each search for 8 items
parallel -N8 doit {1} {2} {3} {4} {5} {6} {7} {8} {#} < patterns.txt

You can loop over your patterns like this (assuming you're using Bash):
while read -r line; do
grep -m 1 "$line" patterns_copy.txt
done < patterns.txt > output.txt
Or, in one line:
while read -r line; do grep -m 1 "$line" patterns_copy.txt; done < patterns.txt > output.txt
For parallel processing, you can start the processes as background jobs:
while read -r line; do
grep -m 1 "$line" patterns_copy.txt &
read -r line && grep -m 1 "$line" patterns_copy.txt &
# Repeat the previous line as desired
wait # Wait for greps of this loop to finish
done < patterns.txt > output.txt
This is not really elegant as for each loop it will wait for the slowest grep to finish, but should still be faster than just one grep per loop.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

parallelize awk script - files splitting - awk

Related

Extracting data after a tag and CR with Busybox sed

Change a string using sed or awk

How can i repeat a script?

Prepend a # to the first line not already having a #

grep early stop with one match per pattern

Categories

Resources