awk doesn't work in hadoop's mapper - awk

This is my hadoop job:
hadoop streaming \
-D mapred.map.tasks=1\
-D mapred.reduce.tasks=1\
-mapper "awk '{if(\$0<3)print}'" \ # doesn't work
-reducer "cat" \
-input "/user/***/input/" \
-output "/user/***/out/"
this job always fails, with an error saying:
sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `export TMPDIR='..../work/tmp'; /bin/awk { if ($0 < 3) print } '
But if I change the -mapper into this:
-mapper "awk '{print}'"
it works without any error. What's the problem with the if(..) ?
UPDATE:
Thank #paxdiablo for your detailed answer.
what I really want to do is filter out some data whose 1st column is greater than x, before piping the input data to my custom bin. So the -mapper actually looks like this:
-mapper "awk -v x=$x{if($0<x)print} | ./bin"
Is there any other way to achieve that?

The problem's not with the if per se, it's to do with the fact that the quotes have been stripped from your awk command.
You'll realise this when you look at the error output:
sh: -c: line 0: `export TMPDIR='..../work/tmp'; /bin/awk { if ($0 < 3) print } '
and when you try to execute that quote-stripped command directly:
pax> echo hello | awk {if($0<3)print}
bash: syntax error near unexpected token `('
pax> echo hello | awk {print}
hello
The reason the {print} one works is because it doesn't contain the shell-special ( character.
One thing you might want to try is to escape the special characters to ensure the shell doesn't try to interpret them:
{if\(\$0\<3\)print}
It may take some effort to get the correctly escaped string but you can look at the error output to see what is generated. I've had to escape the () since they're shell sub-shell creation commands, the $ to prevent variable expansion, and the < to prevent input redirection.
Also keep in mind that there may be other ways to filter depending on you needs, ways that can avoid shell-special characters. If you specify what your needs are, we can possibly help further.
For example, you could create an shell script (eg, pax.sh) to do the actual awk work for you:
#!/bin/bash
awk -v x=$1 'if($1<x){print}'
then use that shell script in the mapper without any special shell characters:
hadoop streaming \
-D mapred.map.tasks=1 -D mapred.reduce.tasks=1 \
-mapper "pax.sh 3" -reducer "cat" \
-input "/user/***/input/" -output "/user/***/out/"

Related

Sed / Awk : replace the first occurrence of a pattern with the content of another file

With sed, I'm trying to replace the first occurrence of a comment in a script, like :
#ENTRYPOINT_CONTENT
by the content of a second file ($file_content) into another third file (/base.sh).
So, according to many docs, the string should be quite simple, something like :
sed "s|\#ENTRYPOINT_CONTENT|$file_content|" /base.sh
But I always end up with errors like :
sed: -e expression #1, char 23: unterminated s' command
or similar messages, also tried different delimiters, even with Awk instead, but without success, it seems to be fine after escaping the # in the search pattern, but I still can't get the file content as a variable in Sed.
Any ideas, either with Sed or Awk ?
Edit : --------------------
#Sundeep #James Brown:
Don't want to mix the subjects nor to be long :) the response to you clarification request is in bold at the end of this edit, but just to elaborate the context, my case is a Docker entrypoint script in bash (for a base Docker image) called /root/test/base :
#!/usr/bin/env bash
#ENTRYPOINT_CONTENT
if [[ -e "/root/test/custom" ]]; then
printf "\n\n#ENTRYPOINT_CONTENT\n" >> /root/test/custom
# Code from whjm's reply below (actually works but appends shebangs from custom files)
sed -e '0,/^#ENTRYPOINT_CONTENT/!b; /^#ENTRYPOINT_CONTENT/{ r /root/test/custom' -e 'd; }' /root/test/base.sh >> /root/test/base2.sh
mv /root/test/base2.sh /root/test/base.sh
rm -f /root/test/custom
fi
I just want to let users drop another bash script of their own on a specific path (say /root/test/custom), for example :
#!/usr/bin/env bash
echo 'My 2nd bash code'
The first script above (base) should insert the content of the custom file at #ENTRYPOINT_CONTENT position (in the base script itself, without removing this search string), like this :
#!/usr/bin/env bash
echo 'My 2nd bash code'
#ENTRYPOINT_CONTENT
if [[ -e "/root/test/custom" ]]; then
printf "\n\n#ENTRYPOINT_CONTENT\n" >> /root/test/custom
# Code from whjm's reply below
sed -e '0,/^#ENTRYPOINT_CONTENT/!b; /^#ENTRYPOINT_CONTENT/{ r /root/test/custom' -e 'd; }' /root/test/base.sh >> /root/test/base2.sh
mv /root/test/base2.sh /root/test/base.sh
rm -f /root/test/custom
fi
If another user later drops another custom script at the same path, we should have the code of this third custom script appended like this :
#!/usr/bin/env bash
echo 'My 2nd bash code'
echo 'My 3rd bash code'
#ENTRYPOINT_CONTENT
if [[ -e "/root/test/custom" ]]; then
# ... and so on
Regarding the shebangs from custom files, it's not a real issue if they are appended to the base file, the sed code from #whjm works as expected but appends them, while (surprisingly) both awk codes from #James Brown already (and gracefully :) ignore all additional shebangs from custom files (probably because they also start with # as the #ENTRYPOINT_CONTENT search string) but currently partly preprend the code like :
#!/usr/bin/env bash
echo 'My 3rd bash code'
echo 'My 2nd bash code'
#ENTRYPOINT_CONTENT
while I'm trying to get it appended like :
#!/usr/bin/env bash
echo 'My 2nd bash code'
echo 'My 3rd bash code'
#ENTRYPOINT_CONTENT
So, in short, #Sundeep, if you could just give me an updated version of your awk code for this, it would be perfect ! :D (couldn't find a way to invert this...) Thanks a lot.
Code from your previous post :
NR==FNR { b=b (FNR==1?"":ORS) $0; next }
{ r=r (FNR==1?"":ORS) $0 } #ENTRYPOINT_CONTENT
END { sub(/\#[^\n]+/,r,b); print b}
If you're using GNU sed:
[STEP 101] # cat file1
11
xx // replace me
44
xx // don't replace me
55
[STEP 102] # cat file2
22
33
[STEP 103] # sed -e '0,/^xx/!b; /^xx/{ r file2' -e 'd }' file1
11
22
33
44
xx // don't replace me
55
[STEP 104] #

Prepend a # to the first line not already having a #

I have a file with options for a command I run. Whenever I run the command I want it to run with the options defined in the first line which is not commented out. I do this using this bash script:
while read run opt c; do
[[ $run == \#* ]] && continue
./submit.py $opt $run -c "$c"
break
done < to_submit.txt
The file to_submit.txt has entries like this:
#167993 options/optionfile.py long description
167995 options/other_optionfile.py other long description
...
After having run the submit script with the options in the last not commented out line, I want to comment out that line after the command ran successfully.
I can find the line number of the options I used adding this to the while loop:
line=$(grep -n "$run" to_submit.txt | grep "$opt" | grep "$c" | cut -f 1 -d ":")
But I'm not sure how to actually prepend a # to that line now. I could probably use head and tail to save the other lines and process that line separately and combine it all back into the file. But this sounds like it's to complicated, there must be an easier sed or awk solution to this.
$ awk '!f && sub(/^[^#]/,"#&"){f=1} 1' file
#167993 options/optionfile.py long description
#167995 options/other_optionfile.py other long description
...
To overwrite the contents of the original file:
awk '!f && sub(/^[^#]/,"#&"){f=1} 1' file > tmp && mv tmp file
just like with any other UNIX command.
Using GNU sed is probably simplest here:
sed '0,/^[^#]/ s//#&/' file
Add option -i if you want to update file in place.
'0,/^[^#]/ matches all lines up to and including the first one that doesn't start with #
s//#&/ then prepends # to that line.
Note that s//.../ (i.e., an empty regex) reuses the last matching regex in the range, which is /^[^#]/ in this case.
Note that the command doesn't work with BSD/OSX sed, unfortunately, because starting a range with 0 so as to allow the range endpoint to match the very first line also is not supported there. It is possible to make the command work with BSD/OSX sed, but it's more cumbersome.
If the input/output file is not very large, you can do it all in Bash:
optsfile=to_submit.txt
has_run_cmd=0
outputlines=()
while IFS= read -r inputline || [[ -n $inputline ]] ; do
read run opt c <<<"$inputline"
if (( has_run_cmd )) || [[ $run == \#* ]] ; then
outputlines+=( "$inputline" )
elif ./submit.py "$opt" "$run" -c "$c" ; then
has_run_cmd=1
outputlines+=( "#$inputline" )
else
exit $?
fi
done < "$optsfile"
(( has_run_cmd )) && printf '%s\n' "${outputlines[#]}" > "$optsfile"
The lines of the file are put in the outputlines array, with a hash prepended to the line that was used in the ./submit.py command. If the command runs successfully, the file is overwritten with the lines in outputlines.
After some searching around I found that
awk -v run="$run" -v opt="$opt" '{if($1 == run && $2 == opt) {print "#" $0} else print}' to_submit.txt > temp
mv -b -f temp to_submit.txt
seems to solve this (without needing to find the line number first, just comparing $ run and $opt). This assumes that the combination of run and opt is enough to identify a line and the comment is not needed (which happens to be true in my case). Not sure how the comment which is spanning multiple fields in awk would also be taken into account.

Find a word in a text file and replace it with the filename

I have a lot of text files in which I would like to find the word 'CASE' and replace it with the related filename.
I tried
find . -type f | while read file
do
awk '{gsub(/CASE/,print "FILENAME",$0)}' $file >$file.$$
mv $file.$$ >$file
done
but I got the following error
awk: syntax error at source line 1 context is >>> {gsub(/CASE/,print <<< "CASE",$0)}
awk: illegal statement at source line 1
I also tried
for i in $(ls *);
do
awk '{gsub(/CASE/,${i},$0)}' ${i} > file.txt;
done
getting an empty output and
awk: syntax error at source line 1 context is >>> {gsub(/CASE/,${ <<<
awk: illegal statement at source line 1
Why awk? sed is what you want:
while read -r file; do
sed -i "s/CASE/${file##*/}/g" "$file"
done < <( find . -type f )
or
while read -r file; do
sed -i.bak "s/CASE/${file##*/}/g" "$file"
done < <( find . -type f )
To create a backup of the original.
You didn't post any sample input and expected output so this is a guess but maybe this is what you want:
find . -type f |
while IFS= read -r file
do
awk '{gsub(/CASE/,FILENAME)} 1' "$file" > "${file}.$$" &&
mv "${file}.$$" "$file"
done
Every change I made to the shell code is important so if you don't understand why I changed any part of it, ask the question.
btw if after making the changes you are still getting the error message:
awk: syntax error at source line 1
awk: illegal statement at source line 1
then you are using old, broken awk (/usr/bin/awk on Solaris). Never use that awk. On Solaris use /usr/xpg4/bin/awk (or nawk if you must).
Caveats: the above will fail if your file name contains newlines or ampersands (&) or escaped digits (e.g. \1). See Is it possible to escape regex metacharacters reliably with sed for details. If any of that is a problem, post some representative sample input and expected output.
print in that first script is the error.
The second argument to gsub is the replacement string not a command.
You want just FILENAME. (Note not "FILENAME" that's a literal string. FILENAME the variable.)
find . -type f -print0 | while IFS= read -d '' file
do
awk '{gsub(/CASE/,FILENAME,$0)} 7' "$file" >"$file.$$"
mv "$file.$$" "$file"
done
Note that I quoted all your variables and fixed your find | read pipeline to work correctly for files with odd characters in the names (see Bash FAQ 001 for more about that). I also fixed the erroneous > in the mv command.
See the answers on this question for how to properly escape the original filename to make it safe to use in the replacement portion of gsub.
Also note that recent (4.1+ I believe) versions of awk have the -i inplace argument.
To fix the second script you need to add the quotes you removed from the first script.
for i in *; do awk '{gsub(/CASE/,"'"${i}"'",$0)}' "${i}" > file.txt; done
Note that I got rid of the worse than useless use of ls (worse than useless because it actively breaks files with spaces or shell metacharacters in the their names (see Parsing ls for more on that).
That command though is somewhat ugly and unsafe for filenames with various characters in them and would be better written as the following though:
for i in *; do awk -v fname="$i" '{gsub(/CASE/,fname,$0)}' "${i}" > file.txt; done
since that will work with filenames with double quotes/etc. in their names correctly whereas the direct variable expansion version will not.
That being said the corrected first script is the right answer.

Why doesn't this ssh command work in ksh?

I'm tweaking a KSH script and I'm trying to ssh into various hosts and execute a grep command on vfstab that will return a certain line. The problem is, I can't get the following to work below. I'm trying to get the line it returns and append it to a destination file. Is there a better way to do this, ex assign the grep statement to a command variable? The command works fine within the script, but the nested quotations seems to bugger it. Anyways, here's the line:
ssh $user#$host "grep '/var/corefiles' $VFSTAB_LOC | awk '{print $3, $7}' " >> $DEST
This results in:
awk: syntax error near line 1
awk: illegal statement near line one
If there is a better/more correct way to do this please let me know!
You're putting the remote command in double quotes, so the $3 and $7 in the awk body will be substituted. awk probably sees '{print ,}'. Escape the dollar signs in the awk body.
ssh $user#$host "grep '/var/corefiles' $VFSTAB_LOC | awk '{print \$3, \$7}' " >> $DEST
^ ^
I tried below and it worked for me (in ksh) not sure why it would error out in your case
user="username";
host="somehost";
VFSTAB_LOC="result.out";
DEST="/home/username/aaa.out";
echo $DEST;
`ssh $user#$host "grep '/abc/dyf' $VFSTAB_LOC | awk '{print $3, $1}'" >> $DEST`;

Solaris awk Troubles

I'm writing a shell script and I need to strip FIND ME out of something like this:
* *[**FIND ME**](find me)*
and assign it to an array. I had the code working flawlessly .. until I moved the script in Solaris to a non-global zone. Here is the code I used before:
objectArray[$i]=`echo $line | nawk -F '*[**|**]' '{print $2}'`
Now Prints:
awk: syntax error near line 1
awk: bailing out near line 1
It was suggested that I try the same command with nawk, but I receive this error now instead:
nawk: illegal primary in regular expression `* *[**|**]` at `*[**|**]`
input record number 1
source line number 1
Also, /usr/xpg4/bin/awk does not exist.
I think you need to be clearer on what you want to get. For me your awk line doesn't 'strip FIND ME out'
echo "* *[**FIND ME**](find me)*" | nawk -F '* *[**|**]' '{print $2}'
[
So it would help if you gave some examples of the input/output you are expecting. Maybe there's a way to do what you want with sed?
EDIT:
From comments you actually want to select "FIND ME" from line, not strip it out.
I guess the dialect of regular expressions accepted by this nawk is different than gawk. So maybe a tool that's better suited to the job is in order.
echo "* *[**FIND ME**](find me)*" | sed -e"s/.*\* \*\[\*\*\(.[^*]*\)\*\*\].*/\1/"
FIND ME
quote your $line variable like this: "$line". If still doesn't work, you can do it another way with nawk, since you only want to find one instance of FIND ME,
$ echo "$line" | nawk '{gsub(/.*\*\[\*\*|\*\*\].*/,"");print}'
FIND ME
or if you are using bash/ksh on Solaris,
$ line="${line#*\[\*\*}"
$ echo "${line%%\*\*\]*}"
FIND ME