I am trying to run the following to extract the text from all the pdfs
find *.pdf | awk '{system("pdftotext "$0)}'
but dang it some crazy person put spaces in file names, how can I deal with this smoothly?
What is awk's role in this? Perhaps you should let find execute things itself.
find . -name \*.pdf -exec /path/to/pdftotext {} \;
Or if you're really really stuck with assuming that filenames will be safe as stdout to find (which you've proven they are not simply by asking this question), then put the filenames in quotes. This will work:
find . -name \*.pdf -print | awk '{cmd=sprintf("pdftotext \"%s\"", $0);system(cmd);}'
Related
I have a bunch of output files labelled file1.out, file2.out, file3.out, ...,fileN.out.
All of these files have multiple instances of a string in them called "keystring". However, only the first instance of "keystring" is meaningful to me. The other lines are not required.
When I do
grep 'keystring' *.out
I reach all files, and they output every instance of keystring.
When I do grep -m1 'keystring' *.out I only get the instance when file1.out has keystring.
I want to extract the line where keystring appears FIRST in all these output files. How can I pull this off?
You can use awk:
awk '/keystring/ {print FILENAME ":", $0; nextfile}' *.out
nextfile will move to next file as soon as it has printed first match from current file.
Use find -exec like so:
find . -name '*.out' -exec grep -m1 'keystring' {} \;
SEE ALSO:
GNU find manual
I have seen questions that are close to this but I have not seen the exact answer I need and can't seem to get my head wrapped around the regex, awk, sed, grep, rename that I would need to make it happen.
I have files in one directory sequentially named from multiple sub directories of a different directory created using find piped to xargs.
Command I used:
find `<dir1>` -name "*.png" | xargs cp -t `<dir2>`
This resulted in the second directory containing duplicate filenames sequentially named as follows:
<name>.png
<name>.png.~1~
<name>.png.~2~
...
<name>.png.~n~
What I would like to do is take all files ending in ~*~ and rename it as follows:
<name>.#.png where the '#" is the number between the "~"s at the end of the file name
Any help would be appreciated.
With Perl's rename (stand alone command):
rename -nv 's/^([^.]+)\.(.+)\.~([0-9]+)~/$1.$3.$2/' *
If everything looks fine remove option -n.
There might be an easier way to this, but here is a small shell script using grep and awk to achieve what you wanted
for i in $(ls|grep ".png."); do
name=$(echo $i|awk -F'png' '{print $1}');
n=$(echo $i|awk -F'~' '{print $2}');
mv $i $name$n.png;
done
I am interested in efficiently searching files for content using bash and related tools (eg sed, grep), in the specific case that I have additional information about where in the file the intended content is. For example, I want to replace a particular string in line #3 of each file that contains a specific string on line 3 of the file. Therefore, I don't want to do a recursive grep -r on the whole directory as that would search the entirety of each file, wasting time since I know that the string of interest is on line #3, if it is there. This full-grep approach could be done with grep -rl 'string_to_find_in_files' base_directory_to_search_recursively. Instead I am thinking about using sed -i ".bak" '3s/string_to_replace/string_to_replace_with' files to search only on line #3 of all files recursively in a directory, however sed seems to only be able to take one file as input argument. How can I apply sed to multiple files recursively? find -exec {} \; and find -print0 | xargs -0 seem to be very slow.. Is there a faster method than using find? I can achieve the desired effect very quickly with awk but only on a single directory, it does not seem to me to be recursive, such as using awk 'FNR==3{print $0}' directory/*. Any way to make this recursive? Thanks.
You can use find to have the list of files and feed to sed or awk one by one by xargs
for example, this will print the first lines of all files listed by find.
$ find . -name "*.csv" | xargs -L 1 sed -n '1p'
I have the following code:
#!/bin/sh
while read line; do
printf "%s\n" $line
done < input.txt
Input.txt has the following lines:
one\two
eight\nine
The output is as follows
onetwo
eightnine
The "standard" solutions to retain the slashes would be to use read -r.
However, I have the following limitations:
must run under #!/bin/shfor reasons of portability/posix compliance.
not all systems
will support the -r switch to read under /sh
The input file format cannot be changed
Therefore, I am looking for another way to retain the backslash after reading in the line. I have come up with one working solution, which is to use sed to replace the \ with some other value (e.g.||) into a temporary file (thus bypassing my last requirement above) then, after reading them in use sed again to transform it back. Like so:
#!/bin/sh
sed -e 's/[\/&]/||/g' input.txt > tempfile.txt
while read line; do
printf "%s\n" $line | sed -e 's/||/\\/g'
done < tempfile.txt
I'm thinking there has to be a more "graceful" way of doing this.
Some ideas:
1) Use command substitution to store this into a variable instead of a file. Problem - I'm not sure command substitution will be portable here either and my attempts at using a variable instead of a file were unsuccessful. Regardless, file or variable the base solution is really the same (two substitutions).
2) Use IFS somehow? I've investigated a little, but not sure that can help in this issue.
3) ???
What are some better ways to handle this given my constraints?
Thanks
Your constraints seem a little strict. Here's a piece of code I jotted down(I'm not too sure of how valuable your while loop is for the other stuffs you would like to do, so I removed it off just for ease). I don't guarantee this code to be robustness. But anyway, the logic would give you hints in the direction you may wish to proceed. (temp.dat is the input file)
#!/bin/sh
var1="$(cut -d\\ -f1 temp.dat)"
var2="$(cut -d\\ -f2 temp.dat)"
iter=1
set -- $var2
for x in $var1;do
if [ "$iter" -eq 1 ];then
echo $x "\\" $1
else
echo $x "\\" $2
fi
iter=$((iter+1))
done
As Larry Wall once said, writing a portable shell is easier than writing a portable shell script.
perl -lne 'print $_' input.txt
The simplest possible Perl script is simpler still, but I imagine you'll want to do something with $_ before printing it.
I want to write a shell script to search and delete all non text files in a directory..
I basically cd into the directory that I want to iterate through in the script and search through all files.
-- Here is the part I can't do --
I want to check using an if statement if the file is a text file.
If not I want to delete it
else continue
Thanks
PS By the way this is in linux
EDIT
I assume a file is a "text file" if and only if its name matches the shell pattern *.txt.
The file program always outputs the word "text" when passed the name of a file that it determines contains text format. You can test for output using grep. For example:
find -type f -exec file '{}' \; | grep -v '.*:[[:space:]].*text.*' | cut -d ':' -f 1
I strongly recommend printing out files to delete before deleting them, to the point of redirecting output to a file and then doing:
rm $(<filename)
after reviewing the contents of "filename". And beware of filenames with spaces, if you have those, things can get more involved.
Use the opposite of, unless an if statement is mandatory:
find <dir-path> -type f -name "*.txt" -exec rm {} \;
What the opposite is exactly is an exercise for you. Hint: it comes before -name.
Your question was ambiguous about how you define a "text file", I assume it's just a file with extension ".txt" here.
find . -type f ! -name "*.txt" -exec rm {} +;