How does piping handle multiple files in linux? - awk

So a naive me wanted to parse 50 files using awk, so I did the following
zcat dir_with_50files/* > huge_file
cat huge_file | awk '{parsing}'
Of course, this was terrible because it would spend time creating a file, then consume a whole bunch of memory to pass along to awk.
Then a coworker showed me that I could do this.
zcat dir_with_50files/filename{0..50} | awk '{parsing}'
I was amazed that I would get the same results without the memory consumption.
ps aux also showed that the two commands ran in parallel. I was confused about what was happening and this SO answer partially answered my question.
https://stackoverflow.com/a/1072251/6719378
But if piping knows to initiate the second command after certain amount of buffered data, why does my naive approach consume so much more memory compared to the second approach?
Is it because I am using cat on a single file compared to loading multiple files?

you can reduce maximuml memory usage by zcat file by file
ex:
for f in dir_with_50files/*
do
zcat f | awk '{parsing}' >> Result.File
done
# or
find dir_with_50files/ -exec zcat {} | awk '{parsing}' >> Result.File \;
but it depend on your parsing
ok for modfying line, deleting, copying if there is no relation to previous items ( ex: sub( /foo/, "bar"))
bad for counter (ex: List[$2]++ ) or related (modification) (ex: NR != FNR {...}; ! List[$2]++ {...})

Related

Process part of a line through the shell pipe

I would like to process part of each line of command output, leaving the rest untouched.
Problem
Let's say I have some du output:
❯ du -xhd 0 /usr/lib/gr*
3.2M /usr/lib/GraphicsMagick-1.3.40
584K /usr/lib/grantlee
12K /usr/lib/graphene-1.0
4.2M /usr/lib/graphviz
4.0K /usr/lib/grcrt1.o
224K /usr/lib/groff
Now I want to process each path with another command, for example running pacman -Qo on it, leaving the remainder of the line untouched.
Approach
I know I can use awk {print $2} to get only the path, and could probably use it in a convoluted for loop to weld it back together, but maybe there is an elegant way, ideally easy to type on the fly, producing this in the end:
3.2M /usr/lib/GraphicsMagick-1.3.40/ is owned by graphicsmagick 1.3.40-2
584K /usr/lib/grantlee/ is owned by grantlee 5.3.1-1
12K /usr/lib/graphene-1.0/ is owned by graphene 1.10.8-1
4.2M /usr/lib/graphviz/ is owned by graphviz 7.1.0-1
4.0K /usr/lib/grcrt1.o is owned by glibc 2.36-7
224K /usr/lib/groff/ is owned by groff 1.22.4-7
Workaround
This is the convoluted contraption I am living with for now:
❯ du -xhd 0 /usr/lib/gr* | while read line; do echo "$line $(pacman -Qqo $(echo $line | awk '{print $2}') | paste -s -d',')"; done | column -t
3.2M /usr/lib/GraphicsMagick-1.3.40 graphicsmagick
584K /usr/lib/grantlee grantlee,grantleetheme
12K /usr/lib/graphene-1.0 graphene
4.2M /usr/lib/graphviz graphviz
4.0K /usr/lib/grcrt1.o glibc
224K /usr/lib/groff groff
But multiple parts of it are pacman-specific.
du -xhd 0 /usr/lib/gr* | while read line; do echo "$line" | awk -n '{ORS=" "; print $1}'; pacman --color=always -Qo $(echo $line | awk '{print $2}') | head -1; done | column -t
3.2M /usr/lib/GraphicsMagick-1.3.40/ is owned by graphicsmagick 1.3.40-2
584K /usr/lib/grantlee/ is owned by grantlee 5.3.1-1
12K /usr/lib/graphene-1.0/ is owned by graphene 1.10.8-1
4.2M /usr/lib/graphviz/ is owned by graphviz 7.1.0-1
4.0K /usr/lib/grcrt1.o is owned by glibc 2.36-7
224K /usr/lib/groff/ is owned by groff 1.22.4-7
This is a more generic solution, but what if there are three columns of output and I want to process only the middle one?
It grows in complexity, and I thought there must be a simpler way avoiding duplication.
Use a bash loop
(
IFS=$'\t'
while read -r -a fields; do
fields[1]=$(pacman -Qo "${fields[1]}")
printf '%s\n' "${fields[*]}"
done
)
Use a simple shell loop.
du -xhd 0 /usr/lib/gr* |
while read -r size package; do
pacman --color=always -Qo "$package" |
awk -v sz="$size" '{
printf "%s is owned by %s\n", sz, $0 }'
done
If you want to split out parts of the output from pacman, Awk makes that easy to do; for example, the package name is probably in Awk's $1 and the version in $2.
(Sorry, don't have pacman here; perhaps edit your question to show its output if you need more details. Going forward, please take care to ask the actual question you need help with, so you don't have to move the goalposts by editing after you have received replies - this is problematic for many reasons, not least of which because the answers you already received will seem wrong or unintelligible if they no longer answer the question as it stands after your edit.)
These days, many tools have options to let you specify which fields exactly you want to output, and a formatting option to produce them in machine-readable format. The pacman man page mentions a --machinereadable option, though it does not seem to be of particular use here. Many modern tools will produce JSON, which can be unwieldy to handle in shell scripts, but easy if you have a tool like jq which understands JSON format (less convenient if the only available output format is XML; some tools will let you get the result as CSV, which is mildly clumsy but relatively easy to parse). Maybe also look for an option like --format for specifying how exactly to arrange the output. (In curl it's called -w/--write-out.)

Run awk on file being decoded continuously and then print patterns from decode file using awk

I a command which decodes binary logs to ascii format
From ASCII format file, I need to grep some patters using awk and print them
How can this be done?
What I have tried is as below in shell script and it does not work.
command > file.txt | awk /pattern/ | sed/pattern/
Also I need command to continously decode file and keep printing patterns on file being updated
Thanks in advance
command to continously decode file and keep printing patterns
The first question is exactly how continuously manifests itself. Most log files grow by being appended to -- for our purpose here, by some unknown external process -- and are periodically rotated. If you're going to continuously decode them, you're going to have to keep track of log rotation.
Can command continuously decode, or do you intend to re-run command periodically, picking up where you left off? If the latter, you might instead try some variation of:
cat log | command | awk
If that can't be done, you'll have to record where each iteration terminates, something like:
touch pos
while -f pos
do
command | awk -v status=pos script.awk > output || rm pos
done
where script.awk skips input until NR equals the value of the number in the pos file. It then processes lines until EOF, and overwrites pos with its final NR. On error, it calls exit 1, and the file is removed, and the loop terminates.
I recommend you ignore sed, and put all the pattern matching logic in one awk script. It will be easier to understand and cheaper to execute.

Retain backslashes with while read loop in multiple shells

I have the following code:
#!/bin/sh
while read line; do
printf "%s\n" $line
done < input.txt
Input.txt has the following lines:
one\two
eight\nine
The output is as follows
onetwo
eightnine
The "standard" solutions to retain the slashes would be to use read -r.
However, I have the following limitations:
must run under #!/bin/shfor reasons of portability/posix compliance.
not all systems
will support the -r switch to read under /sh
The input file format cannot be changed
Therefore, I am looking for another way to retain the backslash after reading in the line. I have come up with one working solution, which is to use sed to replace the \ with some other value (e.g.||) into a temporary file (thus bypassing my last requirement above) then, after reading them in use sed again to transform it back. Like so:
#!/bin/sh
sed -e 's/[\/&]/||/g' input.txt > tempfile.txt
while read line; do
printf "%s\n" $line | sed -e 's/||/\\/g'
done < tempfile.txt
I'm thinking there has to be a more "graceful" way of doing this.
Some ideas:
1) Use command substitution to store this into a variable instead of a file. Problem - I'm not sure command substitution will be portable here either and my attempts at using a variable instead of a file were unsuccessful. Regardless, file or variable the base solution is really the same (two substitutions).
2) Use IFS somehow? I've investigated a little, but not sure that can help in this issue.
3) ???
What are some better ways to handle this given my constraints?
Thanks
Your constraints seem a little strict. Here's a piece of code I jotted down(I'm not too sure of how valuable your while loop is for the other stuffs you would like to do, so I removed it off just for ease). I don't guarantee this code to be robustness. But anyway, the logic would give you hints in the direction you may wish to proceed. (temp.dat is the input file)
#!/bin/sh
var1="$(cut -d\\ -f1 temp.dat)"
var2="$(cut -d\\ -f2 temp.dat)"
iter=1
set -- $var2
for x in $var1;do
if [ "$iter" -eq 1 ];then
echo $x "\\" $1
else
echo $x "\\" $2
fi
iter=$((iter+1))
done
As Larry Wall once said, writing a portable shell is easier than writing a portable shell script.
perl -lne 'print $_' input.txt
The simplest possible Perl script is simpler still, but I imagine you'll want to do something with $_ before printing it.

Parallel processing in awk?

Awk processes the files line by line. Assuming each line operation has no dependency on other lines, is there any way to make awk process multiple lines at a time in parallel?
Is there any other text processing tool which automatically exploits parallelism and processes the data quicker ?
The only awk implementation that was attempting to provide a parallel implementation of awk was parallel-awk but it looks like the project is dead now.
Otherwise, one way to parallelize awk is be to split your input in chunks and process them in parallel. However, splitting the input data would still be single threaded so might defeat the performance enhancement goal, the main issue being the standard split command is unable to split at line boundaries without reading each and every line.
If you have GNU split available, or a version that support the -n l/* option, here is one optimized way to process your file in parallel, assuming here you have 8 vCPUs:
inputfile=input.txt
outputfile=output.txt
script=script.awk
count=8
split -n l/$count $inputfile /tmp/_pawk$$
for file in /tmp/_pawk$$*; do
awk -f script.awk $file > ${file}.out &
done
wait
cat /tmp/_pawk$$*.out > $outputfile
rm /tmp/_pawk$$*
You can use GNU Parallel for this purpose
Consider you are counting the sum of numbers in a big file:
cat rands20M.txt | awk '{s+=$1} END {print s}'
With GNU Parallel you can do it in multiple threads:
cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'

Need help in executing the SQL via shell script and use the result set

I currently have a request to build a shell script to get some data from the table using SQL (Oracle). The query which I'm running return a number of rows. Is there a way to use something like result set?
Currently, I'm re-directing it to a file, but I'm not able to reuse the data again for the further processing.
Edit: Thanks for the reply Gene. The result file looks like:
UNIX_PID 37165
----------
PARTNER_ID prad
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
/mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml
pradeep1
/mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654
----------
PARTNER_ID swam
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
smariswam2
/mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
There are multiple rows like this. My requirement is only to use shell script and write this program.
I need to take each of the pid and check if the process is running, which I can take care of.
My question is how do I check for each PID so I can loop and get corresponding partner_id and the xml_file name? Since it is a file, how can I get the exact corresponding values?
Your question is pretty short on specifics (a sample of the file to which you've redirected your query output would be helpful, as well as some idea of what you actually want to do with the data), but as a general approach, once you have your query results in a file, why not use the power of your scripting language of choice (ruby and perl are both good choices) to parse the file and act on each row?
Here is one suggested approach. It wasn't clear from the sample you posted, so I am assuming that this is actually what your sample file looks like:
UNIX_PID 37165 PARTNER_ID prad XML_FILE /mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml pradeep1 /mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654 PARTNER_ID swam XML_FILE smariswam2 /mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
I am also assuming that:
There is a line-feed at the end of
the last line of your file.
The columns are separated by a single
space.
Here is a suggested bash script (not optimal, I'm sure, but functional):
#! /bin/bash
cat myOutputData.txt |
while read line;
do
myPID=`echo $line | awk '{print $2}'`
isRunning=`ps -p $myPID | grep $myPID`
if [ -n "$isRunning" ]
then
echo "PARTNER_ID `echo $line | awk '{print $4}'`"
echo "XML_FILE `echo $line | awk '{print $6}'`"
fi
done
The script iterates through every line (row) of the input file. It uses awk to extract column 2 (the PID), and then does a check (using ps -p) to see if the process is running. If it is, it uses awk again to pull out and echo two fields from the file (PARTNER ID and XML FILE). You should be able to adapt the script further to suit your needs. Read up on awk if you want to use different column delimiters or do additional text processing.
Things get a little more tricky if the output file contains one row for each data element (as you indicated). A good approach here is to use a simple state mechanism within the script and "remember" whether or not the most recently seen PID is running. If it is, then any data elements that appear before the next PID should be printed out. Here is a commented script to do just that with a file of the format you provided. Note that you must have a line-feed at the end of the last line of input data or the last line will be dropped.
#! /bin/bash
cat myOutputData.txt |
while read line;
do
# Extract the first (myKey) and second (myValue) words from the input line
myKey=`echo $line | awk '{print $1}'`
myValue=`echo $line | awk '{print $2}'`
# Take action based on the type of line this is
case "$myKey" in
"UNIX_PID")
# Determine whether the specified PID is running
isRunning=`ps -p $myValue | grep $myValue`
;;
"PARTNER_ID")
# Print the specified partner ID if the PID is running
if [ -n "$isRunning" ]
then
echo "PARTNER_ID $myValue"
fi
;;
*)
# Check to see if this line represents a file name, and print it
# if the PID is running
inputLineLength=${#line}
if (( $inputLineLength > 0 )) && [ "$line" != "XML_FILE" ] && [ -n "$isRunning" ]
then
isHyphens=`expr "$line" : -`
if [ "$isHyphens" -ne "1" ]
then
echo "XML_FILE $line"
fi
fi
;;
esac
done
I think that we are well into custom software development territory now so I will leave it at that. You should have enough here to customize the script to your liking. Good luck!