I'm reading huge json files with jq, something like:
jq -r '[.a, .b, .time] | #tsv' file.txt
those files are coming as gz files and I spend each day 20 minutes just to gunzip them.
I was wondering is it possible to read the files with jq directly from the gz format? and if so, will it faster overall or will it slow down my process?
If it takes 20 minutes to unzip, it's going to take 20 minutes to unzip whether the library is used by gunzip or by jq.
But, you could avoid writing the unzipped file to disk and the time related to that. This would be achieved by using gunzip -c.
gunzip -c file.gz | jq -r '[ .a, .b, .time ] | #tsv' >file.tsv
To be clear, the above uses minimal memory. Not one of the three files (compressed, decompressed or TSV) is found in memory in its entirety at any time.
The following demonstrates the streaming nature of jq:
$ (
j='{"a": "a", "b": "a", "time": "20210210T10:10:00"}'
printf '%s\n' "$j"
printf '%s\n' "$j"
sleep 4
printf '%s\n' "$j"
) | jq -r '[ .a, .b, .time, now ] | #tsv'
a a 20210210T10:10:00 1620305187.460741
a a 20210210T10:10:00 1620305187.460791
[4 second pause]
a a 20210210T10:10:00 1620305191.459734
The first two records are emitted without delay, and the third is emitted after 4 seconds. This is reflected by the timestamps.
Related
I would like to process part of each line of command output, leaving the rest untouched.
Problem
Let's say I have some du output:
❯ du -xhd 0 /usr/lib/gr*
3.2M /usr/lib/GraphicsMagick-1.3.40
584K /usr/lib/grantlee
12K /usr/lib/graphene-1.0
4.2M /usr/lib/graphviz
4.0K /usr/lib/grcrt1.o
224K /usr/lib/groff
Now I want to process each path with another command, for example running pacman -Qo on it, leaving the remainder of the line untouched.
Approach
I know I can use awk {print $2} to get only the path, and could probably use it in a convoluted for loop to weld it back together, but maybe there is an elegant way, ideally easy to type on the fly, producing this in the end:
3.2M /usr/lib/GraphicsMagick-1.3.40/ is owned by graphicsmagick 1.3.40-2
584K /usr/lib/grantlee/ is owned by grantlee 5.3.1-1
12K /usr/lib/graphene-1.0/ is owned by graphene 1.10.8-1
4.2M /usr/lib/graphviz/ is owned by graphviz 7.1.0-1
4.0K /usr/lib/grcrt1.o is owned by glibc 2.36-7
224K /usr/lib/groff/ is owned by groff 1.22.4-7
Workaround
This is the convoluted contraption I am living with for now:
❯ du -xhd 0 /usr/lib/gr* | while read line; do echo "$line $(pacman -Qqo $(echo $line | awk '{print $2}') | paste -s -d',')"; done | column -t
3.2M /usr/lib/GraphicsMagick-1.3.40 graphicsmagick
584K /usr/lib/grantlee grantlee,grantleetheme
12K /usr/lib/graphene-1.0 graphene
4.2M /usr/lib/graphviz graphviz
4.0K /usr/lib/grcrt1.o glibc
224K /usr/lib/groff groff
But multiple parts of it are pacman-specific.
du -xhd 0 /usr/lib/gr* | while read line; do echo "$line" | awk -n '{ORS=" "; print $1}'; pacman --color=always -Qo $(echo $line | awk '{print $2}') | head -1; done | column -t
3.2M /usr/lib/GraphicsMagick-1.3.40/ is owned by graphicsmagick 1.3.40-2
584K /usr/lib/grantlee/ is owned by grantlee 5.3.1-1
12K /usr/lib/graphene-1.0/ is owned by graphene 1.10.8-1
4.2M /usr/lib/graphviz/ is owned by graphviz 7.1.0-1
4.0K /usr/lib/grcrt1.o is owned by glibc 2.36-7
224K /usr/lib/groff/ is owned by groff 1.22.4-7
This is a more generic solution, but what if there are three columns of output and I want to process only the middle one?
It grows in complexity, and I thought there must be a simpler way avoiding duplication.
Use a bash loop
(
IFS=$'\t'
while read -r -a fields; do
fields[1]=$(pacman -Qo "${fields[1]}")
printf '%s\n' "${fields[*]}"
done
)
Use a simple shell loop.
du -xhd 0 /usr/lib/gr* |
while read -r size package; do
pacman --color=always -Qo "$package" |
awk -v sz="$size" '{
printf "%s is owned by %s\n", sz, $0 }'
done
If you want to split out parts of the output from pacman, Awk makes that easy to do; for example, the package name is probably in Awk's $1 and the version in $2.
(Sorry, don't have pacman here; perhaps edit your question to show its output if you need more details. Going forward, please take care to ask the actual question you need help with, so you don't have to move the goalposts by editing after you have received replies - this is problematic for many reasons, not least of which because the answers you already received will seem wrong or unintelligible if they no longer answer the question as it stands after your edit.)
These days, many tools have options to let you specify which fields exactly you want to output, and a formatting option to produce them in machine-readable format. The pacman man page mentions a --machinereadable option, though it does not seem to be of particular use here. Many modern tools will produce JSON, which can be unwieldy to handle in shell scripts, but easy if you have a tool like jq which understands JSON format (less convenient if the only available output format is XML; some tools will let you get the result as CSV, which is mildly clumsy but relatively easy to parse). Maybe also look for an option like --format for specifying how exactly to arrange the output. (In curl it's called -w/--write-out.)
I have a small awk script which takes input from a stream and writes to the appropriate file based on the second column value. Here is how it goes:
cat mydir/*.csv | awk -F, '{if(NF==29)print $0 >> "output/"$2".csv"}'
How do I parallelize it, so that it can use multiple cores available in the machine? Right now, this is running on a single core.
you can try this.
I execute 1 awk per source file. Put content in temporary file (in each process it is a series of different one to avoid conflict in same final file and/or too much open/close handle on it). At the end of the awk, it put the content of temporary file into final one and remove temporary
you maybe have to use a batch limiter (a sleep or more smart grouping) if there are lot of file to treat to avoid to kill the machine with too much subprocess concurrent.
rm output/*.csv
for File in mydir/*.csv
do
# shell sub process
{
# ref for a series of temporary file
FileRef="${File##*/}"
awk -F ',' -v FR="${FileRef}" '
NF == 29 {
# put info in temporary file
ListFiles [ OutTemp = "output/"$2".csv_" FR ] = "output/"$2".csv"
print > OutTemp}
END {
# put temporary content into final file
for ( TempFile in ListFiles ) {
Command = sprintf( "cat \042%s\042 >> \042%s\042; rm \042%s\042" \
, TempFile, ListFiles[TempFile], TempFile )
printf "" | Command
}
' File
} &
done
wait
echo ls -l output/*.csv
Untested:
do_one() {
# Make a workdir only used by this process to ensure no files are added to in parallel
mkdir -p $1
cd $1
cat ../"$2" | awk -F, '{if(NF==29)print $0 >> $2".csv"}'
}
export -f do_one
parallel do_one workdir-{%} {} ::: mydir/*.csv
ls workdir-*/ | sort -u |
parallel 'cat workdir*/{} > output/{}'
rm -rf workdir-*
If you want to avoid the extra cat you can use this instead, though I find the cat version easier to read (performance is normally the same on modern systems http://oletange.blogspot.com/2013/10/useless-use-of-cat.html):
do_one() {
# Make a workdir only used by this process to ensure no files are added to in parallel
mkdir -p $1
cd $1
awk -F, <../"$2" '{if(NF==29)print $0 >> $2".csv"}'
}
export -f do_one
parallel do_one workdir-{%} {} ::: mydir/*.csv
ls workdir-*/ | sort -u |
parallel 'cat workdir*/{} > output/{}'
rm -rf workdir-*
But as #Thor writes, you are most likely I/O starved.
So a naive me wanted to parse 50 files using awk, so I did the following
zcat dir_with_50files/* > huge_file
cat huge_file | awk '{parsing}'
Of course, this was terrible because it would spend time creating a file, then consume a whole bunch of memory to pass along to awk.
Then a coworker showed me that I could do this.
zcat dir_with_50files/filename{0..50} | awk '{parsing}'
I was amazed that I would get the same results without the memory consumption.
ps aux also showed that the two commands ran in parallel. I was confused about what was happening and this SO answer partially answered my question.
https://stackoverflow.com/a/1072251/6719378
But if piping knows to initiate the second command after certain amount of buffered data, why does my naive approach consume so much more memory compared to the second approach?
Is it because I am using cat on a single file compared to loading multiple files?
you can reduce maximuml memory usage by zcat file by file
ex:
for f in dir_with_50files/*
do
zcat f | awk '{parsing}' >> Result.File
done
# or
find dir_with_50files/ -exec zcat {} | awk '{parsing}' >> Result.File \;
but it depend on your parsing
ok for modfying line, deleting, copying if there is no relation to previous items ( ex: sub( /foo/, "bar"))
bad for counter (ex: List[$2]++ ) or related (modification) (ex: NR != FNR {...}; ! List[$2]++ {...})
I am interested in efficiently searching files for content using bash and related tools (eg sed, grep), in the specific case that I have additional information about where in the file the intended content is. For example, I want to replace a particular string in line #3 of each file that contains a specific string on line 3 of the file. Therefore, I don't want to do a recursive grep -r on the whole directory as that would search the entirety of each file, wasting time since I know that the string of interest is on line #3, if it is there. This full-grep approach could be done with grep -rl 'string_to_find_in_files' base_directory_to_search_recursively. Instead I am thinking about using sed -i ".bak" '3s/string_to_replace/string_to_replace_with' files to search only on line #3 of all files recursively in a directory, however sed seems to only be able to take one file as input argument. How can I apply sed to multiple files recursively? find -exec {} \; and find -print0 | xargs -0 seem to be very slow.. Is there a faster method than using find? I can achieve the desired effect very quickly with awk but only on a single directory, it does not seem to me to be recursive, such as using awk 'FNR==3{print $0}' directory/*. Any way to make this recursive? Thanks.
You can use find to have the list of files and feed to sed or awk one by one by xargs
for example, this will print the first lines of all files listed by find.
$ find . -name "*.csv" | xargs -L 1 sed -n '1p'
How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}