How to get few lines from a .gz compressed file without uncompressing - gzip

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.

zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.

On some systems (e.g., Mac), you need to use gzcat.

On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head

If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.

If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16

This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}

Related

Extract Regex Pattern For Each Line - Leave Blank Line If No Pattern Exists

I am working with the following input:
"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"
I need to be able to extract both the phone number and email of each line into separate files. However, both values don't always appear in the same field - they will always be prefaced with "phone": or "email":, but they may be in the first, second, third or even twentieth field.
I have tried chopping together solutions in SED and AWK to remove everything up until "phone" and then every after the next , but this doesn't not work as desired. It also means that, if "phone" and/or "email do not exist, the line is not changed at all.
I need a solution that will give me an output with the phone value of each line in one file, and the email value in another. HOWEVER, if no phone or email value exists, a blank line in the output needs to be in place.
Any ideas?
This might work for you (GNU sed):
sed -Ene 'h;/.*"phone":([^,]*).*/!z;s//\1/;w phoneFile' -e 'g;/.*"email":([^,]*).*/!z;s//\1/;w emailFile' file
Make a copy of line.
If the line does not contain a phone number empty the line, otherwise remove everything but the phone number.
Write the result to the phone number file.
Replace the current pattern space by the copy of the original line.
Repeat as above for an email address.
N.B. My first attempt used s/.*// instead of z to empty the line which worked but should not have. If the line contained no phone/email, the substitution should have reset default regexp and the second substitution should have objected that it did not contain a back reference. However the second substitution worked in either case.
After fixing your file to be valid json and adding an extra line missing the phone attribute so we can test more of your requirements:
$ cat file
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"}
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"city":"Marshall","gender":"female","email":"foo.bar#gmail.com"}
you can do whatever you like with the data:
$ jq -r '.email // ""' file
mortina.curabia#gmail.com
foo.bar#gmail.com
$
$ jq -r '.phone // ""' file
549-287-5287
$
As long as it doesn't contain embedded newlines you can used sed 's/.*/{&}/' file to convert the input in your question to valid json as in my answer:
$ cat file
"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"
"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"city":"Marshall","gender":"female","email":"foo.bar#gmail.com"
$ sed 's/.*/{&}/' file
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"}
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"city":"Marshall","gender":"female","email":"foo.bar#gmail.com"}
$ sed 's/.*/{&}/' file | jq -r '.email // ""'
mortina.curabia#gmail.com
foo.bar#gmail.com
but I'm betting you started out with valid json and removed the {} by mistake along the way so you probably just need to not do that.
Using grep
Try:
grep -o '"phone":"[0-9-]*"' < Input > phone.txt
grep -o '"email":"[^"]*"' <Input > email.txt
Demo:
$echo '"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"' | grep -o '"phone":"[0-9-]*"'
"phone":"549-287-5287"
$echo '"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"' | grep -o '"email":"[^"]*"'
"email":"mortina.curabia#gmail.com"
$

Yet another attempt to do the "vlookup" thing using AWK

Just a clarifications before starting: The server where the script needs to run is an AIX box. Shell is KSH, so I do not have the advanced features Bash provides.
Ok, this is what I need to acccomplish:
I have two files. Both of them have two colums, separated by comma.
File "A" has ~170K lines, and it looks like this:
0000A7AED4F0C9FB1ADC14134700CadsevDDD4A000CEDCF.ext,\\server000005\F$\DICE\0035\
0000C3793C4CD6095947E44618D4Eadsev397460011D036.ext,\\server000005\F$\DICE\0020\
0001168DDDA4DF294E37753FE891BadsevB33900011EEA3.ext,\\server000005\F$\DICE\0088\
00014E6A3AFF0911D95A933778895adsev6C81E00088E97.ext,\\server000005\F$\DICE\0009\
0001A65FA90EC0E6640E1589C4B66adsev6FE1F00088EB9.ext,\\server000005\F$\DICE\0009\
0001C5AA0A9AC8E4EDFC69C483160adsev881CC001157ED.ext,\\server000005\F$\DICE\0034\
0003270ED2D2AB11739029711A233adsev55605000CFC63.ext,\\server000005\F$\DICE\0028\
000327C08A0ECD8F23EE6AE42B3C3adsevE35F00011481D.ext,\\server000005\F$\DICE\0061\
0003423C2592EF9D0AD9A7E2B595Cadsev6ABD9000D3501.ext,\\server000005\F$\DICE\0022\
00035862746EFB2098EC965F31328adsev66800000DA8CF.ext,\\server000005\F$\DICE\0021\
File "B" has ~2Million lines, and it looks like this:
0000294A3F3997slredA9D7ADBEE0C0CDE67C100001245C.nlo,\\server000002\F$\RESTORE_DICE\DICE\0083\
00003FFF21F5DAslred8F20FCF0A5CEE9920A4A00016835.nlo,\\server000002\F$\RESTORE_DICE\DICE\0029\
00005B1FFB996Fslred065F708695ADDD987AF9002139AD.nlo,\\server000002\F$\RESTORE_DICE\DICE\0157\
00005CF3C87456slred41FDB077914EB04FFA2B001F9D74.nlo,\\server000002\F$\RESTORE_DICE\DICE\0103\
00006BD33D737FslredD717F08A20F44F2B878500011050.nlo,\\server000002\F$\RESTORE_DICE\DICE\0094\
00008254F4D661slred6C05CFC91D9BCB82EDD800077FFA.nlo,\\server000002\F$\RESTORE_DICE\DICE\0082\
000092391392E3slredB744E98697FA39CEEDCD0004FB66.nlo,\\server000002\F$\RESTORE_DICE\DICE\0032\
0000945EDBB916slredAB08CD7AA8B825E1F55C0000FDC0.nlo,\\server000002\F$\RESTORE_DICE\DICE\0093\
0000C3793C4CD6slredE44618D4E0A2C50397460011D036.nlo,\\server000002\F$\RESTORE_DICE\DICE\0146\
0000D0DA56260DslredF30BCC9CDFF2A4556A7500039400.nlo,\\server000002\F$\RESTORE_DICE\DICE\0054\
In boh cases, column 1 is a filename while column 2 is it's path.
What I need:
for each line in file A, grab the filename, look for it in file B, and compose a newline, to be echoed/print into a new file "C" consisting of the following three fields:
Column 1: Path for the file as shown in file B.
Column 2: Path for the file as shown in file A.
Column 3: File name.
I have tried to keep it short as I have been "accused" of diarrhea of the words in previous questions, but feel free to let me know in case I am missing important details here.
Just so you know, I have working batch and ksh codes for this that work fine with smaller (lesser lines) files, tough none of them do the job in this case. (CMD Batch one only processes a couple thousand lines an hour, while the ksh one refuses to even read such files based on memory limitations..)
Thank you guys for being always there!
Martín.
It's like 3 lines and a couple of temporary files if you don't have access to ksh93 (I think AIX uses ksh88, not ksh93?), bash, zsh, or another shell that understands <(command) redirection...
$ sort -t, -k1 filea > sorted_filea
$ sort -t, -k1 fileb > sorted_fileb
$ join -t, -j1 -o '2.2 1.2 0' sorted_filea sorted_fileb > filec
(fewer if one or both of the files are already sorted on filename.)
If you do have one of those shells:
$ join -t, -j1 -o '2.2 1.2 0' <(sort -t, -k1 filea) <(sort -t, -k1 fileb) > filec

Batch renaming files with text from file as a variable

I am attempting to convert the files with the titles {out1.hmm, out2.hmm, ... , outn.hmm} to unique identifiers based on the third line of the file {PF12574.hmm, PF09847.hmm, PF0024.hmm} The script works on a single file however the variable does not get overwritten and only one file remains after running the command below:
for f in *.hmm;
do output="$(sed -n '3p' < $f |
awk -F ' ' '{print $2}' |
cut -f1 -d '.' | cat)" |
mv $f "${output}".hmm; done;
The first line calls all the outn.hmms as an input. The second line sets a variable to return the desired unique identifier. SED, AWK, and CUT are used to get the unique identifier. The variable supposed to rename the current file by the unique identifier, however the variable remains locked and overwrites the previous file.
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm
How can I overwrite the variable to get the following file structure:
out1.hmm out2.hmm out3.hmm becomes PF12574.hmm PF09847.hmm PF0024.hmm
You're piping the empty output of the assignment statement (to the variable named "output") into the mv command. That variable is not set yet, so what I think will happen is that you will - one after the other - rename all the files that match *.hmm to the file named ".hmm".
Try ls -a to see if that's what actually happened.
The sed, awk, cut, and (unneeded) cat are a bit much. awk can do all you need. Then do the mv as a separate command:
for f in *.hmm
do
output=$(awk 'NR == 3 {print $2}' "$f")
mv "$f" "${output%.*}.hmm"
done
Note that the above does not do any checking to verify that output is assigned to a reasonable value: one that is non-empty, that is a proper "identifier", etc.

How to print the 'nth + x' lines after a match is found?

I have a file which contains the output below. I want only the lines which contain the actual vm_id number.
I want to match pattern 'vm_id' and print 2nd line + all other lines until 'rows' is reached.
FILE BEGIN:
vm_id
--------------------------------------
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
(6 rows)
datacenter=
FILE END:
So the resulting output would be;
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
Also, the number of VM Id's will vary, this example has 6 while others could have 3 or 300.
I have tried the following but they only output a single line that's specified;
awk 'c&&!--c;/vm_id/{c=2}'
and
awk 'c&&!--c;/vm_id/{c=2+1}'
$ awk '/rows/{f=0} f&&(++c>2); /vm_id/{f=1}' file
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
If you wanted that first line of hex(?) printed too then just change the starting number to compare c to from 2 to 1 (or 3 or 127 or however many lines you want to skip after hitting the vm_id line):
$ awk '/rows/{f=0} f&&(++c>1); /vm_id/{f=1}' file
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
bf6c4f90-2e71-4253-a7f6-dbe5d666d3a4
6ffac9a9-1b6b-4600-8114-1ca0666951be
47b5e6d1-6ddd-424a-ab08-18ee35b54ebf
cc0e8b36-eba3-4846-af08-67ab72d911fc
1b8c2766-92b7-477a-bc92-797a8cb74271
c37bf1d8-a6b2-4099-9d98-179b4e573c64
What about this:
awk '/vm_id/{p=1;getline;next}/\([0-9]+ rows/{p=0}p'
I'm setting the p flag on vm_id and resetting it on (0-9+ rows).
Also sed comes in mind, the command follows basically the same logic as the awk command above:
sed -n '/vm_id/{n;:a;n;/([0-9]* rows)/!{p;ba}}'
Another thing, if it is safe that the only GUIDs in your input file are the vm ids, grep might be the tool of choise:
grep -Eo '([0-9a-f]+-){4}([0-9a-f]+)'
It's not 100% bullet proof in this form, but it should be good enough for the most use cases.
Bullet proof would be:
grep -Eoi '[0-9a-f]{8}(-[0-9a-f]{4}){3}-[0-9a-f]{12}'

grep a number from the line and append it to a file

I went through several grep examples, but don't see how to do the following.
Say, i have a file with a line
! some test here and number -123.2345 text
i can get this line using
grep ! input.txt
but how do i get the number (possibly positive or negative) from this line and append it to the end of another file? Is it possible to apply grep to grep results?
If yes, then i could get the number via something like
grep -Eo "[0-9]{1,}|\-[0-9]{1,}"
p/s/ i am using OS-X
p/p/s/ i'm trying to fetch data from several files and put into a single file for later plotting.
The format with your commands would be:
grep ! input.txt | grep -Eo "[0-9]{1,}|\-[0-9]{1,}" >> output
To grep from grep we use the pipe operator | this lets us chain commands together. To append this output to a file we use the redirection operator >>.
However there are a couple of problems. You regexp is better written: grep -Eoe '-?[0-9.]+' this allows for the decimal and returns the single number instead of two and if you want lines that start with ! then grep ^! is better to avoid matches with lines what contain ! but don't start with it. Better to do:
grep '^!' input | grep -Eoe '-?[0-9.]+' >> output
perl -lne 'm/.*?([\d\.\-]+).*/g;print $1' your_file >>anotherfile_to_append
$foo="! some test here and number -123.2345 text"
$echo $foo | sed -e 's/[^0-9\.-]//g'
$-123.2345
Edit:-
for a file,
[ ]$ cat log
! some test here and number -123.2345 text
some blankline
some line without "the character" and with number 345.566
! again a number 34
[ ]$ sed -e '/^[^!]/d' -e 's/[^0-9.-]//g' log > op
[ ]$ cat op
-123.2345
34
Now lets see the toothpicks :) '/^[^!]/d' / start of pattern, ^ not (like multiply with false), [^!] anyline starting with ! and d delete. Second expression, [^0-9.-] not matching anything within 0 to 9, and . and -, (everything else) // replace with nothing (i.e. delete) and done :)