Extract Regex Pattern For Each Line - Leave Blank Line If No Pattern Exists - awk

I am working with the following input:
"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"
I need to be able to extract both the phone number and email of each line into separate files. However, both values don't always appear in the same field - they will always be prefaced with "phone": or "email":, but they may be in the first, second, third or even twentieth field.
I have tried chopping together solutions in SED and AWK to remove everything up until "phone" and then every after the next , but this doesn't not work as desired. It also means that, if "phone" and/or "email do not exist, the line is not changed at all.
I need a solution that will give me an output with the phone value of each line in one file, and the email value in another. HOWEVER, if no phone or email value exists, a blank line in the output needs to be in place.
Any ideas?

This might work for you (GNU sed):
sed -Ene 'h;/.*"phone":([^,]*).*/!z;s//\1/;w phoneFile' -e 'g;/.*"email":([^,]*).*/!z;s//\1/;w emailFile' file
Make a copy of line.
If the line does not contain a phone number empty the line, otherwise remove everything but the phone number.
Write the result to the phone number file.
Replace the current pattern space by the copy of the original line.
Repeat as above for an email address.
N.B. My first attempt used s/.*// instead of z to empty the line which worked but should not have. If the line contained no phone/email, the substitution should have reset default regexp and the second substitution should have objected that it did not contain a back reference. However the second substitution worked in either case.

After fixing your file to be valid json and adding an extra line missing the phone attribute so we can test more of your requirements:
$ cat file
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"}
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"city":"Marshall","gender":"female","email":"foo.bar#gmail.com"}
you can do whatever you like with the data:
$ jq -r '.email // ""' file
mortina.curabia#gmail.com
foo.bar#gmail.com
$
$ jq -r '.phone // ""' file
549-287-5287
$
As long as it doesn't contain embedded newlines you can used sed 's/.*/{&}/' file to convert the input in your question to valid json as in my answer:
$ cat file
"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"
"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"city":"Marshall","gender":"female","email":"foo.bar#gmail.com"
$ sed 's/.*/{&}/' file
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"}
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"city":"Marshall","gender":"female","email":"foo.bar#gmail.com"}
$ sed 's/.*/{&}/' file | jq -r '.email // ""'
mortina.curabia#gmail.com
foo.bar#gmail.com
but I'm betting you started out with valid json and removed the {} by mistake along the way so you probably just need to not do that.

Using grep
Try:
grep -o '"phone":"[0-9-]*"' < Input > phone.txt
grep -o '"email":"[^"]*"' <Input > email.txt
Demo:
$echo '"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"' | grep -o '"phone":"[0-9-]*"'
"phone":"549-287-5287"
$echo '"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"' | grep -o '"email":"[^"]*"'
"email":"mortina.curabia#gmail.com"
$

Related

Delete everything before first pattern match with sed/awk

Let's say I have a line looking like this:
/Users/random/354765478/Tests/StoreTests/Base64Tests.swift
In this example, I would like to get the result:
Tests/StoreTests/Base64Tests.swift
How can I do if I want to get everything before the first pattern match (either Sources or Tests) using sed or awk?
I am using sed 's/^.*\(Tests.*\).*$/\1/' right now but it's falling:
echo '/Users/random/354765478/Tests/StoreTests/Base64Tests.swift' | sed 's/^.*\(Tests\)/\1/'
Tests.swift
Here's another example using Sources (which seems to work):
echo '/Users/random/741672469/Sources/Store/StoreDataSource.swift' | sed 's/^.*\(Sources\)/\1/'
Sources/Store/StoreDataSource.swift
I would like to get everything before the first, and not the last Sources or Tests pattern match.
Any help would be appreciated!
How can I do if I want to get everything before the first pattern match (either Sources or Tests).
Easier to use a grep -o here:
grep -Eo '(Sources|Tests)/.*' file
Tests/StoreTests/Base64Tests.swift
Sources/Store/StoreDataSource.swift
# where input file is
cat file
/Users/random/354765478/Tests/StoreTests/Base64Tests.swift
/Users/random/741672469/Sources/Store/StoreDataSource.swift
Breakdown:
Regex pattern (Sources|Tests)/.* match any text that starts with Sources/ or Tests/ until end of the line.
-E: enables extended regex mode
-o: prints only matched text instead of full line
Alternatively you may use this awk as well:
awk 'match($0, /(Sources|Tests)\/.*/) {
print substr($0, RSTART)
}' file
Tests/StoreTests/Base64Tests.swift
Sources/Store/StoreDataSource.swift
Or this sed:
sed -E 's~.*/((Sources|Tests)/.*)~\1~' file
Tests/StoreTests/Base64Tests.swift
Sources/Store/StoreDataSource.swift
With your shown samples please try following GNU grep. This will look for very first match of /Sources OR /Tests and then print values from these strings to till end of the value.
grep -oP '^.*?\/\K(Sources|Tests)\/.*' Input_file
Using sed
$ sed -E 's~([^/]*/)+((Tests|Sources).*)~\2~' input_file
Tests/StoreTests/Base64Tests.swift
would like to get everything before the first, and not the last
Sources or Tests pattern match.
First thing is to understand reason of that, you are using
sed 's/^.*\(Tests.*\).*$/\1/'
observe that * is greedy, i.e. it will match as much as possible, therefore it will always pick last Tests, if it would be non-greedy it would find first Tests but sed does not support this, if you are using linux there is good chance that you have perl command which does support that, let file.txt content be
/Users/random/354765478/Tests/StoreTests/Base64Tests.swift
then
perl -p -e 's/^.*?(Tests.*)$/\1/' file.txt
gives output
Tests/StoreTests/Base64Tests.swift
Explanation: -p -e means engage sed-like mode, alterations in regular expression made: brackets no longer require escapes, first .* (greedy) changed to .*? (non-greedy), last .* deleted as superfluous (observe that capturing group will always extended to end of line)
(tested in perl 5, version 30, subversion 0)

How to delete the "0"-row for multiple fles in a folder?

Each file's name starts with "input". One example of the files look like:
0.0005
lii_bk_new
traj_new.xyz
0
73001
146300
I want to delete the lines which only includes '0' and the expected output is:
0.0005
lii_bk_new
traj_new.xyz
73001
146300
I have tried with
sed -i 's/^0\n//g' input_*
and
grep -RiIl '^0\n' input_* | xargs sed -i 's/^0\n//g'
but neither works.
Please give some suggestions.
Could you please try changing your attempted code to following, run it on a single Input_file once.
sed 's/^0$//' Input_file
OR as per OP's comment to delete null lines:
sed 's/^0$//;/^$/d' Input_file
I have intentionally not put -i option here first test this in a single file of output looks good then only run with -i option on multiple files.
Also problem in your attempt was, you are putting \n in regex of sed which is default separator of line, we need to put $ in it to tell sed delete those lines which starts and ends with 0.
In case you want to take backup of files(considering that you have enough space available in your file system) you could use -i.bak option of sed too which will take backup of each file before editing(this isn't necessary but for safer side you have this option too).
$ sed '/^0$/d' file
0.0005
lii_bk_new
traj_new.xyz
73001
146300
In your regexp you were confusing \n (the literal LineFeed character which will not be present in the string sed is analyzing since sed reads one \n-separated line at a time) with $ (the end-of-string regexp metacharacter which represents end-of-line when the string being parsed is a line as is done with sed by default).
The other mistake in your script was replacing 0 with null in the matching line instead of just deleting the matching line.
Please give some suggestions.
I would use GNU awk -i inplace for that following way:
awk -i inplace '!/^0$/' input_*
This simply will preserve all lines which do not match ^0$ i.e. (start of line)0(end of line). If you want to know more about -i inplace I suggest reading this tutorial.

Replace character except between pattern using grep -o or sed (or others)

In the following file I want to replace all the ; by , with the exception that, when there is a string (delimited with two "), it should not replace the ; inside it.
Example:
Input
A;B;C;D
5cc0714b9b69581f14f6427f;5cc0714b9b69581f14f6428e;1;"5cc0714b9b69581f14f6427f;16a4fba8d13";xpto;
5cc0723b9b69581f14f64285;5cc0723b9b69581f14f64294;2;"5cc0723b9b69581f14f64285;16a4fbe3855";xpto;
5cc072579b69581f14f6428a;5cc072579b69581f14f64299;3;"5cc072579b69581f14f6428a;16a4fbea632";xpto;
output
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
For sed I have: sed 's/;/,/g' input.txt > output.txt but this would replace everything.
The regex for the " delimited string: \".*;.*\" .
(A regex for hexadecimal would be better -- something like: [0-9a-fA-F]+)
My problem is combining it all to make a grep -o / sed that replaces everything except for that pattern.
The file size is in the order of two digit Gb (max 99Gb), so performance is important. Relevant.
Any ideas are appreciated.
sed is for doing simple s/old/new on individual strings. grep is for doing g/re/p. You're not trying to do either of those tasks so you shouldn't be considering either of those tools. That leaves the other standard UNIX tool for manipulating text - awk.
You have a ;-separated CSV that you want to make ,-separated. That's simply:
$ awk -v FPAT='[^;]*|"[^"]+"' -v OFS=',' '{$1=$1}1' file
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
The above uses GNU awk for FPAT. See What's the most robust way to efficiently parse CSV using awk? for more details on parsing CSVs with awk.
If I get correctly your requirements, one option would be to make a three pass thing.
From your comment about hex, I'll consider nothing like # will come in the input so you can do (using GNU sed) :
sed -E 's/("[^"]+);([^"]+")/\1#\2/g' original > transformed
sed -i 's/;/,/g' transformed
sed -i 's/#/;/g' transformed
The idea being to replace the ; when within quotes by something else and write it to a new file, then replace all ; by , and then set back the ; in place within the same file (-i flag of sed).
The three pass can be combined in a single command with
sed -E 's/("[^"]+);([^"]+")/\1#\2/g;s/;/,/g;s/#/;/g' original > transformed
That said, there's probably a bunch of csv parser witch already handle quoted fields that you can probably use in the final use case as I bet this is just an intermediary step for something else later in the chain.
From Ed Morton's comment: if you do it in one pass, you can use \n as replacement separator as there can't be a newline in the text considered line by line.
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^";]*);/\1\n/;ta;y/;/,/;y/\n/;/' file
Replace ;'s inside double quotes with newlines, transpose ;'s to ,'s and then transpose newlines to ;'s.

How to extract the first column from a tsv file?

I have a file containing some data and I want to use only the first column as a stdin for my script, but I'm having trouble extracting it.
I tried using this
awk -F"\t" '{print $1}' inputs.tsv
but it only shows the first letter of the first column. I tried some other things but it either shows the entire file or just the first letter of the first column.
My file looks something like this:
Harry_Potter 1
Lord_of_the_rings 10
Shameless 23
....
You can use cut which is available on all Unix and Linux systems:
cut -f1 inputs.tsv
You don't need to specify the -d option because tab is the default delimiter. From man cut:
-d delim
Use delim as the field delimiter character instead of the tab character.
As Benjamin has rightly stated, your awk command is indeed correct. Shell passes literal \t as the argument and awk does interpret it as a tab, while other commands like cut may not.
Not sure why you are getting just the first character as the output.
You may want to take a look at this post:
Difference between single and double quotes in Bash
Try this (better rely on a real csv parser...):
csvcut -c 1 -f $'\t' file
Check csvkit
Output :
Harry_Potter
Lord_of_the_rings
Shameless
Note :
As #RomanPerekhrest said, you should fix your broken sample input (we saw spaces where tabs are expected...)

How to get few lines from a .gz compressed file without uncompressing

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}