how to extract two strings from each line

how to extract two strings from each line - awk

I have a file of the following content:
product a1 version a2 owner a3
owner b1 version b2 product b3 size b4
....
I am interested in extracting product and version from each line using a shell script, and write them in 2 columns with product first and version second. So the output should be:
a1 a2
b3 b2
...
I used "while read line", but it is extremely slow. I tried to use awk, but couldn't figure out how to do it. Any help is appreciated.

The following will do what you want:
$ nl dat
1 product a1 version a2 owner a3
2 owner b1 version b2 product b3 size b4
$ awk 'NF { delete row;
for( i=1; i <= NF; i += 2 ) {
row[$i] = $(i+1)
}
print( row["product"], row["version"])
}' dat
a1 a2
b3 b2
This builds an associative array from the name-value pairs in your data file by position, and then retrieves the values by name. The NF in the pattern ensures blank lines are ignored. If product or version are otherwise missing, they'll likewise be missing in the output.

A different perl approach:
perl -lane 'my %h = #F; print "$h{product} $h{version}"' input.txt
Uses auto-split mode to put each word of each line in an array, turns that into a hash/associative array, and prints out the keys you're interested in.

Here is a perl to do that:
perl -lnE '
$x=$1 if /(?=product\h+(\H+))/;
$y=$1 if /(?=version\h+(\H+))/;
say "$x $y" if $x && $y;
$x=$y="";' file
Or, same method with GNU awk:
gawk '/product/ && /version/{
match($0,/product[ \t]+([^ \t]+)/,f1)
match($0,/version[ \t]+([^ \t]+)/,f2)
print f1[1],f2[1]
}' file
With the example, either prints:
a1 a2
b3 b2
The advantage here is only complete lines are printed where both targets are found.

With awk:
$ awk '{for(i=1;i<NF;i++){
if($i=="version")v=$(i+1)
if($i=="product")p=$(i+1)}}
{print p,v}' data.txt
a1 a2
b3 b2
If you have lines without a version or product number, and you want to skip them:
awk '{ok=0}
{for(i=1;i<NF;i++){
if($i=="version"){v=$(i+1);ok++}
if($i=="product"){p=$(i+1);ok++}}}
ok==2{print p,v}' data.txt

Thank you guys for the quick and excellent replies. I ended up using the awk version as it is most convenient for me to insert into an existing shell script. But I learned a lot from other scripts too.

Related

How can I adjust a text file in VIM to have two columns instead of three while not splitting paired data?

Hello there and thank you for reading this!
I have a very large text file that looks something like this:
a1 b1 a2
b2 a3 b3
a4 b4 a5
b5 a6 b6
I would like my text file to look like this:
a1 b1
a2 b2
a3 b3
a4 b4
a5 b5
a6 b6
In reality, these values are paired lon/lat coordinates. If it is useful the values look like:
1.591336e+02 4.978998e+01 1.591162e+02
4.977995e+01 1.590988e+02 4.976991e+01
1.590815e+02 4.975988e+01 1.590641e+02
4.974984e+01 1.590468e+02 4.973980e+01
I have been learning in vim, but I do have access to other tools if this is easier done elsewhere. Should I be looking for a sed or awk command that will assess the amount of spaces in a given row? I appreciate any and all advice, and if I can offer any more information, I would be glad to!
I have searched for other folks who have had this question, but I don't know how to apply some of the solutions I've seen for similar problems to this - and I'm afraid of messing up this very large file. I am expecting the answer to be something using sed or awk, but I don't know how to be successful with these commands with what I've found so far. I'm rather new to coding and this site, so if I missed this question already being asked, I apologize!
All the best!
EDIT: I used
sed 's/\s/\n/2;P;D' file.txt > newfile.txt
to turn my file into:
a1 b1
a2^M
b2 a3
b3^M
a4 b4
a5^M
b5 a6
b6^M
I then used:
dos2unix newfile.txt
to get rid of the ^M within the data. I haven't made it to the structure, but I am one step closer.

$ tr ' ' '\n' <input_file|paste -d" " - -
a1 b1
a2 b2
a3 b3
$ sed 's/ /\n/2; P; D' <(tr '\n' ' ' <input_file)
a1 b1
a2 b2
a3 b3
$ tr '\n' ' ' <input_file|xargs -d" " printf '%s %s\n'
a1 b1
a2 b2
a3 b3

An approach using awk
% awk '{for(i=1;i<=NF;i++){x++; printf x%2==0 ? $i"\n" : $i" "}}' file
a1 b1
a2 b2
a3 b3
a4 b4
a5 b5
a6 b6
Data
% cat file
a1 b1 a2
b2 a3 b3
a4 b4 a5
b5 a6 b6

Using GNU sed
$ sed -Ez 's/([^ \n]*)[ \n]([^ \n]* ?)\n?/\1 \2\n/g' input_file
a1 b1
a2 b2
a3 b3
a4 b4
a5 b5
a6 b6

Using cut:
$ cut -d' ' -f1,2 test
a1 b1
b2 a3
a4 b4
b5 a6

In vim you can
:%j
to join all the lines, then
:s/\([^ ]\+ [^ ]\+\) /\1\r/g
to turn every 2nd space into a newline.
With perl
perl -0777 -lpe 's/(\S+)\s+(\S+)\s+/$1 $2\n/g' file
That reads the whole file into memory, so it depends on what "very large" is. Is it smaller than the amount of memory you have?

FWIW, here is a bit of a meta-answer.
Vim lets you filter all or some of the lines in your buffer via an external program with :help :!. It is very handy because, wile Vim can do a lot, there are plenty of use cases for which external tools would do a better job.
So… if you already have the file opened in Vim, you should be able to apply the provided answers with little effort:
:%!tr ' ' '\n' | paste -d" " - -
:%!tr '\n' ' ' | sed 's/ /\n/2; P; D'
:%!perl -0777 -lpe 's/(\S+)\s+(\S+)\s+/$1 $2\n/g'
etc.
Of note:
The command after the [range]! takes the lines covered by [range] as standard input which makes constructs like <filename unnecessary in this context.
Vim expands % to the current filename and # to the alternate filename if they exist so they usually need to be escaped, as in:
:%!tr '\n' ' ' | xargs printf '\%s \%s\n'
There's a lot to learn from this thread. Good luck.

This might work for you (GNU sed):
sed '$!N;y/\n/ /;s/ /\n/2;P;D' file
Append the following line if not the last.
Translate all newlines to spaces.
Replace the second space by a newline.
Print the first line, delete the first line and repeat.

Remove data till some character in the field

I have file with 6 columns seperated by space with data
cell in out f ty le
A A1 Z A1 com 2
B A1,B Z AB com 0,2
I want to remove the 0, from 6th column getting output as
cell in out f ty le
A A1 Z A1 com 2
B A1,B Z AB com 2
I tried code
awk '{sub(/\0,.*$/,"",$6);print $1,$2,$3,$4,$5,$6}' file
But this did not worked.

Simple solution would be, as per your shown samples.
awk '{sub(/^0,/,"",$6)} 1' Input_file
In case you have few cases where comma could come in starting or ending of line of can come in between then you could please try following, written and tested with shown samples in GNU awk.
awk '{gsub(/^,+0,+|,+0,+$/,"",$6);gsub(/,0,/,",",$6)} 1' Input_file
problem in OP's attempt: OP using .* after 0, which is very greedy and will match everything of 6th field hence its substituting everything in 6th field with NULL.
Fixes in OP's attempts: I have added logic to substitute starting 0, and ending ,0 with NULL in 6th field. Then to handle in between zero substituting from ,0, to ,, to make solution generic here.
NOTE: In case your Input_file is TAB separated then add BEGIN{FS=OFS="\t"}.

Since this is a simple substitution on an individual string, I'd just use sed:
$ sed 's/0,\([^[:space:]]*\)$/\1/' file
cell in out f ty le
A A1 Z A1 com 2
B A1,B Z AB com 2
otherwise with an awk that has gensub() (e.g. GNU awk):
$ awk '{print gensub(/0,([^[:space:]]*)$/,"\\1",1)}' file
cell in out f ty le
A A1 Z A1 com 2
B A1,B Z AB com 2
or with any awk:
$ awk 'match($0,/0,[^[:space:]]*$/){$0=substr($0,1,RSTART-1) substr($0,RSTART+2)} 1' file
cell in out f ty le
A A1 Z A1 com 2
B A1,B Z AB com 2
Note that all of the above work with and simply retain whatever spacing you already have in your input.

Replace in every nth line starting from a certain line

I want to replace on every third line starting from second line using sed.
Input file
A1
A2
A3
A4
A5
A6
A7
.
.
.
Expected output
A1
A2
A3
A4_edit
A5
A6
A7_edit
.
.
.
I know there are many solution releted to this is available on stack but for this specific problem, I was unable to find.
My try:
sed '1n;s/$/_edit/;n'
This only replacing on every second line from the beginning.

Something like this?
$ seq 10 | sed '1b ; n ; n ; s/$/_edit/'
1
2
3
4_edit
5
6
7_edit
8
9
10_edit
This breaks down to a cycle of
1b if this is the first line in the input, start the next cycle, using sed default behaviour to print the line and read the next one - which skips the first line in the input
n print the current line and read the next line - which skips the first line in a group of three
n print the current line and read the next line - which skips the second line in a group of three
s/$/_edit/ substitute the end of line for _edit on the third line of each group of three
then use the default sed behaviour to print, read next line and start the cycle again
If you want to skip more than one line at the start, change 1b to, say, 1,5b.
As Wiktor Stribiżew has pointed out in the comments, as an alternative, there is a GNU range extension first~step which allows us to write
sed '4~3s/$/_edit/'
which means substitute on every third line starting from line 4.

In case you are ok with awk, try following.
awk -v count="-1" '++count==3{$0=$0"_edit";count=0} 1' Input_file
Append > temp_file && mv temp_file Input_file in case you want to save output into Input_file itself.
Explanation:
awk -v count="-1" ' ##Starting awk code here and mentioning variable count whose value is -1 here.
++count==3{ ##Checking condition if increment value of count is equal to 3 then do following.
$0=$0"_edit" ##Appending _edit to current line value.
count=0 ##Making value of count as ZERO now.
} ##Closing block of condition ++count==3 here.
1 ##Mentioning 1 will print edited/non-edited lines.
' Input_file ##Mentioning Input_file name here.

Another awk
awk 'NR>3&&NR%3==1{$0=$0"_edit"}1' file
A1
A2
A3
A4_edit
A5
A6
A7_edit
A8
A9
A10_edit
A11
A12
A13_edit
NR>3 Test if line is larger then 3
NR%3==1 and every third line
{$0=$0"_edit"} edit the line
1 print everything

You can use seds ~ step operator.
sed '4~3s|$|_edit|'
~ is a feature of GNU sed, so it will be available in most (all?) distros of Linux. But to use it on macOS (which comes with BSD sed), you would have to install GNU sed to get this feature: brew install gnu-sed.

Replace '\n' with white space only when the occurrence of '\t' is more than a number

I have tens of thousands of tab-delimited data files, each like:
a0\ta1\ta2\ta3\ta4\ta5\ta6\ta7\ta8\ta9\n
b0\tb1\tb2\tb3\tb4\tb5\tb6\tb7\tb8\tb9\n
...
However, occasionally there are files containing (randomly) malformed lines like:
a0\ta1\ta2\ta3_0\n
a3_1\ta4\ta5\ta6\ta7\ta8\ta9\n
b0\tb1\tb2_0\n
b2_1\tb3\tb4\tb5\tb6\tb7\tb8\tb9\n
...
where a3_0, a3_1 (b2_0, b2_1 resp.) are parts of a3 (b2 resp.) originally separated by a white space. I want to replace each \n at the end of a line with a white space only when that line is too short, or, too few \t. Currently 5 seems to be a safe threshold.
I often use sed to do some modifications, which are much simpler than the above. I am wondering if sed or some other commands (like awk? which I still need to learn) can be used for fast processing (since I have many files). Thanks.

With GNU awk for multi-char RS and RT (and later -i infile and ENDFILE) and using commas instead of tabs for visibility:
$ cat file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0
a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0
b2_1,b3,b4,b5,b6,b7,b8,b9
$ awk -v RS='([^,]*,){9}[^\n]*\n' '{$0=RT; sub(/\n$/,"") gsub(/\n/," ")} 1' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9
The above [ab-]uses RS to describe each record (instead of the record separator) as a series of 10 comma-separated fields ending with a newline and then replaces the newlines as appropriate within each record before printing.
Just change RS='([^,]*,){9}[^\n]*\n' to RS='([^\t]*\t){9}[^\n]*\n' for it to work with tab-separated instead of comma-separated fields.
To make the changes to all files add -i inplace:
awk -i inplace -v RS='...' '...' *
or:
find ... -exec awk -i inplace -v RS='...' '...' {} +
You actually don't even have to hard-code the RS, the tool could figure it out, assuming there's at least 1 complete line in each input file:
$ awk -F',' '
BEGIN { ARGV[ARGC] = ARGV[ARGC-1]; ARGC++ }
NR==FNR { n=(NF>n?NF:n); next }
ENDFILE { RS="([^"FS"]*"FS"){"n-1"}[^\n]*\n" }
{ $0=RT; sub(/\n$/,"") gsub(/\n/," "); print }
' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9
Just change -F',' to -F'\t' for tab-separated.
FYI with POSIX awks, the closest equivalents of the above two gawk scripts would be:
$ awk '
{ rec=rec $0 RS }
END{
while ( match(rec,/([^,]*,){9}[^\n]*\n/) ) {
tgt = substr(rec,RSTART,RLENGTH)
sub(/\n$/,"",tgt)
gsub(/\n/," ",tgt)
print tgt
rec = substr(rec,RSTART+RLENGTH)
}
}
' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9
and:
awk -F',' '
{ rec=rec $0 RS; n=(NF>n?NF:n) }
END{
while ( match(rec,"([^"FS"]*"FS"){"n-1"}[^\n]*\n") ) {
tgt = substr(rec,RSTART,RLENGTH)
sub(/\n$/,"",tgt)
gsub(/\n/," ",tgt)
print tgt
rec = substr(rec,RSTART+RLENGTH)
}
}
' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9
Just be aware that those read the whole file into a single string before the main processing begins so they'd fail if your file was too huge to fit in memory but you already told us each file is "very small" so that shouldn't be an issue.
To overwrite the input file the simplest approach is always:
awk '{...}' file > tmp && mv tmp file
but in this case you can alternatively do:
awk '{...} END{... print tgt > ARGV[1] ...}' file
That works in this case because awk has already completed reading the input file before starting the END section. Do not attempt it elsewhere in the script.

Assuming you name the following script repiece:
#!/usr/bin/env bash
IFS=$'\t' # use tab separators throughout this script
rIFS=, # except to avoid field coalescing, use commas
pieces_needed=5 # adjust this to taste
for arg; do
tempfile="${arg}.tmp-$$" # vulnerable to symlink attacks; use mktemp instead if untrusted
# users have write access to current directory.
deferred=( )
{
while IFS="$rIFS" read -r -a pieces; do
if (( ( ${#deferred[#]} + ${#pieces[#]} ) < pieces_needed )); then
deferred+=( "${pieces[#]}" )
elif (( ${#deferred[#]} )); then
# separate last piece of deferred and first of pieces with a space
all_pieces=( "${deferred[#]} ${pieces[#]}" )
printf '%s\n' "${all_pieces[*]}"
deferred=( )
else
printf '%s\n' "${pieces[*]}"
fi
done
# if we have anything deferred for the last line, print it now
(( ${#deferred[#]} )) && printf '%s\n' "${deferred[*]}"
} < <(tr -- "$IFS" "$rIFS" <"$arg") >"$tempfile"
mv -- "$tempfile" "$arg"
done
...you can invoke the smallest possible number of invocations to process all your files as follows:
# if your files end in .tsv
find . -type f -name '*.tsv' -exec ./repiece {} +

In awk, changing the ORS between a space and \ņ:
$ awk '
BEGIN {
FS=OFS="\t" # set field separators
RS=ORS="\n" # set record separators
}
NF<=5 { # if below or at threshold
ORS=" " # redefine output record separator
}
{
print # print record with ORS
ORS="\n" # reset ORS back to newline
}' file
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
a0 a1 a2 a3_0 a3_1 a4 a5 a6 a7 a8 a9
b0 b1 b2_0 b2_1 b3 b4 b5 b6 b7 b8 b9
Processing multiple files with shell scripting:
$ for f in file1 file2 ; do awk ... $f > new-$f ; done
Quote $f if needed.

This might work for you (GNU sed):
sed ':a;s/\t/&/9;t;N;s/\n/ /;ta' file
If there are less than 9 tabs in the current line, append the next line and replace the newline by a space. Repeat until 9 or more tabs.

Convert a tree to a list of paths using awk [duplicate]

I have input files with the structure like the next:
a1
b1
c1
c2
c3
b2
c1
d1
d2
b3
b4
a2
a3
b1
b2
c1
c2
Each level is indented by 2 spaces. The needed output is:
a1/b1/c1
a1/b1/c2
a1/b1/c3
a1/b2/c1/d1
a1/b2/c1/d2
a1/b3
a1/b4
a2
a3/b1
a3/b2/c1
a3/b2/c2
It is like a filesystem, if the next line have bigger indentation, the current one is like a "directory" and when have same indentation it is like a "file". Need print full paths of "files".
Trying to solve this without any high-level language, like python, perl - with only basic bash commands.
My current code/idea is based on recursive function call and working with a stack, but have problem with the "logic". The code currently outputs the next:
a1 b1 c1
a1 b1
a1
DD: line 8: [0-1]: bad array subscript
only the 1st line is OK - so handling the recursion is wrong...
input="ifile.tree"
#stack array
declare -a stack
#stack manipulation
pushstack() { stack+=("$1"); }
popstack() { unset stack[${#stack[#]}-1]; }
printstack() { echo "${stack[*]}"; }
#recursive function
checkline() {
local uplev=$1
#read line - if no more lines - print the stack and return
read -r level text || (printstack; exit 1) || return
#if the current line level is largest than previous level
if [[ $uplev < $level ]]
then
pushstack "$text"
checkline $level #recurse
fi
printstack
popstack
}
# MAIN PROGRAM
# change the input from indented spaces to
# level_number<space>text
(
#subshell - change IFS
IFS=,
while read -r spaces content
do
echo $(( (${#spaces} / 2) + 1 )) "$content"
done < <(sed 's/[^ ]/,&/' < "$input")
) | ( #pipe to another subshell
checkline 0 #recurse by levels
)
Sry for the long code - can anybody help?

interesting question.
this awk (could be one-liner) command does the job:
awk -F' ' 'NF<=p{for(i=1;i<=p;i++)printf "%s%s", a[i],(i==p?RS:"/")
if(NF<p)for(i=NF;i<=p;i++) delete a[i]}
{a[NF] =$NF;p=NF }
END{for(i=1;i<=NF;i++)printf "%s%s", a[i],(i==NF?RS:"/")}' file
you can see above, there are duplicated codes, you can extract them into a function if you like.
test with your data:
kent$ cat f
a1
b1
c1
c2
c3
b2
c1
d1
d2
b3
b4
a2
a3
b1
b2
c1
c2
kent$ awk -F' ' 'NF<=p{for(i=1;i<=p;i++)printf "%s%s", a[i],(i==p?RS:"/")
if(NF<p)for(i=NF;i<=p;i++) delete a[i]}
{a[NF] =$NF;p=NF }END{for(i=1;i<=NF;i++)printf "%s%s", a[i],(i==NF?RS:"/")} ' f
a1/b1/c1
a1/b1/c2
a1/b1/c3
a1/b2/c1/d1
a1/b2/c1/d2
a1/b3
a1/b4
a2
a3/b1
a3/b2/c1
a3/b2/c2

I recently had to do something similar enough that with a few tweaks I can post my script here:
#!/bin/bash
prev_level=-1
# Index into node array
i=0
# Regex to screen-scrape all nodes
tc_re="^(( )*)(.*)$"
while IFS= read -r ln; do
if [[ $ln =~ $tc_re ]]; then
# folder level indicated by spaces in preceding node name
spaces=${#BASH_REMATCH[1]}
# 2 space characters per level
level=$(($spaces / 2))
# Name of the folder or node
node=${BASH_REMATCH[3]}
# get the rest of the node path from the previous entry
curpath=( ${curpath[#]:0:$level} $node )
# increment i only if the current level is <= the level of the previous
# entry
if [ $level -le $prev_level ]; then
((i++))
fi
# add this entry (overwrite previous if $i was not incremented)
tc[$i]="${curpath[#]}"
# save level for next iteration
prev_level=$level
fi
done
for p in "${tc[#]}"; do
echo "${p// //}"
done
Input is taken from STDIN, so you'd have to do something like this:
$ ./tree2path.sh < ifile.tree
a1/b1/c1
a1/b1/c2
a1/b1/c3
a1/b2/c1/d1
a1/b2/c1/d2
a1/b3
a1/b4
a2
a3/b1
a3/b2/c1
a3/b2/c2
$

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to extract two strings from each line - awk

A different perl approach: perl -lane 'my %h = #F; print "$h{product} $h{version}"' input.txt Uses auto-split mode to put each word of each line in an array, turns that into a hash/associative array, and prints out the keys you're interested in.

Thank you guys for the quick and excellent replies. I ended up using the awk version as it is most convenient for me to insert into an existing shell script. But I learned a lot from other scripts too.

Related

How can I adjust a text file in VIM to have two columns instead of three while not splitting paired data?

Remove data till some character in the field

Replace in every nth line starting from a certain line

Replace '\n' with white space only when the occurrence of '\t' is more than a number

Convert a tree to a list of paths using awk [duplicate]

Categories

Resources