Deleting columns from a file with awk or from command line on linux - awk

How can I delete some columns from a tab separated fields file with awk?
c1 c2 c3 ..... c60
For example, delete columns between 3 and 29 .

This is what the cut command is for:
cut -f1,2,30- inputfile
The default is tab. You can change that with the -d switch.

You can loop over all columns and filter out the ones you don't want:
awk '{for (i=1; i<=NF; i++) if (i<3 || i>29) printf $i " "; print""}' input.txt
where the NF gives you the total number of fields in a record.
For each column that meets the condition we print the column followed by a space " ".
EDIT: updated after remark from johnny:
awk -F 'FS' 'BEGIN{FS="\t"}{for (i=1; i<=NF-1; i++) if(i<3 || i>5) {printf $i FS};{print $NF}}' input.txt
this is improved in 2 ways:
keeps the original separators
does not append a separator at the end

awk '{for(z=3;z<=15;z++)$z="";$0=$0;$1=$1}1'
Input
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21
Output
c1 c2 c16 c17 c18 c19 c20 c21

Perl 'splice' solution which does not add leading or trailing whitespace:
perl -lane 'splice #F,3,27; print join " ",#F' file
Produces output:
c1 c2 c30 c31

Related

How can I adjust a text file in VIM to have two columns instead of three while not splitting paired data?

Hello there and thank you for reading this!
I have a very large text file that looks something like this:
a1 b1 a2
b2 a3 b3
a4 b4 a5
b5 a6 b6
I would like my text file to look like this:
a1 b1
a2 b2
a3 b3
a4 b4
a5 b5
a6 b6
In reality, these values are paired lon/lat coordinates. If it is useful the values look like:
1.591336e+02 4.978998e+01 1.591162e+02
4.977995e+01 1.590988e+02 4.976991e+01
1.590815e+02 4.975988e+01 1.590641e+02
4.974984e+01 1.590468e+02 4.973980e+01
I have been learning in vim, but I do have access to other tools if this is easier done elsewhere. Should I be looking for a sed or awk command that will assess the amount of spaces in a given row? I appreciate any and all advice, and if I can offer any more information, I would be glad to!
I have searched for other folks who have had this question, but I don't know how to apply some of the solutions I've seen for similar problems to this - and I'm afraid of messing up this very large file. I am expecting the answer to be something using sed or awk, but I don't know how to be successful with these commands with what I've found so far. I'm rather new to coding and this site, so if I missed this question already being asked, I apologize!
All the best!
EDIT: I used
sed 's/\s/\n/2;P;D' file.txt > newfile.txt
to turn my file into:
a1 b1
a2^M
b2 a3
b3^M
a4 b4
a5^M
b5 a6
b6^M
I then used:
dos2unix newfile.txt
to get rid of the ^M within the data. I haven't made it to the structure, but I am one step closer.
$ tr ' ' '\n' <input_file|paste -d" " - -
a1 b1
a2 b2
a3 b3
$ sed 's/ /\n/2; P; D' <(tr '\n' ' ' <input_file)
a1 b1
a2 b2
a3 b3
$ tr '\n' ' ' <input_file|xargs -d" " printf '%s %s\n'
a1 b1
a2 b2
a3 b3
An approach using awk
% awk '{for(i=1;i<=NF;i++){x++; printf x%2==0 ? $i"\n" : $i" "}}' file
a1 b1
a2 b2
a3 b3
a4 b4
a5 b5
a6 b6
Data
% cat file
a1 b1 a2
b2 a3 b3
a4 b4 a5
b5 a6 b6
Using GNU sed
$ sed -Ez 's/([^ \n]*)[ \n]([^ \n]* ?)\n?/\1 \2\n/g' input_file
a1 b1
a2 b2
a3 b3
a4 b4
a5 b5
a6 b6
Using cut:
$ cut -d' ' -f1,2 test
a1 b1
b2 a3
a4 b4
b5 a6
In vim you can
:%j
to join all the lines, then
:s/\([^ ]\+ [^ ]\+\) /\1\r/g
to turn every 2nd space into a newline.
With perl
perl -0777 -lpe 's/(\S+)\s+(\S+)\s+/$1 $2\n/g' file
That reads the whole file into memory, so it depends on what "very large" is. Is it smaller than the amount of memory you have?
FWIW, here is a bit of a meta-answer.
Vim lets you filter all or some of the lines in your buffer via an external program with :help :!. It is very handy because, wile Vim can do a lot, there are plenty of use cases for which external tools would do a better job.
So… if you already have the file opened in Vim, you should be able to apply the provided answers with little effort:
:%!tr ' ' '\n' | paste -d" " - -
:%!tr '\n' ' ' | sed 's/ /\n/2; P; D'
:%!perl -0777 -lpe 's/(\S+)\s+(\S+)\s+/$1 $2\n/g'
etc.
Of note:
The command after the [range]! takes the lines covered by [range] as standard input which makes constructs like <filename unnecessary in this context.
Vim expands % to the current filename and # to the alternate filename if they exist so they usually need to be escaped, as in:
:%!tr '\n' ' ' | xargs printf '\%s \%s\n'
There's a lot to learn from this thread. Good luck.
This might work for you (GNU sed):
sed '$!N;y/\n/ /;s/ /\n/2;P;D' file
Append the following line if not the last.
Translate all newlines to spaces.
Replace the second space by a newline.
Print the first line, delete the first line and repeat.

how to extract two strings from each line

I have a file of the following content:
product a1 version a2 owner a3
owner b1 version b2 product b3 size b4
....
I am interested in extracting product and version from each line using a shell script, and write them in 2 columns with product first and version second. So the output should be:
a1 a2
b3 b2
...
I used "while read line", but it is extremely slow. I tried to use awk, but couldn't figure out how to do it. Any help is appreciated.
The following will do what you want:
$ nl dat
1 product a1 version a2 owner a3
2 owner b1 version b2 product b3 size b4
$ awk 'NF { delete row;
for( i=1; i <= NF; i += 2 ) {
row[$i] = $(i+1)
}
print( row["product"], row["version"])
}' dat
a1 a2
b3 b2
This builds an associative array from the name-value pairs in your data file by position, and then retrieves the values by name. The NF in the pattern ensures blank lines are ignored. If product or version are otherwise missing, they'll likewise be missing in the output.
A different perl approach:
perl -lane 'my %h = #F; print "$h{product} $h{version}"' input.txt
Uses auto-split mode to put each word of each line in an array, turns that into a hash/associative array, and prints out the keys you're interested in.
Here is a perl to do that:
perl -lnE '
$x=$1 if /(?=product\h+(\H+))/;
$y=$1 if /(?=version\h+(\H+))/;
say "$x $y" if $x && $y;
$x=$y="";' file
Or, same method with GNU awk:
gawk '/product/ && /version/{
match($0,/product[ \t]+([^ \t]+)/,f1)
match($0,/version[ \t]+([^ \t]+)/,f2)
print f1[1],f2[1]
}' file
With the example, either prints:
a1 a2
b3 b2
The advantage here is only complete lines are printed where both targets are found.
With awk:
$ awk '{for(i=1;i<NF;i++){
if($i=="version")v=$(i+1)
if($i=="product")p=$(i+1)}}
{print p,v}' data.txt
a1 a2
b3 b2
If you have lines without a version or product number, and you want to skip them:
awk '{ok=0}
{for(i=1;i<NF;i++){
if($i=="version"){v=$(i+1);ok++}
if($i=="product"){p=$(i+1);ok++}}}
ok==2{print p,v}' data.txt
Thank you guys for the quick and excellent replies. I ended up using the awk version as it is most convenient for me to insert into an existing shell script. But I learned a lot from other scripts too.

awk match multiple pattern in column

What is the proper awk syntax to match multiple patterns in one column? Having a columnar file like this:
c11 c21 c31
c12 c22 c32
c13 c23 c33
how to exclude lines that match c21 and c22 in the second column.
With grep, one can do something like this (but it doesn't specify to match in the second column only):
> egrep -w -v "c21|c22" bar.txt
c13 c23 c33
I tried playing with awk but to no avail:
> awk '$2 != /c21|c22/' bar.txt
c11 c21 c31
c12 c22 c32
c13 c23 c33
> awk '$2 != "c21" || $2 != "c22"' bar.txt
c11 c21 c31
c12 c22 c32
c13 c23 c33
So, what is the proper awk syntax to get this right?
$2 != /c21|c22/
is shorthand for
$2 != ($0 ~ /c21|c22/)
which is comparing $2 to the result of comparing $0 to c21 or c22 and that result is either 1 or 0 so it's testing for $2 having a value other than 1.
$2 != "c21" || $2 != "c22"
is testing for $2 not equal to c21 or $2 not equal to c22 which is a condition that is always true. Think about it - if $2 is c21 then the first condition ($2 != "c21") is false but then the second condition ($2 != "c22") is true and so on so the or is always true for any value of $2
What you're trying to write is:
awk '$2 !~ /c21|c22/'
or more robustly:
awk '$2 !~ /^(c21|c22)$/'
and more briefly (plus just as robustly) the way to REALLY write that condition is:
awk '$2 !~ /^c2[12]$/'
and if you wanted to do a string rather than regexp comparison then you'd do either of these if it's a throwaway script (I favor the first for fewer negation signs which IMHO makes it clearer):
awk '!($2 == "c21" || $2 == "c22")'
awk '$2 != "c21" && $2 != "c22"'
and this otherwise:
awk 'BEGIN{split("c21 c22",t); for (i in t) vals[t[i]]} !($2 in vals)'
That last is best since you only specify $2 once and you can just add other values to the string being split if you need to test more which means you can't break the comparison ogic later in the script.
Use and (&&) instead of or (||):
awk '$2 != "c21" && $2 != "c22"' bar.txt
Prints:
c13 c23 c33
Since c21 doesn't equal c22, lines with c21 in column 2 will be printed in the version with || because $2 doesn't equal c22 and vice versa for lines with c22. In fact, it would be impossible for not all the lines to be printed because in no line can column 2 equal both c21 and c22.

Awk: concatenating split field element from one file to another based on common field

I have two tab-delimited files, f1 and f2, that look like this:
f1:
id1 r1
id2 r2
id3 r3
...
idN rN
f2:
f1 g1 x1;id1
f2 g2 x2;id2
f4 g4 x2;id4
...
fM gM xm;idM
where N and M may be different. I'm looking to create an associative array of f1 and concatenate the second column of f1 to the end of f2 such that the output is:
f1 g1 x1;id1=r1
f2 g2 x2;id2=r2
...
As a test, I've run this:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{id[$1]=$1; r[$1]=$2; next} {split($3,a,";"); if (a[2] in id) {print "found"} else {print "not found"}}' f1 f2
which gives output:
found
found
not found
...
However, running the following command:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{id[$1]=$1; r[$1]=$2; next} {split($3,a,";"); if (a[2] in id) {$3=$3"="r[$1]; print} else {print "not found"}}' f1 f2
gives the output:
f1 g1 x1;id1=
f2 g2 x2;id2=
not found
...
My question is: how do I access the value associated with the key such that I can append it to the 3rd column of f2?
join is the tool for joining files, especially if they are already sorted by the key.
$ join -14 <(sed 's/;/; /' file2) file1 |
awk '{print $2,$3,$4$1 "=" $5}'
f1 g1 x1;id1=r1
f2 g2 x2;id2=r2
however, your output format is not standard, so need awk for that purpose. I guess in that case the whole script can be done in awk as well.

Convert a tree to a list of paths using awk [duplicate]

I have input files with the structure like the next:
a1
b1
c1
c2
c3
b2
c1
d1
d2
b3
b4
a2
a3
b1
b2
c1
c2
Each level is indented by 2 spaces. The needed output is:
a1/b1/c1
a1/b1/c2
a1/b1/c3
a1/b2/c1/d1
a1/b2/c1/d2
a1/b3
a1/b4
a2
a3/b1
a3/b2/c1
a3/b2/c2
It is like a filesystem, if the next line have bigger indentation, the current one is like a "directory" and when have same indentation it is like a "file". Need print full paths of "files".
Trying to solve this without any high-level language, like python, perl - with only basic bash commands.
My current code/idea is based on recursive function call and working with a stack, but have problem with the "logic". The code currently outputs the next:
a1 b1 c1
a1 b1
a1
DD: line 8: [0-1]: bad array subscript
only the 1st line is OK - so handling the recursion is wrong...
input="ifile.tree"
#stack array
declare -a stack
#stack manipulation
pushstack() { stack+=("$1"); }
popstack() { unset stack[${#stack[#]}-1]; }
printstack() { echo "${stack[*]}"; }
#recursive function
checkline() {
local uplev=$1
#read line - if no more lines - print the stack and return
read -r level text || (printstack; exit 1) || return
#if the current line level is largest than previous level
if [[ $uplev < $level ]]
then
pushstack "$text"
checkline $level #recurse
fi
printstack
popstack
}
# MAIN PROGRAM
# change the input from indented spaces to
# level_number<space>text
(
#subshell - change IFS
IFS=,
while read -r spaces content
do
echo $(( (${#spaces} / 2) + 1 )) "$content"
done < <(sed 's/[^ ]/,&/' < "$input")
) | ( #pipe to another subshell
checkline 0 #recurse by levels
)
Sry for the long code - can anybody help?
interesting question.
this awk (could be one-liner) command does the job:
awk -F' ' 'NF<=p{for(i=1;i<=p;i++)printf "%s%s", a[i],(i==p?RS:"/")
if(NF<p)for(i=NF;i<=p;i++) delete a[i]}
{a[NF] =$NF;p=NF }
END{for(i=1;i<=NF;i++)printf "%s%s", a[i],(i==NF?RS:"/")}' file
you can see above, there are duplicated codes, you can extract them into a function if you like.
test with your data:
kent$ cat f
a1
b1
c1
c2
c3
b2
c1
d1
d2
b3
b4
a2
a3
b1
b2
c1
c2
kent$ awk -F' ' 'NF<=p{for(i=1;i<=p;i++)printf "%s%s", a[i],(i==p?RS:"/")
if(NF<p)for(i=NF;i<=p;i++) delete a[i]}
{a[NF] =$NF;p=NF }END{for(i=1;i<=NF;i++)printf "%s%s", a[i],(i==NF?RS:"/")} ' f
a1/b1/c1
a1/b1/c2
a1/b1/c3
a1/b2/c1/d1
a1/b2/c1/d2
a1/b3
a1/b4
a2
a3/b1
a3/b2/c1
a3/b2/c2
I recently had to do something similar enough that with a few tweaks I can post my script here:
#!/bin/bash
prev_level=-1
# Index into node array
i=0
# Regex to screen-scrape all nodes
tc_re="^(( )*)(.*)$"
while IFS= read -r ln; do
if [[ $ln =~ $tc_re ]]; then
# folder level indicated by spaces in preceding node name
spaces=${#BASH_REMATCH[1]}
# 2 space characters per level
level=$(($spaces / 2))
# Name of the folder or node
node=${BASH_REMATCH[3]}
# get the rest of the node path from the previous entry
curpath=( ${curpath[#]:0:$level} $node )
# increment i only if the current level is <= the level of the previous
# entry
if [ $level -le $prev_level ]; then
((i++))
fi
# add this entry (overwrite previous if $i was not incremented)
tc[$i]="${curpath[#]}"
# save level for next iteration
prev_level=$level
fi
done
for p in "${tc[#]}"; do
echo "${p// //}"
done
Input is taken from STDIN, so you'd have to do something like this:
$ ./tree2path.sh < ifile.tree
a1/b1/c1
a1/b1/c2
a1/b1/c3
a1/b2/c1/d1
a1/b2/c1/d2
a1/b3
a1/b4
a2
a3/b1
a3/b2/c1
a3/b2/c2
$