Convert a tree to a list of paths using awk [duplicate] - awk

I have input files with the structure like the next:
a1
b1
c1
c2
c3
b2
c1
d1
d2
b3
b4
a2
a3
b1
b2
c1
c2
Each level is indented by 2 spaces. The needed output is:
a1/b1/c1
a1/b1/c2
a1/b1/c3
a1/b2/c1/d1
a1/b2/c1/d2
a1/b3
a1/b4
a2
a3/b1
a3/b2/c1
a3/b2/c2
It is like a filesystem, if the next line have bigger indentation, the current one is like a "directory" and when have same indentation it is like a "file". Need print full paths of "files".
Trying to solve this without any high-level language, like python, perl - with only basic bash commands.
My current code/idea is based on recursive function call and working with a stack, but have problem with the "logic". The code currently outputs the next:
a1 b1 c1
a1 b1
a1
DD: line 8: [0-1]: bad array subscript
only the 1st line is OK - so handling the recursion is wrong...
input="ifile.tree"
#stack array
declare -a stack
#stack manipulation
pushstack() { stack+=("$1"); }
popstack() { unset stack[${#stack[#]}-1]; }
printstack() { echo "${stack[*]}"; }
#recursive function
checkline() {
local uplev=$1
#read line - if no more lines - print the stack and return
read -r level text || (printstack; exit 1) || return
#if the current line level is largest than previous level
if [[ $uplev < $level ]]
then
pushstack "$text"
checkline $level #recurse
fi
printstack
popstack
}
# MAIN PROGRAM
# change the input from indented spaces to
# level_number<space>text
(
#subshell - change IFS
IFS=,
while read -r spaces content
do
echo $(( (${#spaces} / 2) + 1 )) "$content"
done < <(sed 's/[^ ]/,&/' < "$input")
) | ( #pipe to another subshell
checkline 0 #recurse by levels
)
Sry for the long code - can anybody help?

interesting question.
this awk (could be one-liner) command does the job:
awk -F' ' 'NF<=p{for(i=1;i<=p;i++)printf "%s%s", a[i],(i==p?RS:"/")
if(NF<p)for(i=NF;i<=p;i++) delete a[i]}
{a[NF] =$NF;p=NF }
END{for(i=1;i<=NF;i++)printf "%s%s", a[i],(i==NF?RS:"/")}' file
you can see above, there are duplicated codes, you can extract them into a function if you like.
test with your data:
kent$ cat f
a1
b1
c1
c2
c3
b2
c1
d1
d2
b3
b4
a2
a3
b1
b2
c1
c2
kent$ awk -F' ' 'NF<=p{for(i=1;i<=p;i++)printf "%s%s", a[i],(i==p?RS:"/")
if(NF<p)for(i=NF;i<=p;i++) delete a[i]}
{a[NF] =$NF;p=NF }END{for(i=1;i<=NF;i++)printf "%s%s", a[i],(i==NF?RS:"/")} ' f
a1/b1/c1
a1/b1/c2
a1/b1/c3
a1/b2/c1/d1
a1/b2/c1/d2
a1/b3
a1/b4
a2
a3/b1
a3/b2/c1
a3/b2/c2

I recently had to do something similar enough that with a few tweaks I can post my script here:
#!/bin/bash
prev_level=-1
# Index into node array
i=0
# Regex to screen-scrape all nodes
tc_re="^(( )*)(.*)$"
while IFS= read -r ln; do
if [[ $ln =~ $tc_re ]]; then
# folder level indicated by spaces in preceding node name
spaces=${#BASH_REMATCH[1]}
# 2 space characters per level
level=$(($spaces / 2))
# Name of the folder or node
node=${BASH_REMATCH[3]}
# get the rest of the node path from the previous entry
curpath=( ${curpath[#]:0:$level} $node )
# increment i only if the current level is <= the level of the previous
# entry
if [ $level -le $prev_level ]; then
((i++))
fi
# add this entry (overwrite previous if $i was not incremented)
tc[$i]="${curpath[#]}"
# save level for next iteration
prev_level=$level
fi
done
for p in "${tc[#]}"; do
echo "${p// //}"
done
Input is taken from STDIN, so you'd have to do something like this:
$ ./tree2path.sh < ifile.tree
a1/b1/c1
a1/b1/c2
a1/b1/c3
a1/b2/c1/d1
a1/b2/c1/d2
a1/b3
a1/b4
a2
a3/b1
a3/b2/c1
a3/b2/c2
$

Related

how to extract two strings from each line

I have a file of the following content:
product a1 version a2 owner a3
owner b1 version b2 product b3 size b4
....
I am interested in extracting product and version from each line using a shell script, and write them in 2 columns with product first and version second. So the output should be:
a1 a2
b3 b2
...
I used "while read line", but it is extremely slow. I tried to use awk, but couldn't figure out how to do it. Any help is appreciated.
The following will do what you want:
$ nl dat
1 product a1 version a2 owner a3
2 owner b1 version b2 product b3 size b4
$ awk 'NF { delete row;
for( i=1; i <= NF; i += 2 ) {
row[$i] = $(i+1)
}
print( row["product"], row["version"])
}' dat
a1 a2
b3 b2
This builds an associative array from the name-value pairs in your data file by position, and then retrieves the values by name. The NF in the pattern ensures blank lines are ignored. If product or version are otherwise missing, they'll likewise be missing in the output.
A different perl approach:
perl -lane 'my %h = #F; print "$h{product} $h{version}"' input.txt
Uses auto-split mode to put each word of each line in an array, turns that into a hash/associative array, and prints out the keys you're interested in.
Here is a perl to do that:
perl -lnE '
$x=$1 if /(?=product\h+(\H+))/;
$y=$1 if /(?=version\h+(\H+))/;
say "$x $y" if $x && $y;
$x=$y="";' file
Or, same method with GNU awk:
gawk '/product/ && /version/{
match($0,/product[ \t]+([^ \t]+)/,f1)
match($0,/version[ \t]+([^ \t]+)/,f2)
print f1[1],f2[1]
}' file
With the example, either prints:
a1 a2
b3 b2
The advantage here is only complete lines are printed where both targets are found.
With awk:
$ awk '{for(i=1;i<NF;i++){
if($i=="version")v=$(i+1)
if($i=="product")p=$(i+1)}}
{print p,v}' data.txt
a1 a2
b3 b2
If you have lines without a version or product number, and you want to skip them:
awk '{ok=0}
{for(i=1;i<NF;i++){
if($i=="version"){v=$(i+1);ok++}
if($i=="product"){p=$(i+1);ok++}}}
ok==2{print p,v}' data.txt
Thank you guys for the quick and excellent replies. I ended up using the awk version as it is most convenient for me to insert into an existing shell script. But I learned a lot from other scripts too.

Replace in every nth line starting from a certain line

I want to replace on every third line starting from second line using sed.
Input file
A1
A2
A3
A4
A5
A6
A7
.
.
.
Expected output
A1
A2
A3
A4_edit
A5
A6
A7_edit
.
.
.
I know there are many solution releted to this is available on stack but for this specific problem, I was unable to find.
My try:
sed '1n;s/$/_edit/;n'
This only replacing on every second line from the beginning.
Something like this?
$ seq 10 | sed '1b ; n ; n ; s/$/_edit/'
1
2
3
4_edit
5
6
7_edit
8
9
10_edit
This breaks down to a cycle of
1b if this is the first line in the input, start the next cycle, using sed default behaviour to print the line and read the next one - which skips the first line in the input
n print the current line and read the next line - which skips the first line in a group of three
n print the current line and read the next line - which skips the second line in a group of three
s/$/_edit/ substitute the end of line for _edit on the third line of each group of three
then use the default sed behaviour to print, read next line and start the cycle again
If you want to skip more than one line at the start, change 1b to, say, 1,5b.
As Wiktor Stribiżew has pointed out in the comments, as an alternative, there is a GNU range extension first~step which allows us to write
sed '4~3s/$/_edit/'
which means substitute on every third line starting from line 4.
In case you are ok with awk, try following.
awk -v count="-1" '++count==3{$0=$0"_edit";count=0} 1' Input_file
Append > temp_file && mv temp_file Input_file in case you want to save output into Input_file itself.
Explanation:
awk -v count="-1" ' ##Starting awk code here and mentioning variable count whose value is -1 here.
++count==3{ ##Checking condition if increment value of count is equal to 3 then do following.
$0=$0"_edit" ##Appending _edit to current line value.
count=0 ##Making value of count as ZERO now.
} ##Closing block of condition ++count==3 here.
1 ##Mentioning 1 will print edited/non-edited lines.
' Input_file ##Mentioning Input_file name here.
Another awk
awk 'NR>3&&NR%3==1{$0=$0"_edit"}1' file
A1
A2
A3
A4_edit
A5
A6
A7_edit
A8
A9
A10_edit
A11
A12
A13_edit
NR>3 Test if line is larger then 3
NR%3==1 and every third line
{$0=$0"_edit"} edit the line
1 print everything
You can use seds ~ step operator.
sed '4~3s|$|_edit|'
~ is a feature of GNU sed, so it will be available in most (all?) distros of Linux. But to use it on macOS (which comes with BSD sed), you would have to install GNU sed to get this feature: brew install gnu-sed.

Transpose columns of data table using awk [duplicate]

I have a huge tab-separated file formatted like this
X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11
I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
I thought of a solution like this
cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done
But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' file
output
$ more file
0 1 2
3 4 5
6 7 8
9 10 11
$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11
Performance against Perl solution by Jonathan on a 10000 lines file
$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2
$ wc -l < file
10000
$ time perl test.pl file >/dev/null
real 0m0.480s
user 0m0.442s
sys 0m0.026s
$ time awk -f test.awk file >/dev/null
real 0m0.382s
user 0m0.367s
sys 0m0.011s
$ time perl test.pl file >/dev/null
real 0m0.481s
user 0m0.431s
sys 0m0.022s
$ time awk -f test.awk file >/dev/null
real 0m0.390s
user 0m0.370s
sys 0m0.010s
EDIT by Ed Morton (#ghostdog74 feel free to delete if you disapprove).
Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
for (rowNr=1;rowNr<=NF;rowNr++) {
cell[rowNr,NR] = $rowNr
}
maxRows = (NF > maxRows ? NF : maxRows)
maxCols = NR
}
END {
for (rowNr=1;rowNr<=maxRows;rowNr++) {
for (colNr=1;colNr<=maxCols;colNr++) {
printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
The above solutions will work in any awk (except old, broken awk of course - there YMMV).
The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
print ""
if (ARGIND < NF) {
ARGV[ARGC] = FILENAME
ARGC++
}
}
$ awk -f tst.awk file
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.
awk
Gawk version which uses arrays of arrays:
tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' "${1+FS=$1}";}
Plain awk version which uses multidimensional arrays (this was about twice as slow in my benchmark):
tp(){ awk '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}' "${1+FS=$1}";}
macOS comes with a version of Brian Kerningham's nawk from 2007 which doesn't support arrays of arrays.
To use space as a separator without collapsing sequences of multiple spaces, use FS='[ ]'.
rs
rs is a BSD utility which also comes with macOS, but it should be available from package managers on other platforms. It is named after the reshape function in APL.
Use sequences of spaces and tabs as column separator:
rs -T
Use tab as column separator:
rs -c -C -T
Use comma as column separator:
rs -c, -C, -T
-c changes the input column separator and -C changes the output column separator. A lone -c or -C sets the separator to tab. -T transposes rows and columns.
Do not use -t instead of -T, because it automatically selects the number of output columns so that the output lines fill the width of the display (which is 80 characters by default but which can be changed with -w).
When an output column separator is specified using -C, an extra column separator character is added to the end of each row, but you can remove it with sed:
$ seq 4|paste -d, - -|rs -c, -C, -T
1,3,
2,4,
$ seq 4|paste -d, - -|rs -c, -C, -T|sed s/.\$//
1,3
2,4
rs -T determines the number of columns based on the number of columns on the first row, so it produces the wrong result when the first line ends with one or more empty columns:
$ rs -c, -C, -T<<<$'1,\n3,4'
1,3,4,
R
The t function transposes a matrix or dataframe:
Rscript -e 'write.table(t(read.table("stdin",sep=",",quote="",comment.char="")),sep=",",quote=F,col.names=F,row.names=F)'
If you replace Rscript -e with R -e, then it echoes the code that is being run to STDOUT, and it also results in the error ignoring SIGPIPE signal if the R command is followed by a command like head -n1 which exits before it has read the whole STDIN.
quote="" can be removed if the input doesn't contain double quotes or single quotes, and comment.char="" can be removed if the input doesn't contain lines that start with a hash character.
For a big input file, fread and fwrite from data.table are faster than read.table and write.table:
$ seq 1e6|awk 'ORS=NR%1e3?FS:RS'>a
$ time Rscript --no-init-file -e 'write.table(t(read.table("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m1.061s
user 0m0.983s
sys 0m0.074s
$ time Rscript --no-init-file -e 'write.table(t(data.table::fread("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m0.599s
user 0m0.535s
sys 0m0.048s
$ time Rscript --no-init-file -e 'data.table::fwrite(t(data.table::fread("a")),sep=" ",col.names=F)'>t/b
x being coerced from class: matrix to data.table
real 0m0.375s
user 0m0.296s
sys 0m0.073s
jq
tp(){ jq -R .|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]';}
jq -R . prints each input line as a JSON string literal, -s (--slurp) creates an array for the input lines after parsing each line as JSON, and -r (--raw-output) outputs the contents of strings instead of JSON string literals. The / operator is overloaded to split strings.
Ruby
ruby -e'STDIN.map{|x|x.chomp.split(",",-1)}.transpose.each{|x|puts x*","}'
The -1 argument to split disables discarding empty fields at the end:
$ ruby -e'p"a,,".split(",")'
["a"]
$ ruby -e'p"a,,".split(",",-1)'
["a", "", ""]
Function form:
$ tp(){ ruby -e's=ARGV[0];STDIN.map{|x|x.chomp.split(s==" "?/ /:s,-1)}.transpose.each{|x|puts x*s}' -- "${1-$'\t'}";}
$ seq 4|paste -d, - -|tp ,
1,3
2,4
The function above uses s==" "?/ /:s because when the argument to the split function is a single space, it enables awk-like special behavior where strings are split based on contiguous runs of spaces and tabs:
$ ruby -e'p" a \tb ".split(" ",-1)'
["a", "b", ""]
$ ruby -e'p" a \tb ".split(/ /,-1)'
["", "a", "", "\tb", ""]
A Python solution:
python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output
The above is based on the following:
import sys
for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
print(' '.join(c))
This code does assume that every line has the same number of columns (no padding is performed).
Have a look at GNU datamash which can be used like datamash transpose.
A future version will also support cross tabulation (pivot tables)
Here is how you would do it with space separated columns:
datamash transpose -t ' ' < file > transposed_file
the transpose project on sourceforge is a coreutil-like C program for exactly that.
gcc transpose.c -o transpose
./transpose -t input > output #works with stdin, too.
Pure BASH, no additional process. A nice exercise:
declare -a array=( ) # we build a 1-D-array
read -a line < "$1" # read the headline
COLS=${#line[#]} # save number of columns
index=0
while read -a line ; do
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < COLS; ROW++ )); do
for (( COUNTER = ROW; COUNTER < ${#array[#]}; COUNTER += COLS )); do
printf "%s\t" ${array[$COUNTER]}
done
printf "\n"
done
GNU datamash is perfectly suited for this problem with only one line of code and potentially arbitrarily large filesize!
datamash -W transpose infile > outfile
There is a purpose built utility for this,
GNU datamash utility
apt install datamash
datamash transpose < yourfile
Taken from this site, https://www.gnu.org/software/datamash/ and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods
Here is a moderately solid Perl script to do the job. There are many structural analogies with #ghostdog74's awk solution.
#!/bin/perl -w
#
# SO 1729824
use strict;
my(%data); # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
my(#row) = split /\s+/;
my($colnum) = 0;
foreach my $val (#row)
{
$data{$rownum}{$colnum++} = $val;
}
$rownum++;
$maxcol = $colnum if $colnum > $maxcol;
}
my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
for (my $row = 0; $row < $maxrow; $row++)
{
printf "%s%s", ($row == 0) ? "" : "\t",
defined $data{$row}{$col} ? $data{$row}{$col} : "";
}
print "\n";
}
With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.
Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:
Osiris JL: time gawk -f tr.awk xxx > /dev/null
real 0m0.367s
user 0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null
real 0m0.138s
user 0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx > /dev/null
real 0m1.891s
user 0m0.924s
sys 0m0.961s
Osiris-2 JL:
Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.
Assuming all your rows have the same number of fields, this awk program solves the problem:
{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}
In words, as you loop over the rows, for every field f grow a ':'-separated string col[f] containing the elements of that field. After you are done with all the rows, print each one of those strings in a separate line. You can then substitute ':' for the separator you want (say, a space) by piping the output through tr ':' ' '.
Example:
$ echo "1 2 3\n4 5 6"
1 2 3
4 5 6
$ echo "1 2 3\n4 5 6" | awk '{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}' | tr ':' ' '
1 4
2 5
3 6
If you have sc installed, you can do:
psc -r < inputfile | sc -W% - > outputfile
I normally use this little awk snippet for this requirement:
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i
max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j==NR?RS:FS)
}
}' file
This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input.
This needs to keep track of the maximum amount of columns the initial file has, so that it is used as the number of rows to print back.
A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste
#!/usr/bin/perl
use warnings;
use strict;
my $counter;
open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
while (my $line = <INPUT>) {
chomp $line;
my #array = split ("\t",$line);
open OUTPUT, ">temp$." or die ("unable to open output file!");
print OUTPUT join ("\n",#array);
close OUTPUT;
$counter=$.;
}
close INPUT;
# paste files together
my $execute = "paste ";
foreach (1..$counter) {
$execute.="temp$counter ";
}
$execute.="> $ARGV[1]";
system $execute;
The only improvement I can see to your own example is using awk which will reduce the number of processes that are run and the amount of data that is piped between them:
/bin/rm output 2> /dev/null
cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do
awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input
done >> output
Some *nix standard util one-liners, no temp files needed. NB: the OP wanted an efficient fix, (i.e. faster), and the top answers are usually faster than this answer. These one-liners are for those who like *nix software tools, for whatever reasons. In rare cases, (e.g. scarce IO & memory), these snippets can actually be faster than some of the top answers.
Call the input file foo.
If we know foo has four columns:
for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done
If we don't know how many columns foo has:
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done
xargs has a size limit and therefore would make incomplete work with a long file. What size limit is system dependent, e.g.:
{ timeout '.01' xargs --show-limits ; } 2>&1 | grep Max
Maximum length of command we could actually use: 2088944
tr & echo:
for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done
...or if the # of columns are unknown:
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n); do
cut -d ' ' -f $f foo | tr '\n' ' ' ; echo
done
Using set, which like xargs, has similar command line size based limitations:
for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $# ; done
I used fgm's solution (thanks fgm!), but needed to eliminate the tab characters at the end of each row, so modified the script thus:
#!/bin/bash
declare -a array=( ) # we build a 1-D-array
read -a line < "$1" # read the headline
COLS=${#line[#]} # save number of columns
index=0
while read -a line; do
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < COLS; ROW++ )); do
for (( COUNTER = ROW; COUNTER < ${#array[#]}; COUNTER += COLS )); do
printf "%s" ${array[$COUNTER]}
if [ $COUNTER -lt $(( ${#array[#]} - $COLS )) ]
then
printf "\t"
fi
done
printf "\n"
done
I was just looking for similar bash tranpose but with support for padding. Here is the script I wrote based on fgm's solution, that seem to work. If it can be of help...
#!/bin/bash
declare -a array=( ) # we build a 1-D-array
declare -a ncols=( ) # we build a 1-D-array containing number of elements of each row
SEPARATOR="\t";
PADDING="";
MAXROWS=0;
index=0
indexCol=0
while read -a line; do
ncols[$indexCol]=${#line[#]};
((indexCol++))
if [ ${#line[#]} -gt ${MAXROWS} ]
then
MAXROWS=${#line[#]}
fi
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < MAXROWS; ROW++ )); do
COUNTER=$ROW;
for (( indexCol=0; indexCol < ${#ncols[#]}; indexCol++ )); do
if [ $ROW -ge ${ncols[indexCol]} ]
then
printf $PADDING
else
printf "%s" ${array[$COUNTER]}
fi
if [ $((indexCol+1)) -lt ${#ncols[#]} ]
then
printf $SEPARATOR
fi
COUNTER=$(( COUNTER + ncols[indexCol] ))
done
printf "\n"
done
I was looking for a solution to transpose any kind of matrix (nxn or mxn) with any kind of data (numbers or data) and got the following solution:
Row2Trans=number1
Col2Trans=number2
for ((i=1; $i <= Line2Trans; i++));do
for ((j=1; $j <=Col2Trans ; j++));do
awk -v var1="$i" -v var2="$j" 'BEGIN { FS = "," } ; NR==var1 {print $((var2)) }' $ARCHIVO >> Column_$i
done
done
paste -d',' `ls -mv Column_* | sed 's/,//g'` >> $ARCHIVO
If you only want to grab a single (comma delimited) line $N out of a file and turn it into a column:
head -$N file | tail -1 | tr ',' '\n'
Not very elegant, but this "single-line" command solves the problem quickly:
cols=4; for((i=1;i<=$cols;i++)); do \
awk '{print $'$i'}' input | tr '\n' ' '; echo; \
done
Here cols is the number of columns, where you can replace 4 by head -n 1 input | wc -w.
Another awk solution and limited input with the size of memory you have.
awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) }
END{ for (i in RtoC) print RtoC[i] }' infile
This joins each same filed number positon into together and in END prints the result that would be first row in first column, second row in second column, etc.
Will output:
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
#!/bin/bash
aline="$(head -n 1 file.txt)"
set -- $aline
colNum=$#
#set -x
while read line; do
set -- $line
for i in $(seq $colNum); do
eval col$i="\"\$col$i \$$i\""
done
done < file.txt
for i in $(seq $colNum); do
eval echo \${col$i}
done
another version with set eval
Here is a Bash one-liner that is based on simply converting each line to a column and paste-ing them together:
echo '' > tmp1; \
cat m.txt | while read l ; \
do paste tmp1 <(echo $l | tr -s ' ' \\n) > tmp2; \
cp tmp2 tmp1; \
done; \
cat tmp1
m.txt:
0 1 2
4 5 6
7 8 9
10 11 12
creates tmp1 file so it's not empty.
reads each line and transforms it into a column using tr
pastes the new column to the tmp1 file
copies result back into tmp1.
PS: I really wanted to use io-descriptors but couldn't get them to work.
Another bash variant
$ cat file
XXXX col1 col2 col3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11
Script
#!/bin/bash
I=0
while read line; do
i=0
for item in $line; { printf -v A$I[$i] $item; ((i++)); }
((I++))
done < file
indexes=$(seq 0 $i)
for i in $indexes; {
J=0
while ((J<I)); do
arr="A$J[$i]"
printf "${!arr}\t"
((J++))
done
echo
}
Output
$ ./test
XXXX row1 row2 row3 row4
col1 0 3 6 9
col2 1 4 7 10
col3 2 5 8 11
I'm a little late to the game but how about this:
cat table.tsv | python -c "import pandas as pd, sys; pd.read_csv(sys.stdin, sep='\t').T.to_csv(sys.stdout, sep='\t')"
or zcat if it's gzipped.
This is assuming you have pandas installed in your version of python
Here's a Haskell solution. When compiled with -O2, it runs slightly faster than ghostdog's awk and slightly slower than Stephan's thinly wrapped c python on my machine for repeated "Hello world" input lines. Unfortunately GHC's support for passing command line code is non-existent as far as I can tell, so you will have to write it to a file yourself. It will truncate the rows to the length of the shortest row.
transpose :: [[a]] -> [[a]]
transpose = foldr (zipWith (:)) (repeat [])
main :: IO ()
main = interact $ unlines . map unwords . transpose . map words . lines
An awk solution that store the whole array in memory
awk '$0!~/^$/{ i++;
split($0,arr,FS);
for (j in arr) {
out[i,j]=arr[j];
if (maxr<j){ maxr=j} # max number of output rows.
}
}
END {
maxc=i # max number of output columns.
for (j=1; j<=maxr; j++) {
for (i=1; i<=maxc; i++) {
printf( "%s:", out[i,j])
}
printf( "%s\n","" )
}
}' infile
But we may "walk" the file as many times as output rows are needed:
#!/bin/bash
maxf="$(awk '{if (mf<NF); mf=NF}; END{print mf}' infile)"
rowcount=maxf
for (( i=1; i<=rowcount; i++ )); do
awk -v i="$i" -F " " '{printf("%s\t ", $i)}' infile
echo
done
Which (for a low count of output rows is faster than the previous code).
A oneliner using R...
cat file | Rscript -e "d <- read.table(file('stdin'), sep=' ', row.names=1, header=T); write.table(t(d), file=stdout(), quote=F, col.names=NA) "
I've used below two scripts to do similar operations before. The first is in awk which is a lot faster than the second which is in "pure" bash. You might be able to adapt it to your own application.
awk '
{
for (i = 1; i <= NF; i++) {
s[i] = s[i]?s[i] FS $i:$i
}
}
END {
for (i in s) {
print s[i]
}
}' file.txt
declare -a arr
while IFS= read -r line
do
i=0
for word in $line
do
[[ ${arr[$i]} ]] && arr[$i]="${arr[$i]} $word" || arr[$i]=$word
((i++))
done
done < file.txt
for ((i=0; i < ${#arr[#]}; i++))
do
echo ${arr[i]}
done
Simple 4 line answer, keep it readable.
col="$(head -1 file.txt | wc -w)"
for i in $(seq 1 $col); do
awk '{ print $'$i' }' file.txt | paste -s -d "\t"
done

Replace '\n' with white space only when the occurrence of '\t' is more than a number

I have tens of thousands of tab-delimited data files, each like:
a0\ta1\ta2\ta3\ta4\ta5\ta6\ta7\ta8\ta9\n
b0\tb1\tb2\tb3\tb4\tb5\tb6\tb7\tb8\tb9\n
...
However, occasionally there are files containing (randomly) malformed lines like:
a0\ta1\ta2\ta3_0\n
a3_1\ta4\ta5\ta6\ta7\ta8\ta9\n
b0\tb1\tb2_0\n
b2_1\tb3\tb4\tb5\tb6\tb7\tb8\tb9\n
...
where a3_0, a3_1 (b2_0, b2_1 resp.) are parts of a3 (b2 resp.) originally separated by a white space. I want to replace each \n at the end of a line with a white space only when that line is too short, or, too few \t. Currently 5 seems to be a safe threshold.
I often use sed to do some modifications, which are much simpler than the above. I am wondering if sed or some other commands (like awk? which I still need to learn) can be used for fast processing (since I have many files). Thanks.
With GNU awk for multi-char RS and RT (and later -i infile and ENDFILE) and using commas instead of tabs for visibility:
$ cat file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0
a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0
b2_1,b3,b4,b5,b6,b7,b8,b9
$ awk -v RS='([^,]*,){9}[^\n]*\n' '{$0=RT; sub(/\n$/,"") gsub(/\n/," ")} 1' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9
The above [ab-]uses RS to describe each record (instead of the record separator) as a series of 10 comma-separated fields ending with a newline and then replaces the newlines as appropriate within each record before printing.
Just change RS='([^,]*,){9}[^\n]*\n' to RS='([^\t]*\t){9}[^\n]*\n' for it to work with tab-separated instead of comma-separated fields.
To make the changes to all files add -i inplace:
awk -i inplace -v RS='...' '...' *
or:
find ... -exec awk -i inplace -v RS='...' '...' {} +
You actually don't even have to hard-code the RS, the tool could figure it out, assuming there's at least 1 complete line in each input file:
$ awk -F',' '
BEGIN { ARGV[ARGC] = ARGV[ARGC-1]; ARGC++ }
NR==FNR { n=(NF>n?NF:n); next }
ENDFILE { RS="([^"FS"]*"FS"){"n-1"}[^\n]*\n" }
{ $0=RT; sub(/\n$/,"") gsub(/\n/," "); print }
' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9
Just change -F',' to -F'\t' for tab-separated.
FYI with POSIX awks, the closest equivalents of the above two gawk scripts would be:
$ awk '
{ rec=rec $0 RS }
END{
while ( match(rec,/([^,]*,){9}[^\n]*\n/) ) {
tgt = substr(rec,RSTART,RLENGTH)
sub(/\n$/,"",tgt)
gsub(/\n/," ",tgt)
print tgt
rec = substr(rec,RSTART+RLENGTH)
}
}
' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9
and:
awk -F',' '
{ rec=rec $0 RS; n=(NF>n?NF:n) }
END{
while ( match(rec,"([^"FS"]*"FS"){"n-1"}[^\n]*\n") ) {
tgt = substr(rec,RSTART,RLENGTH)
sub(/\n$/,"",tgt)
gsub(/\n/," ",tgt)
print tgt
rec = substr(rec,RSTART+RLENGTH)
}
}
' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9
Just be aware that those read the whole file into a single string before the main processing begins so they'd fail if your file was too huge to fit in memory but you already told us each file is "very small" so that shouldn't be an issue.
To overwrite the input file the simplest approach is always:
awk '{...}' file > tmp && mv tmp file
but in this case you can alternatively do:
awk '{...} END{... print tgt > ARGV[1] ...}' file
That works in this case because awk has already completed reading the input file before starting the END section. Do not attempt it elsewhere in the script.
Assuming you name the following script repiece:
#!/usr/bin/env bash
IFS=$'\t' # use tab separators throughout this script
rIFS=, # except to avoid field coalescing, use commas
pieces_needed=5 # adjust this to taste
for arg; do
tempfile="${arg}.tmp-$$" # vulnerable to symlink attacks; use mktemp instead if untrusted
# users have write access to current directory.
deferred=( )
{
while IFS="$rIFS" read -r -a pieces; do
if (( ( ${#deferred[#]} + ${#pieces[#]} ) < pieces_needed )); then
deferred+=( "${pieces[#]}" )
elif (( ${#deferred[#]} )); then
# separate last piece of deferred and first of pieces with a space
all_pieces=( "${deferred[#]} ${pieces[#]}" )
printf '%s\n' "${all_pieces[*]}"
deferred=( )
else
printf '%s\n' "${pieces[*]}"
fi
done
# if we have anything deferred for the last line, print it now
(( ${#deferred[#]} )) && printf '%s\n' "${deferred[*]}"
} < <(tr -- "$IFS" "$rIFS" <"$arg") >"$tempfile"
mv -- "$tempfile" "$arg"
done
...you can invoke the smallest possible number of invocations to process all your files as follows:
# if your files end in .tsv
find . -type f -name '*.tsv' -exec ./repiece {} +
In awk, changing the ORS between a space and \ņ:
$ awk '
BEGIN {
FS=OFS="\t" # set field separators
RS=ORS="\n" # set record separators
}
NF<=5 { # if below or at threshold
ORS=" " # redefine output record separator
}
{
print # print record with ORS
ORS="\n" # reset ORS back to newline
}' file
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
a0 a1 a2 a3_0 a3_1 a4 a5 a6 a7 a8 a9
b0 b1 b2_0 b2_1 b3 b4 b5 b6 b7 b8 b9
Processing multiple files with shell scripting:
$ for f in file1 file2 ; do awk ... $f > new-$f ; done
Quote $f if needed.
This might work for you (GNU sed):
sed ':a;s/\t/&/9;t;N;s/\n/ /;ta' file
If there are less than 9 tabs in the current line, append the next line and replace the newline by a space. Repeat until 9 or more tabs.

How to get random lines with a given distribution with awk?

I have two tabular files that I want to merge, but prior to that I want to reduce the second one.
The first file let's say File1 is tabular and is like this
A 67 98 56
A 22 29 62
A 11 99 28
B 37 88 14
B 33 99 65
We have 3 lines with A and two with B
File2 contains 3000 lines between A and B, I want to randomly select lines from File2 but exactly the same number of A and B than File1, which means just 3 random lines with A and two with B
Any one have an idea on how to do this with awk ?
Thanks
#!/bin/bash
read -r acount bcount <<< $(csplit file2 '/^B /')
awk -v "acount=$acount" -v "bcount=$bcount" '
NR == FNR {
arr[$1]++;
next
}
! setup {
setup = 1
while (arandcount < acount) {
line = int(rand() * acount) + 1
if (! alines[line]) {
alines[line] = 1
arandcount++
}
}
while (brandcount < bcount) {
line = int(rand() * bcount) + 1
if (! blines[line]) {
blines[line] = 1
brandcount++
}
}
}
FILENAME == "xx00" && FNR in alines {
print
}
FILENAME == "xx01" && FNR in blines {
print
}' file1 xx00 xx01
Where "xx00" and "xx01" are the names of the files created by csplit.
The csplit command splits the input file on the regex and outputs the line counts of each output file. The read command puts those counts into the variables. The variables are passed into the AWK program.
The first block reads all the lines in file1 and counts each "type" (A or B).
The second block picks the line numbers to select by choosing a random number between 1 and the count of lines of the "type". This block only gets executed once because of the flag.
The last two blocks check each line's record number to see if it's in the array of picked line numbers and prints it if it is.
This might work for you:
grep '^A' file2 | sort -R | head -$(grep -c '^A' file1) >file3
grep '^B' file2 | sort -R | head -$(grep -c '^B' file1) >>file3
N.B. This assumes file1 is sorted.