How to Field Separate and append text using awk - awk

Experts,
I have the following text in an xml files ( there will 20,000 rows in file).
<record record_no = "1" error_code="101">"21006041";"28006041";"34006211";"43";"101210-0001"
Here is how I need the result for each row to be and append to new file.
"21006041";"28006041";"34006211";"43";"101210-0001";101
Here is what I need to do to get the above result.
I replaced " with "
remove <record record_no = "1" error_code="
Get the text 101 ( it can have any value in this position)
append to the last.
Here is what I have been trying.
BEGIN { FS=OFS=";" }
/<record/ {
gsub(/"/,"\"")
gsub(/&apos;/,"")
gsub(/.*="|">.*/,"",$1)
$(NF+1)=$1;
$1="";
print $0;
}

This should do the trick.
awk -F'">' -v OFS=';' '{gsub(/<record record_no = \"[0-9]+\" error_code="/,""); gsub(/"/,"\""); print $2,$1}'
The strategy is to:
split the string at closing chars of the xml element ">
remove the first bit of the xml element including the attribute names leaving only the error code.
replace all " xml entities with ".
print the two FS sections in reverse order.
Test it out with the following data generation script. The script will generate 500x20000 line files with records of random length, some with dashes in the values.
#!/bin/bash
recCount=0
for h in {1..500};
do
for i in {1..20000};
do
((recCount++))
error=$(( RANDOM % 998 + 1 ))
record="<record record_no = "'"'"${recCount}"'"'" error_code="'"'"${error}"'"'">"
upperBound=$(( RANDOM % 4 + 5 ))
for (( k=0; k<${upperBound}; k++ ));
do
randomVal=$(( RANDOM % 99999999 + 1))
record+=""${randomVal}"
if [[ $((RANDOM % 4)) == 0 ]];
then
randomVal=$(( RANDOM % 99999999 + 1))
record+="-${randomVal}"
fi
record+="""
if [[ $k != $(( ${upperBound} - 1 )) ]];
then
record+=";"
fi
done;
echo "${record}" >> "file-${h}.txt"
done;
done;
On my laptop I get the following performance.
$ time cat file-*.txt | awk -F'">' -v OFS=';' '{gsub(/<record record_no = \"[0-9]+\" error_code="/,""); gsub(/"/,"\""); print $2,$1}' > result
real 0m18.985s
user 0m17.673s
sys 0m2.697s
As an added bonus, here is the "equivalent" command in sed:
sed -e 's|\("\)|"|g' -e 's|^.*error_code="\([^>]\+\)">\(.\+\).*$|\2;\1|g'
Much slower although the strategy is the same. Two expressions are used. First replace all " xml entities with ". Lastly group all characters (.+) after >. Display the remembered patterns in reverse order \2;\1
Timing statistics:
$ time cat file-* | sed -e 's|\("\)|"|g' -e 's|^.*error_code="\([^>]\+\)">\(.\+\).*$|\2;\1|g' > result.sed
real 5m59.576s
user 5m56.136s
sys 0m9.850s

Is this too thick:
$ awk -F""+" -v OFS='";"' -v dq='"' '{gsub(/^.*="|">$/,"",$1);print dq""$2,$4,$6,$8,$10dq";"$1}' test.in
"21006041";"28006041";"34006211";"43";"101210-0001";101

Related

To read and print 1st 1000 rows from a csv using awk command and then next 1000 and so on

I have a csv that has around 25k rows. I have to pick 1000 rows from column#1 and column#2 at a time and then next 1000 rows and so on.
I am using below command, and its working fine in picking up all the values from column#1 and Column#2 i.e 25K fields from both the columns, I want to pick value like 1-1000, put them in my sql export query then 1001-2000,2001-3000 and so on and put the value in WHERE IN in my export query and append the result in dbData.csv file.
My code is below:
awk -F ',' 'NR > 2 {print $1}' $INPUT > column1.txt
i=$(cat column1.txt | sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}')
awk -F ',' 'NR > 2 {print $2}' $INPUT > column2.txt
j=$(cat column2.txt | sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}')
echo "Please wait - connecting to database..."
db2 connect to $sourceDBStr user user123 using pas123
db2 "export to dbData.csv of del select partnumber,language_id as LanguageId from CATENTRY c , CATENTDESC cd where c.CATENTRY_ID=cd.CATENTRY_ID and c.PARTNUMBER in ($i) and cd.language_id in ($j)"
Let's assume the two first fields of your input CSV are "simple" (no spaces, no commas...) and do not need any kind of quoting. You could generate the tricky part of your query string with an awk script:
# foo.awk
NR >= first && NR <= last {
c1[n+0] = $1
c2[n++] = $2
}
END {
for(i = 0; i < n-1; i++) printf("%s,", c1[i])
printf("%s) %s (%s", c1[n-1], midstr, c2[0])
for(i = 1; i < n; i++) printf(",%s", c2[i])
}
And then use it in a bash loop to process 1000 records per iteration, store the result of the query in a temporary file (e.g., tmp.csv in the following bash script) that you concatenate to your dbData.csv file. The following example bash script uses the same parameters as you do (INPUT, sourceDBStr) and the same constants (dbData.csv, 1000, user123, pas123). Adapt if you need more flexibility. Error management (input file not found, DB connection error, DB query error...) is left as a bash exercise (but should be done).
prefix="export to tmp.csv of del select partnumber,language_id as LanguageId from CATENTRY c , CATENTDESC cd where c.CATENTRY_ID=cd.CATENTRY_ID and c.PARTNUMBER in"
midstr="and cd.language_id in"
rm -f dbData.csv
len=$(cat "$INPUT" | wc -l)
for (( first = 2; first <= len - 999; first += 1000 )); do
(( last = len < first + 999 ? len : first + 999 ))
query=$(awk -F ',' -f foo.awk -v midstr="$midstr" -v first="$first" \
-v last="$last" "$INPUT")
echo "Please wait - connecting to database..."
db2 connect to $sourceDBStr user user123 using pas123
db2 "$prefix ($query)"
cat tmp.csv >> dbData.csv
done
rm -f tmp.csv
But there are other ways using split, bash arrays and simpler awk or sed scripts. Example:
declare -a arr=()
prefix="export to tmp.csv of del select partnumber,language_id as LanguageId from CATENTRY c , CATENTDESC cd where c.CATENTRY_ID=cd.CATENTRY_ID and c.PARTNUMBER in"
midstr="and cd.language_id in"
awk -F, 'NR>1 {print $1, $2}' "$INPUT" | split -l 1000 - foobar
rm -f dbData.csv
for f in foobar*; do
arr=($(awk '{print $1 ","}' "$f"))
i="${arr[*]}"
arr=($(awk '{print $2 ","}' "$f"))
j="${arr[*]}"
echo "Please wait - connecting to database..."
db2 connect to $sourceDBStr user user123 using pas123
db2 "$prefix (${i%,}) $midstr (${j%,})"
cat tmp.csv >> dbData.csv
rm -f "$f"
done
rm -f tmp.csv

Transpose columns of data table using awk [duplicate]

I have a huge tab-separated file formatted like this
X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11
I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
I thought of a solution like this
cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done
But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' file
output
$ more file
0 1 2
3 4 5
6 7 8
9 10 11
$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11
Performance against Perl solution by Jonathan on a 10000 lines file
$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2
$ wc -l < file
10000
$ time perl test.pl file >/dev/null
real 0m0.480s
user 0m0.442s
sys 0m0.026s
$ time awk -f test.awk file >/dev/null
real 0m0.382s
user 0m0.367s
sys 0m0.011s
$ time perl test.pl file >/dev/null
real 0m0.481s
user 0m0.431s
sys 0m0.022s
$ time awk -f test.awk file >/dev/null
real 0m0.390s
user 0m0.370s
sys 0m0.010s
EDIT by Ed Morton (#ghostdog74 feel free to delete if you disapprove).
Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
for (rowNr=1;rowNr<=NF;rowNr++) {
cell[rowNr,NR] = $rowNr
}
maxRows = (NF > maxRows ? NF : maxRows)
maxCols = NR
}
END {
for (rowNr=1;rowNr<=maxRows;rowNr++) {
for (colNr=1;colNr<=maxCols;colNr++) {
printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
The above solutions will work in any awk (except old, broken awk of course - there YMMV).
The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
print ""
if (ARGIND < NF) {
ARGV[ARGC] = FILENAME
ARGC++
}
}
$ awk -f tst.awk file
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.
awk
Gawk version which uses arrays of arrays:
tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' "${1+FS=$1}";}
Plain awk version which uses multidimensional arrays (this was about twice as slow in my benchmark):
tp(){ awk '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}' "${1+FS=$1}";}
macOS comes with a version of Brian Kerningham's nawk from 2007 which doesn't support arrays of arrays.
To use space as a separator without collapsing sequences of multiple spaces, use FS='[ ]'.
rs
rs is a BSD utility which also comes with macOS, but it should be available from package managers on other platforms. It is named after the reshape function in APL.
Use sequences of spaces and tabs as column separator:
rs -T
Use tab as column separator:
rs -c -C -T
Use comma as column separator:
rs -c, -C, -T
-c changes the input column separator and -C changes the output column separator. A lone -c or -C sets the separator to tab. -T transposes rows and columns.
Do not use -t instead of -T, because it automatically selects the number of output columns so that the output lines fill the width of the display (which is 80 characters by default but which can be changed with -w).
When an output column separator is specified using -C, an extra column separator character is added to the end of each row, but you can remove it with sed:
$ seq 4|paste -d, - -|rs -c, -C, -T
1,3,
2,4,
$ seq 4|paste -d, - -|rs -c, -C, -T|sed s/.\$//
1,3
2,4
rs -T determines the number of columns based on the number of columns on the first row, so it produces the wrong result when the first line ends with one or more empty columns:
$ rs -c, -C, -T<<<$'1,\n3,4'
1,3,4,
R
The t function transposes a matrix or dataframe:
Rscript -e 'write.table(t(read.table("stdin",sep=",",quote="",comment.char="")),sep=",",quote=F,col.names=F,row.names=F)'
If you replace Rscript -e with R -e, then it echoes the code that is being run to STDOUT, and it also results in the error ignoring SIGPIPE signal if the R command is followed by a command like head -n1 which exits before it has read the whole STDIN.
quote="" can be removed if the input doesn't contain double quotes or single quotes, and comment.char="" can be removed if the input doesn't contain lines that start with a hash character.
For a big input file, fread and fwrite from data.table are faster than read.table and write.table:
$ seq 1e6|awk 'ORS=NR%1e3?FS:RS'>a
$ time Rscript --no-init-file -e 'write.table(t(read.table("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m1.061s
user 0m0.983s
sys 0m0.074s
$ time Rscript --no-init-file -e 'write.table(t(data.table::fread("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m0.599s
user 0m0.535s
sys 0m0.048s
$ time Rscript --no-init-file -e 'data.table::fwrite(t(data.table::fread("a")),sep=" ",col.names=F)'>t/b
x being coerced from class: matrix to data.table
real 0m0.375s
user 0m0.296s
sys 0m0.073s
jq
tp(){ jq -R .|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]';}
jq -R . prints each input line as a JSON string literal, -s (--slurp) creates an array for the input lines after parsing each line as JSON, and -r (--raw-output) outputs the contents of strings instead of JSON string literals. The / operator is overloaded to split strings.
Ruby
ruby -e'STDIN.map{|x|x.chomp.split(",",-1)}.transpose.each{|x|puts x*","}'
The -1 argument to split disables discarding empty fields at the end:
$ ruby -e'p"a,,".split(",")'
["a"]
$ ruby -e'p"a,,".split(",",-1)'
["a", "", ""]
Function form:
$ tp(){ ruby -e's=ARGV[0];STDIN.map{|x|x.chomp.split(s==" "?/ /:s,-1)}.transpose.each{|x|puts x*s}' -- "${1-$'\t'}";}
$ seq 4|paste -d, - -|tp ,
1,3
2,4
The function above uses s==" "?/ /:s because when the argument to the split function is a single space, it enables awk-like special behavior where strings are split based on contiguous runs of spaces and tabs:
$ ruby -e'p" a \tb ".split(" ",-1)'
["a", "b", ""]
$ ruby -e'p" a \tb ".split(/ /,-1)'
["", "a", "", "\tb", ""]
A Python solution:
python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output
The above is based on the following:
import sys
for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
print(' '.join(c))
This code does assume that every line has the same number of columns (no padding is performed).
Have a look at GNU datamash which can be used like datamash transpose.
A future version will also support cross tabulation (pivot tables)
Here is how you would do it with space separated columns:
datamash transpose -t ' ' < file > transposed_file
the transpose project on sourceforge is a coreutil-like C program for exactly that.
gcc transpose.c -o transpose
./transpose -t input > output #works with stdin, too.
Pure BASH, no additional process. A nice exercise:
declare -a array=( ) # we build a 1-D-array
read -a line < "$1" # read the headline
COLS=${#line[#]} # save number of columns
index=0
while read -a line ; do
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < COLS; ROW++ )); do
for (( COUNTER = ROW; COUNTER < ${#array[#]}; COUNTER += COLS )); do
printf "%s\t" ${array[$COUNTER]}
done
printf "\n"
done
GNU datamash is perfectly suited for this problem with only one line of code and potentially arbitrarily large filesize!
datamash -W transpose infile > outfile
There is a purpose built utility for this,
GNU datamash utility
apt install datamash
datamash transpose < yourfile
Taken from this site, https://www.gnu.org/software/datamash/ and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods
Here is a moderately solid Perl script to do the job. There are many structural analogies with #ghostdog74's awk solution.
#!/bin/perl -w
#
# SO 1729824
use strict;
my(%data); # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
my(#row) = split /\s+/;
my($colnum) = 0;
foreach my $val (#row)
{
$data{$rownum}{$colnum++} = $val;
}
$rownum++;
$maxcol = $colnum if $colnum > $maxcol;
}
my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
for (my $row = 0; $row < $maxrow; $row++)
{
printf "%s%s", ($row == 0) ? "" : "\t",
defined $data{$row}{$col} ? $data{$row}{$col} : "";
}
print "\n";
}
With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.
Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:
Osiris JL: time gawk -f tr.awk xxx > /dev/null
real 0m0.367s
user 0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null
real 0m0.138s
user 0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx > /dev/null
real 0m1.891s
user 0m0.924s
sys 0m0.961s
Osiris-2 JL:
Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.
Assuming all your rows have the same number of fields, this awk program solves the problem:
{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}
In words, as you loop over the rows, for every field f grow a ':'-separated string col[f] containing the elements of that field. After you are done with all the rows, print each one of those strings in a separate line. You can then substitute ':' for the separator you want (say, a space) by piping the output through tr ':' ' '.
Example:
$ echo "1 2 3\n4 5 6"
1 2 3
4 5 6
$ echo "1 2 3\n4 5 6" | awk '{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}' | tr ':' ' '
1 4
2 5
3 6
If you have sc installed, you can do:
psc -r < inputfile | sc -W% - > outputfile
I normally use this little awk snippet for this requirement:
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i
max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j==NR?RS:FS)
}
}' file
This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input.
This needs to keep track of the maximum amount of columns the initial file has, so that it is used as the number of rows to print back.
A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste
#!/usr/bin/perl
use warnings;
use strict;
my $counter;
open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
while (my $line = <INPUT>) {
chomp $line;
my #array = split ("\t",$line);
open OUTPUT, ">temp$." or die ("unable to open output file!");
print OUTPUT join ("\n",#array);
close OUTPUT;
$counter=$.;
}
close INPUT;
# paste files together
my $execute = "paste ";
foreach (1..$counter) {
$execute.="temp$counter ";
}
$execute.="> $ARGV[1]";
system $execute;
The only improvement I can see to your own example is using awk which will reduce the number of processes that are run and the amount of data that is piped between them:
/bin/rm output 2> /dev/null
cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do
awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input
done >> output
Some *nix standard util one-liners, no temp files needed. NB: the OP wanted an efficient fix, (i.e. faster), and the top answers are usually faster than this answer. These one-liners are for those who like *nix software tools, for whatever reasons. In rare cases, (e.g. scarce IO & memory), these snippets can actually be faster than some of the top answers.
Call the input file foo.
If we know foo has four columns:
for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done
If we don't know how many columns foo has:
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done
xargs has a size limit and therefore would make incomplete work with a long file. What size limit is system dependent, e.g.:
{ timeout '.01' xargs --show-limits ; } 2>&1 | grep Max
Maximum length of command we could actually use: 2088944
tr & echo:
for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done
...or if the # of columns are unknown:
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n); do
cut -d ' ' -f $f foo | tr '\n' ' ' ; echo
done
Using set, which like xargs, has similar command line size based limitations:
for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $# ; done
I used fgm's solution (thanks fgm!), but needed to eliminate the tab characters at the end of each row, so modified the script thus:
#!/bin/bash
declare -a array=( ) # we build a 1-D-array
read -a line < "$1" # read the headline
COLS=${#line[#]} # save number of columns
index=0
while read -a line; do
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < COLS; ROW++ )); do
for (( COUNTER = ROW; COUNTER < ${#array[#]}; COUNTER += COLS )); do
printf "%s" ${array[$COUNTER]}
if [ $COUNTER -lt $(( ${#array[#]} - $COLS )) ]
then
printf "\t"
fi
done
printf "\n"
done
I was just looking for similar bash tranpose but with support for padding. Here is the script I wrote based on fgm's solution, that seem to work. If it can be of help...
#!/bin/bash
declare -a array=( ) # we build a 1-D-array
declare -a ncols=( ) # we build a 1-D-array containing number of elements of each row
SEPARATOR="\t";
PADDING="";
MAXROWS=0;
index=0
indexCol=0
while read -a line; do
ncols[$indexCol]=${#line[#]};
((indexCol++))
if [ ${#line[#]} -gt ${MAXROWS} ]
then
MAXROWS=${#line[#]}
fi
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < MAXROWS; ROW++ )); do
COUNTER=$ROW;
for (( indexCol=0; indexCol < ${#ncols[#]}; indexCol++ )); do
if [ $ROW -ge ${ncols[indexCol]} ]
then
printf $PADDING
else
printf "%s" ${array[$COUNTER]}
fi
if [ $((indexCol+1)) -lt ${#ncols[#]} ]
then
printf $SEPARATOR
fi
COUNTER=$(( COUNTER + ncols[indexCol] ))
done
printf "\n"
done
I was looking for a solution to transpose any kind of matrix (nxn or mxn) with any kind of data (numbers or data) and got the following solution:
Row2Trans=number1
Col2Trans=number2
for ((i=1; $i <= Line2Trans; i++));do
for ((j=1; $j <=Col2Trans ; j++));do
awk -v var1="$i" -v var2="$j" 'BEGIN { FS = "," } ; NR==var1 {print $((var2)) }' $ARCHIVO >> Column_$i
done
done
paste -d',' `ls -mv Column_* | sed 's/,//g'` >> $ARCHIVO
If you only want to grab a single (comma delimited) line $N out of a file and turn it into a column:
head -$N file | tail -1 | tr ',' '\n'
Not very elegant, but this "single-line" command solves the problem quickly:
cols=4; for((i=1;i<=$cols;i++)); do \
awk '{print $'$i'}' input | tr '\n' ' '; echo; \
done
Here cols is the number of columns, where you can replace 4 by head -n 1 input | wc -w.
Another awk solution and limited input with the size of memory you have.
awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) }
END{ for (i in RtoC) print RtoC[i] }' infile
This joins each same filed number positon into together and in END prints the result that would be first row in first column, second row in second column, etc.
Will output:
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
#!/bin/bash
aline="$(head -n 1 file.txt)"
set -- $aline
colNum=$#
#set -x
while read line; do
set -- $line
for i in $(seq $colNum); do
eval col$i="\"\$col$i \$$i\""
done
done < file.txt
for i in $(seq $colNum); do
eval echo \${col$i}
done
another version with set eval
Here is a Bash one-liner that is based on simply converting each line to a column and paste-ing them together:
echo '' > tmp1; \
cat m.txt | while read l ; \
do paste tmp1 <(echo $l | tr -s ' ' \\n) > tmp2; \
cp tmp2 tmp1; \
done; \
cat tmp1
m.txt:
0 1 2
4 5 6
7 8 9
10 11 12
creates tmp1 file so it's not empty.
reads each line and transforms it into a column using tr
pastes the new column to the tmp1 file
copies result back into tmp1.
PS: I really wanted to use io-descriptors but couldn't get them to work.
Another bash variant
$ cat file
XXXX col1 col2 col3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11
Script
#!/bin/bash
I=0
while read line; do
i=0
for item in $line; { printf -v A$I[$i] $item; ((i++)); }
((I++))
done < file
indexes=$(seq 0 $i)
for i in $indexes; {
J=0
while ((J<I)); do
arr="A$J[$i]"
printf "${!arr}\t"
((J++))
done
echo
}
Output
$ ./test
XXXX row1 row2 row3 row4
col1 0 3 6 9
col2 1 4 7 10
col3 2 5 8 11
I'm a little late to the game but how about this:
cat table.tsv | python -c "import pandas as pd, sys; pd.read_csv(sys.stdin, sep='\t').T.to_csv(sys.stdout, sep='\t')"
or zcat if it's gzipped.
This is assuming you have pandas installed in your version of python
Here's a Haskell solution. When compiled with -O2, it runs slightly faster than ghostdog's awk and slightly slower than Stephan's thinly wrapped c python on my machine for repeated "Hello world" input lines. Unfortunately GHC's support for passing command line code is non-existent as far as I can tell, so you will have to write it to a file yourself. It will truncate the rows to the length of the shortest row.
transpose :: [[a]] -> [[a]]
transpose = foldr (zipWith (:)) (repeat [])
main :: IO ()
main = interact $ unlines . map unwords . transpose . map words . lines
An awk solution that store the whole array in memory
awk '$0!~/^$/{ i++;
split($0,arr,FS);
for (j in arr) {
out[i,j]=arr[j];
if (maxr<j){ maxr=j} # max number of output rows.
}
}
END {
maxc=i # max number of output columns.
for (j=1; j<=maxr; j++) {
for (i=1; i<=maxc; i++) {
printf( "%s:", out[i,j])
}
printf( "%s\n","" )
}
}' infile
But we may "walk" the file as many times as output rows are needed:
#!/bin/bash
maxf="$(awk '{if (mf<NF); mf=NF}; END{print mf}' infile)"
rowcount=maxf
for (( i=1; i<=rowcount; i++ )); do
awk -v i="$i" -F " " '{printf("%s\t ", $i)}' infile
echo
done
Which (for a low count of output rows is faster than the previous code).
A oneliner using R...
cat file | Rscript -e "d <- read.table(file('stdin'), sep=' ', row.names=1, header=T); write.table(t(d), file=stdout(), quote=F, col.names=NA) "
I've used below two scripts to do similar operations before. The first is in awk which is a lot faster than the second which is in "pure" bash. You might be able to adapt it to your own application.
awk '
{
for (i = 1; i <= NF; i++) {
s[i] = s[i]?s[i] FS $i:$i
}
}
END {
for (i in s) {
print s[i]
}
}' file.txt
declare -a arr
while IFS= read -r line
do
i=0
for word in $line
do
[[ ${arr[$i]} ]] && arr[$i]="${arr[$i]} $word" || arr[$i]=$word
((i++))
done
done < file.txt
for ((i=0; i < ${#arr[#]}; i++))
do
echo ${arr[i]}
done
Simple 4 line answer, keep it readable.
col="$(head -1 file.txt | wc -w)"
for i in $(seq 1 $col); do
awk '{ print $'$i' }' file.txt | paste -s -d "\t"
done

While using awk showing fatal : cannot open pipe ( Too many open files) error

I was trying to do masking of file with command 'tr' and 'awk' but failing with error fatal: cannot open pipe ( Too many open pipes) error. FILE has approx 1000000 records quite a huge number.
Below is the code I am trying :-
awk - F "|" - v OFS="|" '{ "echo \""$1"\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\"" | get line $1}1' FILE.CSV > test.CSV
It is showing error :-
awk: (FILENAME=- FNR=1019) fatal: cannot open pipe `echo ""TTP_123"" | tr "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" "QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq"' (Too many open pipes)
Please let me know what I am doing wrong here
Also a Note any number of columns could be used for masking and can be at any positions in this example I have taken 1 and 2 column positions but it could be 3 and 10 or 5,7,25 columns
Thanks
AJ
First things first, you can't have a space between - and F or v.
I was going to suggest sed, but as you only want to translate the first column, that's not as easy.
Unfortunately, awk doesn't have built-in tr functionality, so you'd have to use the shell like you are and just close the pipe:
awk -F "|" -v OFS="|" '{
command="echo \"\\"$1"\\\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\""
command | getline $1
close(command)
}1' FILE.CSV > test.CSV
However, I suggest using perl, which can do field splitting and character translation:
perl -F'\|' -lane '$F[0] =~ tr/0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq/; print join("|", #F)' FILE.CSV > test.CSV
Or, for a shorter command line, just put the program into a file, drop the e in -lane and use the file name instead of the '...' command.
you can do the mapping in awk instead of making a system call for each line, or perhaps simply
paste -d'|' <(cut -d'|' -f1 file | tr '0-9' 'a-z') <(cut -d'|' -f2- file)
replace the tr arguments with yours.
This does not answer your question, but you can implement tr as an awk function that would save having to spawn lots of external processes
$ cat tr.awk
function tr(str, from, to, s,i,c,idx) {
s = ""
for (i=1; i<=length($str); i++) {
c = substr(str, i, 1)
idx = index(from, c)
s = s (idx == 0 ? c : substr(to, idx, 1))
}
return s
}
{
print $1, tr($1,
" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ",
" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq")
}
Example:
$ printf "%s\n" hello wor-ld | awk -f tr.awk
hello KGCCN
wor-ld 3N8-CF

awk field separator with in the xml

I have a xml file with the following data.
<record record_no = "2" error_code="100">"18383531";"22677833";"21459732";"41001";"394034";"0208";"Prime Lending - ;Corporate - 2201";"";"Prime Lending - Lacey - 2508";"Prime Lending - Lacey - 2508";"1";"rrvc";"Tiffany Poe";"HEIDI";"BUNDY";"000002274";"2.0";"18.0";"2";"362661";"Rejected by IRS";"A1AAA";"20160720";"1021";"HEDI & Bundy";"4985045838";"PPASSESS";"Web";"3683000826";"823";"IC W2";"";"";"";"";"Rapid_20160801_Monthly.txt";"20160720102100";"";"20160803095309";"286023";"RGT";"1";"14702324400223";"14702324400223";"0";"OMCProcessed"
I'm using following code:
cat RR_00404.fin.bc_lerr.xml.bc| awk 'BEGIN { FS=OFS=";" }/<record/ { gsub(/"/,"\""); gsub(/.*=" ">.*/,"",$1);print $1,$40,$43,$46 ,"'base_err_xml'", "0",$7; }'
The idea is to do the following:
Replace &quote; with "
Extract the error_code
Print " and ; seperated values.
Use sqlldr to load ( not to worry about this).
Problem to solve:
There is ; within the text. e.g Prime Lending -;Corporate - 2201
There's &
Output:
100;"20160803095309";"1";"1";"base_err_xml";"0";"Prime Lending
100;"286023";"14702324400223";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate - 2201"
100;"286024-1";"";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate - 2201"
awk is the wrong tool for this job, without some preprocessing. Here, we use XMLStarlet for the first pass (decoding all XML entities and splitting attributes off into separate fields), and GNU awk for the second (reading those fields and performing whatever transforms or logic you actually need):
#!/bin/sh
# reads XML on stdin; puts record_no in first field, error code in second,
# ...record content for remainder of output line.
xmlstarlet sel -t -m '//record' \
-v ./#record_no -o ';' \
-v ./#error_code -o ';' \
-v . -n
...and, cribbed from the GNU awk documentation...
#!/bin/env gawk -f
# must be GNU awk for the FPAT feature
BEGIN {
FPAT = "([^;]*)|(\"[^\"]*\")"
}
{
print "NF = ", NF
for (i = 1; i <= NF; i++) {
printf("$%d = <%s>\n", i, $i)
}
}
Here, what we're doing with gawk is just showing how the fields get split, but obviously, you can modify the script for whatever needs you have.
A subset of output from the above for your given input file (when extended to actually be valid XML) is quoted below:
$1 = <2>
$2 = <100>
$9 = <"Prime Lending - ;Corporate - 2201">
Note, then, that $1 is the record_no, $2 is the error_code, and $9 correctly contains the semicolon as literal content.
Obviously, you can encapsulate both these components in shell functions to avoid the need for separate files.

Using awk to pull specific lines from a file

I have two files, one file is my data, and the other file is a list of line numbers that I want to extract from my data file. Can I use awk to read in my lines file, and then extract the lines that match the line numbers?
Example:
Data file:
This is the first line of my data
This is the second line of my data
This is the third line of my data
This is the fourth line of my data
This is the fifth line of my data
Line numbers file
1
4
5
Output:
This is the first line of my data
This is the fourth line of my data
This is the fifth line of my data
I've only ever used command line awk and sed for really simple stuff. This is way beyond me and I have been googling for an hour without an answer.
awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile
simply referring to an array subscript creates the entry. Looping over the first file, while NR (record number) is equal to FNR (file record number) using the next statement stores all the line numbers in the array. After that when FNR of the second file is present in the array (true) the line is printed (which is the default action for "true").
One way with sed:
sed 's/$/p/' linesfile | sed -n -f - datafile
You can use the same trick with awk:
sed 's/^/NR==/' linesfile | awk -f - datafile
Edit - Huge files alternative
With regards to huge number of lines it is not prudent to keep whole files in memory. The solution in that case can be to sort the numbers-file and read one line at a time. The following has been tested with GNU awk:
extract.awk
BEGIN {
getline n < linesfile
if(length(ERRNO)) {
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
NR == n {
print
if(!(getline n < linesfile)) {
if(length(ERRNO))
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
Run it like this:
awk -v linesfile=$linesfile -f extract.awk infile
Testing:
echo "2
4
7
8
10
13" | awk -v linesfile=/dev/stdin -f extract.awk <(paste <(seq 50e3) <(seq 50e3 | tac))
Output:
2 49999
4 49997
7 49994
8 49993
10 49991
13 49988
Here is an awk example. inputfile is loaded up front, then matching records of datafile are output.
awk \
-v RS="[\r]*[\n]" \
-v FILE="inputfile" \
'BEGIN \
{
LINES = ","
while ((getline Line < FILE))
{
LINES = LINES Line ","
}
}
LINES ~ "," NR "," \
{
print
}
' datafile
I had the same problem. This is the solution already posted by Thor:
cat datafile \
| awk 'BEGIN{getline n<"numbers"} n==NR{print; getline n<"numbers"}'
If like me you don't have a numbers file, but it is instead passed on from stdin and you don't want to generate a temporary numbers file, then this is an alternative solution:
cat numbers \
| awk '{while((getline line<"datafile")>0) {n++; if(n==$0) {print line;next}}}'
This solution...
awk 'NR == FNR {nums[$1]; next} FNR in nums' numberfile datafile
...only prints unique numbers in the numberfile. What if the numberfile contains repeated entries? Then sed is a better (but much slower) alternative:
sed -nf <(sed 's/.*/&p/' numberfile) datafile
while read line; do echo $(sed -n '$(echo $line)p' Datafile.txt); done < numbersfile.txt