add column of 1's to tab-delimited file - awk

Can't find a solution, even though thousands of variations of this question have been asked.
I want to add a column of 1's to a tab-delimited file using awk or sed.
The file will have about 20 million lines, so something efficient would be nice.
turn this:
a b c
r j k
i t w
into this:
a b c 1
r j k 1
i t w 1

One simple way. Modify Input and Output field separators to a tab. The NF variable keeps last column, so increment for a new one, assign the 1 number and print:
awk 'BEGIN { FS = OFS = "\t" } { $(NF+1) = 1; print $0 }' infile
It yields:
a b c 1
r j k 1
i t w 1

Code for sed:
sed 's/$/&\t1/' file

Assuming you used awk -F'\t' instead of just awk:
{
print $0 FS 1;
}
If you didn't use the -F option, replace FS 1 with "\t1".

Related

how to keep newline(s) when selecting a given column with awk

Suppose I have a file like this (disclaimer: this is not fixed I can have more than 7 rows, and more than 4 columns)
R H A 23
S E A 45
T E A 34
U A 35
Y T A 35
O E A 353
J G B 23
I want the output to select second column if third column is A but keeping newline or whitespace character.
output should be:
HEE TE
I tried this:
awk '{if ($3=="A") print $2}' file | awk 'BEGIN{ORS = ""}{print $1}'
But this gives:
HEETE%
Which has a weird % and is missing the space.
You may use this gnu-awk solution using FIELDWIDTHS:
awk 'BEGIN{ FIELDWIDTHS = "1 1 1 1 1 1 *" } $5 == "A" {s = s $3}
END {print s}' file
HEE TE
awk splits each record using width values provided in this variable FIELDWIDTHS.
1 1 1 1 1 1 * means each of first 6 columns will have single character length and remaining text will be filled in 7th column. Since you have a space after each value so $2,$4,$6 will be filled with a single space and $1,$3,$5 will be filled with the provided values in input.
$5 == "A" {s = s $3}: Here we are checking if $5 is A and if that condition is true then we keep appending value of $3 in a variable s. In the END block we just print variable s.
Without using fixed width parsing, awk will treat A in 4th row as $2.
Or else if we let spaces part of column value then use:
awk '
BEGIN{ FIELDWIDTHS = "2 2 2 *" }
$3 == "A " {s = s substr($2,1,1)}
END {print s}
' file

linux csv file concatenate columns into one column

I've been looking to do this with sed, awk, or cut. I am willing to use any other command-line program that I can pipe data through.
I have a large set of data that is comma delimited. The rows have between 14 and 20 columns. I need to recursively concatenate column 10 with column 11 per row such that every row has exactly 14 columns. In other words, this:
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
will become:
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
I can get the first 10 columns. I can get the last N columns. I can concatenate columns. I cannot think of how to do it in one line so I can pass a stream of endless data through it and end up with exactly 14 columns per row.
Examples (by request):
How many columns are in the row?
sed 's/[^,]//g' | wc -c
Get the first 10 columns:
cut -d, -f1-10
Get the last 4 columns:
rev | cut -d, -f1-4 | rev
Concatenate columns 10 and 11, showing columns 1-10 after that:
awk -F',' ' NF { print $1","$2","$3","$4","$5","$6","$7","$8","$9","$10$11}'
Awk solution:
awk 'BEGIN{ FS=OFS="," }
{
diff = NF - 14;
for (i=1; i <= NF; i++)
printf "%s%s", $i, (diff > 1 && i >= 10 && i < (10+diff)?
"": (i == NF? ORS : ","))
}' file
The output:
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
With GNU awk for the 3rd arg to match() and gensub():
$ cat tst.awk
BEGIN{ FS="," }
match($0,"(([^,]+,){9})(([^,]+,){"NF-14"})(.*)",a) {
$0 = a[1] gensub(/,/,"","g",a[3]) a[5]
}
{ print }
$ awk -f tst.awk file
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
If perl is okay - can be used just like awk for stream processing
$ cat ip.txt
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4
1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4
$ awk -F, '{print NF}' ip.txt
16
18
22
$ perl -F, -lane '$n = $#F - 4;
print join ",", (#F[0..8], join("", #F[9..$n]), #F[$n+1..$#F])
' ip.txt
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
1,2,3,4,5,6,3,4,2,43432,5,2,3,4
1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4
-F, -lane split on , results saved in #F array
$n = $#F - 4 magic number, to ensure output ends with 14 columns. $#F gives the index of last element of array (won't work if input line has less than 14 columns)
join helps to stitch array elements together with specified string
#F[0..8] array slice with first 9 elements
#F[9..$n] and #F[$n+1..$#F] the other slices as needed
Borrowing from Ed Morton's regex based solution
$ perl -F, -lape '$n=$#F-13; s/^([^,]*,){9}\K([^,]*,){$n}/$&=~tr|,||dr/e' ip.txt
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
1,2,3,4,5,6,3,4,2,43432,5,2,3,4
1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4
$n=$#F-13 magic number
^([^,]*,){9}\K first 9 fields
([^,]*,){$n} fields to change
$&=~tr|,||dr use tr to delete the commas
e this modifier allows use of Perl code in replacement section
this solution also has the added advantage of working even if input field is less than 14
You can try this gnu sed
sed -E '
s/,/\n/9g
:A
s/([^\n]*\n)(.*)(\n)(([^\n]*\n){4})/\1\2\4/
tA
s/\n/,/g
' infile
First variant - with awk
awk -F, '
{
for(i = 1; i <= NF; i++) {
OFS = (i > 9 && i < NF - 4) ? "" : ","
if(i == NF) OFS = "\n"
printf "%s%s", $i, OFS
}
}' input.txt
Second variant - with sed
sed -r 's/,/#/10g; :l; s/#(.*)((#[^#]){4})/\1\2/; tl; s/#/,/g' input.txt
or, more straightforwardly (without loop) and probably faster.
sed -r 's/,(.),(.),(.),(.)$/#\1#\2#\3#\4/; s/,//10g; s/#/,/g' input.txt
Testing
Input
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u
Output
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
a,b,c,d,e,f,g,h,i,jklmn,o,p,q,r
a,b,c,d,e,f,g,h,i,jklmnopq,r,s,t,u
Solved a similar problem using csvtool. Source file, copied from one of the other answers:
$ cat input.txt
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4
1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4
Concatenating columns:
$ cat input.txt | csvtool format '%1,%2,%3,%4,%5,%6,%7,%8,%9,%10%11%12,%13,%14,%15,%16,%17,%18,%19,%20,%21,%22\n' -
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p,,,,,,
1,2,3,4,5,6,3,4,2,434,3,2,5,2,3,4,,,,
1,2,3,4,5,6,3,4,2,4as,f,e,3,4,3,2,5,2,3,4
anatoly#anatoly-workstation:cbs$ cat input.txt

Print the 1st and every nth column of a text file using awk

I have a txt file contains a total of 10177 columns and a total of approximately 450,000 rows. The information is separated by tabs. I am trying to trim the file down using awk so that it only prints the 1-3, 5th, and every 14th column after the fifth one.
My file has a format that looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... 10177
A B C D E F G H I J K L M N O P Q R S T ...
X Y X Y X Y X Y X Y X Y X Y X Y X Y X Y ...
I am hoping to generate an output txt file (also separated with tab) that contains:
1 2 3 5 18 ...
A B C E R ...
X Y X X Y ...
The current awk code I have looks like (I am using cygwin to use the code):
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
But the result I am getting shows something like:
123518...ABCER...XYXXY...
When opened with excel program, the results are all mashed into 1 single cell.
In addition, when I try to include code
for (i=0;i<=3;i++) printf "%s ",$i
in the awk to get the first 3 columns, it just prints out the original input document together with the mashed result. I am not familiar with awk, so I am not sure what causes this issue.
Awk field numbers, strings, and array indices all start at 1, not 0, so when you do:
for (i=0;i<=3;i++) printf "%s ",$i
the first iteration prints $0 which is the whole record.
You're on the right track with:
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
but never do printf with input data as the only argument to printf since then printf will treat it as a format string without data (rather than what you want which is a plain string format with your data) and then that will fail cryptically if/when your input data contains formatting characters like %s or %d. So, always use printf "%s", $i, never printf $i.
The problem you're having with excel, I would guess, is you're trying to double click on the file and hoping excel knows what to do with it (it won't, unlike if this was a CSV). You can import tab-separated files into excel after it's opened though - google that.
You want something like:
awk '
BEGIN { FS=OFS="\t" }
{
for (i=1; i<=3; i++) {
printf "%s%s", (i>1?OFS:""), $i
}
for (i=5; i<=NF; i+=14) {
printf "%s%s", OFS, $i
}
print ""
}
' file
I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
In awk using conditional operator in for:
$ awk 'BEGIN { FS=OFS="\t" }
{
for(i=1; i<=NF; i+=( i<3 ? 1 : ( i==3 ? 2 : 14 )))
printf "%s%s", $i, ( (i+14)>NF ? ORS : OFS)
}' file
1 2 3 5 19
A B C E S
X Y X X X
In the for if i<3 increment by one, if i==3 increment by two to get to 5 and after that by 14.
I would be tempted to solve the problem along the following lines. I think you'll find you save time by not iterating in awk.
$ cols="$( { echo 1 2 3; seq 5 14 10177; } | sed 's/^/$/; 2,$ s/^/, /' )"
$ awk -F\\t "{print $cols}" test.txt

awk '{print $<number>}' but without knowing <number> before hand

Printing a specific column per line using pipe to awk is fine.
But how do I do it if I do not know which column it is, except I have to get the column who's first row matches something.
Example.
Title1 Title2 TargetTitle Title3
x y z a
b c d e
The above table, I want to filter out only:
z
d
BUT, two problems
1) I don't know exactly the column number
2) I don't want first row (not a big deal, I can just sed lines 2 to $).
Thanks.
You can build your output using awk like this:
awk -v OFS='\t' 'NR>1{for (i=1; i<=NF; i++) {
if ($i=="b"||$i=="d") $i=""; printf "%s%s", $i, (i==NF)?ORS:OFS}}' file
x y z a
c e
To filter out one column, you could use something like this:
awk -v title="TargetTitle" 'NR==1 { for (i=1;i<=NF;++i) if ($i==title) col=i }
NR>1 { for (i=1;i<=NF;++i) if (i!=col) printf "%s%s", $i, (i<NF?OFS:ORS)}' file
Output:
x y a
b c e
If you want to add more space between each column in the output, you can change the value of the OFS variable or change the first format specifier from %s to %4s, for example.
If you want to only print one column, you can do something like this:
awk -v title="TargetTitle" 'NR==1 { for (i=1;i<=NF;++i) if ($i==title) col=i }
NR>1 { print $col }' file
Output:
z
d

Print Specific Columns using AWK

I'm trying to fetch the data from column B to D from a tab delimited file "FILE". The simple AWK code I use fetch the data, but unfortunately keeps the output in a single column and remove the identifiers (shown below).
Any suggestions please.
CODE
awk '{for(i=2;i<=4;++i)print $i}' FILE
FILE
A B C D E F G
1_at 10.8435630935 10.8559287854 8.6666141543 8.820310681 9.9024050571 8.613199083 11.9807771094
2_at 4.7615531106 4.5209119307 11.2467919586 8.8105151099 7.1831990104 11.0645055836 4.3726598561
3_at 6.0025262754 5.4058080843 3.2475272982 3.1869728585 3.5654989547
OUTPUT OBTAINED
B
C
D
10.8435630935
10.8559287854
8.6666141543
4.7615531106
4.5209119307
11.2467919586
6.0025262754
5.4058080843
3.2475272982
Why don't you directly use cut?
$ cut -d$'\t' -f2-4 < file
B C D
10.8435630935 10.8559287854 8.6666141543
4.7615531106 4.5209119307 11.2467919586
6.0025262754 5.4058080843 3.2475272982
With awk you would need printf to avoid new lines of print:
awk -F"\t" '{for(i=2;i<=4;++i) printf "%s%s", $i, (i==4?RS:FS)}'