Data partitioning by columns

Data partitioning by columns - awk

I have a this big matrix of 50 rows and 1.5M columns. From these 1.5M columns, the first two are my headers.
I am trying to divide my data by columns into small pieces. So for example each small set will be 50 lines and 100 columns. But each small data must have the first two columns mentioned above as the headers.
I tried
awk '{print $1"\t"$2"\t"}' test | cut -f 3-10
awk '{print $1"\t"$2"\t"}' test | cut -f 11-20
...
or
cut -f 1-2 | cut -f 3-10 test
cut -f 1-2 | cut -f 11-20 test
...
but none of the above is working.
Is there an efficient way of doing this?

One way with awk. I don't know if it (awk) can handle such a big number of columns, but give it a try. It uses modulus operator to cut line each a specific number of columns.
awk '{
## Print header of first line.
printf "%s%s%s%s", $1, FS, $2, FS
## Count number of columns printed, from 0 to 100.
count = 0
## Traverse every columns but the first two keys.
for ( i = 3; i <= NF; i++ ) {
## Print header again when counted 100 columns.
if ( count != 0 && count % 100 == 0 ) {
printf "%s%s%s%s%s", ORS, $1, FS, $2, FS
}
## Print current column and count it.
printf "%s%s", $i, FS
++count
}
## Separator between splits.
print ORS
}
' infile
I've tested it with two lines and 4 columns instead of 100. Here is the test file:
key1 key2 one two three four five six seven eight nine ten
key1 key2 one2 two2 three2 four2 five2 six2 seven2 eight2 nine2 ten2
And results in:
key1 key2 one two three four
key1 key2 five six seven eight
key1 key2 nine ten
key1 key2 one2 two2 three2 four2
key1 key2 five2 six2 seven2 eight2
key1 key2 nine2 ten2

Related

AWK select rows where all columns are equal

I have a file with tab-separated values where the number of columns is not known a priori. In other words the number of columns is consistent within a file but different files have different number of columns. The first column is a key, the other columns are some arbitrary values.
I need to filter out the rows where the values are not the same. For example, assuming that the number of columns is 4, I need to keep the first 2 rows and filter out the 3-rd:
1 A A A
2 B B B
3 C D C
I'm planning to use AWK for this purpose, but I don't know how to deal with the fact that the number of columns is unknown. The case of the known number of columns is simple, this is a solution for 4 columns:
$2 == $3 && $3 == $4 {print}
How can I generalize the solution for arbitrary number of columns?

If you guarantee no field contains regex-active chars and the first field never match the second, and there is no blank line in the input:
awk '{tmp=$0;gsub($2,"")} NF==1{print tmp}' file
Note that this solution is designed for this specific case and less extendable than others.

Another slight twist on the approach. In your case you know you want to compare fields 2-4 so you can simply loop from i=3;i<=NF checking $i!=$(i-1) for equality, and if it fails, don't print, get the next record, e.g.
awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1'
Example Use/Output
With your data in file.txt:
$ awk '{for(i=3;i<=NF;i++)if($i!=$(i-1))next}1' file.txt
1 A A A
2 B B B

Could you please try following. This will compare all columns from 2nd column to till last column and check if every element is equal or not. If they are all same it will print line.
awk '{for(i=3;i<=NF;i++){if($(i-1)==$i){count++}};if((NF-2)==count){print};count=""}' Input_file
OR(by hard coding $2 in code, since if $2=$3 AND $3=$4 it means $2=$3=$4 so intentionally taking $2 in comparison rather than having i-1 fetching its previous value.)
awk '{for(i=3;i<=NF;i++){if($2==$i){count++}};if((NF-2)==count){print};count=""}' Input_file

I'd use a counter t with initial value of 2 to add the number of times $i == $(i+1) where i iterates from 2 to NF-1. print the line only if t==NF is true:
awk -F'\t' '{t=2;for(i=2;i<NF;i++){t+=$i==$(i+1)}}t==NF' file.txt

Here is a generalisation of the problem:
Select all lines where a set of columns have the same value: c1 c2 c3 c4 ..., where ci can be any number:
Assume we want to select the columns: 2 3 4 11 15
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if ($(a[i])!=$(a[1])) next}1' file
A bit more robust, in case a line might not contain all fields:
awk 'BEGIN{n=split("2 3 4 11 15",a)}
{for(i=2;i<=n;++i) if (a[i] <= NF) if ($(a[i])!=$(a[1])) next}1' file

awk count unique occurrences and print other columns

I have the following piece of code:
awk '{h[$1]++}; END { for(k in h) print k, h[k]}' ${infile} >> ${outfile2}
Which does part of what I want: printing out the unique values and then also counting how many times these unique values have occurred. Now, I want to print out the 2nd and 3rd column as well from each unique value. For some reason the following does not seem to work:
awk '{h[$1]++}; END { for(k in h) print k, $2, $3, h[k]}' ${infile} >> ${outfile2}
awk '{h[$1]++}; END { for(k in h) print k, h[$2], h[$3], h[k]}' ${infile} >> ${outfile2}
The first prints out the last index's 2nd and 3rd column, whereas the second code prints out nothing except k and h[k].
${infile} would look like:
20600 33.8318 -111.9286 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
29400 33.9455 -113.5430 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
The desired output would be:
20600, 33.8318, -111.9286, 3
30900, 33.3979, -111.8140, 2
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2

You were close and you can do it all in awk, but if you are going to store the count based on field 1 and also have field 2 and field 3 available in END to output, you also need to store field 2 & 3 in arrays indexed by field 1 (or whatever field you are keeping count of). For example you could do:
awk -v OFS=', ' '
{ h[$1]++; i[$1]=$2; j[$1]=$3 }
END {
for (a in h)
print a, i[a], j[a], h[a]
}
' infile
Where h[$1] holds the count of the number of times field 1 is seen indexing the array with field 1. i[$1]=$2 captures field 2 indexed by field 1, and then j[$1]=$3 captures field 3 indexed by field 1.
Then within END all that is needed is to output field 1 (a the index of h), i[a] (field 2), j[a] (field 3), and finally h[a] the count of the number of times field 1 was seen.
Example Use/Output
Using your example data, you can just copy/middle-mouse-paste the code at the terminal with the correct filename, e.g.
$ awk -v OFS=', ' '
> { h[$1]++; i[$1]=$2; j[$1]=$3 }
> END {
> for (a in h)
> print a, i[a], j[a], h[a]
> }
> ' infile
20600, 33.8318, -111.9286, 3
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
30900, 33.3979, -111.8140, 2
Which provides the output desired. If you need to preserve the order of records in the order of the output you show, you can use string-concatenation to group fields 1, 2 & 3 as the index of the array and then output the array and index, e.g.
$ awk '{a[$1", "$2", "$3]++}END{for(i in a) print i ", " a[i]}' infile
20600, 33.8318, -111.9286, 3
30600, 33.4461, -111.7876, 2
29400, 33.9455, -113.5430, 1
30900, 33.3979, -111.8140, 2
Look things over and let me know if you have further questions.

GNU datamash is a very handy tool for working on groups of columnar data in files that makes this trivial to do.
Assuming your file uses tabs to separate columns like it appears to:
$ datamash -s --output-delimiter=, -g 1,2,3 count 3 < input.tsv
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
Though it's not much more complicated in awk, using a multi dimensional array:
$ awk 'BEGIN { OFS=SUBSEP="," }
{ group[$1,$2,$3]++ }
END { for (g in group) print g, group[g] }' input.tsv
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
20600,33.8318,-111.9286,3
30900,33.3979,-111.8140,2
If you want sorted output instead of random order for this one, if using GNU awk, add a PROCINFO["sorted_in"] = "#ind_str_asc" in the BEGIN block, or otherwise pipe the output through sort.
You can also get the same effect by pipelining a bunch of utilities (including awk and uniq):
$ sort -k1,3n input.tsv | cut -f1-3 | uniq -c | awk -v OFS=, '{ print $2, $3, $4, $1 }'
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2

awk/sed remove duplicates and merge permuted columns

I have the following file:
ABC MNH 1
UHR LOI 2
QWE LOI 3
MNH ABC 4
PUQ LOI 5
MNH ABC 6
QWE LOI 7
LOI UHR 8
I want to remove all duplicates (based on the the first two columns - e.g., row 6 is a duplicate of row 4). Also I want to merge entries where column 1 and 2 are permuted (e.g., row 1 and 4). This means that this list should result in:
ABC MNH 1 4
UHR LOI 2 8
QWE LOI 3
PUQ LOI 5
However, this file is huge. About 2-3 TB. Can this be done with awk/sed?

I don't understand why what you posted is your expected output so you may have to massage it but IMHO this is right the way to approach the problem so that only "sort" is handling storing the multi-TB input internally (and sort is designed to do that with paging etc.) while the awk scripts are just processing one line at a time and keeping very little in memory:
$ cat tst.sh
#!/bin/env bash
awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -k1,2 |
awk '
{ curr = $1 OFS $2 }
prev != curr {
if ( NR>1 ) {
print rec
}
rec = $0
sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
prev = curr
next
}
{ rec = rec OFS $NF }
END { print rec }
'
$ ./tst.sh file
ABC MNH 1 4 6
PUQ LOI 5
QWE LOI 3 7
LOI UHR 8 2
An alternative implementation after discussing with #kvantour in the comments below (requires GNU sort for -s stable sort):
$ cat tst.sh
#!/bin/env bash
awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -s -k1,2 |
awk '
{ curr = $1 OFS $2 }
prev != curr {
if ( NR>1 ) {
print rec
}
rec = $0
sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
sub(/[[:space:]]+[^[:space:]]+$/,"",rec)
delete seen
prev = curr
}
!seen[$3,$4]++ { rec = rec OFS $NF }
END { print rec }
'
$ ./tst.sh file
ABC MNH 1 4
PUQ LOI 5
QWE LOI 3
UHR LOI 2 8

The always helpful GNU datmash to the rescue!
$ sort -k1,2 -u input.txt |
awk -v OFS="\t" '$2 < $1 { tmp = $1; $1 = $2; $2 = tmp } { print $1, $2, $3 }' |
sort -k1,2 |
datamash groupby 1,2 collapse 3 |
tr ',' ' '
ABC MNH 1 4
LOI PUQ 5
LOI QWE 3
LOI UHR 2 8
Broken down, this:
Sorts the input file based on the first two columns and removes duplicates.
If the second column is less than the first column, swaps the two (So MNH ABC 6 becomes ABC MNH 6), and outputs tab-separated columns (Which is what datamash works with by default).
Sorts that so all the transformed rows are in order (But this time keeping duplicates).
Uses datamash to produce a single line for all the duplicate first two columns, with a comma-separated list of the values of the third columns as the third column of the output (Like ABC MNH 1,4)
Turns those commas into spaces.
Most memory-efficient solutions will require the data to be sorted, and while the sort program is quite good at doing that, it'll still use a bunch of temporary files so you'll need 2-3 or so terabytes of free disk space.
If you're going to be doing a lot of stuff with the same data, it's probably worth sorting it once and reusing that file instead of sorting it every time as the first step of a pipeline:
$ sort -k1,2 -u input.txt > unique_sorted.txt
$ awk ... unique_sorted.txt | ...
If there's enough duplicates and enough RAM that it's feasible to hold the results in memory, it can be done in one pass through the input file removing duplicates as it goes and then iterating through all the remaining pairs of values:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my %keys;
while (<>) {
chomp;
my ($col1, $col2, $col3) = split ' ';
$keys{$col1}{$col2} = $col3 unless exists $keys{$col1}{$col2};
}
$, = " ";
while (my ($col1, $sub) = each %keys) {
while (my ($col2, $col3) = each %$sub) {
next unless defined $col3;
if ($col1 lt $col2 && exists $keys{$col2}{$col1}) {
$col3 .= " $keys{$col2}{$col1}";
$keys{$col2}{$col1} = undef;
} elsif ($col2 lt $col1 && exists $keys{$col2}{$col1}) {
next;
}
say $col1, $col2, $col3;
}
}
This produces output in arbitrary unsorted order for efficiency's sake.
And an approach using sqlite (Also requires lots of extra free disk space, and that the columns are separated by tabs, not arbitrary whitespace):
#!/bin/sh
input="$1"
sqlite3 -batch -noheader -list temp.db 2>/dev/null <<EOF
.separator \t
PRAGMA page_size = 8096; -- Make sure the database can grow big enough
CREATE TABLE data(col1, col2, col3, PRIMARY KEY(col1, col2)) WITHOUT ROWID;
.import "$input" data
SELECT col1, col2, group_concat(col3, ' ')
FROM (
SELECT col1, col2, col3 FROM data WHERE col1 < col2
UNION ALL
SELECT col2, col1, col3 FROM data WHERE col2 < col1
)
GROUP BY col1, col2
ORDER BY col1, col2;
EOF
rm -f temp.db

If your first two columns will only have 3 characters maximum you will have 26^6 possible combinations for the first two columns. This is very easy to handle with awk.
{ key1=$1$2; key2=$2$1 }
(key1 in a) { next } # duplicate :> skip
(key2 in a) { print $2,$1,a[key2],$3 } # permutation :> print
{ a[key1]=$3 } # store value
This however will only print the permutations, and as requested, maximum 2 elements. As a consequence, the array a will have both key1 and the permuted key key2 in the array in case a permutation is found, otherwise it will only have key1.
This can be cleaned up with a second array keeping track if a permutation is already printed. Call it b. This way you can eliminate 2 elements from a while keeping track of one element in b:
{ key1=$1$2; key2=$2$1 }
(key1 in b) || (key2 in b) { next } # permutation printed, is duplicate
(key1 in a) { next } # only duplicate, no permutation found
(key2 in a) { # permutation found
print $2,$1,a[key2],$3 # - print
delete a[key1] # - delete keys from a
delete a[key2]
b[key1] # - store key in b
next # - skip the rest
}
{ a[key1]=$3 }
END { for (k in a) { print substr(1,3,k),substr(4,3,k),a[k] } }

Adding numbers of a field

I am having a text file with multiple rows and two or four column. If two column then 1st column is id and 2nd is number and if four column 1st and 2nd is id and 3rd and 4th is number. For the four column rows 2nd and 4th column cells can have multiple entry separated by comma. If there is two column only I want to print them as it is; but if there is four column I want to print only the 1st column id and in the second column I want the sum of all the number present in 3rd and 4th column for that row.
Input
CG AT,AA,CA 17 1,1,1
GT 14
TB AC,TC,TA,GG,TT,AR,NN,NM,AB,AT,TT,TC,CA,BB,GT,AT,XT,MT,NA,TT 552 6,1,1,2,2,1,2,1,5,3,4,1,2,1,1,1,3,4,5,4
TT CG,GT,TA,GB 105 3,4,1,3
Expected Output
CG 20
GT 14
TB 602
TT 116

If there are no leading spaces in the actual file, use $1 instead of $2.
$ awk -F '[ ,]+' '{for(i=1; i<=NF; i++) s+=$i; print $2, s; s=0}' <<EOF
CG AT,AA,CA 17 1,1,1
GT 14
TB AC,TC,TA,GG,TT,AR,NN,NM,AB,AT,TT,TC,CA,BB,GT,AT,XT,MT,NA,TT 552 6,1,1,2,2,1,2,1,5,3,4,1,2,1,1,1,3,4,5,4
TT CG,GT,TA,GB 105 3,4,1,3
EOF
CG 20
GT 14
TB 602
TT 116
-F '[ ,]+' means "fields are delimited by one or more spaces or commas".
There is no condition associated with the {action}, so it will be performed on every line.
NF is the Number of Fields, and $X refers to the Xth field.
Strings are equal to 0, so we can simply add every field together to get a sum.
After we print the first non-blank field and our sum, we reset the sum for the next line.

Here is a solution coded to follow your instruction as closely as possible (with no field-splitting tricks so that it's easy to reason about):
awk '
NF == 2 {
print $1, $2
next
}
NF == 4 {
N = split($4, f, /,/)
for (i = 1; i <= N; ++i)
$3 += f[i]
print $1, $3
}'
I noticed though that your input section contains leading spaces. If leading spaces are actually present (and are irrelevant), we can add a leading { sub(/^ +/, "") } to the script.

comparing two columns in two files

I would like to compare two columns in two files.
Here's an example:
1 722603 0.08 0.0013 0.0035 0.02
1 793227 0.17 0 0 0.01
2 931508 0.52 0.95 0.93 0.92
1 722603 0.0348543
1 793227 0.130642
2 931508 0.275751
2 1025859 0.0739543
2 1237036 0.476705
This code compares the second columns of the two files:
awk 'FNR==NR {a[$2]++; next} a[$2]' file 1 file 2
However, I want to print the common second column if the first column is also the same. More specifically, if it finds 722603 in both files, it must check that the first column is also equal to 1 and then prints it. If the number in second column is repeated, it is important that it gets printed more than once with different values of column 1.
I'd be very thankful if you could guide me through this, thank you.

like this? extended your codes a bit:
awk 'FNR==NR {a[$1 FS $2]++; next} a[$1 FS $2]' file1 file2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Data partitioning by columns - awk

Related

AWK select rows where all columns are equal

awk count unique occurrences and print other columns

awk/sed remove duplicates and merge permuted columns

Adding numbers of a field

comparing two columns in two files

Categories

Resources