awk or sed change field of file based on another input file

awk or sed change field of file based on another input file - awk

I have a sparse matrix ("matrix.csv") with 10k rows and 22 columns (1st column is "user", and the rest columns are called "slots" and contain 0s or 1s), like this:
user1,0,0,0,1,0,0,0,1,0,1,0,0,...,0
user2,0,1,0,0,0,0,0,0,0,0,0,1,...,0
user3,0,0,0,0,0,0,1,0,0,0,1,0,...,0
...
Some of the slots that contain a "0" should be changed to contain a "1".
I have another file ("slots2change.csv") that tells me which slots should be changed, like this:
user1,3
user3,21
...
So for user1, I need to change slot3 to contain a "1" instead of a "0", and for user3 I should change slot21 to contain a "1" instead of a "0", and so on.
Expected result:
user1,0,0,1,1,0,0,0,1,0,1,0,0,...,0
user2,0,1,0,0,0,0,0,0,0,0,0,1,...,0
user3,0,0,0,0,0,0,1,0,0,0,1,0,...,1
...
How can I achieve this using awk or sed?
tried
awk -F"," 'NR==FNR{user=$1;slot2change=$2} NR!=FNR; /user/ slot2change==0{print $0} ' slots2change.csv matrix.csv
but I feel I am still far from a correct command...

Like this:
awk 'BEGIN{FS=OFS=","}
NR==FNR{arr[$1]=$2;next}
NR!=FNR {for (i in arr)
if ($1 == i) {
F=arr[i] + 1
$F=1
}
print
}
' slots2change.csv matrix.csv

Related

awk sum values of a column, if other columns remain constant

I have a file (example.csv) with around 5M rows, each row containing 4 columns (user, day, type, value), like this:
user1,2022-01-01,type1,0.1
user1,2022-01-01,type1,0.9
user1,2022-01-02,type1,1.0
user1,2022-01-02,type2,1.0
user2,2022-01-01,type1,1.0
user2,2022-01-01,type2,1.0
user3,2022-01-01,type1,0.3
user3,2022-01-01,type1,0.2
user3,2022-01-01,type1,0.5
I would like to sum the values (4th column in this example) that correspond to the same user, day and type, so the expected output should look like this:
user1,2022-01-01,type1,1.0
user1,2022-01-02,type1,1.0
user1,2022-01-02,type2,1.0
user2,2022-01-01,type1,1.0
user2,2022-01-01,type2,1.0
user3,2022-01-01,type1,1.0
I tried something like this to try if it works
awk -F"," '!seen[$1]++;&&!seen[$2]++;&&!seen[$3]++;sum+=$4{print sum}' example.csv
but I am still far from the correct solution.
Any suggestions?

$ awk '
BEGIN {
FS=OFS=","
}
{
a[$1 OFS $2 OFS $3]+=$4
}
END {
for(i in a)
print i,sprintf("%.1f",a[i])
}' file
Output:
user2,2022-01-01,type1,1.0
user2,2022-01-01,type2,1.0
user1,2022-01-01,type1,1.0
user3,2022-01-01,type1,1.0
user1,2022-01-02,type1,1.0
user1,2022-01-02,type2,1.0
The output order is awk implementation dependent. If needed, use sort or GNU AWK's PROCINFO["sorted_in"].

Awk Remove lines if one column matches another column, and keep line if max value from another column

I have a file of ~8,000 lines. I am trying to remove the lines where when the 5th column matches (in this case ga2016mldlzd), but keep only the line with the max value in the 6th column. For example, if given this:
-25.559,129.8529,6674.560547,2.0,ga2016mldlzd,6
-25.5596,129.8565,6902.750651,2.0,ga2016mldlzd,7
-25.5450,129.830,969.8079427,2.0,ga2016mldlzd,8
-25.5450,129.834,57.04752604,2.0,ga2016mldlzd,9
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
remove all lines except the final line with 10 as the max value, to get this. I'm stumped as to how this could be done either in awk or sed?
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If tried this:
awk -F, '!a[$5]++'
but I want to keep last column e.g., the column with '10', rather than the column with '6'. Thanks

Keep track of the max and line associated with that max and print at the end:
awk -F, '
{
if ($6>max[$5]) {
max[$5]=$6
tl[$5]=$0
}
}
END{
for (l in tl) print tl[l]
}' file
Prints:
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
The order of the file will be lost; ie, the groups may be reordered compared to the original file.
If you are dealing with a file with many different keys for $5 and not all of them could fit in memory, you could sort into blocks grouped by the fifth field and then by the numeric value of the sixth. Then have awk print the last line every time the fifth field changes. Since it is sorted, that will be the max:
sort -t , -k 5,5 -k 6n file |
awk -F, '
FNR==1{lf=$5;ll=$0}
lf!=$5{print ll}
{ll=$0; lf=$5}
END{print $0}'
# same print out
The second there will be way slower but way less memory for a large number of $5 uniq values.

If you want to maintain original order of lines then use this awk:
awk -F, 'NR==FNR {if ($6 > max[$5]) max[$5] = $6; next} $5 in max && max[$5] == $6' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10
If you want to filter for ga2016mldlzd while maintaining original order of lines then use this awk:
awk -F, '
NR==FNR {
if ($5 == "ga2016mldlzd" && $6 > max[$5]) {
max[$5] = $6
n = FNR
}
next
}
FNR == n' file file
-25.57067,129.856,7929.60612,2.0,ga2016mldlzd,10

Substitute patterns using a correspondence file

I try to change in a file some word by others using sed or awk.
My initial fileA as this format:
>Genus_species|NP_001006347.1|transcript-60_2900.p1:1-843
I have a second fileB with the correspondences like this:
NP_001006347.1 GeneA
XP_003643123.1 GeneB
I am trying to substitute in FileA the name to get this ouput:
>Genus_species|GeneA|transcript-60_2900.p1:1-843
I was thinking to use awk or sed, to do something like
's/$patternA/$patternB/' with a while read l but how to indicate which pattern 1 and 2 are in the fileB? I tried also this but not working.
sed "$(sed 's/^\([^ ]*\) \(.*\)$/s#\1#\2#g/' fileB)" fileA
Awk may be able to do the job more easily?
Thanks

It is easier to this in awk:
awk -v OFS='|' 'NR == FNR {
map[$1] = $2
next
}
{
for (i=1; i<=NF; ++i)
$i in map && $i = map[$i]
} 1' file2 FS='|' file1
>Genus_species|GeneA|transcript-60_2900.p1:1-843

Written and tested with your shown samples, considering that you have only one entry for NP_digits.digits in your Input_fileA then you could try following too.
awk '
FNR==NR{
arr[$1]=$2
next
}
match($0,/NP_[0-9]+\.[0-9]+/) && ((val=substr($0,RSTART,RLENGTH)) in arr){
$0=substr($0,1,RSTART-1) arr[val] substr($0,RSTART+RLENGTH)
}
1
' Input_fileB Input_fileA

Using awk
awk -F [\|" "] 'NR==FNR { arr[$1]=$2;next } NR!=FNR { OFS="|";$2=arr[$2] }1' fileB fileA
Set the field delimiter to space or |. Process fileB first (NR==FNR) Create an array called arr with the first space delimited field as the index and the second the value. Then for the second file (NR != FNR), check for an entry for the second field in the arr array and if there is an entry, change the second field for the value in the array and print the lines with short hand 1

You are looking for the join command which can be used like this:
join -11 -22 -t'|' <(tr ' ' '|' < fileB | sort -t'|' -k1) <(sort -t'|' -k2 fileA)
This performs a join on column 1 of fileB with column 2 of fileA. The tr was used such that fileB also uses | as delimiter because join requires it to be equal on both files.
Note that the output columns are not in the order you specified. You can swap by piping the output into awk.

awk: first, split a line into separate lines; second, use those new lines as a new input

Let's say I have this line:
foo|bar|foobar
I want to split it at | and then use those 3 new lines as the input for the further proceedings (let's say replace bar with xxx).
Sure, I can pipe two awk instances, like this:
echo "foo|bar|foobar" | awk '{gsub(/\|/, "\n"); print}' | awk '/bar/ {gsub(/bar/, "xxx"); print}'
But how I can achieve this in one script? First, do one operation on some input, and then treat the result as the new input for the second operation?
I tried something like this:
echo "foo|bar|foobar" | awk -v c=0 '{
{
gsub(/\|/, "\n");
sprintf("%s", $0);
}
{
if ($0 ~ /bar/) {
c+=1;
gsub(/bar/, "xxx");
print c;
print
}
}
}'
Which results in this:
1
foo
xxx
fooxxx
And thanks to the counter c, it's absolutely obvious that the subsequent if doesn't treat the multi-line input it receives as several new records but instead just as one multi-lined record.
Thus, my question is: how to tell awk to treat this new multi-line record it receives as many single-line records?
The desired output in this very example should be something like this if I'm correct:
1
xxx
2
fooxxx
But this is just an example, the question is more about the mechanics of such a transition.

I would suggest an alternative approach using split() where you can just split the elements based on the delimiter into an array and iterate over its fields, Instead of working on a single multi line string.
echo "foo|bar|foobar" |\
awk '{
count = 0
n = split($0, arr, "|")
for ( i = 1; i <= n; i++ )
{
if ( arr[i] ~ /bar/ )
{
count += sub(/bar/, "xxx", arr[i])
print count
print arr[i]
}
}
}'
Also you don't need an explicit increment of count variable, sub() returns the number of substitutions made on the source string. You can just increment to the existing value of count.
As one more level of optimization, you can get rid of the ~ match in the if condition and directly use the sub() function there
if ( sub(/bar/, "xxx", arr[i]) )
{
count++
print count
print arr[i]
}

If you set the record separator (RS) to the pipe character, you almost get the desired effect, e.g.:
echo 'foo|bar|foobar' | awk -v RS='|' 1
Output:
foo
bar
foobar
[...an empty line
Except that a new-line character becomes part of the last field, so there is an extra line at the end of the output. You can work around this by either including a new-line in the RS variable, making it less portable, or avoid sending new-lines to awk.
For example using the less portable way:
echo 'foo|bar|foobar' | awk -v RS='\\||\n' '{ sub(/bar/, "baz") } 1'
Output:
foo
baz
foobaz
Note that the empty record at the end is ignored.

With GNU awk:
$ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file
1
xxx
2
fooxxx
With any awk:
$ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file
1
xxx
2
fooxxx

Using awk pattern to file filter data

I have the folling file(named /tmp/test99) which containd the rows:
"0","15","wall15"
123132,09808098,"0","15"
I am trying to filter the rows that contains "0" in the 3rd place, and "15" in 4th place (like in the second row)
I tried running:
cat /tmp/test99 | awk '/"0","15"/{print>"/tmp/0_15_file.out"} '
but instead of getting only the second row, I get also the first row starting with "0","15".
Could you please help with the pattern ?
Thanks:)

You may check if Fields 3 and 4 are equal to some hardcoded value using
awk -F, '$3=="\"0\"" && $4=="\"15\""'
Set the field separator to a comma and then, if Field 3 is "0" and Field 4 is "15" print the line, else discard.
See the online demo:
s='"0","15","wall15"
123132,09808098,"0","15"'
awk -F, '$3=="\"0\"" && $4=="\"15\""' <<< "$s"
# => 123132,09808098,"0","15"

Could you please try following.(comment on your effort, you need NOT to use cat with awk it could read Input_file by itself)
awk -F, '$3!~/\"0\"/ && $4!~/\"15\"/' Input_file

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk or sed change field of file based on another input file - awk

Like this: awk 'BEGIN{FS=OFS=","} NR==FNR{arr[$1]=$2;next} NR!=FNR {for (i in arr) if ($1 == i) { F=arr[i] + 1 $F=1 } print } ' slots2change.csv matrix.csv

Related

awk sum values of a column, if other columns remain constant

Awk Remove lines if one column matches another column, and keep line if max value from another column

Substitute patterns using a correspondence file

awk: first, split a line into separate lines; second, use those new lines as a new input

Using awk pattern to file filter data

Categories

Resources