How to replace other column-entries when searching for a specific column in a file? - awk

Suppose my file looks like this:
A 1 0
B 1 0
C 1 0
How can I search for the line that has B in the first column, and if so, switch the entries in the second and third column? So my final result would look like:
A 1 0
B 0 1
C 1 0

try this -
vipin#kali:~$ awk '{if($1 == "B") {print $1,$3,$2} else print $1,$2,$3}' kk
A 1 0
B 0 1
C 1 0

Related

Change 1st instance of every unique row as 1 in pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.
Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

replace velues in column matching based on 2 columns

I have file f1 which looks like this: (has 1651 lines)
fam0110 G110 0 0 0 1 T G
fam6106 G6106 0 0 0 2 T T
fam1000 G1000 0 0 0 2 T T
...
and I have file f2 which looks like (has 1651 lines)
fam1000 G1000 1 1
fam6106 G6106 1 1
fam0110 G110 2 2
...
I would like to replace the 6th column in f1 so that it matches the 3rd column of f2 os that they match by the 1st and 2nd column
the output would look like this:
fam0110 G110 0 0 0 2 T G
fam6106 G6106 0 0 0 1 T T
fam1000 G1000 0 0 0 1 T T
I tried to do it with:
awk 'FNR==NR{a[NR]=$3;next}{$6=a[FNR]}1' pheno_laser2 chr9.plink.ped > chr9.new.ped
but this doesn't work because the lines are not sorted in the same way so I need matching by the values in the 1st and 2nd column in the both files.
Please advise
my the this doesn't work because
You have to use only the first two fields into the hash, as you want to match only for them, not for the line number or anything else.
awk 'FNR==NR{a[$1, $2]=$3;next} {$6=a[$1, $2]}1' file2 file1
Testing with your examples:
fam0110 G110 0 0 0 2 T G
fam6106 G6106 0 0 0 1 T T
fam1000 G1000 0 0 0 1 T T
Note that it would print empty field for any not matching lines, I assume this is ok.

Count number of occurrences of a number larger than x from every raw

I have a file with multiple rows and 26 columns. I want to count the number of occurrences of values that are higher than 0 (I guess is also valid different from 0) in each row (excluding the first two columns). The file looks like this:
X Y Sample1 Sample2 Sample3 .... Sample24
a a1 0 7 0 0
b a2 2 8 0 0
c a3 0 3 15 3
d d3 0 0 0 0
I would like to have an output file like this:
X Y Result
a a1 1
b b1 2
c c1 3
d d1 0
awk or sed would be good.
I saw a similar question but in that case the columns were summed and the desired output was different.
awk 'NR==1{printf "X\tY\tResult%s",ORS} # Printing the header
NR>1{
count=0; # Initializing count for each row to zero
for(i=3;i<=NF;i++){ #iterating from field 3 to end, NF is #fields
if($i>0){ #$i expands to $3,$4 and so which are the fields
count++; # Incrementing if the condition is true.
}
};
printf "%s\t%s\t%s%s",$1,$2,count,ORS # For each row print o/p
}' file
should do that
another awk
$ awk '{if(NR==1) c="Result";
else for(i=3;i<=NF;i++) c+=($i>0);
print $1,$2,c; c=0}' file | column -t
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0
$ awk '{print $1, $2, (NR>1 ? gsub(/ [1-9]/,"") : "Result")}' file
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0

awk split and concatenate has extra space in one line of output

The below awk produces the current output with all the lines split. However the first line seems to have a space in it and I can not seem to figure out why. Not sure if this is the best way but it produces output that is close to correct. Thank you :).
awk
awk '{split($6,a,":"); print $1":",$2,$3,a[1],a[2],a[6],a[7]} {split($7,a,":"); print $1":"$2,$3,a[1],a[2],a[6],a[7]} {split($8,a,":"); print $1":"$2,$3,a[1],a[2],a[6],a[7]} {split($9,a,":"); print $1":"$2,$3,a[1],a[2],a[6],a[7]} {split($10,a,":"); print $1":"$2,$3,a[1],a[2],a[6],a[7]}' input > parse
input file (tab delimited)
chr1 13408 C 1 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:1:2.00:28.00:2.00:0:1:0.00:0.02:0.00:0:0.00:0.00:0.00 G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00
current output (parse)
chr1: 13408 C A 0 0 0 (has a space between `chr1:` and `13408`)
chr1:13408 C C 1 0 1
chr1:13408 C G 0 0 0
chr1:13408 C T 0 0 0
chr1:13408 C N 0 0 0
desired output
chr1:13408 C A 0 0 0 (has no space between `chr1:` and `13408`)
chr1:13408 C C 1 0 1
chr1:13408 C G 0 0 0
chr1:13408 C T 0 0 0
chr1:13408 C N 0 0 0
Youre telling awk to print a space with the first , in your first print statement. Change
print $1":",$2
to
print $1":"$2
like you already have in your second print statement so you don't get the OFS value (a space by default) printed between : and $2.

Selecting columns using specific patterns then finding sum and ratio

I want to calculate the sum and ratio values from data below. (The actual data contains more than 200,000 columns and 45000 rows (lines)).
For clarity purpose I have given only simple data format.
#Frame BMR_42#O22 BMR_49#O13 BMR_59#O13 BMR_23#O26 BMR_10#O13 BMR_61#O26 BMR_23#O25
1 1 1 0 1 1 1 1
2 0 1 0 0 1 1 0
3 1 1 1 0 0 1 1
4 1 1 0 0 1 0 1
5 0 0 0 0 0 0 0
6 1 0 1 1 0 1 0
7 1 1 1 1 0 0 0
8 1 1 1 0 0 0 0
9 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0
The columns need to be selected with certain criteria.
The column data which I consider is columns with "#O13" only. Below I have given the selected columns from above example.
BMR_49#O13 BMR_59#O13 BMR_10#O13
1 0 1
1 0 1
1 1 0
1 0 1
0 0 0
0 1 0
1 1 0
1 1 0
1 1 1
0 0 0
From the selected column, I want to calculate:
1) the sum of all the "1"s. In this example we get value 16.
2) the number of total rows containing occurrence of "1" (at least once). From above example there are 8 rows which contain at least one occurrence of "1".
lastly,
3) the ratio of total of all "1"s with total lines with occurrence of "1"s.
That is :: (total of all "1"s)/(total rows with the occurance of "1").
Example 16/8
As a start, I tried with this command to select only the columns with "#O13"
awk '{for (i=1;i<=NF;i++) if (i~/#O13/); print ""}' $file2
Although this run but doesn't show up the values.
This should do:
awk 'NR==1{for (i=1;i<=NF;i++) if ($i~/#O13/) a[i];next} {f=0;for (i in a) if ($i) {s++;f++};if (f) r++} END {print "number of 1="s"\nrows with 1="r"\nratio="s/r}' file
number of 1=16
rows with 1=8
ratio=2
Some more readable:
awk '
NR==1{
for (i=1;i<=NF;i++)
if ($i~/#O13/)
a[i]
next
}
{
f=0
for (i in a)
if ($i=="1") {
s++
f++
}
if (f) r++
}
END {
print "number of 1="s \
"\nrows with 1="r \
"\nratio="s/r
}
' file