How to find matched rows in 2 files based on column3 and create extra file with rank value - awk

I have 2 files, I need to merge based on column3 (pos). Then find matched position and create an desirable output as follows using awk. I would like to have output with 4 columns. The 4th columns indicate common position across 2 files with rank number.
File1.txt
SNP-ID Chr Pos
rs62637813 1 52058
rs150021059 1 52238
rs4477212 1 52356
kgp15717912 1 53424
rs140052487 1 54353
rs9701779 1 56537
kgp7727307 1 56962
kgp15297216 1 72391
rs3094315 1 75256
rs3131972 1 75272
kgp6703048 1 75406
kgp22792200 1 75665
kgp15557302 1 75769
File2.txt:
SNP-ID Chr Pos Chip1
rs58108140 1 10583 1
rs189107123 1 10611 2
rs180734498 1 13302 3
rs144762171 1 13327 4
rs201747181 1 13957 5
rs151276478 1 13980 6
rs140337953 1 30923 7
rs199681827 1 46402 8
rs200430748 1 47190 9
rs187298206 1 51476 10
rs116400033 1 51479 11
rs190452223 1 51914 12
rs181754315 1 51935 13
rs185832753 1 51954 14
rs62637813 1 52058 15
rs190291950 1 52144 16
rs201374420 1 52185 17
rs150021059 1 52238 18
rs199502715 1 53234 19
rs140052487 1 54353 20
Desirable-output:
SNP-ID Chr Pos Chip1 Chip2
rs58108140 1 10583 1 0
rs189107123 1 10611 2 0
rs180734498 1 13302 3 0
rs144762171 1 13327 4 0
rs201747181 1 13957 5 0
rs151276478 1 13980 6 0
rs140337953 1 30923 7 0
rs199681827 1 46402 8 0
rs200430748 1 47190 9 0
rs187298206 1 51476 10 0
rs116400033 1 51479 11 0
rs190452223 1 51914 12 0
rs181754315 1 51935 13 0
rs185832753 1 51954 14 0
rs62637813 1 52058 15 1
rs190291950 1 52144 16 0
rs201374420 1 52185 17 0
rs150021059 1 52238 18 2
rs199502715 1 53234 19 0
rs140052487 1 54353 20 3

I don't quite understand what you mean by "rank"
awk '
NR==FNR {pos[$3]=1; next}
FNR==1 {print $0, "Chip2"; next}
{print $0, ($3 in pos ? ++rank : 0)}
' File1.txt File2.txt | column -t
SNP-ID Chr Pos Chip1 Chip2
rs58108140 1 10583 1 0
rs189107123 1 10611 2 0
rs180734498 1 13302 3 0
rs144762171 1 13327 4 0
rs201747181 1 13957 5 0
rs151276478 1 13980 6 0
rs140337953 1 30923 7 0
rs199681827 1 46402 8 0
rs200430748 1 47190 9 0
rs187298206 1 51476 10 0
rs116400033 1 51479 11 0
rs190452223 1 51914 12 0
rs181754315 1 51935 13 0
rs185832753 1 51954 14 0
rs62637813 1 52058 15 1
rs190291950 1 52144 16 0
rs201374420 1 52185 17 0
rs150021059 1 52238 18 2
rs199502715 1 53234 19 0
rs140052487 1 54353 20 3

Related

How to replace variables across multiple columns using awk?

I have a file that looks like this with 2060 lines with a header (column names) at the top:
FID IID late_telangiectasia_G1 late_atrophy_G1 late_atrophy_G2 late_nipple_retraction_G1 late_nipple_retraction_G2 late_oedema_G1 late_oedema_G2 late_induration_tumour_G1 late_induration_outside_G1 late_induration_G2 late_arm_lympho_G1 late_hyper_G1
1 470502 1 0 0 0 0 0 0 0 0 0 0 0
2 470514 0 0 0 0 0 0 0 0 0 0 0 0
3 470422 0 0 0 0 0 0 0 0 0 0 0 1
4 470510 0 0 0 0 0 1 0 1 1 1 0 1
5 470506 0 0 0 0 0 0 0 0 0 0 0 0
6 471948 0 0 0 0 0 0 0 1 0 0 0 0
7 469922 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
8 471220 0 1 1 -9 -9 0 0 1 1 1 0 0
9 470498 0 1 0 0 0 0 0 0 0 0 0 0
10 471993 0 1 1 0 0 0 0 0 0 0 0 0
11 470414 0 1 0 0 0 0 0 0 1 0 0 0
12 470522 0 0 0 0 0 0 0 0 0 0 0 0
13 470345 0 0 0 0 0 0 0 0 0 0 0 0
14 471275 0 1 0 -9 0 0 0 1 0 0 0 0
15 471283 0 1 0 0 0 0 0 1 1 0 0 0
16 472577 0 1 0 0 0 0 0 1 0 0 0 0
17 470492 0 1 0 0 0 0 0 0 0 0 0 0
18 472889 0 0 0 -9 0 0 0 0 0 0 0 0
19 470500 0 1 0 1 0 0 0 0 1 0 0 0
20 470493 0 0 0 0 0 0 0 1 1 0 0 0
I want to replace all the 0 -> 1 and the 1 -> 2 from column 3 to 12. I don't want to replace the -9.
I know for a single column the command will be:
awk'
{
if($3==1)$3=2
if($3==0)$3=1
}
1'file
Therefore, for multiple columns is there an easier way to specify a range rather than manually type every column number?
awk'
{
if($3,$4,$5,$6,$7,$8,$9,$10,$11,$12==1)$3,$4,$5,$6,$7,$8,$9,$10,$11,$12=2
if($3,$4,$5,$6,$7,$8,$9,$10,$11,$12==0)$3,$4,$5,$6,$7,$8,$9,$10,$11,$12=1
}
1'file
Thanks in advance
You could use a loop and change the field values accessing the field value using $i
awk '
{
for(i=3; i<=12; i++) {
if ($i==1 || $i==0) $i++
}
}1
' file | column -t
One possibility if you want to change almost all of your fields (as in your case) is to just save the ones you don't want to change and then change everything else:
$ awk 'NR>1{hd=$1 FS $2; tl=$13 FS $14; $1=$2=$13=$14=""; gsub(1,2); gsub(0,1); $0=hd $0 tl} 1' file
FID IID late_telangiectasia_G1 late_atrophy_G1 late_atrophy_G2 late_nipple_retraction_G1 late_nipple_retraction_G2 late_oedema_G1 late_oedema_G2 late_induration_tumour_G1 late_induration_outside_G1 late_induration_G2 late_arm_lympho_G1 late_hyper_G1
1 470502 2 1 1 1 1 1 1 1 1 1 0 0
2 470514 1 1 1 1 1 1 1 1 1 1 0 0
3 470422 1 1 1 1 1 1 1 1 1 1 0 1
4 470510 1 1 1 1 1 2 1 2 2 2 0 1
5 470506 1 1 1 1 1 1 1 1 1 1 0 0
6 471948 1 1 1 1 1 1 1 2 1 1 0 0
7 469922 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
8 471220 1 2 2 -9 -9 1 1 2 2 2 0 0
9 470498 1 2 1 1 1 1 1 1 1 1 0 0
10 471993 1 2 2 1 1 1 1 1 1 1 0 0
11 470414 1 2 1 1 1 1 1 1 2 1 0 0
12 470522 1 1 1 1 1 1 1 1 1 1 0 0
13 470345 1 1 1 1 1 1 1 1 1 1 0 0
14 471275 1 2 1 -9 1 1 1 2 1 1 0 0
15 471283 1 2 1 1 1 1 1 2 2 1 0 0
16 472577 1 2 1 1 1 1 1 2 1 1 0 0
17 470492 1 2 1 1 1 1 1 1 1 1 0 0
18 472889 1 1 1 -9 1 1 1 1 1 1 0 0
19 470500 1 2 1 2 1 1 1 1 2 1 0 0
20 470493 1 1 1 1 1 1 1 2 2 1 0 0
pipe it to column -t for alignment if you like.
Or using GNU awk for the 3rg arg to match() and retaining white space:
$ awk 'NR>1{ match($0,/((\S+\s+){2})((\S+\s+){9}\S+)(.*)/,a); gsub(1,2,a[3]); gsub(0,1,a[3]); $0=a[1] a[3] a[5] } 1' file
FID IID late_telangiectasia_G1 late_atrophy_G1 late_atrophy_G2 late_nipple_retraction_G1 late_nipple_retraction_G2 late_oedema_G1 late_oedema_G2 late_induration_tumour_G1 late_induration_outside_G1 late_induration_G2 late_arm_lympho_G1 late_hyper_G1
1 470502 2 1 1 1 1 1 1 1 1 1 0 0
2 470514 1 1 1 1 1 1 1 1 1 1 0 0
3 470422 1 1 1 1 1 1 1 1 1 1 0 1
4 470510 1 1 1 1 1 2 1 2 2 2 0 1
5 470506 1 1 1 1 1 1 1 1 1 1 0 0
6 471948 1 1 1 1 1 1 1 2 1 1 0 0
7 469922 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
8 471220 1 2 2 -9 -9 1 1 2 2 2 0 0
9 470498 1 2 1 1 1 1 1 1 1 1 0 0
10 471993 1 2 2 1 1 1 1 1 1 1 0 0
11 470414 1 2 1 1 1 1 1 1 2 1 0 0
12 470522 1 1 1 1 1 1 1 1 1 1 0 0
13 470345 1 1 1 1 1 1 1 1 1 1 0 0
14 471275 1 2 1 -9 1 1 1 2 1 1 0 0
15 471283 1 2 1 1 1 1 1 2 2 1 0 0
16 472577 1 2 1 1 1 1 1 2 1 1 0 0
17 470492 1 2 1 1 1 1 1 1 1 1 0 0
18 472889 1 1 1 -9 1 1 1 1 1 1 0 0
19 470500 1 2 1 2 1 1 1 1 2 1 0 0
20 470493 1 1 1 1 1 1 1 2 2 1 0 0
It is hard to tell if that is space delimited or tab delimited?
Here is a ruby that will deal with either space or tab delimited fields and will convert the result to tab delimited.
Note: Ruby arrays are zero based, so fields 1,2 are [0..1] and fields 3-12 are [2..11]
ruby -r csv -e 'options={:col_sep=>"\t", :converters=>:all, :headers=>true}
data=CSV.parse($<.read.gsub(/[[:blank:]]+/,"\t"), **options)
data.each_with_index{
|r,i| data[i]=r[0..1]+r[2..11].map{|e| (e==1 || e==0) ? e+1 : e}+r[12..]}
puts data.to_csv(**options)
' file
Prints:
FID IID late_telangiectasia_G1 late_atrophy_G1 late_atrophy_G2 late_nipple_retraction_G1 late_nipple_retraction_G2 late_oedema_G1 late_oedema_G2 late_induration_tumour_G1 late_induration_outside_G1 late_induration_G2 late_arm_lympho_G1 late_hyper_G1
1 470502 2 1 1 1 1 1 1 1 1 1 0 0
2 470514 1 1 1 1 1 1 1 1 1 1 0 0
3 470422 1 1 1 1 1 1 1 1 1 1 0 1
4 470510 1 1 1 1 1 2 1 2 2 2 0 1
5 470506 1 1 1 1 1 1 1 1 1 1 0 0
6 471948 1 1 1 1 1 1 1 2 1 1 0 0
7 469922 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9
8 471220 1 2 2 -9 -9 1 1 2 2 2 0 0
9 470498 1 2 1 1 1 1 1 1 1 1 0 0
10 471993 1 2 2 1 1 1 1 1 1 1 0 0
11 470414 1 2 1 1 1 1 1 1 2 1 0 0
12 470522 1 1 1 1 1 1 1 1 1 1 0 0
13 470345 1 1 1 1 1 1 1 1 1 1 0 0
14 471275 1 2 1 -9 1 1 1 2 1 1 0 0
15 471283 1 2 1 1 1 1 1 2 2 1 0 0
16 472577 1 2 1 1 1 1 1 2 1 1 0 0
17 470492 1 2 1 1 1 1 1 1 1 1 0 0
18 472889 1 1 1 -9 1 1 1 1 1 1 0 0
19 470500 1 2 1 2 1 1 1 1 2 1 0 0
20 470493 1 1 1 1 1 1 1 2 2 1 0 0
With awk you can do:
awk -v OFS="\t" 'FNR>1{for(i=3;i<=12;i++)if ($i~"^[10]$")$i=$i+1} $1=$1' file
# same output
gawk -v RS='[[:space:]]+' '++c > 2 && /^(0|1)$/ { ++$0 }
{ printf "%s", $0 RT } RT ~ /\n/ { c = 0 }' file

Reset 'Id' value of appended Dataframe

I have appended multiple dataframes to form single dataframe. Each dataframe had multiple rows assigned with specific ID. After appending, Big dataframe has multiple rows with same Id. Would like assign new id's.
Current Dataframe:
Index name groupid
0 Abc 0
1 cvb 0
2 sdf 0
3 ksh 1
4 kjl 1
5 lmj 2
6 hyb 2
0 khf 0
1 uyt 0
2 tre 1
3 awe 1
4 uys 2
5 asq 2
6 lsx 2
Desired Output:
Index name groupid new_id
0 Abc 0 0
1 cvb 0 0
2 sdf 0 0
3 ksh 1 1
4 kjl 1 1
5 lmj 2 2
6 hyb 2 2
7 khf 0 3
8 uyt 0 3
9 tre 1 4
10 awe 1 4
11 uys 2 5
12 asq 2 5
13 lsx 2 5
You would have to use a slightly modified version of groupby:
df['new_id'] = df.groupby(df['groupid'].ne(df['groupid'].shift()).cumsum(), sort=False)
.ngroup())
Output is:
Index name groupid new_id
0 0 Abc 0 0
1 1 cvb 0 0
2 2 sdf 0 0
3 3 ksh 1 1
4 4 kjl 1 1
5 5 lmj 2 2
6 6 hyb 2 2
7 0 khf 0 3
8 1 uyt 0 3
9 2 tre 1 4
10 3 awe 1 4
11 4 uys 2 5
12 5 asq 2 5
13 6 lsx 2 5
See previous answer for reference.

Dataframe within a Dataframe - to create new column_

For the following dataframe:
import pandas as pd
df=pd.DataFrame({'list_A':[3,3,3,3,3,\
2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4]})
How can 'list_A' be manipulated to give 'list_B'?
Desired output:
list_A
list_B
0
3
1
1
3
1
2
3
1
3
3
0
4
2
1
5
2
1
6
2
0
7
2
0
8
4
1
9
4
1
10
4
1
11
4
1
12
4
0
13
4
0
14
4
0
15
4
0
16
4
0
As you can see, if List_A has the number 3 - then the first 3 values of List_B are '1' and then the value of List_B changes to '0', until List_A changes value again.
GroupBy.cumcount
df['list_B'] = df['list_A'].gt(df.groupby('list_A').cumcount()).astype(int)
print(df)
Output
list_A list_B
0 3 1
1 3 1
2 3 1
3 3 0
4 3 0
5 2 1
6 2 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 0
12 4 1
13 4 1
14 4 1
15 4 1
16 4 0
17 4 0
18 4 0
19 4 0
20 4 0
21 4 0
22 4 0
23 4 0
EDIT
blocks = df['list_A'].ne(df['list_A'].shift()).cumsum()
df['list_B'] = df['list_A'].gt(df.groupby(blocks).cumcount()).astype(int)

Using If-else to change values in Pandas

I’ve a pd df consists three columns: ID, t, and ind1.
import pandas as pd
dat = {'ID': [1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,6,6,6],
't': [0,1,2,3,0,1,2,0,1,2,3,0,1,2,0,1,0,1,2],
'ind1' : [1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0]
}
df = pd.DataFrame(dat, columns = ['ID', 't', 'ind1'])
print (df)
What I need to do is to create a new column (res) that
for all ID with ind1==0, then res is zero.
for all ID with
ind1==1 and if t==max(t) (group by ID), then res = 1, otherwise zero.
Here’s anticipated output
Check with groupby with idxmax , then where with transform all
df['res']=df.groupby('ID').t.transform('idxmax').where(df.groupby('ID').ind1.transform('all')).eq(df.index).astype(int)
df
Out[160]:
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0
This works on the knowledge that the ID column is sorted :
cond1 = df.ind1.eq(0)
cond2 = df.ind1.eq(1) & (df.t.eq(df.groupby("ID").t.transform("max")))
df["res"] = np.select([cond1, cond2], [0, 1], 0)
df
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0
Use groupby.apply:
df['res'] = (df.groupby('ID').apply(lambda x: x['ind1'].eq(1)&x['t'].eq(x['t'].max()))
.astype(int).reset_index(drop=True))
print(df)
ID t ind1 res
0 1 0 1 0
1 1 1 1 0
2 1 2 1 0
3 1 3 1 1
4 2 0 0 0
5 2 1 0 0
6 2 2 0 0
7 3 0 0 0
8 3 1 0 0
9 3 2 0 0
10 3 3 0 0
11 4 0 1 0
12 4 1 1 0
13 4 2 1 1
14 5 0 1 0
15 5 1 1 1
16 6 0 0 0
17 6 1 0 0
18 6 2 0 0

how populate columns dependng found value?

I have a pandas DataFrame with customers ID and columns related to months (1,2,3....)
I have a column with the number of months since last purchase
I am using the following to populate the relevant months columns
dt.loc[dt.month == 1, '1'] = 1
dt.loc[dt.month == 2, '2'] = 1
dt.loc[dt.month == 3, '3'] = 1
etc,
How can I populate the columns in a better way to avoid creating 12 statements?
pd.get_dummies
pd.get_dummies(dt.month)
Consider the dataframe dt
dt = pd.DataFrame(dict(
month=np.random.randint(1, 13, (10)),
a=range(10)
))
a month
0 0 8
1 1 3
2 2 8
3 3 11
4 4 3
5 5 4
6 6 1
7 7 5
8 8 3
9 9 11
Add columns like this
dt.join(pd.get_dummies(dt.month))
a month 1 3 4 5 8 11
0 0 8 0 0 0 0 1 0
1 1 3 0 1 0 0 0 0
2 2 8 0 0 0 0 1 0
3 3 11 0 0 0 0 0 1
4 4 3 0 1 0 0 0 0
5 5 4 0 0 1 0 0 0
6 6 1 1 0 0 0 0 0
7 7 5 0 0 0 1 0 0
8 8 3 0 1 0 0 0 0
9 9 11 0 0 0 0 0 1
If you wanted the column names to be strings
dt.join(pd.get_dummies(dt.month).rename(columns='month {}'.format))
a month month 1 month 3 month 4 month 5 month 8 month 11
0 0 8 0 0 0 0 1 0
1 1 3 0 1 0 0 0 0
2 2 8 0 0 0 0 1 0
3 3 11 0 0 0 0 0 1
4 4 3 0 1 0 0 0 0
5 5 4 0 0 1 0 0 0
6 6 1 1 0 0 0 0 0
7 7 5 0 0 0 1 0 0
8 8 3 0 1 0 0 0 0
9 9 11 0 0 0 0 0 1