Comparing two columns: if they match, print the value in a new column and if they do not match print the value of the second column to the new column - awk

I have a file with multiple columns. I want to compare A1 ($4) and A2 ($14), and if the values do not match, print the value of A2 ($14). If the values match, I want to print the value of A1 ($15).
File:
chr SNP BP A1 TEST N OR Z P chr SNP cm BP A2 A1
20 rs6078030 61098 T ADD 421838 0.9945 -0.209 0.8344 20 rs6078030 0 61098 C T
20 rs143291093 61270 G ADD 422879 1.046 0.5966 0.5508 20 rs143291093 0 61270 G A
20 rs4814683 61795 T ADD 417687 1.015 0.6357 0.525 20 rs4814683 0 61795 G T
Desired output:
chr SNP BP A1 TEST N OR Z P chr SNP cm BP A2 A1 noneff
20 rs6078030 61098 T ADD 421838 0.9945 -0.209 0.8344 20 rs6078030 0 61098 C T C
20 rs143291093 61270 G ADD 422879 1.046 0.5966 0.5508 20 rs143291093 0 61270 G A A
20 rs4814683 61795 T ADD 417687 1.015 0.6357 0.525 20 rs4814683 0 61795 G T G
I checked the difference between column 4 and 15 first.
awk '$4!=$15{print $4,$15}' file > diff
Then I tried to write the if-else statement:
awk '{if($4=$14) print $16=$14 ; else print $16=$15}' file > new_file

Try this:
awk 'NR==1{$(++NF)="noneff"}NR>1{$(++NF)=($4==$14)?$15:$14}1' so1186.txt
Output:
awk 'NR==1{$(++NF)="noneff"}NR>1{$(++NF)=($4==$14)?$15:$14}1' so1186.txt | column -t
chr SNP BP A1 TEST N OR Z P chr SNP cm BP A2 A1 noneff
20 rs6078030 61098 T ADD 421838 0.9945 -0.209 0.8344 20 rs6078030 0 61098 C T C
20 rs143291093 61270 G ADD 422879 1.046 0.5966 0.5508 20 rs143291093 0 61270 G A A
20 rs4814683 61795 T ADD 417687 1.015 0.6357 0.525 20 rs4814683 0 61795 G T G

awk '{$(++NF)=($4==$15)?$4:$15}1' file

Related

Create a new column for table B based on information from table A

I have this problem. I want to create a report that keeps everything in table B, but adds another column from table A (QtyRecv).
Condition: If RunningTotalQtyUsed (from table B) < QtyRecv, take that QtyRecv for the new column.
For example, for item A1, (RunningTotalQtyUsed) 55 < 100 (QtyRecv), -> ExpectedQtyRecv = 100.
But if RunningTotalQtyUsed exceeds QtyRecv, we take the next QtyRecv to cover that used quantity.
For example, 101 > 100, -> ExpectedQtyRecv = 138.
149 (RunningTotalQtyUsed) < (100 + 138) (QtyRecv) -> get 138.
250 < (100 + 138 + 121) -> get 121.
The same logic applies to item A2.
If total QtyRecv = 6 + 4 + 10 = 20, but RunningTotalQtyUsed = 31 -> result should be 99999 to notify an error that QtyRecv can't cover QtyUsed.
Table A:
Item QtyRecv
A1 100
A1 138
A1 121
A2 6
A2 4
A2 10
Table B:
Item RunningTotalQtyUsed
A1 55
A1 101
A1 149
A1 250
A2 1
A2 5
A2 9
A2 19
A2 31
Expected result:
Item RunningTotalQtyUsed ExpectedQtyRecv
A1 55 100
A1 101 138
A1 149 138
A1 250 121
A2 1 6
A2 5 6
A2 9 4
A2 19 10
A2 31 99999
What I made an effort:
SELECT b.*
FROM tableB b LEFT JOIN tableA a
ON b.item = a.item
item RunningTotalQtyUsed
A1 55
A1 55
A1 55
A1 101
A1 101
A1 101
A1 149
A1 149
A1 149
A1 250
A1 250
A1 250
A2 1
A2 1
A2 1
A2 5
A2 5
A2 5
A2 9
A2 9
A2 9
A2 19
A2 19
A2 19
A2 31
A2 31
A2 31
It doesn't keep the same number of rows as table B. How to still keep table B but add the ExpectQtyRecv from table A? Thank you so much for all the help!
SELECT B.TOTAL,B.SUM_RunningTotalQtyUsed,A.SUM_QtyRecv FROM
(
SELECT B.ITEM,SUM(B.RunningTotalQtyUsed)AS SUM_RunningTotalQtyUsed
FROM TABLE_B AS B
GROUP BY B.ITEM
)B_TOTAL
LEFT JOIN
(
SELECT A.ITEM,SUM(A.QtyRecv)AS SUM_QtyRecv
FROM TABLE_A AS A
GROUP BY A.ITEM
)A_TOTAL ON B.ITEM=A.ITEM
I can not be sure, but may be you need something like above ?

Pandas - Groupby by three columns with cumsum or cumcount [duplicate]

I need to create a new "identifier column" with unique values for each combination of values of two columns. For example, the same "identifier" should be used when ID and phase are the same (e.g. r1 and ph1 [but a new, unique value should be added to the column when r1 and ph2])
df
ID phase side values
r1 ph1 l 12
r1 ph1 r 34
r1 ph2 l 93
s4 ph3 l 21
s3 ph2 l 88
s3 ph2 r 54
...
I would need a new column (idx) like so:
new_df
ID phase side values idx
r1 ph1 l 12 1
r1 ph1 r 34 1
r1 ph2 l 93 2
s4 ph3 l 21 3
s3 ph2 l 88 4
s3 ph2 r 54 4
...
I've tried applying code from this question but could no achieve a way to increment the values in idx.
Try with groupby ngroup + 1, use sort=False to ensure groups are enumerated in the order they appear in the DataFrame:
df['idx'] = df.groupby(['ID', 'phase'], sort=False).ngroup() + 1
df:
ID phase side values idx
0 r1 ph1 l 12 1
1 r1 ph1 r 34 1
2 r1 ph2 l 93 2
3 s4 ph3 l 21 3
4 s3 ph2 l 88 4
5 s3 ph2 r 54 4

calculation in new column with if and else in pandas [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I am having below table
Particulars DC Amt
AA D 50
BB D 20
CC C 30
DD D 20
EE C 10
I require below output, if DC column is having "D" it should
have same amount in "Amt" column and if DC column is "C" should
multiply by (-1) with Amt amount.
Particulars DC Amt TTL
AA D 50 50
BB D 20 20
CC C 30 (30)
DD D 20 20
EE C 10 (10)
You can use np.where:
df['TTL'] = np.where(df.DC == 'D', df.Amt, -1*df.Amt)

Python Pandas: Append column names in each row

Is there a way to append column names in dataframe rows?
input:
cv cv mg mg
5g 5g 0% zinsenzin
output:
cv cv col_name mg mg col_name
5g 5g cv 0% zinsenzin mg
I tried by this, but it's not working
list_col = list(df)
for i in list_col:
if i != i.shift(1)
df['new_col'] = i
I got stuck here and can't find any solution.
In pandas working with duplicated columns names is not easy, but possible:
c = 'cv cv mg mg sa sa ta ta at at ad ad an av av ar ar ai ai ca ca ch ch ks ks ct ct ce ce cw cw dt dt fr fr fs fs fm fm it it lg lg mk mk md md mt mt ob ob ph ph pb pb rt rt sz sz tg tg tt tt vv vv yq yq fr fr ms ms lp lp ts ts mv mv'.split()
df = pd.DataFrame([range(77)], columns=c)
print (df)
cv cv mg mg sa sa ta ta at at ... fr fr ms ms lp lp ts \
0 0 1 2 3 4 5 6 7 8 9 ... 67 68 69 70 71 72 73
ts mv mv
0 74 75 76
[1 rows x 77 columns]
df = pd.concat([v.assign(new_col=k) for k, v in df.groupby(axis=1,level=0,sort=False)],axis=1)
print (df)
cv cv new_col mg mg new_col sa sa new_col ta ... new_col lp lp \
0 0 1 cv 2 3 mg 4 5 sa 6 ... ms 71 72
new_col ts ts new_col mv mv new_col
0 lp 73 74 ts 75 76 mv
[1 rows x 115 columns]

Concatenate/merge columns

File( ~50,000 columns)
A1 2 123 f f j j k k
A2 10 789 f o p f m n
Output
A1 2 123 ff jj kk
A2 10 789 fo pf mn
I basically want to concatenate every two columns into one starting from column4. How can we do it in awk or sed?
It is possible in awk. See below
:~/t> more test.txt
A1 2 123 f f j j k k
:~/t> awk '{for(i=j=4; i < NF; i+=2) {$j = $i$(i+1); j++} NF=j-1}1' test.txt
A1 2 123 ff jj kk
Sorry just noticed you gave two lines as example...
:~/t> more test.txt
A1 2 123 f f j j k k
A2 10 789 f o p f m n
:~/t> awk '{for(i=j=4; i < NF; i+=2) {$j = $i$(i+1); j++} NF=j-1}1' test.txt
A1 2 123 ff jj kk
A2 10 789 fo pf mn