Use awk with two different delimiters to split and select columns - awk

How can I tell gawk to use two different delimiters so that I can separate some columns, but select others using the tab-delimited format of my file?
> cat broad_snps.tab
chrsnpID rsID freq_bin snp_maf gene_count dist_nearest_gene_snpsnap dist_nearest_gene_snpsnap_protein_coding dist_nearest_gene dist_nearest_gene_located_within loci_upstream loci_downstream ID_nearest_gene_snpsnap ID_nearest_gene_snpsnap_protein_coding ID_nearest_gene ID_nearest_gene_located_within HGNC_nearest_gene_snpsnap HGNC_nearest_gene_snpsnap_protein_coding flag_snp_within_gene flag_snp_within_gene_protein_coding ID_genes_in_matched_locus friends_ld01 friends_ld02 friends_ld03 friends_ld04 friends_ld05 friends_ld06 friends_ld07 friends_ld08 friends_ld09 -1
10:10001753 10:10001753 7 0.07455 0 98932.0 1045506.0 98932.0 inf 9986766 10039928 ENSG00000224788 ENSG00000048740 ENSG00000224788 CELF2 False False 253.0 103.0 55.0 40.0 35.0 33.031.0 20.0 0.0 -1
10:10001794 10:10001794 41 0.4105 0 98891.0 1045465.0 98891.0 inf 9964948 10071879 ENSG00000224788 ENSG00000048740 ENSG00000224788 CELF2 False False 365.0 299.0 294.0 266.0 168.0 138.58.0 45.0 0.0 -1
10:100023489 10:100023489 10 0.1054 1 4518.0 4518.0 4518.0 4518.0 100023489 100023489 ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 True True ENSG00000138131 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1
10:100025128 10:100025128 45 0.4543 1 2879.0 2879.0 2879.0 2879.0 100025128 100025128 ENSG00000138131 ENSG00000138131 ENSG00000138131 ENSG00000138131 LOXL4 LOXL4 True True ENSG00000138131 112.0 70.0 3.0 0.0 0.0
The output I want:
chr10 10001752 10001753 CELF2
chr10 10001793 10001794 CELF2
chr10 100023488 100023489 LOXL4
chr10 100025127 100025128 LOXL4
chr10 10002974 10002975 LOXL4
The command I am currently using:
cat broad_snps.tab | tail -n+2 | gawk -vOFS="\t" -vFS=":" '{ print "chr"$1, ($2 - 1), $2}' | gawk -vOFS="\t" '{print $1, $2, $3}' > broad_SNPs.bed
Returns this:
chr10 10001752 10001753 10
chr10 10001793 10001794 10
chr10 100023488 100023489 10
chr10 100025127 100025128 10
chr10 10002974 10002975 10
chr10 10003391 10003392 10
chr10 100038815 100038816 10
chr10 10008001 10008002 10
chr10 100093012 100093013 10
I'd like to be able to use the ":" delimiter to split up the first column, but I need to use "\t" to pick out the gene ID.
Thanks!

awk -F'[\t:]' '{print $1, $2, $4, $17}'

Related

How do I get awk to print fields from the second row of a file?

I have a file that looks like this:
measured 10.8 0.0000 0.0000 0.0236 0.0304 0.0383 0.0433 0.0437 0.0442 0.0452
0.0455 0.0448 0.0440 0.0423 0.0386 0.0344 0.0274 0.0000 0.0000
I want gawk to print all the numbers in one long single column like this:
0.0000
0.0000
0.0236
0.0304
0.0383
0.0433
0.0437
0.0442
0.0452
0.0455
0.0448
0.0440
0.0423
0.0386
0.0344
0.0274
0.0000
0.0000
I run the command gawk '/measured/ { printf $3"\n" $4"\n" $5"\n" $6"\n" $7"\n" $8"\n" $9"\n" $10"\n" $11"\n" $12"\n" $13"\n" $14"\n" $15"\n" $16"\n" $17"\n" $18"\n" }' filename.txt
But I just get the first row of numbers:
0.0000
0.0000
0.0236
0.0304
0.0383
0.0433
0.0437
0.0442
0.0452
How do I get gawk to print the second row?
$ cat tst.awk
BEGIN { OFS = "\n" }
/measured/ { c=2; $1=$2=""; $0=$0 }
c && c-- { $1=$1; print }
$ awk -f tst.awk file
0.0000
0.0000
0.0236
0.0304
0.0383
0.0433
0.0437
0.0442
0.0452
0.0455
0.0448
0.0440
0.0423
0.0386
0.0344
0.0274
0.0000
0.0000
$ grep -A1 measured file | tr -s ' ' \\n | tail -n+4
0.0000
0.0000
0.0236
0.0304
0.0383
0.0433
0.0437
0.0442
0.0452
0.0455
0.0448
0.0440
0.0423
0.0386
0.0344
0.0274
0.0000
0.0000
with awk
$ awk -v OFS='\n' '/measured/ {p=1; for(i=3;i<=NF;i++) print $i; next}
p {$1=$1; print; exit}' file
If the number of fields is guaranteed to be as in the example, you can use the following command:
awk '{for(i=NF-8;i<=NF;i++){print $i}}' input.file
The GNU implementation of Awk allows an arbitrary regular expression as the RS record separator If the keyword measured occurs before each batch of numbers, we can use that keyword as the separator:
$ gawk 'BEGIN { RS = "measured" } { for (i = 1; i <= NF ; i++) print "field " i " = " $i }'
measured 10.8 0.0000 0.0000 0.0236 0.0304 0.0383 0.0433 0.0437 0.0442 0.0452
0.0455 0.0448 0.0440 0.0423 0.0386 0.0344 0.0274 0.0000 0.000
field 1 = 10.8
field 2 = 0.0000
field 3 = 0.0000
field 4 = 0.0236
field 5 = 0.0304
field 6 = 0.0383
field 7 = 0.0433
field 8 = 0.0437
field 9 = 0.0442
field 10 = 0.0452
field 11 = 0.0455
field 12 = 0.0448
field 13 = 0.0440
field 14 = 0.0423
field 15 = 0.0386
field 16 = 0.0344
field 17 = 0.0274
field 18 = 0.0000
field 19 = 0.000
As you can see, all the fields between the measured record separators are parsed out regardless of line breaks. Fields are separated on any mixture of spaces, tabs and newlines.
Note that because measured appears first, we get an empty record. The output you see above is, effectively, from the second record. The first record is the whitespcae before measured, which contains no fields.
In other words, he record separator is really expected to be a terminator, except that it can be missing after the last record.

For each unique occurrence in field, transform each unique occurrence in another field in a different column

I have a file
splice_region_variant,intron_variant A1CF 1
3_prime_UTR_variant A1CF 18
intron_variant A1CF 204
downstream_gene_variant A1CF 22
synonymous_variant A1CF 6
missense_variant A1CF 8
5_prime_UTR_variant A2M 1
stop_gained A2M 1
missense_variant A2M 15
splice_region_variant,intron_variant A2M 2
synonymous_variant A2M 2
upstream_gene_variant A2M 22
intron_variant A2M 308
missense_variant A4GNT 1
intron_variant A4GNT 21
5_prime_UTR_variant A4GNT 3
3_prime_UTR_variant A4GNT 7
This file is sorted by $2
for each occurrence of an unique element in $2, I wanna transform in a column each unique occurrence of an element in $1, with corresponding value in $3, or 0 if the record is not there. So that I have:
splice_region_variant,intron_variant 3_prime_UTR_variant intron_variant downstream_gene_variant synonymous_variant missense_variant 5_prime_UTR_variant stop_gained upstream_gene_variant
A1CF 1 18 204 22 6 8 0 0 0
A2M 2 0 308 0 2 15 1 1 22
A4GNT 0 7 21 0 0 22 3 0 0
test file:
a x 2
b,c x 4
dd x 3
e,e,t x 5
a b 1
cc b 2
e,e,t b 1
This is what I'm getting:
a b,c dd e,e,t cc
x 5 2 4 3
b 1 2 1
EDIT: This might be doing it but doesn't output 0s in blank fields
'BEGIN {FS = OFS = "\t"}
NR > 1 {data[$2][$1] = $3; blocks[$1]}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
# header
printf "gene"
for (block in blocks) {
printf "%s%s", OFS, block
}
print ""
# data
for (ts in data) {
printf "%s", ts
for (block in blocks) {
printf "%s%s", OFS, data[ts][block]
}
print ""
}
}' file
modified from https://unix.stackexchange.com/questions/424642/dynamic-transposing-rows-to-columns-using-awk-based-on-row-value
If you want to print 0 if a certain value is absent, you could do something like this:
val = data[ts][block] ? data[ts][block] : 0;
printf "%s%s", OFS, val

missing observation panel data, bring forward value 20 periods

Here's to read in a DataFrame like the one I'm looking at
pd.DataFrame({
'period' : [1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15, 16, 19, 20, 21, 22,
23, 25, 26],
'id' : [1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285,
1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285],
'pred': [-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775],
'ret' : [ None, -0.02222222, -0.01363636, 0. , -0.02764977,
None, -0.00909091, -0.01376147, 0.00465116, None,
0.01869159, 0. , 0. , None , -0.00460829,
0.00462963, 0.02304147, 0. , None, -0.00050756]})
Which will look like this when read in.
period id pred ret
0 1 1285 -1.653477 NaN
1 2 1285 -1.653477 -0.022222
2 3 1285 -1.653477 -0.013636
3 4 1285 -1.653477 0.000000
4 5 1285 -1.653477 -0.027650
5 8 1285 -1.653477 NaN
6 9 1285 -1.653477 -0.009091
7 10 1285 -1.653477 -0.013761
8 11 1285 -1.653477 0.004651
9 13 1285 -1.653477 NaN
10 14 1285 -1.653477 0.018692
11 15 1285 -1.653477 0.000000
12 16 1285 -1.653477 0.000000
13 19 1285 -1.653477 NaN
14 20 1285 -1.653477 -0.004608
15 21 1285 -1.653477 0.004630
16 22 1285 -1.653477 0.023041
17 23 1285 -1.653477 0.000000
18 25 1285 -1.653477 NaN
19 26 1285 -1.653477 -0.000508
pred is 20 period prediction and consequently I want to do is bring the returns back 20 days. (but do it in a flexible way.)
Here's the lag function I have presently
def lag(df, col, lag_dist=1, ref='period', group='id'):
df = df.copy()
new_col = 'lag'+str(lag_dist)+'_'+col
df[new_col] = df.groupby(group)[col].shift(lag_dist)
# set NaN values that differ from specified
df[new_col] = (df.groupby(group)[ref]
.shift(lag_dist)
.sub(df[ref])
.eq(-lag_dist)
.mul(1)
.replace(0,np.nan)*df[new_col])
return df[new_col]
but when I run
df['fut20_ret'] = lag(df, 'ret', -20, 'period')
df.head(20)
I get
period id pred gain fee prc ret fut20_ret
0 1 1285 -1.653478 0.000000 0.87 1.000000 NaN NaN
1 2 1285 -1.653478 -0.022222 0.87 0.977778 -0.022222 NaN
2 3 1285 -1.653478 -0.035556 0.87 0.964444 -0.013636 NaN
3 4 1285 -1.653478 -0.035556 0.87 0.964444 0.000000 NaN
4 5 1285 -1.653478 -0.062222 0.87 0.937778 -0.027650 NaN
6 8 1285 -1.653478 -0.022222 0.87 0.977778 NaN NaN
7 9 1285 -1.653478 -0.031111 0.87 0.968889 -0.009091 NaN
8 10 1285 -1.653478 -0.044444 0.87 0.955556 -0.013761 NaN
9 11 1285 -1.653478 -0.040000 0.87 0.960000 0.004651 NaN
10 13 1285 -1.653478 -0.048889 0.87 0.951111 NaN NaN
11 14 1285 -1.653478 -0.031111 0.87 0.968889 0.018692 NaN
12 15 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
13 16 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
15 19 1285 -1.653478 -0.035556 0.87 0.964444 NaN NaN
16 20 1285 -1.653478 -0.040000 0.87 0.960000 -0.004608 NaN
17 21 1285 -1.653478 -0.035556 0.87 0.964444 0.004630 NaN
18 22 1285 -1.653478 -0.013333 0.87 0.986667 0.023041 NaN
19 23 1285 -1.653478 -0.013333 0.87 0.986667 0.000000 NaN
How can I modify my lag function so that it works properly? It's close but I'm struggling on the last little bit.

How to add missing data to Pandas in Monthly Data

I have this following dataframe:
Date
2002-01-01 10.0 NaN NaN
2002-05-01 NaN 30.0 40.0
2002-07-01 NaN NaN 50.0
I would like to complete the missing months with zeros. I am actualy able to do that, but I can do that only adding the entire range of days that are missing as you can get in the following code. The relevant part of the code is marked with
#############################
-
def createSeriesOfCompanies(df):
listOfCompanies=list(set(df['Company']))
dfSeries=df.pivot(index='Date', columns='Company', values='var1')
# Here I include the missing dates
#######################################################
initialDate=dfSeries.index[0]
endDate=dfSeries.index[-1]
idx = pd.date_range(initialDate, endDate)
dfSeries.index = pd.DatetimeIndex(dfSeries.index)
dfSeries = dfSeries.reindex(idx, fill_value=0)
########################################################
# Here it finishes the procedure
def creatingDataFrame():
dateList=[]
dateList.append(datetime.date(2002,1,1))
dateList.append(datetime.date(2002,7,1))
dateList.append(datetime.date(2002,5,1))
dateList.append(datetime.date(2002,5,1))
dateList.append(datetime.date(2002,7,1))
raw_data = {'Date': dateList,
'Company': ['A', 'B', 'B', 'C' , 'C'],
'var1': [10, 20, 30, 40 , 50]}
df = pd.DataFrame(raw_data, columns = ['Date','Company', 'var1'])
df.loc[1, 'var1'] = np.nan
return df
if __name__=="__main__":
df=creatingDataFrame()
print(df)
dfSeries,listOfCompanies=createSeriesOfCompanies(df)
I would like to get
Date
2002-01-01 10.0 NaN NaN
2002-02-01 0 0 0
2002-03-01 0 0 0
2002-04-01 0 0 0
2002-05-01 NaN 30.0 40.0
2002-06-01 0 0 0
2002-07-01 NaN NaN 50.0
But I am getting this
Company A B C
2002-01-01 10.0 NaN NaN
2002-01-02 0.0 0.0 0.0
2002-01-03 0.0 0.0 0.0
2002-01-04 0.0 0.0 0.0
2002-01-05 0.0 0.0 0.0
2002-01-06 0.0 0.0 0.0
2002-01-07 0.0 0.0 0.0
2002-01-08 0.0 0.0 0.0
2002-01-09 0.0 0.0 0.0
2002-01-10 0.0 0.0 0.0
2002-01-11 0.0 0.0 0.0
2002-01-12 0.0 0.0 0.0
2002-01-13 0.0 0.0 0.0
2002-01-14 0.0 0.0 0.0
2002-01-15 0.0 0.0 0.0
2002-01-16 0.0 0.0 0.0
2002-01-17 0.0 0.0 0.0
2002-01-18 0.0 0.0 0.0
2002-01-19 0.0 0.0 0.0
2002-01-20 0.0 0.0 0.0
2002-01-21 0.0 0.0 0.0
2002-01-22 0.0 0.0 0.0
2002-01-23 0.0 0.0 0.0
2002-01-24 0.0 0.0 0.0
2002-01-25 0.0 0.0 0.0
2002-01-26 0.0 0.0 0.0
2002-01-27 0.0 0.0 0.0
2002-01-28 0.0 0.0 0.0
2002-01-29 0.0 0.0 0.0
2002-01-30 0.0 0.0 0.0
...
How can I deal with this problem?
You can use reindex. Given the date is index,
df.index = pd.to_datetime(df.index)
df.reindex(pd.date_range(df.index.min(), df.index.max(), freq = 'MS'))
A B C
2002-01-01 10.0 NaN NaN
2002-02-01 NaN NaN NaN
2002-03-01 NaN NaN NaN
2002-04-01 NaN NaN NaN
2002-05-01 NaN 30.0 40.0
2002-06-01 NaN NaN NaN
2002-07-01 NaN NaN 50.0
Use asfreq by MS (start of months):
df=creatingDataFrame()
df = df.pivot(index='Date', columns='Company', values='var1').asfreq('MS', fill_value=0)
print (df)
Company A B C
Date
2002-01-01 10.0 NaN NaN
2002-02-01 0.0 0.0 0.0
2002-03-01 0.0 0.0 0.0
2002-04-01 0.0 0.0 0.0
2002-05-01 NaN 30.0 40.0
2002-06-01 0.0 0.0 0.0
2002-07-01 NaN NaN 50.0

Extract date from date time - change . to , and print sum up of different field

aNumber bNumber startDate cost balanceAfter trafficCase Operator unknown3 MainAmount BALANCEBEFORE
22676239633 433 2014-07-02 10:16:48.000 0,00 0.20 0 Short Code 397224944 0.0000 0.2000
22677277255 76919167 2014-07-02 10:16:51.000 1,00 92.60 0 Airtel 126268625 0.0000 92.6000
22676777508 76701575 2014-07-02 10:16:55.000 1,00 217.00 0 Airtel 4132186103 0.0000 217.0000
22665706841 433 2014-07-02 10:16:57.000 0,00 69.50 0 Short Code 4133821554 0.0000 69.5000
22665799922 70110055 2014-07-03 10:16:45.000 20,00 0.50 0 Telmob 126260244 20.0000 0.5000
22676239633 433 2014-07-03 10:16:48.000 0,00 0.20 0 Short Code 397224944 0.0000 0.2000
22677277255 76919167 2014-07-04 10:16:51.000 1,00 92.60 0 Airtel 126268625 0.0000 92.6000
22676777508 76701575 2014-07-04 10:16:55.000 1,00 217.00 0 Airtel 4132186103 0.0000 217.0000
22665706841 433 2014-07-05 10:16:57.000 0,00 69.50 0 Short Code 4133821554 0.0000 69.5000
Here is a sample of the data I have. I want to sum up cost, balanceAfter, MainAmount and BALANCEBEFORE at each time the date changed but my concern is I have date combined with time and my decimal separator is dot instead of comma so my awk script can't perform the operation.
Can I have an AWK script which will first extract only the date so in the end I will have an output looking like:
Date Cost balanceAfter MainAmount BALANCEBEFORE
02/07/2014 2,00 379,3 0 379,3
03/07/2014 20,00 0,7 20 0,7
04/07/2014 2,00 309,6 0 309,6
05/07/2014 0,00 69,5 0 69,5
HERE IS MY AWK SCRIPT
awk -F 'NR==1 {header=$0; next} {a[$3]+=$4 a[$3]+=$5 a[$3]+=$9 a[$3]+=$10} END {for (i in a) {printf "%d\t%d\n", i, a[i]}; tot+=a[i]};' out.txt>output.doc
EDIT: Avoid pre-processing step as per Etan Reisner's suggestion to use $NF to work around differing numbers of tokens in Operator column.
$ cat data.txt
aNumber bNumber startDate cost balanceAfter trafficCase Operator unknown3 MainAmount BALANCEBEFORE
22676239633 433 2014-07-02 10:16:48.000 0,00 0.20 0 Short Code 397224944 0.0000 0.2000
22677277255 76919167 2014-07-02 10:16:51.000 1,00 92.60 0 Airtel 126268625 0.0000 92.6000
22676777508 76701575 2014-07-02 10:16:55.000 1,00 217.00 0 Airtel 4132186103 0.0000 217.0000
22665706841 433 2014-07-02 10:16:57.000 0,00 69.50 0 Short Code 4133821554 0.0000 69.5000
22665799922 70110055 2014-07-03 10:16:45.000 20,00 0.50 0 Telmob 126260244 20.0000 0.5000
22676239633 433 2014-07-03 10:16:48.000 0,00 0.20 0 Short Code 397224944 0.0000 0.2000
22677277255 76919167 2014-07-04 10:16:51.000 1,00 92.60 0 Airtel 126268625 0.0000 92.6000
22676777508 76701575 2014-07-04 10:16:55.000 1,00 217.00 0 Airtel 4132186103 0.0000 217.0000
22665706841 433 2014-07-05 10:16:57.000 0,00 69.50 0 Short Code 4133821554 0.0000 69.5000
$ cat so2.awk
NR > 1 {
cost = $5;
balanceAfter = $6;
mainAmount = $(NF - 1);
balanceBefore = $NF;
sub(",", ".", cost);
sub(",", ".", balanceAfter);
sub(",", ".", mainAmount);
sub(",", ".", balanceBefore);
dateCost[$3] += cost;
dateBalanceAfter[$3] += balanceAfter;
dateMainAmount[$3] += mainAmount;
dateBalanceBefore[$3] += balanceBefore;
}
END {
printf("%s\t%s\t%s\t%s\t%s\n", "Date", "Cost", "BalanceAfter", "MainAmount", "BalanceBefore");
for (i in dateCost) {
printf("%s\t%f\t%f\t%f\t%f\n", i, dateCost[i], dateBalanceAfter[i], dateMainAmount[i], dateBalanceBefore[i]);
}
}
$ awk -f so2.awk data.txt
Date Cost BalanceAfter MainAmount BalanceBefore
2014-07-02 2.000000 379.300000 0.000000 379.300000
2014-07-03 20.000000 0.700000 20.000000 0.700000
2014-07-04 2.000000 309.600000 0.000000 309.600000
2014-07-05 0.000000 69.500000 0.000000 69.500000
This requires no pre-processing of the file:
awk '
BEGIN {print "Date Cost BalanceAfter MainAmount BalanceBefore"}
NR == 1 {next}
function showday() {
printf "%s\t%.2f\t%.1f\t%d\t%.1f\n", date, cost, bAfter, main, bBefore
}
date != $3 {
if (date) showday()
date = $3
cost = bAfter = main = bBefore = 0
}
{
sub(/,/, ".", $5)
cost += $5
bAfter += $6
main += $(NF-1)
bBefore += $NF
}
END {showday()}
' file | column -t
Date Cost BalanceAfter MainAmount BalanceBefore
2014-07-02 2.00 379.3 0 379.3
2014-07-03 20.00 0.7 20 0.7
2014-07-04 2.00 309.6 0 309.6
2014-07-05 0.00 69.5 0 69.5