how to move up previous values in sequences?

how to move up previous values in sequences? - pandas

I have a df like the following ...
df = pd.DataFrame({"A":[8001,7999,np.nan,8030,9900, 9342,9324,8534,8358,9457, np.nan ,8999,8492,np.nan,np.nan],
"B":[201,209,298,300,np.nan, 342,324,854,858,457, 145,189,192,134,135],
"C":[11991,15631,47998,38030,19900, 29342,np.nan,28534,28358,29457, 27245,28999,28492,29334,28234]}, index=pd.Index('2019-06-17 00:00:00','2019-06-17 00:01:01', '2019-06-17 00:02:00', '2019-06-17 00:03:04', '2020-06-17 00:04:00', '2020-06-17 00:05:00', '2020-06-17 00:06:00', '2020-06-17 00:07:00','2020-06-17 00:08:00','2020-06-17 00:09:00','2020-06-17 00:10:00','2020-06-17 00:11:00','2020-06-17 00:12:00','2020-06-17 00:13:00', '2020-06-17 00:14:00'))
Time A B C
'2019-06-17 00:00:00' 8001 201 11991
'2019-06-17 00:01:01' 7999 209 15631
'2019-06-17 00:02:00' 7998 298 47998
'2019-06-17 00:03:04' NaN 300 38030
'2020-06-17 00:04:00' 9900 NaN 19900
'2020-06-17 00:05:00' 9342 342 29342
'2020-06-17 00:06:00' 9324 324 NaN
'2020-06-17 00:07:00' 8534 854 28534
'2020-06-17 00:08:00' 8358 858 28358
'2020-06-17 00:09:00' 9457 457 29457
'2020-06-17 00:10:00' NaN 145 27245
'2020-06-17 00:11:00' 8999 189 28999
'2020-06-17 00:12:00' 8492 192 28492
'2020-06-17 00:13:00' NaN 134 29334
'2020-06-17 00:14:00' NaN 135 28234
... and I can sequentialize it. So far no problem.
(I already got code for this, so the outcome for a sequence_Length==5 is:)
array([
[['2019-06-17 00:00:00' 8001 201 11991 ]
['2019-06-17 00:01:01' 7999 209 15631 ]
['2019-06-17 00:02:00' 7998 298 47998 ]
['2019-06-17 00:03:04' NaN 300 38030 ]
['2020-06-17 00:04:00' 9900 NaN 19900 ]]
[['2019-06-17 00:01:01' 7999 209 15631 ]
['2019-06-17 00:02:00' 7998 298 47998 ]
['2019-06-17 00:03:04' NaN 300 38030 ]
['2020-06-17 00:04:00' 9900 NaN 19900 ]
['2020-06-17 00:05:00' 9342 342 29342 ]]
[['2019-06-17 00:02:00' 7998 298 47998 ]
['2019-06-17 00:03:04' NaN 300 38030 ]
['2020-06-17 00:04:00' 9900 NaN 19900 ]
['2020-06-17 00:05:00' 9342 342 29342 ]
['2020-06-17 00:06:00' 9324 324 NaN ]]
[['2019-06-17 00:03:04' NaN 300 38030 ]
['2020-06-17 00:04:00' 9900 NaN 19900 ]
['2020-06-17 00:05:00' 9342 342 29342 ]
['2020-06-17 00:06:00' 9324 324 NaN ]
['2020-06-17 00:07:00' 8534 854 28534 ]]
[['2020-06-17 00:04:00' 9900 NaN 19900 ]
['2020-06-17 00:05:00' 9342 342 29342 ]
['2020-06-17 00:06:00' 9324 324 NaN ]
['2020-06-17 00:07:00' 8534 854 28534 ]
['2020-06-17 00:08:00' 8358 858 28358 ]]
[['2020-06-17 00:05:00' 9342 342 29342 ]
['2020-06-17 00:06:00' 9324 324 NaN ]
['2020-06-17 00:07:00' 8534 854 28534 ]
['2020-06-17 00:08:00' 8358 858 28358 ]
['2020-06-17 00:09:00' 9457 457 29457 ]]
[['2020-06-17 00:06:00' 9324 324 NaN ]
['2020-06-17 00:07:00' 8534 854 28534 ]
['2020-06-17 00:08:00' 8358 858 28358 ]
['2020-06-17 00:09:00' 9457 457 29457 ]
['2020-06-17 00:10:00' NaN 145 27245 ]]
[['2020-06-17 00:07:00' 8534 854 28534 ]
['2020-06-17 00:08:00' 8358 858 28358 ]
['2020-06-17 00:09:00' 9457 457 29457 ]
['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]]
[['2020-06-17 00:08:00' 8358 858 28358 ]
['2020-06-17 00:09:00' 9457 457 29457 ]
['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]]
[['2020-06-17 00:09:00' 9457 457 29457 ]
['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]
['2020-06-17 00:13:00' NaN 134 29334 ]]
[['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]
['2020-06-17 00:13:00' NaN 134 29334 ]
['2020-06-17 00:14:00' NaN 135 28234 ]]
])
# (i kept the timeindices in there, to clear things up)
I cant work with nans, but i also am not allowed to interpolate the data. So:
Question:
How do I move up all the previous values to where the nans are?
(like in this example:)
# Before:
[['2020-06-17 00:10:00' NaN 145 27245 ]
['2020-06-17 00:11:00' 8999 189 28999 ]
['2020-06-17 00:12:00' 8492 192 28492 ]
['2020-06-17 00:13:00' NaN 134 29334 ]
['2020-06-17 00:14:00' NaN 135 28234 ]]
# Expected Result: # v-- the last 5 valid values in "A" till the current index (=='2020-06-17 00:14:00')
[['2020-06-17 00:10:00' 8534 145 27245 ]
['2020-06-17 00:11:00' 8358 189 28999 ]
['2020-06-17 00:12:00' 9457 192 28492 ]
['2020-06-17 00:13:00' 8999 134 29334 ]
['2020-06-17 00:14:00' 8492 135 28234 ]]
probably a few early samples get lost, when the sequence length gets too high, but thats okay. (since there could be a very high amount of nans in one column)
EDIT: (how to do this?)
In steps it would be like this:
Step 1: (In the df) get the X last valid elements up to date (eg. 00:14:00') (as list)
Step 2: Replace the "column" (in the nested list) by the list from Step 1 (how to select it)

Related

missing observation panel data, bring forward value 20 periods

Here's to read in a DataFrame like the one I'm looking at
pd.DataFrame({
'period' : [1, 2, 3, 4, 5, 8, 9, 10, 11, 13, 14, 15, 16, 19, 20, 21, 22,
23, 25, 26],
'id' : [1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285,
1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285, 1285],
'pred': [-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775,
-1.6534775, -1.6534775, -1.6534775, -1.6534775, -1.6534775],
'ret' : [ None, -0.02222222, -0.01363636, 0. , -0.02764977,
None, -0.00909091, -0.01376147, 0.00465116, None,
0.01869159, 0. , 0. , None , -0.00460829,
0.00462963, 0.02304147, 0. , None, -0.00050756]})
Which will look like this when read in.
period id pred ret
0 1 1285 -1.653477 NaN
1 2 1285 -1.653477 -0.022222
2 3 1285 -1.653477 -0.013636
3 4 1285 -1.653477 0.000000
4 5 1285 -1.653477 -0.027650
5 8 1285 -1.653477 NaN
6 9 1285 -1.653477 -0.009091
7 10 1285 -1.653477 -0.013761
8 11 1285 -1.653477 0.004651
9 13 1285 -1.653477 NaN
10 14 1285 -1.653477 0.018692
11 15 1285 -1.653477 0.000000
12 16 1285 -1.653477 0.000000
13 19 1285 -1.653477 NaN
14 20 1285 -1.653477 -0.004608
15 21 1285 -1.653477 0.004630
16 22 1285 -1.653477 0.023041
17 23 1285 -1.653477 0.000000
18 25 1285 -1.653477 NaN
19 26 1285 -1.653477 -0.000508
pred is 20 period prediction and consequently I want to do is bring the returns back 20 days. (but do it in a flexible way.)
Here's the lag function I have presently
def lag(df, col, lag_dist=1, ref='period', group='id'):
df = df.copy()
new_col = 'lag'+str(lag_dist)+'_'+col
df[new_col] = df.groupby(group)[col].shift(lag_dist)
# set NaN values that differ from specified
df[new_col] = (df.groupby(group)[ref]
.shift(lag_dist)
.sub(df[ref])
.eq(-lag_dist)
.mul(1)
.replace(0,np.nan)*df[new_col])
return df[new_col]
but when I run
df['fut20_ret'] = lag(df, 'ret', -20, 'period')
df.head(20)
I get
period id pred gain fee prc ret fut20_ret
0 1 1285 -1.653478 0.000000 0.87 1.000000 NaN NaN
1 2 1285 -1.653478 -0.022222 0.87 0.977778 -0.022222 NaN
2 3 1285 -1.653478 -0.035556 0.87 0.964444 -0.013636 NaN
3 4 1285 -1.653478 -0.035556 0.87 0.964444 0.000000 NaN
4 5 1285 -1.653478 -0.062222 0.87 0.937778 -0.027650 NaN
6 8 1285 -1.653478 -0.022222 0.87 0.977778 NaN NaN
7 9 1285 -1.653478 -0.031111 0.87 0.968889 -0.009091 NaN
8 10 1285 -1.653478 -0.044444 0.87 0.955556 -0.013761 NaN
9 11 1285 -1.653478 -0.040000 0.87 0.960000 0.004651 NaN
10 13 1285 -1.653478 -0.048889 0.87 0.951111 NaN NaN
11 14 1285 -1.653478 -0.031111 0.87 0.968889 0.018692 NaN
12 15 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
13 16 1285 -1.653478 -0.031111 0.87 0.968889 0.000000 NaN
15 19 1285 -1.653478 -0.035556 0.87 0.964444 NaN NaN
16 20 1285 -1.653478 -0.040000 0.87 0.960000 -0.004608 NaN
17 21 1285 -1.653478 -0.035556 0.87 0.964444 0.004630 NaN
18 22 1285 -1.653478 -0.013333 0.87 0.986667 0.023041 NaN
19 23 1285 -1.653478 -0.013333 0.87 0.986667 0.000000 NaN
How can I modify my lag function so that it works properly? It's close but I'm struggling on the last little bit.

SQL query is not working (Error in rsqlite_send_query)

This is what the head of my data frame looks like
> head(d19_1)
SMZ SIZ1_diff SIZ1_base SIZ2_diff SIZ2_base SIZ3_diff SIZ3_base SIZ4_diff SIZ4_base SIZ5_diff SIZ5_base
1 1 -620 4170 -189 1347 -35 2040 82 1437 244 1533
2 2 -219 831 -57 255 -4 392 8 282 14 297
3 3 -426 834 -162 294 -134 379 -81 241 -22 221
4 4 -481 676 -142 216 -114 267 -50 158 -43 166
5 5 -233 1711 -109 584 54 913 71 624 74 707
6 6 -322 1539 -79 512 -50 799 23 532 63 576
Total_og Total_base %_SIZ1 %_SIZ2 %_SIZ3 %_SIZ4 %_SIZ5 Total_og Total_base
1 11980 12648 14.86811 14.03118 1.715686 5.706333 15.916504 11980 12648
2 2156 2415 26.35379 22.35294 1.020408 2.836879 4.713805 2156 2415
3 1367 2314 51.07914 55.10204 35.356201 33.609959 9.954751 1367 2314
4 790 1736 71.15385 65.74074 42.696629 31.645570 25.903614 790 1736
5 5339 5496 13.61777 18.66438 5.914567 11.378205 10.466761 5339 5496
6 4362 4747 20.92268 15.42969 6.257822 4.323308 10.937500 4362 4747
The datatype of the data frame is as below str(d19_1)
> str(d19_1)
'data.frame': 1588 obs. of 20 variables:
$ SMZ : int 1 2 3 4 5 6 7 8 9 10 ...
$ SIZ1_diff : int -620 -219 -426 -481 -233 -322 -176 -112 -34 -103 ...
$ SIZ1_base : int 4170 831 834 676 1711 1539 720 1396 998 1392 ...
$ SIZ2_diff : int -189 -57 -162 -142 -109 -79 -12 72 -36 -33 ...
$ SIZ2_base : int 1347 255 294 216 584 512 196 437 343 479 ...
$ SIZ3_diff : int -35 -4 -134 -114 54 -50 16 4 26 83 ...
$ SIZ3_base : int 2040 392 379 267 913 799 361 804 566 725 ...
$ SIZ4_diff : int 82 8 -81 -50 71 23 36 127 46 75 ...
$ SIZ4_base : int 1437 282 241 158 624 532 242 471 363 509 ...
$ SIZ5_diff : int 244 14 -22 -43 74 63 11 143 79 125 ...
$ SIZ5_base : int 1533 297 221 166 707 576 263 582 429 536 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
$ %_SIZ1 : num 14.9 26.4 51.1 71.2 13.6 ...
$ %_SIZ2 : num 14 22.4 55.1 65.7 18.7 ...
$ %_SIZ3 : num 1.72 1.02 35.36 42.7 5.91 ...
$ %_SIZ4 : num 5.71 2.84 33.61 31.65 11.38 ...
$ %_SIZ5 : num 15.92 4.71 9.95 25.9 10.47 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
When I run the below query, it is returning me the below error and I don't know why. I don't have any column in table
Query
d20_1 <- sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')
Error:
Error in rsqlite_send_query(conn#ptr, statement) :
table d19_1 has no column named <NA>

Your code works correctly for me:
d19_1 <- structure(list(SMZ = 1:6, SIZ1_diff = c(-620L, -219L, -426L,
-481L, -233L, -322L), SIZ1_base = c(4170L, 831L, 834L, 676L,
1711L, 1539L), SIZ2_diff = c(-189L, -57L, -162L, -142L, -109L,
-79L), SIZ2_base = c(1347L, 255L, 294L, 216L, 584L, 512L), SIZ3_diff = c(-35L,
-4L, -134L, -114L, 54L, -50L), SIZ3_base = c(2040L, 392L, 379L,
267L, 913L, 799L), SIZ4_diff = c(82L, 8L, -81L, -50L, 71L, 23L
), SIZ4_base = c(1437L, 282L, 241L, 158L, 624L, 532L), SIZ5_diff = c(244L,
14L, -22L, -43L, 74L, 63L), SIZ5_base = c(1533L, 297L, 221L,
166L, 707L, 576L), Total_og = c(11980L, 2156L, 1367L, 790L, 5339L,
4362L), Total_base = c(12648L, 2415L, 2314L, 1736L, 5496L, 4747L
), X._SIZ1 = c(14.86811, 26.35379, 51.07914, 71.15385, 13.61777,
20.92268), X._SIZ2 = c(14.03118, 22.35294, 55.10204, 65.74074,
18.66438, 15.42969), X._SIZ3 = c(1.715686, 1.020408, 35.356201,
42.696629, 5.914567, 6.257822), X._SIZ4 = c(5.706333, 2.836879,
33.609959, 31.64557, 11.378205, 4.323308), X._SIZ5 = c(15.916504,
4.713805, 9.954751, 25.903614, 10.466761, 10.9375), Total_og.1 = c(11980L,
2156L, 1367L, 790L, 5339L, 4362L), Total_base.1 = c(12648L, 2415L,
2314L, 1736L, 5496L, 4747L)), .Names = c("SMZ", "SIZ1_diff",
"SIZ1_base", "SIZ2_diff", "SIZ2_base", "SIZ3_diff", "SIZ3_base",
"SIZ4_diff", "SIZ4_base", "SIZ5_diff", "SIZ5_base", "Total_og",
"Total_base", "X._SIZ1", "X._SIZ2", "X._SIZ3", "X._SIZ4", "X._SIZ5",
"Total_og.1", "Total_base.1"), row.names = c(NA, -6L), class = "data.frame")
library(sqldf)
sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')

Protractor - find element in iframe

in very top of my page I have call of iframe:
<iframe ng-if="webpadUrl" id="webPadIframe" ng-src="http://Path/To/iFrame?sessionId=9bc9989441c8c9cfb9ff5bdc381a72ea" seamless="seamless" class="fullscreen ng-scope" src="http://Path/To/iFrame?sessionId=9bc9989441c8c9cfb9ff5bdc381a72ea">
</iframe>
Inside of iframe I have something like:
<svg xmlns="http://www.w3.org/2000/svg" version="1.1" viewBox="0 0 5845 3897">
and below that multiple <g> with different ID and so on..
<g id="30_0_80" transform="translate(420,754)" class="">
<path class="fp x80 xab active" d="M307 0 L293 15 L155 120 L87 181 L47 220 L0 277 L0 282 L14 341 L27 379 L32 386 L32 386 L74 425 L123 461 L153 480 L188 500 L238 525 L303 551 L337 563 L340 563 L381 561 L490 560 L492 557 L522 526 L591 473 L662 430 L745 383 L770 368 L785 358 L796 350 L802 343 L806 335 L809 321 L809 318 L810 295 L808 293 L806 293 L763 292 L680 277 L643 269 L590 253 L555 239 L555 239 L508 214 L452 179 L397 138 L369 115 L339 79 L325 56 L310 28 L308 23 L308 19 L310 1 L307 0 Z"></path>
<path class="p x88 xc7" d="M796 296 L792 300 L736 324 L595 391 L486 455 L413 505 L349 559"></path>
<path class="p x88 xc7" d="M33 372 L57 324 L82 284 L128 228 L133 222 L134 221 L164 188 L222 131 L252 102 L281 69"></path><path class="p x88 xc7" d="M9 283 L24 261 L52 221 L79 190 L88 182"></path><path class="p x88 xc7" d="M169 175 L251 97 L284 60 L295 40 L303 25"></path><path class="p x88 xc7" d="M132 214 L119 229 L88 266"></path>
<path class="p x88 xc7" d="M72 287 L54 315"></path><path class="p x88 xc7" d="M47 326 L44 331 L29 360"></path>
</g>
What I do is trying to click <g>
As it is in iframe I try something like:
browser.driver.switchTo().frame(element(by.id('30_0_80')));
Unfortunately doesn't work, is there any other way how to work with this iframes?
Error: NoSuchElementError: No element found using locator: By.id("30_0_80")

After sometime I find a solution how to switch to right frame, hope it help someone.
browser.switchTo().frame('webPadIframe');
browser.findElement(by.id('30_0_80')).click();

How to deal with non-angular iframes using control-flow.
let flow = protractor.promise.controlFlow();
flow.execute(function(){
browser.ignoreSynchronization = true; //allows you to handle non angular page
browser.switchTo().frame('webPadIframe'); //switches to iframe
browser.findElement(by.id('30_0_80')).click(); //action within iframe
browser.switchTo().defaultContent(); //switches back to main page
browser.ignoreSynchronization = false; //lets you resume handling angular
)};

identifying strings within intervals, pt 2

For each row, I would like to know if the numerical string in the 6th column resides within the start and end intervals of the 3rd and 4th column. The issue for me is that identical strings in the 1st and 5th column are not always in the same row (e.g., uce-6459 is on the same line as uce-432).
Input
locus match start end subset pos region
uce-3280 uce-3280_p1 269 388 uce-3280 222
uce-6620 uce-6620_p1 297 416 uce-6620 198
uce-6620 uce-6620_p1 297 416 uce-6620 300
uce-432 uce-432_p2 328 447 uce-432 205
uce-432 uce-432_p1 268 387 uce-6459 207
uce-6459 uce-6459_p1 210 329 uce-6459 275
uce-6459 uce-6459_p1 210 329 uce-6459 288
uce-6459 uce-6459_p1 210 329 uce-374 373
uce-374 uce-374_p2 509 628 uce-3393 327
uce-374 uce-374_p1 449 568 uce-3393 416
uce-3393 uce-3393_p1 439 558 uce-3393 712
uce-3393 uce-3393_p1 439 558 uce-1200 416
uce-3393 uce-3393_p1 439 558 uce-805 397
uce-1200 uce-1200_p3 341 460 uce-627 326
uce-805 uce-805_p1 333 452 uce-2299 340
uce-627 uce-627_p1 396 515 uce-2126 481
uce-2299 uce-2299_p1 388 507 uce-5427 562
uce-2126 uce-2126_p1 323 437 uce-5427 711
uce-5427 uce-5427_p1 509 628 uce-5893 242
uce-5427 uce-5427_p1 509 628 uce-5893 330
uce-5893 uce-5893_p1 477 582 uce-5893 398
Desired output
locus match start end subset pos region
uce-3280 uce-3280_p1 269 388 uce-3280 222 no
uce-6620 uce-6620_p1 297 416 uce-6620 198 no
uce-6620 uce-6620_p1 297 416 uce-6620 300 yes
uce-432 uce-432_p2 328 447 uce-432 205 no
uce-432 uce-432_p1 268 387 uce-6459 207 no
uce-6459 uce-6459_p1 210 329 uce-6459 275 yes
uce-6459 uce-6459_p1 210 329 uce-6459 288 yes
uce-6459 uce-6459_p1 210 329 uce-374 373 no
uce-374 uce-374_p2 509 628 uce-3393 327 no
uce-374 uce-374_p1 449 568 uce-3393 416 no
uce-3393 uce-3393_p1 439 558 uce-3393 712 no
uce-3393 uce-3393_p1 439 558 uce-1200 416 yes
uce-3393 uce-3393_p1 439 558 uce-805 397 yes
uce-1200 uce-1200_p3 341 460 uce-627 326 no
uce-805 uce-805_p1 333 452 uce-2299 340 no
uce-627 uce-627_p1 396 515 uce-2126 481 no
uce-2299 uce-2299_p1 388 507 uce-5427 562 yes
uce-2126 uce-2126_p1 323 437 uce-5427 711 no
uce-5427 uce-5427_p1 509 628 uce-5893 242 no
uce-5427 uce-5427_p1 509 628 uce-5893 330 no
uce-5893 uce-5893_p1 477 582 uce-5893 398 no
Any help would be appreciated.

Here is a full stand-alone awk-script:
#!/usr/bin/awk -f
# goes through the whole file, saves boundaries of the intervals
NR > 1 && NR == FNR {
starts[$1] = $3
ends[$1] = $4
#print "Scanned interval: "$1" = ["starts[$1]","ends[$1]"]"
}
# prints the header of the table
NR != FNR && FNR == 1 {
print $0
}
# annotates each line with "yes"/"no"
FNR > 1 && NR != FNR {
print $0" "(starts[$5] <= $6 && $6 <= ends[$5] ? "yes" : "no")
}
Depending on what OS you have and what awk you are using, you might need to adjust the first line (use which awk to find the right path).
In order to make it run, you have to save it in a separate file (e.g. analyzeSnp.awk), then make it executable (e.g. chmod u+x analyzeSnp.awk), and then run it like this:
./analyzeSnp.awk inputData inputData
Alternatively, you can enclose it in single quotes and paste it directly into the console like this:
$ awk ' #!/usr/bin/awk -f
# goes through the whole file, saves boundaries of the intervals
NR > 1 && NR == FNR {
starts[$1] = $3
ends[$1] = $4
#print "Scanned interval: "$1" = ["starts[$1]","ends[$1]"]"
}
# prints the header of the table
NR != FNR && FNR == 1 {
print $0
}
# annotates each line with "yes"/"no"
FNR > 1 && NR != FNR {
print $0" "(starts[$5] <= $6 && $6 <= ends[$5] ? "yes" : "no")
}' loci.txt loci.txt
Important: you have to specify your input file twice, because we need a first scan to build a table with intervals, and then a second scan for actual annotation.
Here is the produced output:
locus match start end subset pos region
uce-3280 uce-3280_p1 269 388 uce-3280 222 no
uce-6620 uce-6620_p1 297 416 uce-6620 198 no
uce-6620 uce-6620_p1 297 416 uce-6620 300 yes
uce-432 uce-432_p2 328 447 uce-432 205 no
uce-432 uce-432_p1 268 387 uce-6459 207 no
uce-6459 uce-6459_p1 210 329 uce-6459 275 yes
uce-6459 uce-6459_p1 210 329 uce-6459 288 yes
uce-6459 uce-6459_p1 210 329 uce-374 373 no
uce-374 uce-374_p2 509 628 uce-3393 327 no
uce-374 uce-374_p1 449 568 uce-3393 416 no
uce-3393 uce-3393_p1 439 558 uce-3393 712 no
uce-3393 uce-3393_p1 439 558 uce-1200 416 yes
uce-3393 uce-3393_p1 439 558 uce-805 397 yes
uce-1200 uce-1200_p3 341 460 uce-627 326 no
uce-805 uce-805_p1 333 452 uce-2299 340 no
uce-627 uce-627_p1 396 515 uce-2126 481 no
uce-2299 uce-2299_p1 388 507 uce-5427 562 yes
uce-2126 uce-2126_p1 323 437 uce-5427 711 no
uce-5427 uce-5427_p1 509 628 uce-5893 242 no
uce-5427 uce-5427_p1 509 628 uce-5893 330 no
uce-5893 uce-5893_p1 477 582 uce-5893 398 no
This looks pretty close to your example snippet.
A general remark: AWK is a surprisingly modern and powerful scripting language (at least for it's age: this thing is almost 40 years old!). If you want to use AWK, you are not constrained to cryptic single-liners.

Use AWK on multi FASTA file to add new column based on contig header

I have a multi FASTA file that needs to be parsed so Glimmer multi-extract script can process it. It is composed of many contigs each with it's own header that starts with ">". What I need is to add each header as a new column, the problem is I don't know very much about the linux bash or awk for that matter.
>contig-7
orf00002 1741 461
orf00003 3381 1747
>Wcontig-7000023
>Wcontig-11112
orf00001 426 2648
orf00002 2710 4581
orf00003 4569 5480
orf00004 6990 6133
orf00006 9180 7108
orf00007 10201 9209
orf00008 11663 10203
orf00009 12489 11680
orf00010 13153 12473
orf00011 14382 13225
orf00013 14715 15968
orf00014 19868 16410
>Wcontig-1674000002
orf00001 2995 637
orf00002 2497 1166
orf00003 2984 2529
I need to have each contig header added as a first column along with a tab delimiter.
>contig-7
>contig-7 orf00002 1741 461
>contig-7 orf00003 3381 1747
>Wcontig-7000023
>Wcontig-11112
>Wcontig-11112 orf00001 426 2648
>Wcontig-11112 orf00002 2710 4581
>Wcontig-11112 orf00003 4569 5480
>Wcontig-11112 orf00004 6990 6133
>Wcontig-11112 orf00006 9180 7108
>Wcontig-11112 orf00007 10201 9209
>Wcontig-11112 orf00008 11663 10203
>Wcontig-11112 orf00009 12489 11680
>Wcontig-11112 orf00010 13153 12473
>Wcontig-11112 orf00011 14382 13225
>Wcontig-11112 orf00013 14715 15968
>Wcontig-11112 orf00014 19868 16410
>Wcontig-1674000002
>Wcontig-1674000002 orf00001 2995 637
>Wcontig-1674000002 orf00002 2497 1166
>Wcontig-1674000002 orf00003 2984 2529
Also, after adding the new column I have to erase all the headers, so it would end up looking like this
>contig-7 orf00002 1741 461
>contig-7 orf00003 3381 1747
>Wcontig-11112 orf00001 426 2648
>Wcontig-11112 orf00002 2710 4581
>Wcontig-11112 orf00003 4569 5480
>Wcontig-11112 orf00004 6990 6133
>Wcontig-11112 orf00006 9180 7108
>Wcontig-11112 orf00007 10201 9209
>Wcontig-11112 orf00008 11663 10203
>Wcontig-11112 orf00009 12489 11680
>Wcontig-11112 orf00010 13153 12473
>Wcontig-11112 orf00011 14382 13225
>Wcontig-11112 orf00013 14715 15968
>Wcontig-11112 orf00014 19868 16410
>Wcontig-1674000002 orf00001 2995 637
>Wcontig-1674000002 orf00002 2497 1166
>Wcontig-1674000002 orf00003 2984 2529

Awk can be really handy to solve this problem:
awk '{if($1 ~ /contig/){c=$1}else{print c"\t"$0}}' <yourfile>

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to move up previous values in sequences? - pandas

Related

missing observation panel data, bring forward value 20 periods

SQL query is not working (Error in rsqlite_send_query)

Protractor - find element in iframe

identifying strings within intervals, pt 2

Use AWK on multi FASTA file to add new column based on contig header

Categories

Resources