Filtering file according to the highest value in a column of each line - awk

I have the following file:
gene.100079.0.5.p3 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 84.9
gene.100079.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 84.9
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
The above file has some IDs which are similar
gene.100079.0.5.p3
gene.100079.0.3.p1
gene.100079.0.0.p1
By remaining only gene.100079 the IDs become identically. I would like to filter the above file in the following way:
chr11_pilon3.g3568.t1 = 74.9. IDs starting with chr get excluded from the comparison and they end up straight in the output.
gene.100079.0.0.p1 = 86.7 && gene.100079.0.5.p3 = 84.9 == gene.100079.0.3.p1 = 84.9. gene.100079.0.0.p1 has the highest value and therefore it should be in the output.
gene.100080.0.3.p1 = 99.9 == gene.100080.0.0.p1 = 99.9. Both IDs have the same value and therefore both should be in the output.
However, this awk script from #RavinderSingh13 and #anubhava returns the wrong results.
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $13 >= max[k] {
if(!(k in max))
ord[++n] = k
else if (max[k] == $13) {
print
next
}
max[k] = $13
rec[k] = $0
}
END {
for (i=1; i<=n; ++i)
print rec[ord[i]]
}' file
Wrong output with the above script:
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100079.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 84.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
As output I would like to get:
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
I also tried to fix as show below but it didn't work:
awk '{
if (/^gene\./) {
split($1, a, /\./)
k = a[1] "." a[2]
}
else
k = $1
}
!(k in max) || $13 > max[k] {
max[k]=$13;
line[k]=$0
}
END {
for(i in line)
print line[i]
}'
Thank you in advance,

This seems to work correctly, assuming that the data is ordered so that all the lines with the same first two name components are grouped together in the data file. The order of those lines within the group doesn't matter.
As revised, the question now wants lines starting chr transferred to the output without any filtering. That is easily achieved — the rule matching /^chr/ provides that functionality.
#!/bin/sh
awk '
function dump_memo()
{
if (memo_num > 0)
{
for (i = 0; i < memo_num; i++)
print memo_line[i]
}
}
/^chr/ { print; next } # Do not process lines starting chr specially
{
split($1, a, ".")
key = a[1] "." a[2]
val = $NF
# print "# " key " = " val " (memo_key = " memo_key ", memo_val = " memo_val ")"
if (memo_key == key)
{
if (memo_val == val)
{
memo_line[memo_num++] = $0
}
else if (memo_val < val)
{
memo_val = val
memo_num = 0
memo_line[memo_num++] = $0
}
}
else
{
dump_memo()
memo_num = 0
memo_line[memo_num++] = $0
memo_key = key
memo_val = val
}
}
END { dump_memo() }' "$#"
When run on the data file shown in the question, the original output from the unrevised script was:
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
The main difference between this and what was requested is the sort order. If you need the data in sorted order, pipe the output of the script through sort.
With the revised script (with the /^chr/ rule) is:
gene.100079.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 86.7
chr11_pilon3.g3568.t1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 74.9
chr11_pilon3.g3568.t2 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 76.7
gene.100080.0.3.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
gene.100080.0.0.p1 transcript:OIS96097 82.2 169 30 0 1 169 4 172 1.3e-75 283.1 99.9
Again, if you want the data in some specific order, apply a sort to the output.

Related

Pyspark : how to compute the percentage with condition in dataframe

How to compute the number of performance such that performance = P<=5 ; P>5 & P<=15 ; P>15
address
performance = P
NACELLES
589
NACELLES
0
NACELLES
48
NACELLES
318
NACELLES
378
NACELLES
52
NACELLES
45
NACELLES
201
NACELLES
416
NACELLES
29
NACELLES
183
NACELLES
53
NACELLES
7
NACELLES
127
NACELLES
157
NACELLES
248
NACELLES
10
NACELLES
317
NACELLES
2
NACELLES
4
We obtain this dataset
address
P<=5
P>5 & P<=15
P> 15
NACELLES
15 %
10 %
75 %
using your dataframe as an example :
+--------+-----------+
| address|performance|
+--------+-----------+
|NACELLES| 589|
|NACELLES| 0|
|NACELLES| 48|
|NACELLES| 318|
You simply have to aggregate and sum using a when function :
df.groupBy("address").agg(
(F.sum(F.when(F.col("performance") <= 5, 1)) / F.count("*")).alias("P<=5"),
(
F.sum(F.when((F.col("performance") > 5) & (F.col("performance") <= 15), 1))
/ F.count("*")
).alias("P>5 & P<=15"),
(F.sum(F.when(F.col("performance") > 15, 1)) / F.count("*")).alias("P>15"),
).show()
+--------+----+-----------+----+
| address|P<=5|P>5 & P<=15|P>15|
+--------+----+-----------+----+
|NACELLES|0.15| 0.1|0.75|
+--------+----+-----------+----+

Paraview crash with Nvidia omniverse connector, possible vtk, qt5, libc error?

I'm trying to install the paraview connector to Nvidia omniverse. Once I load the plugin, it crashes with the following error. It looks like some issues with libc.so or Qt5. But I do not know what exactly is the problem and how I can deal with it.
The paraview install works fine without omniverse connector.
AutoMPI: SUCCESS: command is:
"/usr/local/bin/mpiexec" "-n" "6" "/usr/local/bin/pvserver" "--server-port=36395"
AutoMPI: starting process server
-------------- server output --------------
Waiting for client...
AutoMPI: server successfully started.
free(): invalid pointer
Loguru caught a signal: SIGABRT
Stack trace:
72 0x561c3c15e1fe paraview(+0x91fe) [0x561c3c15e1fe]
71 0x7f4381b3d0b3 __libc_start_main + 243
70 0x561c3c15e007 paraview(+0x9007) [0x561c3c15e007]
69 0x7f43801d8116 QCoreApplication::exec() + 150
68 0x7f43801d03ab QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) + 299
67 0x7f4380229435 QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) + 101
66 0x7f4379fb44a3 g_main_context_iteration + 51
65 0x7f4379fb4400 /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0(+0x52400) [0x7f4379fb4400]
64 0x7f4379fb417d g_main_context_dispatch + 637
63 0x7f436e91a32e /usr/lib/x86_64-linux-gnu/libQt5XcbQpa.so.5(+0x7932e) [0x7f436e91a32e]
62 0x7f438059635b QWindowSystemInterface::sendWindowSystemEvents(QFlags<QEventLoop::ProcessEventsFlag>) + 187
61 0x7f43805bc10b QGuiApplicationPrivate::processWindowSystemEvent(QWindowSystemInterfacePrivate::WindowSystemEvent*) + 603
60 0x7f43805ba7d3 QGuiApplicationPrivate::processMouseEvent(QWindowSystemInterfacePrivate::MouseEvent*) + 1763
59 0x7f43801d180a QCoreApplication::notifyInternal2(QObject*, QEvent*) + 394
58 0x7f438131e0f0 QApplication::notify(QObject*, QEvent*) + 816
57 0x7f4381314a66 QApplicationPrivate::notify_helper(QObject*, QEvent*) + 134
56 0x7f43813761ec /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5(+0x1cc1ec) [0x7f43813761ec]
55 0x7f4381373ce4 /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5(+0x1c9ce4) [0x7f4381373ce4]
54 0x7f438131d457 QApplicationPrivate::sendMouseEvent(QWidget*, QMouseEvent*, QWidget*, QWidget*, QWidget**, QPointer<QWidget>&, bool, bool) + 439
53 0x7f43801d180a QCoreApplication::notifyInternal2(QObject*, QEvent*) + 394
52 0x7f438131e343 QApplication::notify(QObject*, QEvent*) + 1411
51 0x7f4381314a66 QApplicationPrivate::notify_helper(QObject*, QEvent*) + 134
50 0x7f43814a1adb QMenu::event(QEvent*) + 347
49 0x7f43813572b6 QWidget::event(QEvent*) + 646
48 0x7f438149f4d2 QMenu::mouseReleaseEvent(QMouseEvent*) + 626
47 0x7f438149e4ae /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5(+0x2f44ae) [0x7f438149e4ae]
46 0x7f4381496d12 /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5(+0x2ecd12) [0x7f4381496d12]
45 0x7f4381310aa2 QAction::activate(QAction::ActionEvent) + 242
44 0x7f438130e3e6 QAction::triggered(bool) + 70
43 0x7f43801fd1d0 QMetaObject::activate(QObject*, int, int, void**) + 2000
42 0x7f43819a9f0d pqManagePluginsReaction::managePlugins() + 221
41 0x7f4381518c6d QDialog::exec() + 461
40 0x7f43801d03ab QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) + 299
39 0x7f4380229435 QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) + 101
38 0x7f4379fb44a3 g_main_context_iteration + 51
37 0x7f4379fb4400 /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0(+0x52400) [0x7f4379fb4400]
36 0x7f4379fb417d g_main_context_dispatch + 637
35 0x7f436e91a32e /usr/lib/x86_64-linux-gnu/libQt5XcbQpa.so.5(+0x7932e) [0x7f436e91a32e]
34 0x7f438059635b QWindowSystemInterface::sendWindowSystemEvents(QFlags<QEventLoop::ProcessEventsFlag>) + 187
33 0x7f43805bc10b QGuiApplicationPrivate::processWindowSystemEvent(QWindowSystemInterfacePrivate::WindowSystemEvent*) + 603
32 0x7f43805ba7d3 QGuiApplicationPrivate::processMouseEvent(QWindowSystemInterfacePrivate::MouseEvent*) + 1763
31 0x7f43801d180a QCoreApplication::notifyInternal2(QObject*, QEvent*) + 394
30 0x7f438131e0f0 QApplication::notify(QObject*, QEvent*) + 816
29 0x7f4381314a66 QApplicationPrivate::notify_helper(QObject*, QEvent*) + 134
28 0x7f43813761ec /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5(+0x1cc1ec) [0x7f43813761ec]
27 0x7f438137335d /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5(+0x1c935d) [0x7f438137335d]
26 0x7f438131d457 QApplicationPrivate::sendMouseEvent(QWidget*, QMouseEvent*, QWidget*, QWidget*, QWidget**, QPointer<QWidget>&, bool, bool) + 439
25 0x7f43801d180a QCoreApplication::notifyInternal2(QObject*, QEvent*) + 394
24 0x7f438131e343 QApplication::notify(QObject*, QEvent*) + 1411
23 0x7f4381314a66 QApplicationPrivate::notify_helper(QObject*, QEvent*) + 134
22 0x7f43813572b6 QWidget::event(QEvent*) + 646
21 0x7f438140b035 QAbstractButton::mouseReleaseEvent(QMouseEvent*) + 229
20 0x7f438140ae73 /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5(+0x260e73) [0x7f438140ae73]
19 0x7f4381409a2e /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5(+0x25fa2e) [0x7f4381409a2e]
18 0x7f4381409806 QAbstractButton::clicked(bool) + 70
17 0x7f43801fd1d0 QMetaObject::activate(QObject*, int, int, void**) + 2000
16 0x7f4380f8ff37 pqPluginDialog::loadPlugin(pqServer*, bool) + 2919
15 0x7f4380f8cc62 pqPluginDialog::loadPlugin(pqServer*, QString const&, bool) + 82
14 0x7f4380c63229 pqPluginManager::loadExtension(pqServer*, QString const&, QString*, bool) + 169
13 0x7f437fd4b5a6 vtkSMPluginManager::LoadLocalPlugin(char const*) + 150
12 0x7f437f09e083 vtkPVPluginLoader::LoadPluginInternal(char const*, bool) + 4227
11 0x7f437f09c474 vtkPVPluginLoader::LoadPluginInternal(vtkPVPlugin*) + 948
10 0x7f437f09b16c vtkPVPlugin::ImportPlugin(vtkPVPlugin*) + 1436
9 0x7f437f0a7b1f vtkPVPluginTracker::RegisterPlugin(vtkPVPlugin*) + 1551
8 0x7f437d2be644 /usr/local/lib/libvtkCommonCore-pv5.9.so.1(+0x440644) [0x7f437d2be644]
7 0x7f437fccc7a1 vtkObject::vtkClassMemberCallback<vtkSIProxyDefinitionManager>::operator()(vtkObject*, unsigned long, void*) + 113
6 0x7f437fcc5365 vtkSIProxyDefinitionManager::HandlePlugin(vtkPVPlugin*) + 469
5 0x7f4381bafcac /lib/x86_64-linux-gnu/libc.so.6(+0x99cac) [0x7f4381bafcac]
4 0x7f4381bae47c /lib/x86_64-linux-gnu/libc.so.6(+0x9847c) [0x7f4381bae47c]
3 0x7f4381ba63ee /lib/x86_64-linux-gnu/libc.so.6(+0x903ee) [0x7f4381ba63ee]
2 0x7f4381b3b859 abort + 299
1 0x7f4381b5c18b gsignal + 203
0 0x7f4381b5c210 /lib/x86_64-linux-gnu/libc.so.6(+0x46210) [0x7f4381b5c210]
( 36.129s) [paraview ] :0 FATL| Signal: SIGABRT
Aborted (core dumped)

H264 encoding and decoding using Videotoolbox

I was testing the encoding and decoding using videotoolbox, to convert the captured frames to H264 and using that data to display it in AVSampleBufferdisplayLayer.
error here while decompress CMVideoFormatDescriptionCreateFromH264ParameterSets with error code -12712
I follow this code from mobisoftinfotech.com
status = CMVideoFormatDescriptionCreateFromH264ParameterSets(
kCFAlloc‌​‌ atorDefault, 2,
(const uint8_t const)parameterSetPointers,
parameterSetSizes, 4, &_formatDesc);
videoCompressionTest; can anyone figure out the problem?
I am not sure if you did figure out the problem yet. However, I found 2 places in your code that leading to the error. After fixed them and run locally your test app, it seems to be working fine. (Tested with Xcode 9.4.1, MacOS 10.13)
The first one is in -(void)CompressAndConvertToData:(CMSampleBufferRef)sampleBuffer method where the while loop should be like this
while (bufferOffset < blockBufferLength - AVCCHeaderLength) {
// Read the NAL unit length
uint32_t NALUnitLength = 0;
memcpy(&NALUnitLength, bufferDataPointer + bufferOffset, AVCCHeaderLength);
// Convert the length value from Big-endian to Little-endian
NALUnitLength = CFSwapInt32BigToHost(NALUnitLength);
// Write start code to the elementary stream
[elementaryStream appendBytes:startCode length:startCodeLength];
// Write the NAL unit without the AVCC length header to the elementary stream
[elementaryStream appendBytes:bufferDataPointer + bufferOffset + AVCCHeaderLength
length:NALUnitLength];
// Move to the next NAL unit in the block buffer
bufferOffset += AVCCHeaderLength + NALUnitLength;
}
uint8_t *bytes = (uint8_t*)[elementaryStream bytes];
int size = (int)[elementaryStream length];
[self receivedRawVideoFrame:bytes withSize:size];
The second place is the decompression code where you process for NALU type 8, the block of code in if(nalu_type == 8) statement. This is a tricky one.
To fix it, update
for (int i = _spsSize + 12; i < _spsSize + 50; i++)
to
for (int i = _spsSize + 12; i < _spsSize + 12 + 50; i++)
And you are freely to remove this hack
//was crashing here
if(_ppsSize == 0)
_ppsSize = 4;
Why? Lets print out the frame packet format.
po frame
▿ 4282 elements
- 0 : 0
- 1 : 0
- 2 : 0
- 3 : 1
- 4 : 39
- 5 : 100
- 6 : 0
- 7 : 30
- 8 : 172
- 9 : 86
- 10 : 193
- 11 : 112
- 12 : 247
- 13 : 151
- 14 : 64
- 15 : 0
- 16 : 0
- 17 : 0
- 18 : 1
- 19 : 40
- 20 : 238
- 21 : 60
- 22 : 176
- 23 : 0
- 24 : 0
- 25 : 0
- 26 : 1
- 27 : 6
- 28 : 5
- 29 : 35
- 30 : 71
- 31 : 86
- 32 : 74
- 33 : 220
- 34 : 92
- 35 : 76
- 36 : 67
- 37 : 63
- 38 : 148
- 39 : 239
- 40 : 197
- 41 : 17
- 42 : 60
- 43 : 209
- 44 : 67
- 45 : 168
- 46 : 0
- 47 : 0
- 48 : 3
- 49 : 0
- 50 : 0
- 51 : 3
- 52 : 0
- 53 : 2
- 54 : 143
- 55 : 92
- 56 : 40
- 57 : 1
- 58 : 221
- 59 : 204
- 60 : 204
- 61 : 221
- 62 : 2
- 63 : 0
- 64 : 76
- 65 : 75
- 66 : 64
- 67 : 128
- 68 : 0
- 69 : 0
- 70 : 0
- 71 : 1
- 72 : 37
- 73 : 184
- 74 : 32
- 75 : 1
- 76 : 223
- 77 : 205
- 78 : 248
- 79 : 30
- 80 : 231
… more
The first NALU start code if (nalu_type == 7) is 0, 0, 0, 1 from index of 15 to 18. The next 0, 0, 0, 1 (from 23 to 26) is type 6, type 8 NALU start code is from 68 to 71. That why I modify the for loop a bit to scan from start index (_spsSize + 12) with a range of 50.
I haven't fully tested your code to make sure encode and decode work properly as expected. However, I hope this finding would help you.
By the way, if there is any misunderstanding, I would love to learn from your comments.

SQL query is not working (Error in rsqlite_send_query)

This is what the head of my data frame looks like
> head(d19_1)
SMZ SIZ1_diff SIZ1_base SIZ2_diff SIZ2_base SIZ3_diff SIZ3_base SIZ4_diff SIZ4_base SIZ5_diff SIZ5_base
1 1 -620 4170 -189 1347 -35 2040 82 1437 244 1533
2 2 -219 831 -57 255 -4 392 8 282 14 297
3 3 -426 834 -162 294 -134 379 -81 241 -22 221
4 4 -481 676 -142 216 -114 267 -50 158 -43 166
5 5 -233 1711 -109 584 54 913 71 624 74 707
6 6 -322 1539 -79 512 -50 799 23 532 63 576
Total_og Total_base %_SIZ1 %_SIZ2 %_SIZ3 %_SIZ4 %_SIZ5 Total_og Total_base
1 11980 12648 14.86811 14.03118 1.715686 5.706333 15.916504 11980 12648
2 2156 2415 26.35379 22.35294 1.020408 2.836879 4.713805 2156 2415
3 1367 2314 51.07914 55.10204 35.356201 33.609959 9.954751 1367 2314
4 790 1736 71.15385 65.74074 42.696629 31.645570 25.903614 790 1736
5 5339 5496 13.61777 18.66438 5.914567 11.378205 10.466761 5339 5496
6 4362 4747 20.92268 15.42969 6.257822 4.323308 10.937500 4362 4747
The datatype of the data frame is as below str(d19_1)
> str(d19_1)
'data.frame': 1588 obs. of 20 variables:
$ SMZ : int 1 2 3 4 5 6 7 8 9 10 ...
$ SIZ1_diff : int -620 -219 -426 -481 -233 -322 -176 -112 -34 -103 ...
$ SIZ1_base : int 4170 831 834 676 1711 1539 720 1396 998 1392 ...
$ SIZ2_diff : int -189 -57 -162 -142 -109 -79 -12 72 -36 -33 ...
$ SIZ2_base : int 1347 255 294 216 584 512 196 437 343 479 ...
$ SIZ3_diff : int -35 -4 -134 -114 54 -50 16 4 26 83 ...
$ SIZ3_base : int 2040 392 379 267 913 799 361 804 566 725 ...
$ SIZ4_diff : int 82 8 -81 -50 71 23 36 127 46 75 ...
$ SIZ4_base : int 1437 282 241 158 624 532 242 471 363 509 ...
$ SIZ5_diff : int 244 14 -22 -43 74 63 11 143 79 125 ...
$ SIZ5_base : int 1533 297 221 166 707 576 263 582 429 536 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
$ %_SIZ1 : num 14.9 26.4 51.1 71.2 13.6 ...
$ %_SIZ2 : num 14 22.4 55.1 65.7 18.7 ...
$ %_SIZ3 : num 1.72 1.02 35.36 42.7 5.91 ...
$ %_SIZ4 : num 5.71 2.84 33.61 31.65 11.38 ...
$ %_SIZ5 : num 15.92 4.71 9.95 25.9 10.47 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
When I run the below query, it is returning me the below error and I don't know why. I don't have any column in table
Query
d20_1 <- sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')
Error:
Error in rsqlite_send_query(conn#ptr, statement) :
table d19_1 has no column named <NA>
Your code works correctly for me:
d19_1 <- structure(list(SMZ = 1:6, SIZ1_diff = c(-620L, -219L, -426L,
-481L, -233L, -322L), SIZ1_base = c(4170L, 831L, 834L, 676L,
1711L, 1539L), SIZ2_diff = c(-189L, -57L, -162L, -142L, -109L,
-79L), SIZ2_base = c(1347L, 255L, 294L, 216L, 584L, 512L), SIZ3_diff = c(-35L,
-4L, -134L, -114L, 54L, -50L), SIZ3_base = c(2040L, 392L, 379L,
267L, 913L, 799L), SIZ4_diff = c(82L, 8L, -81L, -50L, 71L, 23L
), SIZ4_base = c(1437L, 282L, 241L, 158L, 624L, 532L), SIZ5_diff = c(244L,
14L, -22L, -43L, 74L, 63L), SIZ5_base = c(1533L, 297L, 221L,
166L, 707L, 576L), Total_og = c(11980L, 2156L, 1367L, 790L, 5339L,
4362L), Total_base = c(12648L, 2415L, 2314L, 1736L, 5496L, 4747L
), X._SIZ1 = c(14.86811, 26.35379, 51.07914, 71.15385, 13.61777,
20.92268), X._SIZ2 = c(14.03118, 22.35294, 55.10204, 65.74074,
18.66438, 15.42969), X._SIZ3 = c(1.715686, 1.020408, 35.356201,
42.696629, 5.914567, 6.257822), X._SIZ4 = c(5.706333, 2.836879,
33.609959, 31.64557, 11.378205, 4.323308), X._SIZ5 = c(15.916504,
4.713805, 9.954751, 25.903614, 10.466761, 10.9375), Total_og.1 = c(11980L,
2156L, 1367L, 790L, 5339L, 4362L), Total_base.1 = c(12648L, 2415L,
2314L, 1736L, 5496L, 4747L)), .Names = c("SMZ", "SIZ1_diff",
"SIZ1_base", "SIZ2_diff", "SIZ2_base", "SIZ3_diff", "SIZ3_base",
"SIZ4_diff", "SIZ4_base", "SIZ5_diff", "SIZ5_base", "Total_og",
"Total_base", "X._SIZ1", "X._SIZ2", "X._SIZ3", "X._SIZ4", "X._SIZ5",
"Total_og.1", "Total_base.1"), row.names = c(NA, -6L), class = "data.frame")
library(sqldf)
sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')

identifying strings within intervals, pt 2

For each row, I would like to know if the numerical string in the 6th column resides within the start and end intervals of the 3rd and 4th column. The issue for me is that identical strings in the 1st and 5th column are not always in the same row (e.g., uce-6459 is on the same line as uce-432).
Input
locus match start end subset pos region
uce-3280 uce-3280_p1 269 388 uce-3280 222
uce-6620 uce-6620_p1 297 416 uce-6620 198
uce-6620 uce-6620_p1 297 416 uce-6620 300
uce-432 uce-432_p2 328 447 uce-432 205
uce-432 uce-432_p1 268 387 uce-6459 207
uce-6459 uce-6459_p1 210 329 uce-6459 275
uce-6459 uce-6459_p1 210 329 uce-6459 288
uce-6459 uce-6459_p1 210 329 uce-374 373
uce-374 uce-374_p2 509 628 uce-3393 327
uce-374 uce-374_p1 449 568 uce-3393 416
uce-3393 uce-3393_p1 439 558 uce-3393 712
uce-3393 uce-3393_p1 439 558 uce-1200 416
uce-3393 uce-3393_p1 439 558 uce-805 397
uce-1200 uce-1200_p3 341 460 uce-627 326
uce-805 uce-805_p1 333 452 uce-2299 340
uce-627 uce-627_p1 396 515 uce-2126 481
uce-2299 uce-2299_p1 388 507 uce-5427 562
uce-2126 uce-2126_p1 323 437 uce-5427 711
uce-5427 uce-5427_p1 509 628 uce-5893 242
uce-5427 uce-5427_p1 509 628 uce-5893 330
uce-5893 uce-5893_p1 477 582 uce-5893 398
Desired output
locus match start end subset pos region
uce-3280 uce-3280_p1 269 388 uce-3280 222 no
uce-6620 uce-6620_p1 297 416 uce-6620 198 no
uce-6620 uce-6620_p1 297 416 uce-6620 300 yes
uce-432 uce-432_p2 328 447 uce-432 205 no
uce-432 uce-432_p1 268 387 uce-6459 207 no
uce-6459 uce-6459_p1 210 329 uce-6459 275 yes
uce-6459 uce-6459_p1 210 329 uce-6459 288 yes
uce-6459 uce-6459_p1 210 329 uce-374 373 no
uce-374 uce-374_p2 509 628 uce-3393 327 no
uce-374 uce-374_p1 449 568 uce-3393 416 no
uce-3393 uce-3393_p1 439 558 uce-3393 712 no
uce-3393 uce-3393_p1 439 558 uce-1200 416 yes
uce-3393 uce-3393_p1 439 558 uce-805 397 yes
uce-1200 uce-1200_p3 341 460 uce-627 326 no
uce-805 uce-805_p1 333 452 uce-2299 340 no
uce-627 uce-627_p1 396 515 uce-2126 481 no
uce-2299 uce-2299_p1 388 507 uce-5427 562 yes
uce-2126 uce-2126_p1 323 437 uce-5427 711 no
uce-5427 uce-5427_p1 509 628 uce-5893 242 no
uce-5427 uce-5427_p1 509 628 uce-5893 330 no
uce-5893 uce-5893_p1 477 582 uce-5893 398 no
Any help would be appreciated.
Here is a full stand-alone awk-script:
#!/usr/bin/awk -f
# goes through the whole file, saves boundaries of the intervals
NR > 1 && NR == FNR {
starts[$1] = $3
ends[$1] = $4
#print "Scanned interval: "$1" = ["starts[$1]","ends[$1]"]"
}
# prints the header of the table
NR != FNR && FNR == 1 {
print $0
}
# annotates each line with "yes"/"no"
FNR > 1 && NR != FNR {
print $0" "(starts[$5] <= $6 && $6 <= ends[$5] ? "yes" : "no")
}
Depending on what OS you have and what awk you are using, you might need to adjust the first line (use which awk to find the right path).
In order to make it run, you have to save it in a separate file (e.g. analyzeSnp.awk), then make it executable (e.g. chmod u+x analyzeSnp.awk), and then run it like this:
./analyzeSnp.awk inputData inputData
Alternatively, you can enclose it in single quotes and paste it directly into the console like this:
$ awk ' #!/usr/bin/awk -f
# goes through the whole file, saves boundaries of the intervals
NR > 1 && NR == FNR {
starts[$1] = $3
ends[$1] = $4
#print "Scanned interval: "$1" = ["starts[$1]","ends[$1]"]"
}
# prints the header of the table
NR != FNR && FNR == 1 {
print $0
}
# annotates each line with "yes"/"no"
FNR > 1 && NR != FNR {
print $0" "(starts[$5] <= $6 && $6 <= ends[$5] ? "yes" : "no")
}' loci.txt loci.txt
Important: you have to specify your input file twice, because we need a first scan to build a table with intervals, and then a second scan for actual annotation.
Here is the produced output:
locus match start end subset pos region
uce-3280 uce-3280_p1 269 388 uce-3280 222 no
uce-6620 uce-6620_p1 297 416 uce-6620 198 no
uce-6620 uce-6620_p1 297 416 uce-6620 300 yes
uce-432 uce-432_p2 328 447 uce-432 205 no
uce-432 uce-432_p1 268 387 uce-6459 207 no
uce-6459 uce-6459_p1 210 329 uce-6459 275 yes
uce-6459 uce-6459_p1 210 329 uce-6459 288 yes
uce-6459 uce-6459_p1 210 329 uce-374 373 no
uce-374 uce-374_p2 509 628 uce-3393 327 no
uce-374 uce-374_p1 449 568 uce-3393 416 no
uce-3393 uce-3393_p1 439 558 uce-3393 712 no
uce-3393 uce-3393_p1 439 558 uce-1200 416 yes
uce-3393 uce-3393_p1 439 558 uce-805 397 yes
uce-1200 uce-1200_p3 341 460 uce-627 326 no
uce-805 uce-805_p1 333 452 uce-2299 340 no
uce-627 uce-627_p1 396 515 uce-2126 481 no
uce-2299 uce-2299_p1 388 507 uce-5427 562 yes
uce-2126 uce-2126_p1 323 437 uce-5427 711 no
uce-5427 uce-5427_p1 509 628 uce-5893 242 no
uce-5427 uce-5427_p1 509 628 uce-5893 330 no
uce-5893 uce-5893_p1 477 582 uce-5893 398 no
This looks pretty close to your example snippet.
A general remark: AWK is a surprisingly modern and powerful scripting language (at least for it's age: this thing is almost 40 years old!). If you want to use AWK, you are not constrained to cryptic single-liners.