SQL Query Performance DISTINCT IN - sql

I have one requirement in my project as follows :
There is a table like this :
UID VER STATUS
--------------------------
DOC001 VER.1 N/A
DOC001 VER.2 N/A
DOC001 VER.3 N/A
DOC001 VER.4 N/A
DOC002 VER.1 N/A
DOC002 VER.2 N/A
DOC002 VER.3 N/A
DOC003 VER.1 N/A
DOC003 VER.2 N/A
DOC003 VER.3 N/A
DOC003 VER.4 N/A
DOC003 VER.5 N/A
DOC003 VER.6 N/A
I need to change the status of each version after doing some validation. In this process, if VER.2 of Doc, say DOC001 has failed, I need to update the status of remaining versions i.e. VER.3 and VER.4 to FAIL irrespective of validation.
For this, I wrote SQL queries using DISTINCT, IN, ORDER BY clauses which is very slow.
The real time time DB is quite big and has millions of rows which increases the pressure on me to improve the performance.
Your suggestions and help is highly appreciated.
Thanks in advance.

Related

How to detect an event and ensure its the first (earliest event that occurred?

I am attempting to extract data from a large patient database using SQL. I need to detect when an event occurred. In this case, it will be a decrease in SpO2 by 10% from when treatment started. It needs to be the earliest time point when it decreased. Example 100%-95-92-97-88-92-89 only the 88 is what I will need. The tricky part begins as patients might have two treatments throughout care. I need to compare these treatments to observe which treatment is the best. An example of the data is below.
The patient will have treatment A until treatment B starts or until the final reading and vis versa. the SpO2 values with ** are the ones I need to extract
Date
Time
SpO2
Treatment
05/19/22
18:23
100
N/A
05/19/22
18:24
95
A
05/19/22
18:25
90
N/A
05/19/22
18:26
85
N/A
05/19/22
18:27
N/A
B
05/19/22
18:27
90
N/A
05/19/22
18:28
85
N/A
05/19/22
18:29
80
N/A
05/19/22
18:30
78
N/A
05/19/22
18:31
76
N/A
What I'm hoping the final table looks like is
Date
Time
SpO2
Treatment
05/19/22
18:23
100
N/A
05/19/22
18:24
95
A
05/19/22
18:26
85
N/A
05/19/22
18:27
N/A
B
05/19/22
18:27
90
N/A
05/19/22
18:29
80
N/A
If you can please help, that would be amazing and thank you very much.

Tensorflow 2.0 utilize all CPU cores 100%

My Tensorflow model makes heavy use of data preprocessing that should be done on the CPU to leave the GPU open for training.
top - 09:57:54 up 16:23, 1 user, load average: 3,67, 1,57, 0,67
Tasks: 400 total, 1 running, 399 sleeping, 0 stopped, 0 zombie
%Cpu(s): 19,1 us, 2,8 sy, 0,0 ni, 78,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
MiB Mem : 32049,7 total, 314,6 free, 5162,9 used, 26572,2 buff/cache
MiB Swap: 6779,0 total, 6556,0 free, 223,0 used. 25716,1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17604 joro 20 0 22,1g 2,3g 704896 S 331,2 7,2 4:39.33 python
This is what top shows me. I would like to make this python process use at least 90% of available CPU across all cores. How can this be achieved?
GPU utilization is better, around 90%. Even though I don't know why it is not at 100%
Mon Aug 10 10:00:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 35% 41C P2 90W / 260W | 10515MiB / 11016MiB | 11% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1128 G /usr/lib/xorg/Xorg 102MiB |
| 0 1648 G /usr/lib/xorg/Xorg 380MiB |
| 0 1848 G /usr/bin/gnome-shell 279MiB |
| 0 10633 G ...uest-channel-token=1206236727 266MiB |
| 0 13794 G /usr/lib/firefox/firefox 6MiB |
| 0 17604 C python 9457MiB |
+-----------------------------------------------------------------------------+
All i found was a solution for tensorflow 1.0:
sess = tf.Session(config=tf.ConfigProto(
intra_op_parallelism_threads=NUM_THREADS))
I have an Intel 9900k and a RTX 2080 Ti and use Ubuntu 20.04
E: When I add the following code on top, it uses 1 core 100%
tf.config.threading.set_intra_op_parallelism_threads(1)
tf.config.threading.set_inter_op_parallelism_threads(1)
But increasing this number to 16 again only utilizes all cores ~30%
Just setting the set_intra_op_parallelism_threads and set_inter_op_parallelism_threads wasn't working for me. Incase someone else is in the same place, after a lot of struggle with the same issue, below piece of code worked for me in limiting the CPU usage of tensorflow below 500%:
import os
import tensorflow as tf
num_threads = 5
os.environ["OMP_NUM_THREADS"] = "5"
os.environ["TF_NUM_INTRAOP_THREADS"] = "5"
os.environ["TF_NUM_INTEROP_THREADS"] = "5"
tf.config.threading.set_inter_op_parallelism_threads(
num_threads
)
tf.config.threading.set_intra_op_parallelism_threads(
num_threads
)
tf.config.set_soft_device_placement(True)
There can be many issues for this, I solved it for me the following way:
Set
tf.config.threading.set_intra_op_parallelism_threads(<Your_Physical_Core_Count>) tf.config.threading.set_inter_op_parallelism_threads(<Your_Physical_Core_Count>)
both to your physical core count. You do not want Hyperthreading for highly vectorized operations as you cannot benefit from parallized operations when there aren't any gaps.
"With a high level of vectorization, the number of execution gaps is
very small and there is possibly insufficient opportunity to make up
any penalty due to increased contention in HT."
From: Saini et al, published by NASAA dvanced Supercomputing Division, 2011: The Impact of Hyper-Threading on Processor
Resource Utilization in Production Applications
EDIT: I am not sure anymore, if one of the two has to be 1. But one 100% needs to be set to Physical.

How to get rid of N/A from logs forwarded by nxlog?

I have enabled debug mode in windows DNS server and the logfile is located at c:\logs\dns.log
<181>Jan 2 11:41:40 DC-SE-01 MSWinEventLog 1 N/A 1011398786 Tue Jan 2 11:41:40 2018 N/A N/A N/A N/A N/A N/A N/A N/A
<181>Jan 2 11:41:40 DC-IN-01 MSWinEventLog 1 N/A 1011398810 Tue Jan 2 11:41:40 2018 N/A N/A N/A N/A N/A N/A N/A N/A
<181>Jan 2 11:41:40 DC-IN-01 MSWinEventLog 1 N/A 1011398825 Tue Jan 2 11:41:40 2018 N/A N/A N/A N/A N/A N/A N/A 1/2/2018 11:41:38 AM 0A48 PACKET 00BACCA157DAE01 UDP Snd 11.11.201.81 3b20 R Q [8281 DR SERVFAIL] A (8)tnmaster(0) N/A
I think these messages are of little importance so how to get rid of these messages from nxlog and why are the "N/A" 's in there ?
Below are relevant parts of nxlog.conf file
<Input>
Module im_file
File "C:\logs\dns.log"
InputType LineBased
Exec $Message=$raw_event; $SyslogFacilityValue=22;
</Input>
<Output logger>
Module om_udp
Host 11.11.11.10
Port 514
Exec to_syslog_snare();
</Output>
<Route 3>
Path dnsdebug => logger
</Route>
The Snare syslog format is basically a tab delimited string that assumes certain fields such as the EventID since it was primarily designed to transfer the Windows Eventlog over syslog.
In order to generate the output these fields need to be populated. When you read the DNS log from a file obviously these fields are not automatically parsed, thus the output has N/A at those places.
For more information see the Snare topic in the NXLog User Guide.
Since you are trying to collect dns.log the Collecting DNS logs topic might be also relevant.

Counting values from table results

I have a table of data that looks like the following:
ArtistName TrackName TrackID
1 Pendulum Slam 6
2 N/A N/A 26
3 N/A N/A 26
4 N/A N/A 26
5 Snow Patrol Chasing Cars 17
6 Snow Patrol Chasing Cars 17
7 Rihanna Love The Way You Lie 4
8 N/A N/A 26
9 N/A N/A 26
10 Kanye West Stronger 10
11 Rihanna Love The Way You Lie 4
12 N/A N/A 26
13 N/A N/A 26
14 Tinie Tempah Written In The Stars 8
15 N/A N/A 26
16 N/A N/A 26
17 Nero Crush On You 18
etc...
Basically what I'd like to do is count the number of occurrences of each TrackID, and display that in a column. The previous table is created from this query which combines a few other tables:
SELECT Artist_Details.ArtistName, Track_Details.TrackName, Sales_Records.TrackID
FROM Track_Details
INNER JOIN Sales_Records ON Track_Details.TrackID = Sales_Records.TrackID
JOIN Artist_Details ON Track_Details.ArtistID = Artist_Details.ArtistID;
The output format I'd like is:
ArtistName TrackName Track ID TotalSales
1 Pendulum Slam 6 8
2 Tinie Tempah Written In The Stars 8 5
3 Rihanna Love The Way You Lie 4 2
And finally, I'd like the value 26 to not be counted and to be ignored and not displayed in the results, with it sorted ascending by TotalSales. And if possible to limit this chart to 10 rows.
Thanks in advance, Mark
That looks like a slam dunk for group by:
SELECT top 10 Artist_Details.ArtistName, Track_Details.TrackName,
Sales_Records.TrackID, count(Sales_Records.TrackID) as TotalSales
FROM Track_Details
INNER JOIN Sales_Records ON Track_Details.TrackID = Sales_Records.TrackID
JOIN Artist_Details ON Track_Details.ArtistID = Artist_Details.ArtistID
WHERE Sales_Records.TrackID <> 26
GROUP BY Artist_Details.ArtistName, Track_Details.TrackName, Sales_Records.TrackID
ORDER BY count(Sales_Records.TrackID) desc

Match similar zip codes

Background
To replace invalid zip codes.
Sample Data
Consider the following data set:
Typo | City | ST | Zip5
-------+------------+----+------
33967 | Fort Myers | FL | 33902
33967 | Fort Myers | FL | 33965
33967 | Fort Myers | FL | 33911
33967 | Fort Myers | FL | 33901
33967 | Fort Myers | FL | 33907
33967 | Fort Myers | FL | 33994
34115 |Marco Island| FL | 34145
34115 |Marco Island| FL | 34146
86405 | Kingman | FL | 86404
86405 | Kingman | FL | 86406
33967 closely matches 33965, although 33907 could also be correct. (In this case, 33967 is a valid zip code, but not in our zip code database.)
34115 closely matches is 34145 (off by one digit, with a difference of 3 for that digit).
86405 closely matches both.
Sometimes digits are simply reversed (e.g,. 89 instead of 98).
Question
How would you write a SQL statement that finds the "minimum distance" between multiple numbers that have the same number of digits, returning at most one result no matter what?
Ideas
Subtract the digits.
Use LIMIT 1.
Conditions
PostgreSQL 8.3
This sounds like a case for Levenshtein distance.
The Levenshtein distance between two
strings is defined as the minimum
number of edits needed to transform
one string into the other, with the
allowable edit operations being
insertion, deletion, or substitution
of a single character.
It looks like PostgreSQL has it built-in:
test=# SELECT levenshtein('GUMBO', 'GAMBOL');
levenshtein
-------------
2
(1 row)
http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html
Redfilter answered the question that was asked, but I just wanted to clarify that the requested solution will not resolve what appears to be the real problem.
The real problem here seems to be that you have a database which was hand keyed and some numbers were transcribed giving garbage data.
The ONLY way to solve this problem is to validate the full address against a database like the USPS, MapQuest, or another provider. I know the first two have API's available for doing this.
The example I gave in a comment above was to consider a zip of 75084 and a city value of Richardson. Richardson has zip codes in the range of 75080, 81, 82, 83, and 85. The minimum number of edits will be 1. However, which one?
Another equal problem is what if the entered zip code was 75083 for Richardson. Which is a valid zipcode for that city; however, what if the address resided in 75082?
The only way to get that is to have the full address validated.