SQL: How to remove & update records in a table, sync'ing with a (Q)List data in order to decrease the burden on database/table? - sql

I am using Qt and MS-Sql Server on Windows7 OS.
What I have is an MS-SQL database that I use to store data/info coming from equipment that is mounted in some vehicles.
There is a table in the database named TransactionFilesInfo - a table used to store information about transaction files from the equipment, when they connect to the TCP-server.
We are using this table as we are requested to avoid duplicate files. It happens (sometimes) when the remote equipment does NOT delete the transaction files after they are sent to the server. Hence, I use the info from the table to check [size and CRC] to avoid downloading duplicates.
Some sample data for TransactionFilesInfo table looks like this:
[TransactionFilesInfo]:
DeviceID FileNo FileSequence FileSize FileCRC RecordTimeStamp
10203 2 33 230 55384 2015-11-26 14:54:15
10203 7 33 624 55391 2015-11-26 14:54:15
10203 2 34 146 21505 2015-11-26 14:54:16
10203 7 34 312 35269 2015-11-26 14:54:16
10203 2 35 206 23022 2015-11-26 15:33:22
10203 7 35 208 11091 2015-11-26 15:33:22
10203 2 36 134 34918 2015-11-26 15:55:44
10203 7 36 104 63865 2015-11-26 15:55:44
10203 2 37 140 35466 2015-11-26 16:20:38
10203 7 37 208 62907 2015-11-26 16:20:38
10203 2 38 134 17706 2015-11-26 16:38:33
10203 7 38 104 42358 2015-11-26 16:38:33
11511 2 21 194 29913 2015-12-02 16:22:59
11511 7 21 114 30038 2015-12-02 16:22:59
On the other hand, every time a device connects to the server, it first sends a list of file information. The Qt application takes care of that.
The list contains elements like this:
struct FileInfo
{
unsigned short FileNumber;
unsigned short FileSequence;
unsigned short FileCRC;
unsigned long FileSize;
};
So, as an example (inspired by the table above) the connected device (DeviceID=10203) may say that it has the following files:
QList<FileInfo> filesList;
// here is the log4qt output...
filesList[0] --> FileNo=2 FileSeq=33 FileSize=230 and FileCRC=55384
filesList[1] --> FileNo=2 FileSeq=34 FileSize=146 and FileCRC=21505
filesList[2] --> FileNo=7 FileSeq=33 FileSize=624 and FileCRC=55391
filesList[3] --> FileNo=7 FileSeq=34 FileSize=312 and FileCRC=35269 ...
Well, what I need is a method to remove/delete, for a given DeviceID, all the records in the TransactionFilesInfo table, records that are NOT in the list sent by the remote device. Hence, I will be able to decrease the burden (size) on the database table.
Remark: For the moment I just delete (#midnight) all the records that are older than let's say 10 days, based on RecordTimeStamp field. So, the size of the table doesn't increase over an alarming level :)
Finally, to clarify it a little bit: I would mainly need help with SQL. Yet, I would not refuse any idea on how to do some related things/tricks on the Qt side ;)

The SQL to delete those records might look something like this:
DELETE FROM [SAMPLE DATA]
WHERE DeviceID = 10203
and 'File' + CONVERT(varchar(11),FileNo) + '_' +
RIGHT('000' + CONVERT(varchar(11),FileSequence),3)
NOT IN ('File2_033','File2_034','File7_033','File7_034',...)
If you wanted to delete all of them for a device, you could drop the code that looks at the FileNo and FileSequence so it is simply:
DELETE FROM [SAMPLE DATA]
WHERE DeviceID = 10203

Related

Dynamically Calculate difference columns based off slicer- POWERBI

I have a table with quarterly volume data, and a slicer that allows you to choose what quarter/year you want to see volume per code for. The slicer has 2019Q1 through 2021Q4 selections. I need to create dynamic difference column that will adjust depending on what quarter/year is selected in the slicer. I know I need to create a new measure using Calculate/filter but am a beginner in PowerBI and am unsure how to write that formula.
Example of raw table data:
Code
2019Q1
2019Q2
2019Q3
2019Q
2020Q1
2020Q2
2020Q3
2020Q4
11111
232
283
289
19
222
283
289
19
22222
117
481
231
31
232
286
2
19
11111
232
397
94
444
232
553
0
188
22222
117
411
15
14
232
283
25
189
Example if 2019Q1 and 2020Q1 are selected:
Code
2019Q1
2020Q1
Difference
11111
232
222
10
22222
117
481
-364
11111
232
397
-165
22222
117
411
-294
Power BI doesn't work that way. This is an Excel pivot table setup. You don't have any parameter to distinguish first and third or second and fourth row. They have the same code, so Power BI will aggregate their volumes. You could introduce a hidden index column but then why don't you simply stick to Excel? The Power BI approch to the problem would be to unpivot (stack) your table to a Code, Quarter and a Volume column, create 2 independent slicer tables for Minuend and Subtrahend and then CALCULATE your aggregated differences based on the SELECTEDVALUE of the 2 slicers.

Plotting Webscraped data onto matplotlib

I recently managed to collect tabular data from a PDF file using camelot in python. By collect I mean print it out on the terminal, Now i would like to find a way to automate the results into a bar graph diagram on matplotlib. how would i do that? Here's my code for extracting the tabular data from the pdf:
import camelot
tables = camelot.read_pdf("data_table.pdf", pages='2')
print(tables[0].df)
Here's an image of the table
enter image description here
Which then prints out a large table in my terminal:
0 1 2 3 4
0 Country \nCase definition \nCumulative cases \...
1 Guinea Confirmed 2727 156 1683
2 Probable 374 * 374
3 Suspected 7 * ‡
4 Total 3108 156 2057
5 Liberia** Confirmed 3149 11 ‡
6 Probable 1876 * ‡
7 Suspected 3982 * ‡
8 Total 9007 11 3900
9 Sierra Leone Confirmed 8212 230 3042
10 Probable 287 * 208
11 Suspected 2604 * 158
12 Total 11103 230 3408
13 Total 23 218 397 9365
I do have a bit of experience with matplotlib and i know how to plot data manually but not automatically from the pdf. This would save me some time since I'm trying to automate the whole process.

MS SQL Store Procedure Optimizing

I have attached my query result. How can I optimize this sp? Also do I need to optimize? I can get the result in 0.2 or in some case more.
Client Execution Time 18:18:18 18:18:08 18:17:49 18:17:24 18:13:18
Query Profile Statistics
Number of INSERT, DELETE and UPDATE statements 281 281 281 50 0 178.6000
Rows affected by INSERT, DELETE, or UPDATE statements 235 235 235 44 0 149.8000
Number of SELECT statements 4870 4870 4870 741 13 3072.8000
Rows returned by SELECT statements 3653 3653 3653 598 37 2318.8000
Number of transactions 281 281 281 50 0 178.6000
Network Statistics
Number of server roundtrips 1 1 1 3 3 1.8000
TDS packets sent from client 1 1 1 3 3 1.8000
TDS packets received from server 119 110 90 898 78 259.0000
Bytes sent from client 138 138 138 284 288 197.2000
Bytes received from server 327491 327491 327491 2861601 197860 808386.8000
Time Statistics
Client processing time 2755 3793 2364 908 332 2030.4000
Total execution time 3225 4294 2825 2095 1375 2762.8000
Wait time on server replies 470 501 461 1187 1043 732.4000
There's a number of options you can look at:
1.SQL Merge
SQL Merge can be used to perform Inserts, Updates and Deletes in a single statement.
http://technet.microsoft.com/en-us/library/bb510625.aspx
http://blog.sqlauthority.com/2008/08/28/sql-server-2008-introduction-to-merge-statement-one-statement-for-insert-update-delete/
2.Output Clause
The SQL output clause can be used to return any value from ‘inserted’ and ‘deleted’ (New value and Old value) tables when doing an insert or update.
http://msdn.microsoft.com/en-us/library/ms177564(v=sql.90).aspx

How do I loop through a table until condition reached

I have a table product, pick_qty, shortfall, location, loc_qty
Product Picked Qty Shortfall Location Location Qty
1742 4 58 1 15
1742 4 58 2 20
1742 4 58 3 15
1742 4 58 4 20
1742 4 58 5 20
1742 4 58 6 20
1742 4 58 7 15
1742 4 58 8 15
1742 4 58 9 15
1742 4 58 10 20
I want a report to loop around and show the number of locations and the quantity I need to drop to fulfil the shortfall for replenishment. So the report would look like this.
Product Picked Qty Shortfall Location Location Qty
1742 4 58 1 15
1742 4 58 2 20
1742 4 58 3 15
1742 4 58 4 20
Note that it is best not to think about SQL "looping through a table" and instead to think about it as operating on some subset of the rows in a table.
What it sounds like you need to do is create a running total that tells how many of the item you would have if you were to take all of them from a location and all of the locations that came before the current location and then check to see if that would give you enough of the item to fulfill the shortfall.
Based on your example data, the following query would work, though if Locations aren't actually numerics then you would need to add a row number column and tweak the query a bit to use the row number instead of the Location Number; It would still be very similar to the query below.
SELECT
Totals.Product, Totals.PickedQty, Totals.ShortFall, Totals.Location, Totals.LocationQty
FROM (
SELECT
TheTable.Product, TheTable.PickedQty, TheTable.ShortFall,
TheTable.Location, TheTable.LocationQty, SUM(ForRunningTotal.LocationQty) AS RunningTotal
FROM TheTable
JOIN TheTable ForRunningTotal ON TheTable.Product = ForRunningTotal.Product
AND TheTable.Location >= ForRunningTotal.Location
GROUP BY TheTable.Product, TheTable.PickedQty, TheTable.ShortFall, TheTable.Location, TheTable.LocationQty
) Totals
-- Note you could also change the join above so the running total is actually the count of only the rows above,
-- not including the current row; Then the WHERE clause below could be "Totals.RunningTotal < Totals.ShortFall".
-- I liked RunningTotal as the sum of this row and all prior, it seems more appropriate to me.
WHERE Totals.RunningTotal - Totals.LocationQty <= Totals.ShortFall
AND Totals.LocationQty > 0
Also - as long as you are reading my answer, an unrelated side-note: Based on the data you showed above, your database schema isn't normalized as far as it could be. It seems like the Picked Quantity and the ShortFall actually depend only on the Product, so that would be a table of its own, and then the Location Quantity depends on the Product and Location, so that would be a table of its own. I'm pointing it out because if your data contained different Picked Quantities/ShortFall for a single product, then the above query would break; This situation would be impossible with the normalized tables I mentioned.

Naive bayes calculation in sql

I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified.
Say, we find the text "General Motors" in an article. We have a set of data that contains articles and the correct entities mentioned within in. So, if we have found "General Motors" mentioned in a new article, should it fall into that class of articles in the prior data that contained a known genuine mention "General Motors" vs. the class of articles which did not mention that entity?
(I'm not creating a class for every entity and trying to classify every new article into every possible class. I already have a heuristic method for finding plausible mentions of entity names, and I just want to verify the plausibility of the limited number of entity name mentions per article that the method already detects.)
Given that the number of potential classes and articles was quite large and naive bayes is relatively simple, I wanted to do the whole thing in sql, but I'm having trouble with the scoring query...
Here's what I have so far:
CREATE TABLE `each_entity_word` (
`word` varchar(20) NOT NULL,
`entity_id` int(10) unsigned NOT NULL,
`word_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`word`, `entity_id`)
);
CREATE TABLE `each_entity_sum` (
`entity_id` int(10) unsigned NOT NULL DEFAULT '0',
`word_count_sum` int(10) unsigned DEFAULT NULL,
`doc_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`entity_id`)
);
CREATE TABLE `total_entity_word` (
`word` varchar(20) NOT NULL,
`word_count` int(10) unsigned NOT NULL,
PRIMARY KEY (`word`)
);
CREATE TABLE `total_entity_sum` (
`word_count_sum` bigint(20) unsigned NOT NULL,
`doc_count` int(10) unsigned NOT NULL,
`pkey` enum('singleton') NOT NULL DEFAULT 'singleton',
PRIMARY KEY (`pkey`)
);
Each article in the marked data is split into distinct words, and for each article for each entity every word is added to each_entity_word and/or its word_count is incremented and doc_count is incremented in entity_word_sum, both with respect to an entity_id. This is repeated for each entity known to be mentioned in that article.
For each article regardless of the entities contained within for each word total_entity_word total_entity_word_sum are similarly incremented.
P(word|any document) should equal the
word_count in total_entity_word for that word over
doc_count in total_entity_sum
P(word|document mentions entity x)
should equal word_count in
each_entity_word for that word for entity_id x over doc_count in
each_entity_sum for entity_id x
P(word|document does not mention entity x) should equal (the word_count in total_entity_word minus its word_count in each_entity_word for that word for that entity) over (the doc_count in total_entity_sum minus doc_count for that entity in each_entity_sum)
P(document mentions entity x) should equal doc_count in each_entity_sum for that entity id over doc_count in total_entity_word
P(document does not mention entity x) should equal 1 minus (doc_count in each_entity_sum for x's entity id over doc_count in total_entity_word).
For a new article that comes in, split it into words and just select where word in ('I', 'want', 'to', 'use'...) against either each_entity_word or total_entity_word. In the db platform I'm working with (mysql) IN clauses are relatively well optimized.
Also there is no product() aggregate function in sql, so of course you can just do sum(log(x)) or exp(sum(log(x))) to get the equivalent of product(x).
So, if I get a new article in, split it up into distinct words and put those words into a big IN() clause and a potential entity id to test, how can I get the naive bayesian probability that the article falls into that entity id's class in sql?
EDIT:
Try #1:
set #entity_id = 1;
select #entity_doc_count = doc_count from each_entity_sum where entity_id=#entity_id;
select #total_doc_count = doc_count from total_entity_sum;
select
exp(
log(#entity_doc_count / #total_doc_count) +
(
sum(log((ifnull(ew.word_count,0) + 1) / #entity_doc_count)) /
sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (#total_doc_count - #entity_doc_count)))
)
) as likelihood,
from total_entity_word aew
left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=#entity_id
where aew.word in ('I', 'want', 'to', 'use'...);
Use an R to Postgres (or MySQL, etc.) interface
Alternatively, I'd recommend using an established stats package with a connector to the db. This will make your app a lot more flexible if you want to switch from Naive Bayes to something more sophisticated:
http://rpgsql.sourceforge.net/
bnd.pr> data(airquality)
bnd.pr> db.write.table(airquality, no.clobber = F)
bnd.pr> bind.proxy("airquality")
bnd.pr> summary(airquality)
Table name: airquality
Database: test
Host: localhost
Dimensions: 6 (columns) 153 (rows)
bnd.pr> print(airquality)
Day Month Ozone Solar.R Temp
1 1 5 41 190 67
2 2 5 36 118 72
3 3 5 12 149 74
4 4 5 18 313 62
5 5 5 NA NA 56
6 6 5 28 NA 66
7 7 5 23 299 65
8 8 5 19 99 59
9 9 5 8 19 61
10 10 5 NA 194 69
Continues for 143 more rows and 1 more cols...
bnd.pr> airquality[50:55, ]
Ozone Solar.R Wind Temp Month Day
50 12 120 11.5 73 6 19
51 13 137 10.3 76 6 20
52 NA 150 6.3 77 6 21
53 NA 59 1.7 76 6 22
54 NA 91 4.6 76 6 23
55 NA 250 6.3 76 6 24
bnd.pr> airquality[["Ozone"]]
[1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6
[19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA
[37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA
[55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
[73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50
[91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22
[109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73
[127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20
You'll then want to install the e1071 package to do Naive Bayes. At the R prompt:
[ramanujan:~/base]$R
R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
~/.Rprofile loaded.
Welcome at Sun Apr 19 00:45:30 2009
> install.packages("e1071")
> install.packages("mlbench")
> library(e1071)
> ?naiveBayes
> example(naiveBayes)
More info:
http://cran.r-project.org/web/packages/e1071/index.html
Here's a simple version for SQL Server. I run it on a free SQL Express implementation and it is pretty fast.
http://sqldatamine.blogspot.com/2013/07/classification-using-naive-bayes.html
I don't have time to calculate all the expressions for the NB formula, but here's the main idea:
SET #entity = 123;
SELECT EXP(SUM(LOG(probability))) / (EXP(SUM(LOG(probability))) + EXP(SUM(LOG(1 - probability))))
FROM (
SELECT #entity AS _entity,
/* Above is required for efficiency, subqueries using _entity will be DEPENDENT and use the indexes */
(
SELECT SUM(word_count)
FROM total_entity_word
WHERE word = d.word
)
/
(
SELECT doc_count
FROM each_entity_sum
WHERE entity_id = _entity
) AS pwordentity,
/* I've just referenced a previously selected field */
(
SELECT 1 - pwordentity
) AS pwordnotentity,
/* Again referenced a previously selected field */
... etc AS probability
FROM total_entity_word
) q
Note that you can easily refer to the previous field in SELECT by using them in correlated subqueries (as in example).
If using Oracle, it has data mining built in
I'm not sure what db you're running, but if you're using Oracle, data mining capabilities are baked into the db:
http://www.oracle.com/technology/products/bi/odm/index.html
...including Naive Bayes:
http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/algo_nb.htm
and a ton of others:
http://www.oracle.com/technology/products/bi/odm/odm_techniques_algorithms.html
That was surprising to me. Definitely one of the competitive advantages that Oracle has over the open source alternatives in this area.
Here is a blog post detailing what you are looking for: http://nuncupatively.blogspot.com/2011/07/naive-bayes-in-sql.html
I have coded up many versions of NB classifiers in SQL. The answers above advocating changing analysis packages were not scalable to my large data and processing time requirements. I had a table with a row for each word/class combination (nrows = words * classes) and a coefficient column. I had another table with a column for document_id and word. I just joined these tables together on word, grouped by document, and summed the coefficients and then adjusted the sums for the class probability. This left me with a table of document_id, class, score. I then just picked the min score (since I was doing a complement naive bayes approach, which I found worked better in a multi-class situation).
As a side note, I found many transformations/algorithm modifications improved my holdout predictions a great deal. They are described in the work of Jason Rennie on "Tackling the Poor Assumptions of Naive Bayes Text Classifiers" and summarized here: http://www.ist.temple.edu/~vucetic/cis526fall2007/liang.ppt