I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified.
Say, we find the text "General Motors" in an article. We have a set of data that contains articles and the correct entities mentioned within in. So, if we have found "General Motors" mentioned in a new article, should it fall into that class of articles in the prior data that contained a known genuine mention "General Motors" vs. the class of articles which did not mention that entity?
(I'm not creating a class for every entity and trying to classify every new article into every possible class. I already have a heuristic method for finding plausible mentions of entity names, and I just want to verify the plausibility of the limited number of entity name mentions per article that the method already detects.)
Given that the number of potential classes and articles was quite large and naive bayes is relatively simple, I wanted to do the whole thing in sql, but I'm having trouble with the scoring query...
Here's what I have so far:
CREATE TABLE `each_entity_word` (
`word` varchar(20) NOT NULL,
`entity_id` int(10) unsigned NOT NULL,
`word_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`word`, `entity_id`)
);
CREATE TABLE `each_entity_sum` (
`entity_id` int(10) unsigned NOT NULL DEFAULT '0',
`word_count_sum` int(10) unsigned DEFAULT NULL,
`doc_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`entity_id`)
);
CREATE TABLE `total_entity_word` (
`word` varchar(20) NOT NULL,
`word_count` int(10) unsigned NOT NULL,
PRIMARY KEY (`word`)
);
CREATE TABLE `total_entity_sum` (
`word_count_sum` bigint(20) unsigned NOT NULL,
`doc_count` int(10) unsigned NOT NULL,
`pkey` enum('singleton') NOT NULL DEFAULT 'singleton',
PRIMARY KEY (`pkey`)
);
Each article in the marked data is split into distinct words, and for each article for each entity every word is added to each_entity_word and/or its word_count is incremented and doc_count is incremented in entity_word_sum, both with respect to an entity_id. This is repeated for each entity known to be mentioned in that article.
For each article regardless of the entities contained within for each word total_entity_word total_entity_word_sum are similarly incremented.
P(word|any document) should equal the
word_count in total_entity_word for that word over
doc_count in total_entity_sum
P(word|document mentions entity x)
should equal word_count in
each_entity_word for that word for entity_id x over doc_count in
each_entity_sum for entity_id x
P(word|document does not mention entity x) should equal (the word_count in total_entity_word minus its word_count in each_entity_word for that word for that entity) over (the doc_count in total_entity_sum minus doc_count for that entity in each_entity_sum)
P(document mentions entity x) should equal doc_count in each_entity_sum for that entity id over doc_count in total_entity_word
P(document does not mention entity x) should equal 1 minus (doc_count in each_entity_sum for x's entity id over doc_count in total_entity_word).
For a new article that comes in, split it into words and just select where word in ('I', 'want', 'to', 'use'...) against either each_entity_word or total_entity_word. In the db platform I'm working with (mysql) IN clauses are relatively well optimized.
Also there is no product() aggregate function in sql, so of course you can just do sum(log(x)) or exp(sum(log(x))) to get the equivalent of product(x).
So, if I get a new article in, split it up into distinct words and put those words into a big IN() clause and a potential entity id to test, how can I get the naive bayesian probability that the article falls into that entity id's class in sql?
EDIT:
Try #1:
set #entity_id = 1;
select #entity_doc_count = doc_count from each_entity_sum where entity_id=#entity_id;
select #total_doc_count = doc_count from total_entity_sum;
select
exp(
log(#entity_doc_count / #total_doc_count) +
(
sum(log((ifnull(ew.word_count,0) + 1) / #entity_doc_count)) /
sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (#total_doc_count - #entity_doc_count)))
)
) as likelihood,
from total_entity_word aew
left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=#entity_id
where aew.word in ('I', 'want', 'to', 'use'...);
Use an R to Postgres (or MySQL, etc.) interface
Alternatively, I'd recommend using an established stats package with a connector to the db. This will make your app a lot more flexible if you want to switch from Naive Bayes to something more sophisticated:
http://rpgsql.sourceforge.net/
bnd.pr> data(airquality)
bnd.pr> db.write.table(airquality, no.clobber = F)
bnd.pr> bind.proxy("airquality")
bnd.pr> summary(airquality)
Table name: airquality
Database: test
Host: localhost
Dimensions: 6 (columns) 153 (rows)
bnd.pr> print(airquality)
Day Month Ozone Solar.R Temp
1 1 5 41 190 67
2 2 5 36 118 72
3 3 5 12 149 74
4 4 5 18 313 62
5 5 5 NA NA 56
6 6 5 28 NA 66
7 7 5 23 299 65
8 8 5 19 99 59
9 9 5 8 19 61
10 10 5 NA 194 69
Continues for 143 more rows and 1 more cols...
bnd.pr> airquality[50:55, ]
Ozone Solar.R Wind Temp Month Day
50 12 120 11.5 73 6 19
51 13 137 10.3 76 6 20
52 NA 150 6.3 77 6 21
53 NA 59 1.7 76 6 22
54 NA 91 4.6 76 6 23
55 NA 250 6.3 76 6 24
bnd.pr> airquality[["Ozone"]]
[1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6
[19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA
[37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA
[55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
[73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50
[91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22
[109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73
[127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20
You'll then want to install the e1071 package to do Naive Bayes. At the R prompt:
[ramanujan:~/base]$R
R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
~/.Rprofile loaded.
Welcome at Sun Apr 19 00:45:30 2009
> install.packages("e1071")
> install.packages("mlbench")
> library(e1071)
> ?naiveBayes
> example(naiveBayes)
More info:
http://cran.r-project.org/web/packages/e1071/index.html
Here's a simple version for SQL Server. I run it on a free SQL Express implementation and it is pretty fast.
http://sqldatamine.blogspot.com/2013/07/classification-using-naive-bayes.html
I don't have time to calculate all the expressions for the NB formula, but here's the main idea:
SET #entity = 123;
SELECT EXP(SUM(LOG(probability))) / (EXP(SUM(LOG(probability))) + EXP(SUM(LOG(1 - probability))))
FROM (
SELECT #entity AS _entity,
/* Above is required for efficiency, subqueries using _entity will be DEPENDENT and use the indexes */
(
SELECT SUM(word_count)
FROM total_entity_word
WHERE word = d.word
)
/
(
SELECT doc_count
FROM each_entity_sum
WHERE entity_id = _entity
) AS pwordentity,
/* I've just referenced a previously selected field */
(
SELECT 1 - pwordentity
) AS pwordnotentity,
/* Again referenced a previously selected field */
... etc AS probability
FROM total_entity_word
) q
Note that you can easily refer to the previous field in SELECT by using them in correlated subqueries (as in example).
If using Oracle, it has data mining built in
I'm not sure what db you're running, but if you're using Oracle, data mining capabilities are baked into the db:
http://www.oracle.com/technology/products/bi/odm/index.html
...including Naive Bayes:
http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/algo_nb.htm
and a ton of others:
http://www.oracle.com/technology/products/bi/odm/odm_techniques_algorithms.html
That was surprising to me. Definitely one of the competitive advantages that Oracle has over the open source alternatives in this area.
Here is a blog post detailing what you are looking for: http://nuncupatively.blogspot.com/2011/07/naive-bayes-in-sql.html
I have coded up many versions of NB classifiers in SQL. The answers above advocating changing analysis packages were not scalable to my large data and processing time requirements. I had a table with a row for each word/class combination (nrows = words * classes) and a coefficient column. I had another table with a column for document_id and word. I just joined these tables together on word, grouped by document, and summed the coefficients and then adjusted the sums for the class probability. This left me with a table of document_id, class, score. I then just picked the min score (since I was doing a complement naive bayes approach, which I found worked better in a multi-class situation).
As a side note, I found many transformations/algorithm modifications improved my holdout predictions a great deal. They are described in the work of Jason Rennie on "Tackling the Poor Assumptions of Naive Bayes Text Classifiers" and summarized here: http://www.ist.temple.edu/~vucetic/cis526fall2007/liang.ppt
Related
I have a dataframe of User IDs and Tags as shown below under 'Current Data' .
The Goal:
I want to be able to duplicate records per each value under the tags column. As you can see in the target output, user ID 21 is repeated 3x for each of the three tags that are in the source 'TAGS' - everything is duplicated except the Tag column - 1 Record per item in the comma separated list.
Issue:
I looked at using the SPLIT_TO_TABLE functionality in snowflake but it doesn't work in my use case as not all the tags are consistently in some kind of order and in some cases, the cell is also blank.
Current Data:
USER_ID CITY STATUS PPL TAGS
21 LA checked 6 bad ui/ux,dashboards/reporting,pricing
32 SD checked 9 buggy,laggy
21 ATL checked 9
234 MIA checked 5 glitchy, bad ui/ux, horrible
The target:
USER_ID CITY STATUS PPL TAGS
21 LA checked 6 bad ui/ux
21 LA checked 6 dashboards/reporting
21 LA checked 6 Pricing
32 SD checked 9 buggy
32 SD checked 9 laggy
21 ATL checked 9
234 MIA checked 5 glitchy
234 MIA checked 5 bad ui/ux
234 MIA checked 5 horrible
Sql:
select table1.value
from table(split_to_table('a.b', '.')) as table1
SPLIT_TO_TABLE works. Below is the query using your sample data:
select USER_ID, CITY, STATUS, PPL, VALUE
from (values
(21,'LA','checked',6,'bad ui/ux,dashboards/reporting,pricing')
,(32,'SD','checked',9,'buggy,laggy')
,(21,'ATL','checked',9,'')
,(234,'MIA','checked',5,'glitchy, bad ui/ux, horrible')
) as tbl (USER_ID,CITY,STATUS,PPL,TAGS)
, lateral split_to_table(tbl.tags,',');
Result:
USER_ID CITY STATUS PPL VALUE
21 LA checked 6 bad ui/ux
21 LA checked 6 dashboards/reporting
21 LA checked 6 pricing
32 SD checked 9 buggy
32 SD checked 9 laggy
21 ATL checked 9
234 MIA checked 5 glitchy
234 MIA checked 5 bad ui/ux
234 MIA checked 5 horrible
I have a query returned value in this form (query return more than 50 columns).
1-99transval 100-200transval 200-300transval ... 1-99nontransval 100...
50 90 80 67 58
For a row value. I want these details to be converted into columns and take the following shape:
Range Transval NonTransval
1-99 50 67
100-200 90 58
In pure SQL, it will need a lot of coding because you will have to manually put the range as there is no relation between the values and the range at all. Had there been a relationship, you could use CASE expression and build the range dynamically.
SQL> WITH DATA AS
2 (SELECT 50 "1-99transval",
3 90 "100-200transval",
4 80 "200-300transval",
5 67 "1-99nontransval",
6 58 "100-200nontransval",
7 88 "200-300nontransval"
8 FROM dual
9 )
10 SELECT '1-99' range,
11 "1-99transval" transval,
12 "1-99nontransval" nontransval
13 FROM DATA
14 UNION
15 SELECT '100-200' range,
16 "100-200transval",
17 "100-200nontransval" nontransval
18 FROM DATA
19 UNION
20 SELECT '200-300' range,
21 "200-300transval",
22 "200-300nontransval" nontransval
23 FROM DATA;
RANGE TRANSVAL NONTRANSVAL
------- ---------- -----------
1-99 50 67
100-200 90 58
200-300 80 88
From Oracle database 11g Release 1 and above, you could use UNPIVOT
SQL> WITH DATA AS
2 (SELECT 50 "1-99transval",
3 90 "100-200transval",
4 80 "200-300transval",
5 67 "1-99nontransval",
6 58 "100-200nontransval",
7 88 "200-300nontransval"
8 FROM dual
9 )
10 SELECT *
11 FROM DATA
12 UNPIVOT( (transval,nontransval)
13 FOR RANGE IN ( ("1-99transval","1-99nontransval") AS '1-99'
14 ,("100-200transval","100-200nontransval") AS '100-200'
15 ,("200-300transval","200-300nontransval") AS '200-300'));
RANGE TRANSVAL NONTRANSVAL
------- ---------- -----------
1-99 50 67
100-200 90 58
200-300 80 88
Above, in your case you need to replace the WITH clause with your existing query as a sub-query. You need to include other columns in the UNION.
In PL/SQL, you could (ab)use EXECUTE IMMEDIATE and get the "range" by extracting the column names in dynamic sql.
Although, it would be much better to modify/rewrite your existing query which you have not shown yet.
If you are using Oracle 11g version then you can use the UNPIVOT feature.
CREATE TABLE DATA AS
SELECT 50 "1-99transval",
90 "100-200transval",
80 "200-300transval",
67 "1-99nontransval",
58 "100-200nontransval",
88 "200-300nontransval"
FROM dual
SELECT *
FROM DATA
UNPIVOT( (Transval,NonTransval) FOR Range IN ( ("1-99transval","1-99nontransval") as '1-99'
,("100-200transval","100-200nontransval") as '100-200'
,("200-300transval","200-300nontransval") as '200-300'))
http://sqlfiddle.com/#!4/c9747/3/0
I have a query in Access 2010 (have also tried on 2013, same result) that is working but not perfectly for all records. I'm wondering if anyone knows what is causing the error.
Here is the query (adapted from http://allenbrowne.com/subquery-01.html#AnotherRecord):
SELECT t_test_table.individ, t_test_table.test_date, t_test_table.score1, (SELECT top 1 Dupe.score1
FROM t_test_table AS Dupe
WHERE Dupe.individ = t_test_table.individ
AND Dupe.test_date < t_test_table.test_date
ORDER BY Dupe.primary DESC, Dupe.individ
) AS PriorValue, [score1]-[priorvalue] AS scorechange
FROM t_test_table;
The way the data is set up, an individual has multiple records in the file (designated by individ) representing different dates a test was taken. A date AND individ combination are unique - you can only take a test once. [primary] refers to primary key column. I just made it because the individ field is not a primary key since multiples are possible (I'm not including it here due to space)
The goal of the above code was to create the following:
individ test_date score1 PriorValue scorechange
1 3/1/2013 40
1 6/4/2013 51 40 11
1 7/25/2013 55 51 4
1 12/13/2013 59 55 4
5 8/29/2009 39
5 12/9/2009 47 39 8
5 6/1/2010 58 47 11
5 8/28/2010 42 58 -16
5 12/15/2010 51 42 9
Here is what I actually got. You can see that for individ 1, it winds up taking the first score rather than the previous score for each subsequent record. For individ 5, it kind of works, but the final priorvalue should be 42 and not 58.
individ test_date score1 PriorValue scorechange
1 3/1/2012 40
1 6/4/2012 51 40 11
1 7/25/2012 55 40 15
1 12/13/2012 59 40 19
5 8/29/2005 39
5 12/9/2005 47 39 8
5 6/1/2006 58 47 11
5 8/28/2006 42 58 -16
5 12/15/2006 51 58 -7
Does anyone have any ideas about what went wrong here? In other records, it works perfectly, but I can't determine what is causing some records to fail to take the previous value.Any help is appreciated, and let me know if you require additional information.
To get the most recent test for a given individ, you'll need to include a sort by date. In your inner query, replace
ORDER BY Dupe.primary DESC, Dupe.individ
with
ORDER BY Dupe.test_date DESC
It's hard to say exactly what effect sorting by primary has, since you haven't told us how you're generating the values of primary. If the combination of individ and test_date is guaranteed to be unique, you might want to consider making the two of them into your primary key instead of creating a new thing. The Dupe.individ in the ORDER BY line has no effect, since your WHERE clause already limited the results of the inner query to one individ.
I have a table product, pick_qty, shortfall, location, loc_qty
Product Picked Qty Shortfall Location Location Qty
1742 4 58 1 15
1742 4 58 2 20
1742 4 58 3 15
1742 4 58 4 20
1742 4 58 5 20
1742 4 58 6 20
1742 4 58 7 15
1742 4 58 8 15
1742 4 58 9 15
1742 4 58 10 20
I want a report to loop around and show the number of locations and the quantity I need to drop to fulfil the shortfall for replenishment. So the report would look like this.
Product Picked Qty Shortfall Location Location Qty
1742 4 58 1 15
1742 4 58 2 20
1742 4 58 3 15
1742 4 58 4 20
Note that it is best not to think about SQL "looping through a table" and instead to think about it as operating on some subset of the rows in a table.
What it sounds like you need to do is create a running total that tells how many of the item you would have if you were to take all of them from a location and all of the locations that came before the current location and then check to see if that would give you enough of the item to fulfill the shortfall.
Based on your example data, the following query would work, though if Locations aren't actually numerics then you would need to add a row number column and tweak the query a bit to use the row number instead of the Location Number; It would still be very similar to the query below.
SELECT
Totals.Product, Totals.PickedQty, Totals.ShortFall, Totals.Location, Totals.LocationQty
FROM (
SELECT
TheTable.Product, TheTable.PickedQty, TheTable.ShortFall,
TheTable.Location, TheTable.LocationQty, SUM(ForRunningTotal.LocationQty) AS RunningTotal
FROM TheTable
JOIN TheTable ForRunningTotal ON TheTable.Product = ForRunningTotal.Product
AND TheTable.Location >= ForRunningTotal.Location
GROUP BY TheTable.Product, TheTable.PickedQty, TheTable.ShortFall, TheTable.Location, TheTable.LocationQty
) Totals
-- Note you could also change the join above so the running total is actually the count of only the rows above,
-- not including the current row; Then the WHERE clause below could be "Totals.RunningTotal < Totals.ShortFall".
-- I liked RunningTotal as the sum of this row and all prior, it seems more appropriate to me.
WHERE Totals.RunningTotal - Totals.LocationQty <= Totals.ShortFall
AND Totals.LocationQty > 0
Also - as long as you are reading my answer, an unrelated side-note: Based on the data you showed above, your database schema isn't normalized as far as it could be. It seems like the Picked Quantity and the ShortFall actually depend only on the Product, so that would be a table of its own, and then the Location Quantity depends on the Product and Location, so that would be a table of its own. I'm pointing it out because if your data contained different Picked Quantities/ShortFall for a single product, then the above query would break; This situation would be impossible with the normalized tables I mentioned.
I have a daily sales report query and it have 2 columns like
days sales
1 12
2 65
3 25
...
30 24
but when I want to print it there is a lots of free spaces on paper, so I want to seperate query with a percentage (like % 33)
and result will be like 3 x 2 columns for one paper. and it will be more comfortable for me.
days sales days sales days sales
1 12 11 21 21 5
2 65 12 53 22 18
3 25 13 0
...
10 45 20 12 30 55
Any way to do this with DevEx Grid?
this is the view which I get
and I dont want such kind of empty paper for couple of records..
You will not find an easy way to achieve the desired result using XtraGrid, because it is not intended for reporting. I suggest that you create reports using another product: XtraReports.