here is a portion of a data set I have named “antibody” :
Row Subject Type Procedure Measurement Output
1 500 Intial Invasive 20 20
2 500 Initial Surface 35 35
3 500 Followup Invasive 54 54-20
4 428 Followup Outer 29 29-10
5 765 Seventh Other 13 13-19
6 500 Followup Surface 98 98-35
7 428 Initial Outer 10 10
8 765 Initial Other 19 19
9 610 Third Invasive 66 66-17
10 610 Initial Invasive 17 17
I was trying to use proc sql to perform this. The goal is to subtract the numbers in the "MEASUREMENT" column based on the "SUBJECT", "TYPE" and “PROCEDURE” columns. If two values in the “SUBJECT” column match and two values in the “PROCEDURE” column match, then the initial measurement should be subtracted from the other measurement. For example, the initial measurement in row 1 (20) should be subtracted from the followup measurement in row 3 (54) because the subject (500) and procedure (Invasive) match. Furthermore, the initial measurement in row 8 (19) should be subtracted from the seventh measurement in row 5 (13) because the subject (765) and procedure (Other) match. The result should form the "OUTPUT" column.
Thank you in advance!
Here is a hash object approach
data have;
input Subject Type $ 5-12 Procedure $ 15-22 Measurement;
datalines;
500 Initial Invasive 20
500 Initial Surface 35
500 Followup Invasive 54
428 Followup Outer 29
765 Seventh Other 13
500 Followup Surface 98
428 Initial Outer 10
765 Initial Other 19
610 Third Invasive 66
610 Initial Invasive 17
;
data want (drop=rc _Measurement);
if _N_ = 1 then do;
declare hash h (dataset : "have (rename=(Measurement=_Measurement) where=(Type='Initial'))");
h.definekey ('Subject');
h.definedata ('_Measurement');
h.definedone();
end;
set have;
_Measurement=.;
if Type ne 'Initial' then rc = h.find();
Measurement = sum (Measurement, -_Measurement);
run;
EDIT:
data want (drop=rc _Measurement);
if _N_ = 1 then do;
declare hash h (dataset : "have (rename=(Measurement=_Measurement) where=(Type='Initial'))");
h.definekey ('Subject');
h.definedata ('_Measurement');
h.definedone();
end;
set have;
_Measurement=.;
if Type ne 'Initial' then rc = h.find();
NewMeasurement = ifn(Measurement=., ., sum (Measurement, -_Measurement));
run;
Related
I have 2 datasets , one is base dateset and the other is subset of it , I want to create a dataset where the record is not present in the subset dataset but present in base dataset. So if combination of acct_num test_id trandate actual_amt is not present in the subset then it should come in the resultant dataset.
DATA base;
INPUT acct_num test_id tran_date:anydtdte. actual_amt final_amt final_amt_added ;
format tran_date date9.;
DATALINES;
55203610 2542 12-jan-20 30 45 45
16124130 8062 . 56 78 78
16124130 8062 14-dec-19 8 78 78
80479512 2062 19-mar-19 32 32 32
70321918 2062 20-dec-19 1 93 54
17312410 6712 . 45 90 90
17312410 6712 15-jun-18 0 90 90
74623123 2092 17-aug-18 34 87 87
24245321 2082 22-jan-17 22 56 67
;
run;
data subset;
input acct_num test_id tran_date:anydtdte. actual_amt final_amt final_amt_added ;
format tran_date date9.;
DATALINES;
55203610 2542 12-jan-20 30 45 45
16124130 8062 . 56 78 78
16124130 8062 14-dec-19 8 78 78
17312410 6712 . 45 90 90
74623123 2092 17-aug-18 34 87 87
24245321 2082 22-jan-17 22 56 67
;
run;
data that I want
80479512 2062 19-mar-19 32 32 32
70321918 2062 20-dec-19 1 93 54
17312410 6712 15-jun-18 0 90 90
I have tried using not in function in SQL but it does not match multiple variable in that statement.
Any help will be appreciated.
It is about how to solve minus set, see Except operator
proc sql noprint;
create table want as
select * from base
except
select * from subset
;
quit;
Make a list of all the observed values in subset, then simply merge the base file with the combinations found in subset and output the records that are in base only.
Note it is important to restrict subset_combinations to non-duplicates and only keep the sorting variables, else you may overwrite the values from subset.
proc sort data=base;
by acct_num test_id tran_date actual_amt;
proc sort data=subset out=subset_combinations (keep=acct_num test_id tran_date actual_amt) nodupkey;
by acct_num test_id tran_date actual_amt;
data want;
merge base (in=in1) subset_combinations (in=in2);
by acct_num test_id tran_date actual_amt;
if in1 & ^in2;
run;
I have a column in table which is varchar2(3) style column. This column has some nulls and when I try to run following query it runs for some records but when I scroll to some record I get ORA-01722: invalid number error.
Query used:
Select TRUNC(NVL(COLUMN, '2'))
from TABLE;
Also I ran distinct on column to see what values it has.
Select distinct COLUMN
from TABLE;
I got following results:
1
2 62
3 90
4 70
5 82
6 71
7 05
8 21
9 81
10 66
11 12
12 95
13 02
14 91
15 92
16 94
17 01
18 65
19 30
20 20
21
22 50
23 63
24 51
25 64
26 09
Why am I getting this error and how can I do this without getting error?
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions200.htm
Because you want to trunc a string, but TRUNC function is for number or date .
This function takes as an argument any numeric datatype or any nonnumeric datatype that can be implicitly converted to a numeric datatype
Maybe one of your string value cannot be converted to number
edit : your 21t value is not a number and is not null, you should trim your column
SELECT trunc(nvl(trim(column),'2'))
FROM table;
I have a table:
col1 col2
2 20
2.5 25
2.67 30
2.99 40
I'm looking to get
varone = 2 x col2, vartwo= 2.5 x col2, varthree= 2.67 x col3, varfour=2.99 x col2
i.e. extracting a specific value from a table
and then multiplying an entire column by that value (scalar x vector).
I tried transposing col1
col1a col1b col1c col1d col2
2 2.5 2.67 2.99 20
25
30
40
and then tried multiplying col1a x col2, but it didn't seem to work.
In SAS, you can just use proc sql:
proc sql;
select 2*col2 as varone, 2.5*col2 as vartwo, 2.67*col3 as varthree, 2.99*col2 as varfour
from atable;
Assuming you're using SAS and either PROC FACTOR or PROC PRINCOMP then you can use PROC SCORE.
Example straight from the documentation:
http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_score_sect017.htm
/* This data set contains only the first 12 observations */
/* from the full data set used in the chapter on PROC REG. */
data Fitness;
input Age Weight Oxygen RunTime RestPulse RunPulse ##;
datalines;
44 89.47 44.609 11.37 62 178 40 75.07 45.313 10.07 62 185
44 85.84 54.297 8.65 45 156 42 68.15 59.571 8.17 40 166
38 89.02 49.874 9.22 55 178 47 77.45 44.811 11.63 58 176
40 75.98 45.681 11.95 70 176 43 81.19 49.091 10.85 64 162
44 81.42 39.442 13.08 63 174 38 81.87 60.055 8.63 48 170
44 73.03 50.541 10.13 45 168 45 87.66 37.388 14.03 56 186
;
proc factor data=Fitness outstat=FactOut
method=prin rotate=varimax score;
var Age Weight RunTime RunPulse RestPulse;
title 'Factor Scoring Example';
run;
proc print data=FactOut;
title2 'Data Set from PROC FACTOR';
run;
proc score data=Fitness score=FactOut out=FScore;
var Age Weight RunTime RunPulse RestPulse;
run;
proc print data=FScore;
title2 'Data Set from PROC SCORE';
run;
You can make use of array to achieve this.
Below program is dynamic. It will work for any number of observations.
****data we have****;
data have;
input col1 col2;
datalines;
2 20
2.5 25
2.67 30
2.99 40
;
run;
****Taking Count****;
****Creating macro "value" to store col1 data****;
proc sql ;
select count(*) into :cnt_rec from have;
select col1 into :value1 - :value&SysMaxLong from have;
quit;
data want(drop=i);
set have;
array NewColumn(&cnt_rec);
****processing the array and multiplying col2 data****;
do i = 1 to &cnt_rec;
NewColumn[i] = symget('value'||left(i)) * col2;
end;
run;
I have a table product, pick_qty, shortfall, location, loc_qty
Product Picked Qty Shortfall Location Location Qty
1742 4 58 1 15
1742 4 58 2 20
1742 4 58 3 15
1742 4 58 4 20
1742 4 58 5 20
1742 4 58 6 20
1742 4 58 7 15
1742 4 58 8 15
1742 4 58 9 15
1742 4 58 10 20
I want a report to loop around and show the number of locations and the quantity I need to drop to fulfil the shortfall for replenishment. So the report would look like this.
Product Picked Qty Shortfall Location Location Qty
1742 4 58 1 15
1742 4 58 2 20
1742 4 58 3 15
1742 4 58 4 20
Note that it is best not to think about SQL "looping through a table" and instead to think about it as operating on some subset of the rows in a table.
What it sounds like you need to do is create a running total that tells how many of the item you would have if you were to take all of them from a location and all of the locations that came before the current location and then check to see if that would give you enough of the item to fulfill the shortfall.
Based on your example data, the following query would work, though if Locations aren't actually numerics then you would need to add a row number column and tweak the query a bit to use the row number instead of the Location Number; It would still be very similar to the query below.
SELECT
Totals.Product, Totals.PickedQty, Totals.ShortFall, Totals.Location, Totals.LocationQty
FROM (
SELECT
TheTable.Product, TheTable.PickedQty, TheTable.ShortFall,
TheTable.Location, TheTable.LocationQty, SUM(ForRunningTotal.LocationQty) AS RunningTotal
FROM TheTable
JOIN TheTable ForRunningTotal ON TheTable.Product = ForRunningTotal.Product
AND TheTable.Location >= ForRunningTotal.Location
GROUP BY TheTable.Product, TheTable.PickedQty, TheTable.ShortFall, TheTable.Location, TheTable.LocationQty
) Totals
-- Note you could also change the join above so the running total is actually the count of only the rows above,
-- not including the current row; Then the WHERE clause below could be "Totals.RunningTotal < Totals.ShortFall".
-- I liked RunningTotal as the sum of this row and all prior, it seems more appropriate to me.
WHERE Totals.RunningTotal - Totals.LocationQty <= Totals.ShortFall
AND Totals.LocationQty > 0
Also - as long as you are reading my answer, an unrelated side-note: Based on the data you showed above, your database schema isn't normalized as far as it could be. It seems like the Picked Quantity and the ShortFall actually depend only on the Product, so that would be a table of its own, and then the Location Quantity depends on the Product and Location, so that would be a table of its own. I'm pointing it out because if your data contained different Picked Quantities/ShortFall for a single product, then the above query would break; This situation would be impossible with the normalized tables I mentioned.
I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified.
Say, we find the text "General Motors" in an article. We have a set of data that contains articles and the correct entities mentioned within in. So, if we have found "General Motors" mentioned in a new article, should it fall into that class of articles in the prior data that contained a known genuine mention "General Motors" vs. the class of articles which did not mention that entity?
(I'm not creating a class for every entity and trying to classify every new article into every possible class. I already have a heuristic method for finding plausible mentions of entity names, and I just want to verify the plausibility of the limited number of entity name mentions per article that the method already detects.)
Given that the number of potential classes and articles was quite large and naive bayes is relatively simple, I wanted to do the whole thing in sql, but I'm having trouble with the scoring query...
Here's what I have so far:
CREATE TABLE `each_entity_word` (
`word` varchar(20) NOT NULL,
`entity_id` int(10) unsigned NOT NULL,
`word_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`word`, `entity_id`)
);
CREATE TABLE `each_entity_sum` (
`entity_id` int(10) unsigned NOT NULL DEFAULT '0',
`word_count_sum` int(10) unsigned DEFAULT NULL,
`doc_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`entity_id`)
);
CREATE TABLE `total_entity_word` (
`word` varchar(20) NOT NULL,
`word_count` int(10) unsigned NOT NULL,
PRIMARY KEY (`word`)
);
CREATE TABLE `total_entity_sum` (
`word_count_sum` bigint(20) unsigned NOT NULL,
`doc_count` int(10) unsigned NOT NULL,
`pkey` enum('singleton') NOT NULL DEFAULT 'singleton',
PRIMARY KEY (`pkey`)
);
Each article in the marked data is split into distinct words, and for each article for each entity every word is added to each_entity_word and/or its word_count is incremented and doc_count is incremented in entity_word_sum, both with respect to an entity_id. This is repeated for each entity known to be mentioned in that article.
For each article regardless of the entities contained within for each word total_entity_word total_entity_word_sum are similarly incremented.
P(word|any document) should equal the
word_count in total_entity_word for that word over
doc_count in total_entity_sum
P(word|document mentions entity x)
should equal word_count in
each_entity_word for that word for entity_id x over doc_count in
each_entity_sum for entity_id x
P(word|document does not mention entity x) should equal (the word_count in total_entity_word minus its word_count in each_entity_word for that word for that entity) over (the doc_count in total_entity_sum minus doc_count for that entity in each_entity_sum)
P(document mentions entity x) should equal doc_count in each_entity_sum for that entity id over doc_count in total_entity_word
P(document does not mention entity x) should equal 1 minus (doc_count in each_entity_sum for x's entity id over doc_count in total_entity_word).
For a new article that comes in, split it into words and just select where word in ('I', 'want', 'to', 'use'...) against either each_entity_word or total_entity_word. In the db platform I'm working with (mysql) IN clauses are relatively well optimized.
Also there is no product() aggregate function in sql, so of course you can just do sum(log(x)) or exp(sum(log(x))) to get the equivalent of product(x).
So, if I get a new article in, split it up into distinct words and put those words into a big IN() clause and a potential entity id to test, how can I get the naive bayesian probability that the article falls into that entity id's class in sql?
EDIT:
Try #1:
set #entity_id = 1;
select #entity_doc_count = doc_count from each_entity_sum where entity_id=#entity_id;
select #total_doc_count = doc_count from total_entity_sum;
select
exp(
log(#entity_doc_count / #total_doc_count) +
(
sum(log((ifnull(ew.word_count,0) + 1) / #entity_doc_count)) /
sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (#total_doc_count - #entity_doc_count)))
)
) as likelihood,
from total_entity_word aew
left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=#entity_id
where aew.word in ('I', 'want', 'to', 'use'...);
Use an R to Postgres (or MySQL, etc.) interface
Alternatively, I'd recommend using an established stats package with a connector to the db. This will make your app a lot more flexible if you want to switch from Naive Bayes to something more sophisticated:
http://rpgsql.sourceforge.net/
bnd.pr> data(airquality)
bnd.pr> db.write.table(airquality, no.clobber = F)
bnd.pr> bind.proxy("airquality")
bnd.pr> summary(airquality)
Table name: airquality
Database: test
Host: localhost
Dimensions: 6 (columns) 153 (rows)
bnd.pr> print(airquality)
Day Month Ozone Solar.R Temp
1 1 5 41 190 67
2 2 5 36 118 72
3 3 5 12 149 74
4 4 5 18 313 62
5 5 5 NA NA 56
6 6 5 28 NA 66
7 7 5 23 299 65
8 8 5 19 99 59
9 9 5 8 19 61
10 10 5 NA 194 69
Continues for 143 more rows and 1 more cols...
bnd.pr> airquality[50:55, ]
Ozone Solar.R Wind Temp Month Day
50 12 120 11.5 73 6 19
51 13 137 10.3 76 6 20
52 NA 150 6.3 77 6 21
53 NA 59 1.7 76 6 22
54 NA 91 4.6 76 6 23
55 NA 250 6.3 76 6 24
bnd.pr> airquality[["Ozone"]]
[1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6
[19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA
[37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA
[55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
[73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50
[91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22
[109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73
[127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20
You'll then want to install the e1071 package to do Naive Bayes. At the R prompt:
[ramanujan:~/base]$R
R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
~/.Rprofile loaded.
Welcome at Sun Apr 19 00:45:30 2009
> install.packages("e1071")
> install.packages("mlbench")
> library(e1071)
> ?naiveBayes
> example(naiveBayes)
More info:
http://cran.r-project.org/web/packages/e1071/index.html
Here's a simple version for SQL Server. I run it on a free SQL Express implementation and it is pretty fast.
http://sqldatamine.blogspot.com/2013/07/classification-using-naive-bayes.html
I don't have time to calculate all the expressions for the NB formula, but here's the main idea:
SET #entity = 123;
SELECT EXP(SUM(LOG(probability))) / (EXP(SUM(LOG(probability))) + EXP(SUM(LOG(1 - probability))))
FROM (
SELECT #entity AS _entity,
/* Above is required for efficiency, subqueries using _entity will be DEPENDENT and use the indexes */
(
SELECT SUM(word_count)
FROM total_entity_word
WHERE word = d.word
)
/
(
SELECT doc_count
FROM each_entity_sum
WHERE entity_id = _entity
) AS pwordentity,
/* I've just referenced a previously selected field */
(
SELECT 1 - pwordentity
) AS pwordnotentity,
/* Again referenced a previously selected field */
... etc AS probability
FROM total_entity_word
) q
Note that you can easily refer to the previous field in SELECT by using them in correlated subqueries (as in example).
If using Oracle, it has data mining built in
I'm not sure what db you're running, but if you're using Oracle, data mining capabilities are baked into the db:
http://www.oracle.com/technology/products/bi/odm/index.html
...including Naive Bayes:
http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/algo_nb.htm
and a ton of others:
http://www.oracle.com/technology/products/bi/odm/odm_techniques_algorithms.html
That was surprising to me. Definitely one of the competitive advantages that Oracle has over the open source alternatives in this area.
Here is a blog post detailing what you are looking for: http://nuncupatively.blogspot.com/2011/07/naive-bayes-in-sql.html
I have coded up many versions of NB classifiers in SQL. The answers above advocating changing analysis packages were not scalable to my large data and processing time requirements. I had a table with a row for each word/class combination (nrows = words * classes) and a coefficient column. I had another table with a column for document_id and word. I just joined these tables together on word, grouped by document, and summed the coefficients and then adjusted the sums for the class probability. This left me with a table of document_id, class, score. I then just picked the min score (since I was doing a complement naive bayes approach, which I found worked better in a multi-class situation).
As a side note, I found many transformations/algorithm modifications improved my holdout predictions a great deal. They are described in the work of Jason Rennie on "Tackling the Poor Assumptions of Naive Bayes Text Classifiers" and summarized here: http://www.ist.temple.edu/~vucetic/cis526fall2007/liang.ppt