Statistical calculations in an Access 2010 query - sql

currently we're building a database to track different factories' pollutant emissions. Now a query is needed that gives us information about relative quantities. Somehow I feel this should be straight forward but I have had no success implementing it in SQL.
I'm starting from a working query that returns the following fields:
PRODUCTION_YEAR, COMPANY, PRODUCT_CATEGORY, POLLUTANT, TOTAL_EMISSIONS, SHARE
TOTAL_EMISSIONS contains the total emissions for each company in a particular year and product category. SHARE is a computed field and contains the contribution (as a fraction) of each company to that year's overall emissions of that particular pollutant in that particular product category.
Now the task is to count the factories contributing to each pollutant. I arrived at this:
SELECT PRODUCTION_YEAR, POLLUTANT, PRODUCT_CATEGORY, Count(COMPANY)
FROM theQuery
GROUP BY PRODUCTION_YEAR, POLLUTANT, PRODUCT_CATEGORY;
However, now our client wants something more sophisticated: count only the biggest polluters who contribute 95% of emissions. In a script, I'd probably just have the pollution percentages in each category sorted ascendingly, then walk the dataset, sum up the shares and only start counting after reaching 5%. Doing it in SQL, no idea.
My first step (adding a SUM(SHARE) field to the new query) already resulted in errors ("expression not included in aggregate function", roughly translated, not sure what to make of it because all the expressions were indeed included). Is there even a way to do this in an SQL query, or am I wasting my time and would be better off just writing some VBA?
Thanks for any input!
Best,
Ben

Gord's method (see link in comment) works well for this task.

Related

MS SSAS - Need to return a measure in a calculated member based on a tuple set and a max ofunderlying ID

I require some more advanced MDX knowledge than mine.
I need to get the RepoRate_MAX for repo products, at book and instrument level, but also looking at the Java code I'm replacing that code always uses the max MurexId.
How can I perform the below (I've placed MAX in here on the dimension but this is wrong) and I need the combo of the dimensions and also the MAX MurexId:
[Measures].[RepoRate_VAL] = (([Deal].[ProductType].&[REPO],[Deal].[Book],[Deal].[Instrument],MAX([Deal].[MurexId])),[Measures].[RepoRate_MAX])
I'm sure it's a simple one but my mind is part way between the Java OO and MDX worlds currently haha :D
Thanks
Leigh
So after some experimenting I found out about the TAIL and Item MDX functions.
I think at one point I did get it working, but didn't make a note of what did work. I was playing around with this and variants of it..but most versions ended up in unusable query times:
[Measures].[RepoRate_VAL] = (([Deal].[ProductType].&[REPO],[Deal].[Book],[Deal].[Instrument],TAIL(EXISTING([Deal].[MurexId].[MurexId])).Item(0)),[Measures].[RepoRate_MAX])
So I then decided to push the RepoRate calculation back to the SQL data preparation script. Cleaner/smoother data is always better and then to have simple calculated members.
I used SQL to determine the RepoRate from tradelevel with MAX(MurexId) and GROUP BY on Book, Instrument to then update my main fact table to ensure that the correct RepoRate was set at Book, Instrument level.
Thus the calculated member is then:
[Measures].[RepoRate_VAL] = (([Deal].[Book],[Deal].[Instrument]),[Measures].[RepoRate_MAX])
Fast data prep and a fast calculated member on the Excel/Pivot/UI layer.

Internal db logic/operation to group/compress result

I have a CrateDB table storing various information for zipcodes. It contains around 30k zipcodes, and I need my query to return certain profiling information for all zipcodes at once. I understand that typically it wouldn't be feasible, but since I only need ballpark information and many zipcodes are consecutive, I think an optimization is possible.
For example, if I wanted to profile population, a grouped result such as this would work for me:
group 1 (0-1000): 00000-02000,02004-02010,02012
group 2 (1001-3000): ...
...
The populations and groups above are fake, but the idea should hold. Basically, group profiled category into buckets, assign zipcodes to correct bucket, and further reduce size by using range representation. I could settle for a predefined number of groups or have group buckets defined by request/query itself. This would hopefully reduce the response from something that would be too large for a single query to one that's manageable.
Is it possible to write a cratedb function to do something similar to avoid bandwidth issues from having this grouping done on a different service/container/vm?
You could probably crate groups on the fly or as columns if you wish with a regex, I have done this on a 23M row table and group by that.
In my example regex grouping and AVG took around 30s, but this is very subjective to my hardware.
Something like this would probably work as a general pointer
SELECT avg (--yourColumn--), regexp_matches(--yourColumn--, '--your regex--','i')[1]
FROM "doc"."--yourTable--"
group by regexp_matches(postcode, '--your regex--','i')[1]
order by regexp_matches(postcode, '--your regex--','i')[1]
You could use over windowed function but this doesn't yet have the full SQL support for partitioning etc.

SSAS: Get a distinct Count based on two different elements

I need to create a distinct count of people who fall into two different dimensions.
One is called [Student Research Degree].[Is Research Degree Current].&[Yes]
The other is called [Student Research Degree].[Is Research Degree Complete].&[Yes]
If one or the other are Yes, or both, then I need to count the record.
If both are no, I can exclude it. I have a row counter measure called [Measures].[Student ID Distinct Count Hidden] already in place.
If I use just one element with the measure, I get the right answer, but if I try to cross join the other elements, I get a result of NULL.
eg
AGGREGATE(CROSSJOIN(
[Student Research Degree].[Is Research Degree Current].&[Yes]
,[Student Research Degree].[Is Research Degree Complete].&[Yes]
), [Measures].[Student ID Distinct Count Hidden])
I am aware that I can just land an extra value in the ETL, and have SQL do the work, and in the end this might be the solution. Is there a way of doing an OR statement on this sort of thing?
No, the TUPLE of &[YES], &[YES] doesn't create an OR situation, where I want [NO]s when the other is yes.
I started looking at a subtractive approach where I started with the ALL set, and removed the distinct count of invalid combinations in a tuple and subtracted that from the grand total. This approach did work, but ONLY because the data allowed for it. If a person could have been in multiple combinations, this wouldn't have worked.
I'm currently testing that approach with the rest of the cube. By all appearances this works perfectly, but I will go with ETL if any bugs or mismatches can be proven.

Can proc sql embedded in sas macros dynamically merge to data-sets, simulating residential treatment placement decisions for trouble youth?

Good afternoon and happy Friday, folks
I’m trying to automate a placement simulation of youth into residential treatment where they will have the highest likelihood of success. Success is operationalized as “not recidivating” within 3 years of entering treatment. Equations predicting recidivism have been generated for each location, and the equations have been applied to each individual in the scenario (based on youth characteristics like risk, age, etc., LOS). Each youth has predicted success rates for every location, which throws in a wrench: youth are not qualified for all of the treatment facilities for which they have predicted success rates. Indeed, treatment locations have differing, yet overlapping qualifications.
Let’s take a made-up example. Johnny (ID # 5, below) is a 15-year-old boy with drug charges. He could have “predicted success rates” of 91% for location A, 88% for location B, 50% for location C, and 75% for location D. Johnny is most likely to be successful (i.e., not recidivate within three years of entering treatment) if he is treated at location A; unfortunately, location A only accepts youth who are 17 years old or older; therefore, Johnny would not qualify for treatment here. Alternatively, for Johnny, location B is the next best location. Let us assume that Johnny is qualified for location B, but that all of location-B beds are filled; so, we must now look to location D, as it is now Johnny’s “best available” option at 75%.
The score so far: We are matching youth to available beds in location for which they qualify and might enjoy the greatest likelihood of success. Unfortunately, each location only has a certain number of available beds, and the number of available beds different across locations. The qualifications of entry into treatment facilities differ, yet overlap (e.g., 12-17 year-olds vs 14-20 year-olds).
In order to simulate what placement decisions might look like based on success rates, I went through the scenario describe above for over 400 youth, by hand, in excel. It took me about a week. I’d like to use PROC SQL imbedded in a SAS MACRO to automate these placement scenarios with the ultimate goals of a) obtain the ability to bootstrap iterations in order to examine effect sizes across distributions, b) save time, and c) prevent further brain damage from banging my head again desk and wall in frustration whilst doing this by hand. Whilst never having had the necessity—nay—the privilege of using SQL in my typical roll as a researcher, I believe that this time has now come to pass and I’m excited about it! Honestly. I believe it has the capacity I’m looking for. Unfortunately, it is beating the devil out of me!
Here’s what I’ve got cookin’ so far: I want to create and automate the placement simulation with the clever use of merging/joining/switching/or something like that.
I have two datasets (tables). The first dataset contains all of the youth information (one row per youth; several columns with demographics, location ranks, which correspond to the predicted success rates). The order of rows in the youth dataset (was/will be randomly generated (to simulate the randomness with which youth enter the system and are subsequently place into treatment). Note that I will be “cleaning” the youth dataset prior to merging such that rank-column cells will only be populated for programs for which a respective youth qualifies. This should take the “does the youth even qualify for the program” problem out of the equation.
However, it still leaves the issue of availability left to be contended with in the scenario.
The second dataset containing the treatment facility beds, with each row corresponding to an available bed in one of the treatment location; two columns contain bed numbers and location names. Each bed (row) has only one location cell populated, but locations will populate several cells.
Thus, in descending order, I want to merge each youth row with the available bed that represents his/her best chance of success, and so the merge/join/switch/thing should take place
on youth.Rank1= distinct TF.Location,
and if youth.Rank1≠ TF.location then
merge on youth.Rank2= TF.location,
if youth.Rank2≠ TF.location then merge at
youth.Rank3 = TF.location, etc.
Put plainly: “Merge on rank1 unless rank1 location is no longer available, then merge on rank2, unless rank2 location is no longer available, and on down the line, etc., etc., until all option are exhausted and foster care (i.e., alternative services). Is the only option.
I’ve had no success getting this to work. I haven’t even been successful getting the union function to work. About the only successful thing I’ve done in SQL so far is create a view of a single dataset. It’s pretty sad. I’ve been following this guidance, but I get hung up around the “where” command:
proc sql; /Calls the SQL procedure*/;
create table x as /*Tells SAS to create a table called x*/
select /*Specifies the column(s) to be selected*/
from /*Specificies the tables(s) (data sets) to be queried*/
where /*Subjests the data based on a condition*/
group by /*Classifies the data into groups based on the specified
column(s)*/
order by /*Sorts the resulting rows observations) by the specified
column(s)*/
; quit; /*Ends the proc sql procedure*/
Frankly, I’m stuck and I could use some advice. This greenhorn in me is in way over his head.
I appreciate any help or guidance anyone might lend.
Cheers!
P
The process you describe (and to be honest I skiped to the end so I might of missed something) does not lend itself to SQL because each step could affect the results of the next one. However, you want to get the most best results for the most kids. (I think a lot of that text was to convince us how important it is to help out). You don't actually give us anything we can really use to help since you don't give any details of your data model, your data, or expected results. There really is no way to answer this question. But I don't care -- I'm going to go forward with some suggestions because it is a friday and I've never done a stream of consciousness answer to a stream of consciousness question before. I will suggest you don't formulate your solution just in sql, but instead use a higher level program and engage is a process like the one described below -- because this a DB questions I've noted the locations where the DB might be involved.
Generate a list kids (this can be in a table -- called NEEDY-KID)
Have a list of locations to assign (this can also be a table LOCATION)
Run your matching for best fit from KID to location -- at this point don't worry about assign more than one kid to a location -- there can be duplicates (put this in table called KID2LOC using a query)
Check KID2LOC for locations assigned twice -- use some method to remove the duplicate ones so each loc is only assigned once. (remove from the KID2LOC using a query)
Prune the LOCATION list to remove assigned locations (once again -- a query)
If kids exist without a location go to 3 with new pruned location list.
Done.

SQL Server Payment estimate probabilty - best maths equations and recursive query?

I have a table which holds a list of transactions.
Task: To estimate the next transaction amount.
Problem:
The actual payment periods for each rows is a varible, which can be weekly, monthly or anything choosen by the end user.
To estimate the next payment, based on previous data, can anyone suggest a good method?
At the moment I basically take the figure back to the daily amount then multiple by period i.e. week/month/q/year. Then given the history, choose the result that has the highest incidence (count).
This does not generate an accuarate estimations due to payments within payments that I dont need to care about i.e. £100 real payment but +20 for addition charges that are irrelevant.
Another way is to calculate the average,std,varience between payments then choose the highest probability.
Problem is, i've been unable to code this in SQL.
SELECT [Identifier]
,[DateTranEntered]
,[Type]
,[TranDateFrom],
,[TranDateTo]
,[Amount]
,[ReferenceForTran]
,[CreatedDate]
FROM .[TranTable]
Perhaps something with recursion through the table and calculate every transaction daily amount then with the variance, incidence - choose from the last 'x' what the estimate guess is ?
Problem is I have gotten stuck with the resurive query for this.
Any thoughts about this?
SQL Server Analysis services has a suite of data mining tools that provide algorithms such as Linear Regressions, Decision Trees and Neural Networks. You can learn more about them here: http://msdn.microsoft.com/en-us/library/ms175595.aspx. It sounds like Linear Regressions might be the best place to start for this problem.