Defining an RDLC chart axis with an aggregate function - rdlc

The autoaxis for one of my embedded charts isn't behaving well, sometimes only showing one other major value besides top and bottom. So I thought I'd set my own boundaries, which seemed pretty easy given that one of the columns on the chart is always going to be larger than any of the others.
<Maximum>=(((Max(Fields!Entered.Value, "Chart1") + 10) \ 50) + 1) * 50</Maximum>
(the other columns detail what happened to the things that entered this process)
Round up to the nearest 50 with a little overage to put the label on top. Then I can put the intervals at this divided by 5 and I'm gold.
Except I'm not gold. The chart groups records by date and the individual bars are Sum(Fields!Entered.Value) et cetera, so it's drastically underscaling when multiple batches get processed on a single date. But hey, it groups records by date, I can use that:
<ChartCategoryHierarchy>
<ChartMembers>
<ChartMember>
<Group Name="Chart1_CategoryGroup">
<GroupExpressions>
<GroupExpression>=Fields!Date.Value</GroupExpression>
</GroupExpressions>
</Group>
</ChartMember>
</ChartMembers>
</ChartCategoryHierarchy>
as:
<Maximum>=(((Max(Fields!Entered.Value, "Chart1_CategoryGroup") + 10) \ 50) + 1) * 50</Maximum>
and it'll aggregate over the group just fine. Right?
The ValueAxis_Primary.Maximum expression for the chart 'Chart1' has a scope parameter that is not valid for an aggregate function. The scope parameter must be set to a string constant that is equal to either the name of a containing group, the name of a containing data region, or the name of a dataset.
Nope! It works just fine for "Chart1" but not for "Chart1_CategoryGroup"!
So, uh:
what scope are the axis calculations operating in, 'cause it ain't the category scope?
is there some way to provide them an aggregate scope that groups the data by date so they can do their calculations proper?

You Have To Nest The Scope
A little extra work gave me this insight:
Max(Fields!Entered.Value, "Chart1_CategoryGroup") returns the maximum of the entered fields within one single category group, which is not the level the Y axis is concerned with. What you're interested in is the maximum value of the summed calculation (within a group) for the whole chart, so specify the scopes to do that:
<Maximum>
=(((Max(
Sum(Fields!Entered.Value, "Chart1_CategoryGroup")
, "Chart1") + 10) \ 50) + 1) * 50
</Maximum>

Related

Proportions for multiple subcategories

I am trying to calculate proportions with multiple subcategories. As seen in the screenshot below, the series is grouped by ['budget_levels', 'revenue_levels'].
I would like to calculate the proportion for each.
For example,
budget_levels=='low' & revenue_levels=='low' / budget_levels=='low'
budget_levels=='low' & revenue_levels=='medium' / budget_levels=='low'
However, not getting the desired output.
Is there any way I could do this calculation for each with a simple one-line code such as .apply(lambda) function?
Use value_counts to get the number of occurences of each combination. Then group by the column budget_levels and divide the observations in each group by their sum. sort_index makes it easier to compare the groups.
df.value_counts().groupby(level=0).transform(lambda x: x / x.sum()).sort_index()

In Python, is there a way to divide an element in a list by the next element in a list using a for loop or list comprehension?

I have a list of metrics that each have values for multiple time periods. I would like to write a script that takes a value of a metric for a particular time period and divides it by the previous year.
Currently my code looks like this:
for metric in metric:
iya_df[metric+' '+period[0][-4:]+' IYA'] = pivot[metric][period[0]]/pivot[metric][period[1]]*100
iya_df[metric+' '+period[1][-4:]+' IYA'] = pivot[metric][period[1]]/pivot[metric][period[2]]*100
iya_df[metric+' '+period[2][-4:]+' IYA'] = pivot[metric][period[2]]/pivot[metric][period[3]]*100
iya_df[metric+' '+period[3][-4:]+' IYA'] = pivot[metric][period[3]]/pivot[metric][period[4]]*100
I have a list of metrics and a list of periods. (The slicer after period is just to grab the 4 -digit year).
The source table is a pivot table with multiple indices.
I would like to change the code so that I don't have to change it if my list of time periods changes in length.
There's probably a more efficient way to do this with list comprehension than loops but I'm still getting stronger in Python.

Grouping a Group with Accumulator

I want to segment time-series data using separate interval sets. For example, one set has smaller periodic intervals, and one set has larger aperiodic intervals. The interval sets are likely unaligned, such that one from one set may bisect one from the other. In this case the aggregate over the bisected, second interval should resume with the previous value as initial, i.e. an accumulator. The sets are nested in that the carry only occurs between the "inner" set.
For example, sum:
[.,.,.,.,.,.,.,.,., .,.,.,.,.,.][.,.,.,.,.,.,....
[1,2,3,4,5,6,7,8,9][1,2,3,4,5,6 ,7,8,9]
[[45], [21] ],[[45], ....
|_value carries_|
I want to say I should iterate over the dataframe with a for-loop, but...
Specifics:
Both intervals sets are series of datetime64 dtype representing boundary or cut points which can be closed either to the left or right. The aperiodic interval set is generated manually, but the periodic set is driven by the time-series data:
a = pd.read_csv...
o = pd.offsets.Week(weekday=1)
p = pd.date_range(o.rollback(d.index[0]), o.rollforward(d.index[-1]), freq=o)
But it is easy enough to convert both to interval indices, if that helps:
p = pd.interval_range(v1.rollback(d.index[0]), v1.rollforward(d.index[-1]), freq=o)
a = pd.IntervalIndex(pd.arrays.IntervalArray.from_breaks(a))

Quick Delta Between Two Rows/Columns in GoodData

Right now, I see there are quick ways to get things like Sum/Avg/Max/Etc. for two or more rows or columns when building a table in GoodData.
quick total options
I am building a little table that shows last week and the week prior, and I'm trying to show the delta between them.
So if the first column is 100 and the second is 50, I want '-50'
If the first column is 25 and the second is 100, i want '75'
Is there an easy way to do this?
Let’s consider, that the first column contains result of calculating of metric #1 and the second column contains result of calculating of metric #2, you can simply create a metric #3, which would be defined as the (metric #1 - metric #2) or vice versa.

Power-law distribution in T-SQL

I basically need the answer to this SO question that provides a power-law distribution, translated to T-SQL for me.
I want to pull a last name, one at a time, from a census provided table of names. I want to get roughly the same distribution as occurs in the population. The table has 88,799 names ranked by frequency. "Smith" is rank 1 with 1.006% frequency, "Alderink" is rank 88,799 with frequency of 1.7 x 10^-6. "Sanders" is rank 75 with a frequency of 0.100%.
The curve doesn't have to fit precisely at all. Just give me about 1% "Smith" and about 1 in a million "Alderink"
Here's what I have so far.
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank] = ROUND(88799 * RAND(), 0)
But this of course yields a uniform distribution.
I promise I'll still be trying to figure this out myself by the time a smarter person responds.
Why settle for the power-law distribution when you can draw from the actual distribution ?
I suggest you alter the LastNames table to include a numeric column which would contain a numeric value representing the actual number of indivuduals with a name that is more common. You'll probably want a number on a smaller but proportional scale, say, maybe 10,000 for each percent of representation.
The list would then look something like:
(other than the 3 names mentioned in the question, I'm guessing about White, Johnson et al)
Smith 0
White 10,060
Johnson 19,123
Williams 28,456
...
Sanders 200,987
..
Alderink 999,997
And the name selection would be
SELECT TOP 1 [LastName]
FROM [LastNames] as LN
WHERE LN.[number_described_above] < ROUND(100000 * RAND(), 0)
ORDER BY [number_described_above] DESC
That's picking the first name which number does not exceed the [uniform distribution] random number. Note how the query, uses less than and ordering in desc-ending order; this will guaranty that the very first entry (Smith) gets picked. The alternative would be to start the series with Smith at 10,060 rather than zero and to discard the random draws smaller than this value.
Aside from the matter of boundary management (starting at zero rather than 10,060) mentioned above, this solution, along with the two other responses so far, are the same as the one suggested in dmckee's answer to the question referenced in this question. Essentially the idea is to use the CDF (Cumulative Distribution function).
Edit:
If you insist on using a mathematical function rather than the actual distribution, the following should provide a power law function which would somehow convey the "long tail" shape of the real distribution. You may wan to tweak the #PwrCoef value (which BTW needn't be a integer), essentially the bigger the coeficient, the more skewed to the beginning of the list the function is.
DECLARE #PwrCoef INT
SET #PwrCoef = 2
SELECT 88799 - ROUND(POWER(POWER(88799.0, #PwrCoef) * RAND(), 1.0/#PwrCoef), 0)
Notes:
- the extra ".0" in the function above are important to force SQL to perform float operations rather than integer operations.
- the reason why we subtract the power calculation from 88799 is that the calculation's distribution is such that the closer a number is closer to the end of our scale, the more likely it is to be drawn. The List of family names being sorted in the reverse order (most likely names first), we need this substraction.
Assuming a power of, say, 3 the query would then look something like
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 88799 - ROUND(POWER(POWER(88799.0, 3) * RAND(), 1.0/3), 0)
Which is the query from the question except for the last line.
Re-Edit:
In looking at the actual distribution, as apparent in the Census data, the curve is extremely steep and would require a very big power coefficient, which in turn would cause overflows and/or extreme rounding errors in the naive formula shown above.
A more sensible approach may be to operate in several tiers i.e. to perform an equal number of draws in each of the, say, three thirds (or four quarters or...) of the cumulative distribution; within each of these parts list, we would draw using a power law function, possibly with the same coeficient, but with different ranges.
For example
Assuming thirds, the list divides as follow:
First third = 425 names, from Smith to Alvarado
Second third = 6,277 names, from to Gainer
Last third = 82,097 names, from Frisby to the end
If we were to need, say, 1,000 names, we'd draw 334 from the top third of the list, 333 from the second third and 333 from the last third.
For each of the thirds we'd use a similar formula, maybe with a bigger power coeficient for the first third (were were are really interested in favoring the earlier names in the list, and also where the relative frequencies are more statistically relevant). The three selection queries could look like the following:
-- Random Drawing of a single Name in top third
-- Power Coef = 12
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 425 - ROUND(POWER(POWER(425.0, 12) * RAND(), 1.0/12), 0)
-- Second third; Power Coef = 7
...
WHERE LN.[Rank]
= (425 + 6277) - ROUND(POWER(POWER(6277.0, 7) * RAND(), 1.0/7), 0)
-- Bottom third; Power Coef = 4
...
WHERE LN.[Rank]
= (425 + 6277 + 82097) - ROUND(POWER(POWER(82097.0, 4) * RAND(), 1.0/4), 0)
Instead of storing the pdf as rank, store the CDF (the sum of all frequencies until that name, starting from Aldekirk).
Then modify your select to retrieve the first LN with rank greater than your formula result.
I read the question as "I need to get a stream of names which will mirror the frequency of last names from the 1990 US Census"
I might have read the question a bit differently than the other suggestions and although an answer has been accepted, and a very through answer it is, I will contribute my experience with the Census last names.
I had downloaded the same data from the 1990 census. My goal was to produce a large number of names to be submitted for search testing during performance testing of a medical record app. I inserted the last names and the percentage of frequency into a table. I added a column and filled it with a integer which was the product of the "total names required * frequency". The frequency data from the census did not add up to exactly 100% so my total number of names was also a bit short of the requirement. I was able to correct the number by selecting random names from the list and increasing their count until I had exactly the required number, the randomly added count never ammounted to more than .05% of the total of 10 million.
I generated 10 million random numbers in the range of 1 to 88799. With each random number I would pick that name from the list and decrement the counter for that name. My approach was to simulate dealing a deck of cards except my deck had many more distinct cards and a varing number of each card.
Do you store the actual frequencies with the ranks?
Converting the algebra from that accepted answer to MySQL is no bother, if you know what values to use for n. y would be what you currently have ROUND(88799 * RAND(), 0) and x0,x1 = 1,88799 I think, though I might misunderstand it. The only non-standard maths operator involved from a T-SQL perspective is ^ which is just POWER(x,y) == x^y.