How to find categorical outlier/noise rows in bigquery window - sql

How would i detect outlier rows in bigquery and mark rows as outliers. For each row, I want to look at 5 rows before and 5 rows after and see if the value changed.
Here is an example of a table.
In this table I would want the highlighted rows to have some boolean flag to say outlier because the rows only change for a brief second. However, when it changes from id 19 to 5, that is okay (assuming that is a continued change). Basically I am trying to remove glitches where it just changes from one user id to another for 2 rows, so i want to mark those rows with some flag. I was thinking of doing this in bigquery by looking at lag rows and lead rows in a given window (over a partition), but am not sure if i am on the right path.
This isn't a typical "outlier" case as all rows should have the same value except for the glitches and it is more of a categorical variable, albeit a number.

Related

How to combine a row of cells in VBA if certain column values are the same

I have a database where all of the input from the user (through a userform) gets stored. In the database, each column is a different category for the type of data (ex. date, shift, quantity, etc) and the data from the userform input gets put into its corresponding category. For some of the data, all the data is the same except for the quantity. I was wondering how I could combine these rows into one and add the quantities to each other for the whole database (ex. combining the first and third data entries). I have tried playing around with a couple different loops but can't seem to figure anything out.
Period Date Line Shift Type Quantity
4 x 2 4/3/18 A 3 14 18
4 x 2 4/3/18 A 3 13 12
4 x 2 4/3/18 A 3 14 15
Thank you!
If you're looking to modify the underlying database, you might be able to query the data into the format you want by including all the other columns in a GROUP BY statement, save the result to another table, then replace the original table with the properly formatted one.
If you have the data in Excel and you just want to view it with the duplicate rows summed, a Pivot Table would be a good choice. You can select all the other columns as rows for the Pivot Table and sum of Quantity as the values.

No sum() with null values

I'm looking for a solution to create sums of +- 10 scores and targets of a product over 6 different dimensions. There are some more i won't bother you with. Of every dimension I need a total. For example
SalesPeriod. Product: Bikes. Dimensions: bmx, size, colours, with bars etc. Targets: 1,2,3,4,5. Scores:1,2,3,4,5.
So 10 totals for bmx bikes with size x, colour red and bars, and 10 totals for bmx bikes, size x, colour red etc etc.
However, every score needs to be calculated only when none of the underlying values is a null. For example score 1 contains a null then no calculation, but score 2 does not contain a null thus should be calculated.
At this point the calculation is done via a case statement which basically checks the values of within each column/score and only calculates the total when the count of scores is equal to the expected rows.
The calculation requires a lot of cpu and with a larger dataset this is very inefficient and it simply takes too long.
I'm looking for a solution that will be much more effecient. What could be my best option to try?
You can filter (or first group by) the products with Non Null values only first by using your same count method. I don't think there is any other method.
SELECT columnid, SUM(column1)
FROM table
GROUP BY columnid
HAVING COUNT(column1)=COUNT(*);
Then you can join it on columnid with another similar query on another columnN as well.
(I'm not sure if understood your problem completely, but you basically want an efficient query with sum(scores) and sum(targets) only when they are not null? or only when they are both not null? or only scores? or only targets?)

What is the best way to reassign ordinal number of a move operation

I have a column in the sql server called "Ordinal" that is used to indicate the display order of the rows. It starts from 0 and skips 10 for the next row. so we have something like this:
Id Ordinal
1 0
2 20
3 10
It skips 10 because we wanted to be able to move item in between items (based on ordinal) without having to reassign ordinal number for the entire table.
As you can imagine eventually, Ordinal number will need to be reassign somehow for a move in between operation either on surrounding rows or for the entire table as the unused ordinal numbers between the target items are all used up.
Is there any algorithm that I can use to effectively reorder the ordinal number for the move operation taken in the consideration like long term maintainability of the table and minimizing update operations of the table?
You can re-number the sequences using a somewhat complicated UPDATE statement:
UPDATE u
SET u.sequence = 10 * (c.num_below-1)
FROM test u
JOIN (
SELECT t.id, count(*) AS num_below
FROM test t
JOIN test tr ON tr.sequence <= t.sequence
GROUP BY t.id
) c ON c.id=u.id
The idea is to obtain a count of items with the sequence lower than that of the current row, multiply the count by ten, and assign it as the new count.
The content of test before the UPDATE:
ID Sequence
__ ________
1 0
2 10
3 20
4 12
The content of test after the UPDATE:
ID Sequence
__ ________
1 0
2 30
3 10
4 20
Now the sequence numbers are evenly spread again, so you can continue inserting in the middle until you run out of new sequence numbers; then you can re-number again.
Demo.
These won't answer your question directly--I just thought I might suggest some other approaches:
One possibility--don't try to do it by hand. Have your software manage the numbers. If they need re-writing, just save them with new numbers.
a second--use a "Linked List" instead. In each record store the index of the next record you want displayed, then have your code load that directly into a linked list.
Yet another simple approach. Let's say you're inserting a new record with an ordinal equal x.
First, check if there's a row having ordinal value equal x. In case there's one, just update all the records having the ordinal value equal or bigger than x increasing them by y. Then, you are safe to insert a new record.
This way you're sure you'll not run update every time and of course, you'll keep the order.

SSRS - Is there a way to have a table split by a page break to appear on the same page?

I am creating a report using MS Report Builder 3.0. For this report, I have a stored procedure built that filters down to the specific rows needed, and then I use a row group to group on a particular field (pass_no). The table that is displayed is 2 columns and 3 rows within the row group. The basic description of what I want to accomplish is instead of the rows running onto the next page, I want the rows to continue on the same page in a new set of 2 columns. Think of it like a newspaper where the text continues in a new column rather than running down onto the next page.
For the example I'm going to use here, there are 12 rows of data returned by the SP, and 8 unique values in the pass_no column which is what my row group is grouped on. So in the report I end up with 8 groups of 3 rows. I'm aiming to have the table display 6 pass_no values (so 6 groups of 3 rows) before, for lack of better terminology, starting a new table.
My first approach at this has been to create a column group and set the grouping expression to the following:
=Floor((RowNumber(Nothing) - 1) / 6)
While this works in creating a new set of 2 columns, the split for the new columns is based on the row number from the raw data returned by the SP rather than the number of rows sets created by the row group. So because there are 12 rows returned, and the 6th and 7th rows have the same pass_no value, the second set of columns duplicates that 1 set of data. Also, the top 6 rows of the second column set are blank with the second set of values appearing below the first set.
If I add an additional column group where it is also grouping by pass_no, then I don't get the duplicate values, but I do get a pair of columns for each pass_no as well (as would be expected). I've tried modifying the expression above a bit and changed Nothing to the row group name and have tried the table name, but neither of them have yielded the desired result.
I can't alter the SP to do the grouping there because there are other column values that are not identical and I pull that data into a cell value expression within the table using Join(LookupSet()).
I have also considered creating 2 tables and applying a filter to the table so the first table only displays the first 6 results and the second table displays the remaining results, but that also looks at the raw data rather than the groupings and TOP N can't be used on pass_no as it's a text value, not an integer. This would also cause problems if I need to go to 3 tables.
So long story short, is there a way to do a table break rather than a page break or to overflow columns onto the same page rather than onto a new page?
Here's the pertinent portions of the Dataset:
http://sqlfiddle.com/#!2/5082b/1
PASS_NO MASTERTRAN TRANS_NO DESCRIPTION IS_MOD
7913019000 4931019000 4931019000 General Admission Adult 0
7914019000 4932019000 4932019000 Sea Turtle Hosp Adult 0
7914019000 4932019000 4933019000 2:00 PM SEA TURTLE HOSP 1
7916019000 4934019000 4934019000 Sea Turtle Hosp Child 0
7916019000 4934019000 4935019000 2:00 PM SEA TURTLE HOSP 1
7917019000 4934019000 4934019000 Sea Turtle Hosp Child 0
7917019000 4934019000 4935019000 2:00 PM SEA TURTLE HOSP 1
7918019000 4934019000 4934019000 Sea Turtle Hosp Child 0
7918019000 4934019000 4935019000 2:00 PM SEA TURTLE HOSP 1
7922019000 4936019000 4936019000 General Admission Child 0
7923019000 4936019000 4936019000 General Admission Child 0
7924019000 4936019000 4936019000 General Admission Child 0
I think your data presents a bit of a problem.
As you've already figured out, typically for this sort of setup you'd set up a row group with an expression like:
=(RowNumber(Nothing) - 1) Mod 6
And a column group expression like:
=Ceiling(RowNumber(Nothing) / 6)
This would create a six row tablix that would grow horizontally as required.
See this SO question for a similar example.
However, you currently have the requirement of also grouping by another column - pass_no in your case. Normally you can approximate a group-level row number with an expression like:
=RunningValue(Fields!pass_no.Value, CountDistinct, "DataSet1")
Unfortunately, when you try to add this into one of the grouping expressions like:
=Ceiling(RunningValue(Fields!pass_no.Value, CountDistinct, "DataSet1") / 6)
You get the following error:
A group expression for the tablix 'Tablix1' includes the aggregate
function RunningValue. RunningValue cannot be used in group
expressions.
Based on all this, my recommendation is to try and get a Dataset that has one row per pass_no value and base the tablix on this, with the above row/column grouping expressions, i.e. no need to group on multiple pass_no rows. So in your example it would have eight rows. You could then have a separate Dataset with all the individual rows and use a lookupset function to concatenate the description, etc.
Your other option is to try and get everything on one Dataset only, including the aggregates as required. This might not be possible, but for description at least you can leverage any of the various techniques here to get a delimited list. Once you have this list you can replace the delimiter with vbCrLf to split it back over multiple rows.
All this is a very long-winded way of saying that I don't know if your requirement is possible with your data, but if you look at having at least one Dataset with one row per pass_no you should be able to make it work.

Dynamic use of MDX AVG function

Anyone have advice on how to build an average measure that is dynamic -- it doesn't specify a particular slice but instead uses your current view? I'm working within a front-end OLAP viewer (Strategy Companion) and I need a "dynamic" implementation based on the dimensions that are currently filtered in the data view.
My fact table looks something like this:
Key AmountA IndicatorA AmountB Other Data
1 5 1 null 25
2 6 1 null 52
3 7 1 2 106
4 null 0 4 108
Now I can specify a simple average for "[Measures].[AmountA]" with "[Measures].[AmountA] / [Measures].[IndicatorA]" which works great - "[IndicatorA]" sums up to the number of non-null values of "[AmountA]". And this also works great no matter what dimensions are selected in the view - it always divides by the count of rows that have been filtered in.
But what about [AmountB]? I don't have a null indicator column. I want to get an average value of [AmountB] for whatever rows have been filtered in for my current view. If I try to use the count of rows as a simple formula (psuedo-code "[Measures].[AmountB] / Count([Measures].[Key])") I get the wrong result, because it is counting all the null rows in the average.
So, I need a way to use the AVG function to specify the average of [AmountB] over the set of "whatever rows I'm currently filtering in, based on whatever dimensions I'm currently using". How do I specify this dynamic set?
I've tried several different uses of the AVG function and they have either returned null or summed up to huge numbers, clearly not the average I'm looking for.
Thanks-
Matt
Sorry, my first suggestion was wrong. If you don't have access to OLAP cube you can't write any mdx-query for this purpose (IMHO). Because, you don't have any detailed data (from your fact table) in this access level and you can use only aggregated data and dimensions from your cube.
Otherwise (if you have access to olap db), you can create this metric (count of not NULL rows) in your measure group and after that use it for AVG calculation (as calculated member in your cube or in section "WITH" in your mdx-query).