Tidyr/Purr nest and map_dbl for stats (e.g., max, mean) returning incorrect values - tidyr

I'm trying to mutate a variety of summary statistics based on various groupings in my nested data. I'd like to use this strategy instead of summarize() as I want to store the summary statistics in a tibble with the original data, including other identifying variables.
Group Name
Page Name
User
Date
num
min
Area A
Page 1
user265
22-04-13
14
10
Area A
Page 1
user265
22-04-14
5
3
Area B
Page 2
user275
22-04-01
12
6
There are 8 'groups' and hundreds of 'pages' nested across those groups. Before nesting, each row represented observations by Page Name, User, and Date.
When grouping/nesting by Group Name, the stats I generate for either 'num' or 'min' match what would be expected based on the nested values.
However, when I group by Page Name, the results make no sense based on the data in the nested table. For example, the minimum value for 'num' and 'min' is 1, yet I'll get a mean of 0.10 and a min of 0 for one page. Due to the long format of the data, there are no missing values. I'm not sure why the results aren't consistent with the actual data in the nested table when grouping by Page Name.
adopt_30_nest <- adopt_30 %>%
#Select variables of interest
select(page_name, group_name, user, date, num, min) %>%
#Nest/group by grouping factor, then nest
group_by(page_name) %>%
nest() %>%
#Create a new dbl column with the max num value for each page.
mutate(max_num = map_dbl(data, ~max(.x$num)))
Any ideas for how to fix this? Thanks!

Related

function to sum all first value of Results SQL

I have a table with "Number", "Name" and "Result" Column. Result is a 2D text Array and I need to create a Column with the name "Average" that sum all first values of Result Array and divide by 2, can somebody help me Pls, I must use the create function for this. Its look like this:
Table1
Number
Name
Result
Average
01
Kevin
{{2.0,10},{3.0,50}}
2.5
02
Max
{{1.0,10},{4.0,30},{5.0,20}}
5.0
Average = ((2.0+3.0)/2) = 2.5
= ((1.0+4.0+5.0)/2) = 5.0
First of all: You should always avoid storing arrays in the table (or generate them in a subquery if not extremely necessary). Normalize it, it makes life much easier in nearly every single use case.
Second: You should avoid more-dimensional arrays. The are very hard to handle. See Unnest array by one level
However, in your special case you could do something like this:
demo:db<>fiddle
SELECT
number,
name,
SUM(value) FILTER (WHERE idx % 2 = 1) / 2 -- 2
FROM mytable,
unnest(avg_result) WITH ORDINALITY as elements(value, idx) -- 1
GROUP BY number, name
unnest() expands the array elements into one element per record. But this is not an one-level expand: It expand ALL elements in depth. To keep track of your elements, you could add an index using WITH ORDINALITY.
Because you have nested two-elemented arrays, the unnested data can be used as follows: You want to sum all first of two elements, which is every second (the odd ones) element. Using the FILTER clause in the aggregation helps you to aggregate only exact these elements.
However: If that's was a result of a subquery, you should think about doing the operation BEFORE array aggregation (if this is really necessary). This makes things easier.
Assumptions:
number column is Primary key.
result column is text or varchar type
Here are the steps for your requirements:
Add the column in your table using following query (you can skip this step if column is already added)
alter table table1 add column average decimal;
Update the calculated value by using below query:
update table1 t1
set average = t2.value_
from
(
select
number,
sum(t::decimal)/2 as value_
from table1
cross join lateral unnest((result::text[][])[1:999][1]) as t
group by 1
) t2
where t1.number=t2.number
Explanation: Here unnest((result::text[][])[1:999][1]) will return the first value of each child array (considering you can have up to 999 child arrays in your 2D array. You can increase or decrease it as per your requirement)
DEMO
Now you can create your function as per your requirement with above query.

Transform a column of type string to an array/record i.e. nesting a column

I am trying to get calculate and retrieve some indicators from mutiple tables I have in my dataset on bigquery. I am want to invoke nesting on sfam which is a column of strings which I can't do for now i.e. it could have values or be null. So the goal is to transform that column into an array/record...that's the idea that came to mind and I have no idea how to go about doing it.
The product and cart are grouped by key_web, dat_log, univ, suniv, fam and sfam.
The data is broken down into universe refered to as univ which is composed of sub-universe refered to as suniv. Sub-universes contain families refered to as 'fam' which may or may not have sub-families refered to as sfam. I want to invoke nesting on prd.sfam to reduce the resulting columns.
The data is collected from Google Analytics for insight into website trafic and users activities.
I am trying to get information and indicators about each visitor, the amount of time he/she spent on particular pages, actions taken and so on. The resulting table gives me the sum of time spent on those pages, sum of total number of visits for a single day and a breakdown to which category it belongs, thus the univ, suniv, fam and sfam colummns which are of type string (the sfam could be null since some sub-universes suniv only have families famand don't go down to a sub-family level sfam.
dat_log: refers to the date
nrb_fp: number of views for a product page
tps_fp: total time spent on said page
I tried different methods that I found online but none worked, so I post my code and problem in hope of finding guidance and a solution !
A simpler query would be:
select
prd.key_web
, dat_log
, prd.nrb_fp
, prd.tps_fp
, prd.univ
, prd.suniv
, prd.fam
, prd.sfam
from product as prd
left join cart as cart
on prd.key_web = cart.key_web
and prd.dat_log = cart.dat_log
and prd.univ = cart.univ
and prd.suniv = cart.suniv
and prd.fam = cart.fam
and prd.sfam = cart.sfam
And this is a sample result of the query for the last 6 columns in text and images:
Again, I want to get a column of array as sfam where I have all the string values of sfam even nulls.
I limited the output to only only the last 6 columns, the first 3 are the row, key_web and dat_log. Each fam is composed of several sfam or none (null), I want to be able to do nesting on either the fam or sfam.
I want to get a column of array as sfam where I have all the string values of sfam even nulls.
This is not possible in BigQuery. As the documentation explains:
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
That is, your result set cannot contain an array with NULL elements.
Obviously, in BigQuery you cannot output array which holds NULL, but if for some reason you need to preserve them somehow - the workaround is to create array of structs as opposed to arrays of single elements
For example (BigQuery Standard SQL) if you try to execute below
SELECT ['a', 'b', NULL] arr1, ['x', NULL, NULL] arr2
you will get error: Array cannot have a null element; error in writing field arr1
While if you will try below
SELECT ARRAY_AGG(STRUCT(val1, val2)) arr
FROM UNNEST(['a', 'b', NULL]) val1 WITH OFFSET
JOIN UNNEST(['x', NULL, NULL]) val2 WITH OFFSET
USING(OFFSET)
you get result
Row arr.val1 arr.val2
1 a x
b null
null null
As you can see - approaching this way - you can have have even both elements as NULL

Wrapping a range of data

How would I select a rolling/wrapping* set of rows from a table?
I am trying to select a number of records (per type, 2 or 3) for each day, wrapping when I 'run out'.
Eg.
2018-03-15: YyBiz, ZzCo, AaPlace
2018-03-16: BbLocation, CcStreet, DdInc
These are rendered within a SSRS report for Dynamics CRM, so I can do light post-query operations.
Currently I get to:
2018-03-15: YyBiz, ZzCo
2018-03-16: AaPlace, BbLocation, CcStreet
First, getting a number for each record with:
SELECT name, ROW_NUMBER() OVER (PARTITION BY type ORDER BY name) as RN
FROM table
Within SSRS, I then adjust RN to reflect the number of each type I need:
OnPageNum = FLOOR((RN+num_of_type-1)/num_of_type)-1
--Shift RN to be 0-indexed.
Resulting in AaPlace, BbLocation and CcStreet having a PageNum of 0, DdInc of 1, ... YyBiz and ZzCo of 8.
Then using an SSRS Table/Matrix linked to the dataset, I set the row filter to something like:
RowFilter = MOD(DateNum, NumPages(type)) == OnPageNum
Where DateNum is essentially days since epoch, and each page has a separate table and day passed in.
At this point, it is showing only N records of type per page, but if the total number of records of a type isn't a multiple of the number of records per page of that type, there will pages with less records than required.
Is there an easier way to approach this/what's the next step?
*Wrapping such as Wraparound found in videogames, seamless resetting to 0.
To achieve this effect, I found that offsetting the RowNumber by -DateNum*num_of_type (negative for positive ordering), then modulo COUNT(type) would provide the correct "wrap around" effect.
In order to achieve the desired pagination, it then just had to be divided by num_of_type and floor'd, as below:
RowFilter: FLOOR(((RN-DateNum*num_of_type) % count(type))/num_of_type) == 0

How to display n number of rows between two groups in SSRS 2010

Using SSRS 2010
I have Two groups YearMonth and Insured. I need to display only 50 records per page based on a group "Insured". So I have created parent group "GroupPageBreakOnly" and used this expression =CEILING(RowNumber(Nothing)/50).
I ensured that the Page Break at end is checked so that individual groups appear in individual page.
As a result the first page displays 31 rows, the second one 50 rows, and the third one 9 rows.
I tried to specify data region "Insured"
=CEILING(RowNumber("Insured")/50),
but it gives me an error:
...the value of the scope parameter of RowNumber must equal the name of the group directly containing the current group.
What am I missing here?
Unless you need this report to do other things, I would apply the grouping and aggregation in the Dataset itself which is generally a lot more efficient anyway.
Have you tried using ROW_NUMBER() OVER (PARTITION BY YearMonth, Insured ORDER BY YearMonth, Insured) to give the number of rows, perhaps even throwing in a % 50 at the end to see which group of 50 it fell into?
This can then be grouped on in your report.

Qlikview calculation of range for frequencies

I am given a task to calculate the frequency of calls across a territory. If the rep called a physician regarding the sale of the product 5 times, then frequency is 5 and HCP count is 1....I generated frequencies from 1 to 124 in my pivot table using a calculated dimension which is working fine. But my concern is :
My manager wants frequencies till 19 in order from 1..2..3..4...5..6.....19...
And from the frequency 21-124 as 20+.
I would be grateful if someone helps me with this.....Eager for the reply....
Use the Class function in the dimension, to split into buckets:
=class(CallId,5)
And the expression:
=count(Distinct CallId)
You can then customize the output by adding parameters:
class( var,10 ) with var = 23 returns '20<=x<30'
class( var,5,'value' ) with var = 23 returns '20<= value <25'
class( var,10,'x',5 ) with var = 23 returns '15<=x<25'
I think you can do this with a calculated dimension.
If your data has one row per physician coming from the load statement below will likely work.
Dimension
- =IF(CallCount<=19,CallCount,'+20')
Expression
- =COUNT(DISTINCT Physician_ID)
Sort
- Numeric Value Ascending
If your data has to be aggregated, more than one call row per provider incoming from the load try above substituting below for the Dimension.
Dimension
- =IF(AGGR(SUM(CallCount), Physician_ID) <=19,AGGR(SUM(CallCount), Physician_ID),'+20')