How to replicate segments in GA in Big Query - google-bigquery

There is a segment(sequence start) in GA for first user interaction.
Looks like this
Step 1
default channel grouping,
(does not match regex)
Organic Search| Direct
Or
PageSceenType,
(does not match regex)
channel|home|topic|tag.
So there is a GA table in big query
Tried a where clause channelGrouping not in (‘Organic search’, ‘ Direct’) or pageScreenType not in (‘home’, ‘tag’, ‘topic’ channel
But the numbers don’t not match atall with Google analytics
Please how can replicate the segment in GA in BQ

Related

Getting search suggestions to work on 2 (or more) non-consecutive words (to improve search on a medical conditions list - ICD10 codes)

Context:
We are using Azure Cognitive Services in a mobile app to search patient diagnostic codes (ICD10 codes).
The ICD10 code list is approximately 94,000 items. For anyone interested here is a list.
We currently have set-up a standard Lucene analyser on the diagnostic description field
Requirement:
We want to provide a really good search as you type experience, which provides the most relevant suggestions
Using the Suggest method with the fuzzy parameter set to true works reasonably well for a single search term:
As you can see it does well in finding partial matches and is resilient to typos.
The issue comes in when I add a second search term. E.g. I want to search for asthma that is moderate:
In both these examples, there is no match.
So when searching for more than one term, requiring the user to express this in the sequence that this is in the data is not a good user experience.
Using the Search method instead, we can overcome the problem of finding matches where 2 search terms are supplied that do not appear consecutively in the data:
And this is resilient to typos
However, this is not good at finding partial matches (like the Suggest does).
E.g. in this search, we would still want the term moderate to be picked up:
Seemingly if we could combine a wild card search with a fuzzy search we could solve this problem. e.g. supplying the following search phrase: ashtma~* AND moder~*.
But from what we have seen this syntax is not supported.
Any suggestions on how to overcome this limitation so we can get the best of both worlds, i.e:
For 2 or more search terms, it will work on partial matches
And the search terms are treated independently and do not need to appear consecutively in the data
Many thanks in advance,
Andreas.
I recommend using (or at least experimenting with) Lucene ngrams.
An example custom analyzer can use the NGramTokenFilter.
This filter splits each source token into one or more indexed tokens by chopping up the source into substrings of different lengths.
An example from the above link:
"abc" will give "a", "ab", "abc", "b", "bc", "c"
You can, as an example, set each token to be from 3 to 5 characters long (but this is one of the areas where you can experiment with different settings).
When you use this analyzer for indexing, it's going to create many more tokens (larger index) but that gives you more searching flexibility.
Use the same analyzer for searching.
If the user enters the following two words as their search values:
ashtma moder
You would convert that into the following Lucene search phrase:
ashtma~ AND moder~
This will find the following hits:
doc id = 12877
field = Moderate persistent asthma with status asthmaticus
doc id = 12874
field = Moderate persistent asthma
doc id = 12875
field = Moderate persistent asthma, uncomplicated
doc id = 12876
field = Moderate persistent asthma with (acute) exacerbation
doc id = 94210
field = Family history of asthma and oth chronic lower resp diseases
doc id = 6970
field = Xanthelasma of right lower eyelid
doc id = 6973
field = Xanthelasma of left lower eyelid
doc id = 6979
field = Chloasma of right lower eyelid and periocular area
doc id = 6982
field = Chloasma of left lower eyelid and periocular area
As you can see it does find some false positives, but the first four hits (the highest scored) are the ones you want.
You can see how this approach performs in terms of index size and search speed.
One reason for suggesting ngrams is your point about wanting to handle mis-spellings: ngrams may help to isolate spelling mistakes into smaller tokens,since the ~ fuzzy search operator is fairly limited in what it can handle. But, definitely experiment with different ngram lengths - and maybe also without using ngrams at all.

How to get count of active users grouped by version? (from Firebase using BigQuery)

Problem description
I'm trying to get the information of how many active users I have in my app separated by the 2 or 3 latest versions of the app.
I've read some documentations and other stack questions but none of them was solving my problem (and some others had outdated solutions).
Examples of solutions I tried:
https://support.google.com/firebase/answer/9037342?hl=en#zippy=%2Cin-this-article (N-day active users - This solution is probably the best, but even changing the dataset name correctly and removing the _TABLE_SUFFIX conditions it kept returning me a single column n_day_active_users_count = 0 )
https://gist.github.com/sbrissenden/cab9bd3a043f1879ded605cba5005457
(this is not returning any values for me, didn't understand why)
How can I get count of active Users from google analytics (this is not a good fit because the other part of my job is already done and generating charts on Data Studio, so using REST API would be harder to join my two solutions - one from BigQuery and other from REST API)
Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export (this one uses outdated variables)
So, I started to write the solution out of my head, and this is what I get so far:
SELECT
user_pseudo_id,
app_info.version,
ROUND(COUNT(DISTINCT user_pseudo_id) OVER (PARTITION BY app_info.version) / SUM(COUNT(DISTINCT user_pseudo_id)) OVER (), 3) AS adoption
FROM `projet-table.events_*`
WHERE platform = 'ANDROID'
GROUP BY app_info.version, user_pseudo_id
ORDER BY app_info.version
Conclusions
I'm not sure if my logic is correct, but I think I can use user_pseudo_id to calculate it, right? The general idea is: user_of_X_version/users_of_all_versions.
(And the results are kinda close to the ones showing at Google Analytics web platform - I believe the difference is due to the date that I turned on the BigQuery integration. But.... I'd like some confirmation on that: if my logic is correct).
The biggest problem in my code now is that I cannot write it without grouping by user_pseudo_id (Because when I don't BigQuery says: "SELECT list expression references column
user_pseudo_id which is neither grouped nor aggregated at [2:3]") and that's why I have duplicated rows in the query result
Also, about the first link of examples... Is there any possibility of a record with engagement_time_msec param with value < 0? If not, why is that condition in the where clause?

How to unnest Google Analytics custom dimension in Google Data Prep

Background story:
We use Google Analytics to track user behaviour on our website. The data is exported daily into Big Query. Our implementation is quite complex and we use a lot of custom dimensions.
Requirements:
1. The data needs to be imported into our internal databases to enable better and more strategic insights.
2. The process needs to run without requiring human interaction
The problem:
Google Analytics data needs to be in a flat format so that we can import it into our database.
Question: How can I unnest custom dimensions data using Google Data Prep?
What it looks like?
----------------
customDimensions
----------------
[{"index":10,"value":"56483799"},{"index":16,"value":"·|·"},{"index":17,"value":"N/A"}]
What I need it to look like?
----------------------------------------------------------
customDimension10 | customDimension16 | customDimension17
----------------------------------------------------------
56483799 | ·|· | N/A
I know how to achieve this using a standard SQL query in Big Query interface but I really want to have a Google Data Prep flow that does it automatically.
Define the flat format and create it in BigQuery first.
You could
create one big table and repeat several values using CROSS JOINs on all the arrays in the table
create multiple tables (per array) and use ids to connect them, e.g.
for session custom dimensions concatenate fullvisitorid / visitstarttime
for hits concatenate fullvisitorid / visitstarttime / hitnumber
for products concatenate fullvisitorid / visitstarttime / hitnumber / productSku
The second options is a bit more effort but you save storage because you're not repeating all the information for everything.

How much storage capacity my dataset or table consume?

I have multiple datasets, each with hundreds of tables in Google BigQuery. I'd like to remove some old, legacy data and I am looking for the most convenient way to know how much storage space my each dataset and table is occupying, so I could make educated decision on what datasets/tables I may remove.
I tried to use bq command-line tool but couldn't find a way to display table storage and entire dataset storage related information.
You can access metadata about the tables in a dataset by using the TABLES meta-table. I.e., and example:
select * from [publicdata:samples.__TABLES__]
returns
project_id dataset_id table_id creation_time last_modified_time row_count size_bytes type
publicdata samples github_nested 1348782587310 1348782587310 2541639 1694950811 1
publicdata samples github_timeline 1335915950690 1335915950690 6219749 3801936185 1
publicdata samples gsod 1335916040125 1440625349328 14420316 17290009238 1
publicdata samples natality 1335916045005 1440625330604 37826763 23562717384 1
publicdata samples shakespeare 1335916045099 1440625429551 164656 6432064 1
publicdata samples trigrams 1335916127449 1445684180324 68051509 277168458677 1
publicdata samples wikipedia 1335916132870 1445689914564 13797035 38324173849 1
More documentation here: https://cloud.google.com/bigquery/querying-data
Below is an example of how to combine use of metadata (as in answer by #Moshapasumansky) with visualization (as in recommendation by #DoITInternational) and all without leaving BigQuery Web UI, but you will need BigQuery Mate Chrome Extension
Assuming you have extension - Follow below steps:
Step 1 - Run Query against tables metadata in publicdata:samples dataset
SELECT
table_id,
DATE(TIMESTAMP(creation_time/1000)) AS Created,
DATE(TIMESTAMP(last_modified_time/1000)) AS Modified,
row_count AS Rows,
ROUND(size_bytes/POW(1024, 3)) AS GB
FROM [publicdata:samples.__TABLES__]
Step 2 - Move to JSON View
Step 3 - Expand Result Panel by Clicking on + Button
This is for two reasons:
To bring to result panel up to 500 records (which should cover your case as you mentioned you have hundreds tables) at a time vs. relatively limited amount of rows at a time that currently supported by native ui
To release more real estate for chart
Step 4 - Close Query Editor (optional) – more real estate for chart
Step 5 - Click Show Pivot to bring Pivot/Chart Tool up with data from Result and than design your pivot chart the way you like (as it is in below screenshot for example)
It might be not the best way - but at least it allows you to do what you want here w/o leaving web ui. In some cases it can be a preferred option I think.
Rather than using the BigQuery API (Tables: get method specifically) and looking into numBytes in the response, I can suggest to use BQdu or BigQuery Disk Usage web application. It will scan your project for datasets and tables and will display this nice visualization, mentioning how much storage each table (or entire dataset) is consuming.

Ignore similar values and not treat as duplicate records

I'm writing a Select query in SQL server and I found a question.
When I have two rows like that:
ID Address City Zip
1 123 Wash Ave. New York 10035
1 123 Wash Ave New York 10035
Because I have many same Address but some of them just have dot or some little difference.
they are almost identical, so how can I find all such case.
Using UPS Online API's, our solution was not to correct the error but help sort the results that best represent the correct answer.
With the results returned by UPS, we would run various filters against the original source address and each of the returned responses. then produce a weighting system to sort the results to present to our CSR to select the most logical "correctly formatted" answer from UPS.
Thus building a score card from the result set, such as the number of digits incorrect in the ZIP Code (catches fat fingering).
Another measure removes all pronunciation marks and gives a ranking of how close the address is now.
Lastly we pass the results through a matrix of standard substitutions [ ST for STREET ] and do a final ranking.
From all these scores we sort and present the results from most likely to least likely to an individual who then selects the correct answer to save in our database.
Correcting these errors now serve two purposes:
1) We look good to our customers by having the correct address information on the billing ( No just close enough)
2) We save on secondary charges from UPS by not being billed for incorrect addresses.