How to get count of active users grouped by version? (from Firebase using BigQuery) - sql

Problem description
I'm trying to get the information of how many active users I have in my app separated by the 2 or 3 latest versions of the app.
I've read some documentations and other stack questions but none of them was solving my problem (and some others had outdated solutions).
Examples of solutions I tried:
https://support.google.com/firebase/answer/9037342?hl=en#zippy=%2Cin-this-article (N-day active users - This solution is probably the best, but even changing the dataset name correctly and removing the _TABLE_SUFFIX conditions it kept returning me a single column n_day_active_users_count = 0 )
https://gist.github.com/sbrissenden/cab9bd3a043f1879ded605cba5005457
(this is not returning any values for me, didn't understand why)
How can I get count of active Users from google analytics (this is not a good fit because the other part of my job is already done and generating charts on Data Studio, so using REST API would be harder to join my two solutions - one from BigQuery and other from REST API)
Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export (this one uses outdated variables)
So, I started to write the solution out of my head, and this is what I get so far:
SELECT
user_pseudo_id,
app_info.version,
ROUND(COUNT(DISTINCT user_pseudo_id) OVER (PARTITION BY app_info.version) / SUM(COUNT(DISTINCT user_pseudo_id)) OVER (), 3) AS adoption
FROM `projet-table.events_*`
WHERE platform = 'ANDROID'
GROUP BY app_info.version, user_pseudo_id
ORDER BY app_info.version
Conclusions
I'm not sure if my logic is correct, but I think I can use user_pseudo_id to calculate it, right? The general idea is: user_of_X_version/users_of_all_versions.
(And the results are kinda close to the ones showing at Google Analytics web platform - I believe the difference is due to the date that I turned on the BigQuery integration. But.... I'd like some confirmation on that: if my logic is correct).
The biggest problem in my code now is that I cannot write it without grouping by user_pseudo_id (Because when I don't BigQuery says: "SELECT list expression references column
user_pseudo_id which is neither grouped nor aggregated at [2:3]") and that's why I have duplicated rows in the query result
Also, about the first link of examples... Is there any possibility of a record with engagement_time_msec param with value < 0? If not, why is that condition in the where clause?

Related

Incremental load of a full api call

I have an API I where I need to get signup data into my database from and aggregate it daily. Everytime I call the API I will get a full copy of the data. Sometimes old accounts will get deleted, so the historical data will change.
This is what the data from the API looks like:
I want to aggregate it like so, to see the daily account creations and activations:
Now, what I could do is a daily import of the full data and then aggregate like this:
SELECT
Current_date() as snapshot_date,
SUM(CASE WHEN accountCreateOn = current_date() THEN 1 ELSE 0 END) as accountCreateOn,
SUM(CASE WHEN accountActivateOn = current_date() THEN 1 ELSE 0 END) as accountActivateOn
FROM full_data
But this doesn't seem very failure resistant. What happens, if the pipeline fails for a couple of days? What would be the right way to solve such a problem?
The easiest and most fault-tolerant way is to store the data you are getting completely and as detailed as they are. You can't get any better information, and leaving away information - which includes aggregating it - always carries the danger that you will one day want to answer a question on those data that could have been answered on the complete dataset and can't be answered on the reduced one.
The only reason to leave this path could be datasets that are so huge that storing and processing them isn't feasible. For modern DBMS systems running on modern hardware, it's rather unlikely that you run into that problem. So I would create synthetic test data of the maximum size that I expect for my business, say 10 times the account activations per year that I dream of. If the database can handle this, it means you have one less problem to worry about.

How to group similar GTFS trips

I need to group GTFS trips to human understandable "route variants". As one route can have run different trips based on day/time etc.
Is there any preferred way to group similar trips? Trip shape_id looks promising, but is there any guarantee that all similar trips has same shape_id?
My GTFS data is imported my sql database and the database structure is the same as GTFS txt files.
UPDATE
Im not looking sql query example, im looking high level example how to group similar trips to user friendly "route variants".
Many route planning apps (like Moovit) use GTFS data as source and they display different route variants to users.
There is no official way to do this. The best way is probably to group by the ordered list of stops on each trip, sometimes known as the "stopping pattern" of the trip. The idea is discussed at a conceptual level here by Mapzen.
In practice, I have created concatenated strings of all stops on a given trip (from stop_times), and grouped by that to define similar trips. E.g., if the stops on a given trip are A, B, C, D, and E, create a string A-B-C-D-E or A_B_C_D_E and group trips on that string. This functionality is not part of the SQL spec, although MySQL implements it as GROUP_CONCAT and PostgreSQL uses arrays and array_to_string. You may also want to add route_id and shape_id into the grouping as well, to handle some corner cases.

Is Bigtable (or BigQuery) the right platform for correlation analysis of logs?

I'm faced with the challenge of analysing different system logfiles based on following requirements:
several hundred systems
millions of logs every day in different formats
Beside many other objectives my biggest challenge is a realtime correlation analysis of all incoming logs on all current system logs and also on partially historical log events.
Currently we're focusing on MongoDB, ElasticSearch, Hadoop, ... to meet this challenge.
On the other hand I've read some interesting things about Google Bigtable and Bigquery.
So my question is, is Bigtable and/or Bigquery a solution worth looking at, in order to do this realtime analysis ?
I've no experience with these two products, so I'm hoping for some tips whether these Google solutions could be an alternative for my requirements.
THX & BR
bdriven
EDIT:
too broad. you need to show actual analisis you need to make. bigquery will be much much cheaper that homemade with nosql
Our goal is, to develop a system, which is able to generate warnings based on current log events (or a combination of different log events) and their past interactions on other systems behavior.
Therefore we have to be able to do fast correlation analysis for current events against huge amounts of unstructured historical data.
I know that this requirement description is probably not the most specific one, but we're right at the beginning of this project.
So my goal with this question is to get some arguments for our next team meeting, whether we should consider to take a closer look at Bigtable / Bigquery or not.
One of my favorite features of BigQuery is its ability to run correlations.
Here's a correlations with BigQuery tutorial I wrote a couple years ago: http://nbviewer.ipython.org/gist/fhoffa/6459195
For example, to rank and find the most correlated airports in terms of flight delays:
SELECT a.departure_state, b.departure_state, corr(a.avg, b.avg) corr, COUNT(*) c
FROM
(SELECT date, departure_state, AVG(departure_delay) avg , COUNT(*) c
FROM [bigquery-samples:airline_ontime_data.flights]
GROUP BY 1,2 HAVING c > 5
) a
JOIN
(SELECT date, departure_state ,
AVG(departure_delay) avg, COUNT(*) c FROM [bigquery-samples:airline_ontime_data.flights]
GROUP BY 1,2 HAVING c > 5 ) b
ON a.date=b.date
WHERE a.departure_state < b.departure_state
GROUP EACH BY 1, 2
HAVING c > 5
ORDER BY corr DESC;
Try it yourself in the next 5 minutes! A quick getting started tutorial: https://www.reddit.com/r/bigquery/comments/3dg9le/analyzing_50_billion_wikipedia_pageviews_in_5/

Trouble Looking For Events WITHIN a Session In BigQuery or WITHIN Multiple Sessions

I wanted to get a second pair of eyes & some help confirming the best way to look within a session at the hit level in BigQuery. I have read the BigQuery developer documentation thoroughly that provides insight on working WITHIN as session. My challenge is this. Let us assume I write the high level query to count the number of sessions that exist and group the sessions by the device.device category as below:
SELECT device.deviceCategory,
COUNT(DISTINCT CONCAT (fullVisitorId, STRING (visitId)), 10000000) AS SESSIONS
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2015-01-01'), TIMESTAMP('2015-06-30')))
GROUP EACH BY device.deviceCategory
ORDER BY sessions DESC
I then run a follow up query like the following to find the number of distinct users (Client ID's):
SELECT device.deviceCategory,
COUNT(DISTINCT fullVisitorID) AS USERS
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2015-01-01'), TIMESTAMP('2015-06-30')))
GROUP EACH BY device.deviceCategory
ORDER BY users DESC
(Note that I broke those up because of the sheer size of the data I am working with which produces runs greater than 5TB in some cases).
My challenge is the following. I feel like I have the wrong approach and have not had success with the WITHIN function. For every user ID (or full visitor ID), I want to look within all their various sessions to find out how many sessions from the many they had were desktop and how many were mobile. Basically, these are the cross device users. I want to collect a table with these users. I started here:
SELECT COUNT(DISTINCT CONCAT (fullVisitorId, STRING (visitId)), 10000000) AS SESSIONS
FROM (TABLE_DATE_RANGE([XXXXXX.ga_sessions_], TIMESTAMP('2015-01-01'), TIMESTAMP('2015-06-30')))
WHERE device.deviceCategory = 'desktop' AND device.deviceCategory = 'mobile'
This is not correct though. Moreover, any version I write of a within query is giving me non-sense results or results that have 0 as the number. Does anyone have any strategies or tips to recommend a way forward here? What is the best way to use the WITHIN function to look for sessions that may have multiple events happening WITHIN the session (with my goal being collecting the user ID's that meet certain requirements within a session or over various sessions). Two days ago I did this in a very manual way by manually working through the steps and saving intermediate data frames to generate counts. That said, I wanted to see if there was any guidance to quickly do this using a single query?
I'm not sure if this question is still open on your end, but I believe I see your problem, and it is not with the misuse of the WITHIN function. It is a data understanding problem.
When dealing with GA and cross-device identification, you cannot reliably use any combination of fullVisitorId and visitId to identify users, as these are derived from the cookie that GA places on the users browser. Thus, leveraging the fullVisitorId would identify a specific browser on a specific device more accurately that a specific user.
In order to truly track users across devices, you must be able to leverage the userId functionality follow this link. This requires you to have the user sign in in some way, thus giving them an identifier that you can use across all of their devices and tie their behavior together.
After you implement some type of user identification that you can control, rather than GA's cookie assignment, you can use that to look for details across sessions and within those individual sessions.
Hope that helps!

Ruby Rails Complex SQL with aggregate function and DayOfWeek

Rails 2.3.4
I have searched google, and have not found an answer to my dilemma.
For this discussion, I have two models. Users and Entries. Users can have many Entries (one for each day).
Entries have values and sent_at dates.
I want to query and display the average value of entries for a user BY DAY OF WEEK. So if a user has entered values for, say, the past 3 weeks, I want to show the average value for Sundays, Mondays, etc. In MySQL, it is simple:
SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = ? GROUP BY 1
That query will return between 0 and 7 records, depending upon how many days a user has had at least one entry.
I've looked at find_by_sql, but while I am searching Entry, I don't want to return an Entry object; instead, I need an array of up to 7 days and averages...
Also, I am concerned a bit about the performance of this, as we would like to load this to the user model when a user logs in, so that it can be displayed on their dashboard. Any advice/pointers are welcome. I am relatively new to Rails.
You can query the database directly, no need to use an actual ActiveRecord object. For example:
ActiveRecord::Base.connection.execute "SELECT DAYOFWEEK(sent_at) as day, AVG(value) as average FROM entries WHERE user_id = #{user.id} GROUP BY DAYOFWEEK(sent_at);"
This will give you a MySql::Result or MySql2::Result that you can then use each or all on this enumerable, to view your results.
As for caching, I would recommend using memcached, but any other rails caching strategy will work as well. The nice benefit of memcached is that you can have your cache expire after a certain amount of time. For example:
result = Rails.cache.fetch('user/#{user.id}/averages', :expires_in => 1.day) do
# Your sql query and results go here
end
This would put your results into memcached for one day under the key 'user//averages'. For example if you were user with id 10 your averages would be in memcached under 'user/10/average' and the next time you went to perform this query (within the same day) the cached version would be used instead of actually hitting the database.
Untested, but something like this should work:
#user.entries.select('DAYOFWEEK(sent_at) as day, AVG(value) as average').group('1').all
NOTE: When you use select to specify columns explicitly, the returned objects are read only. Rails can't reliably determine what columns can and can't be modified. In this case, you probably wouldn't try to modify the selected columns, but you can'd modify your sent_at or value columns through the resulting objects either.
Check out the ActiveRecord Querying Guide for a breakdown of what's going on here in a fairly newb-friendly format. Oh, and if that query doesn't work, please post back so others that may stumble across this can see that (and I can possibly update).
Since that won't work due to entries returning an array, we can try using join instead:
User.where(:user_id => params[:id]).joins(:entries).select('...').group('1').all
Again, I don't know if this will work. Usually you can specify where after joins, but I haven't seen select combined in there. A tricky bit here is that the select is probably going to eliminate returning any data about the user at all. It might make more sense just to eschew find_by_* methods in favor of writing a method in the Entry model that just calls your query with select_all (docs) and skips the association mapping.