How to tell there is new data available in ga_sessions_intraday_ efficiently - google-bigquery

Google Analytics data should be exported to Big Query 3 times a day, according to the docs. I trying to determine an efficient way to detect new data is available in the ga_sessions_intraday_ table and run a query in BQ to extract on the new data.
My best idea is to poll ga_sessions_intraday_ by running a SQL query every hour. I would track the max visitStartTime (storing the state somewhere) and if a new max visitStartTime shows up in the ga_sessions_intraday_ then I would run my full queries.
Problems with this approach is I need to store state about the max visitStartTime. I would prefer something simpler.
Does GA Big Query have a better way of telling that new data is available in ga_sessions_intraday_? Some kind of event that fires? Do I use the last modified date of the table (but I need to keep track of the time window to run against)?
Thanks in advance for your help,
Kevin

Last modified time on the table is probably the best approach here (and cheaper than issuing a probe query). I don't believe there is any other signalling mechanism for delivery of the data.
If your full queries run more quickly than your polling interval, you could probably just use the modified time of your derived tables to hold the data (and update when your output tables are older than your input tables).
Metadata queries are free, so you can even embed most of the logic in a query:
SELECT
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_INPUT_DATASET.__TABLES__`) >
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_OUTPUT_DATASET.__TABLES__`) need_update
If you have a mix of tables in your output dataset, you can be more selective (with a WHERE clause) to filter down the tables you examine.
If you need a convenient place to run this scheduling logic (that isn't a developer's workstation), you might consider one of my previous answers. (Short version: Apps Script is pretty neat)
You might also consider filing a feature request for "materialized views" or "scheduled queries" on BigQuery's public issue tracker. I didn't see a existing entry for this with a quick skim, but I've certainly heard similar requests in the past.
I'm not sure how the Google Analytics team handles feature requests, but having a pubsub notification upon delivery of a new batch of Analytics data seems like it could be useful as well.

Related

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

Rails 4: dashboard/analytics and querying ALL records in DB

Working on a dashboard page which does a lot of analytics to display BOTH graphical and tabular data to users.
When the dashboard is filtered by a given year, I have to display analytics for the selected year, another year chosen for comparison, and historical averages from all time.
For the selected and comparison years, I create start/end DateTime objects that are set to the beginning_of_year and end_of_year.
year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all
comp = Model.where("closed_at >= ?", comp_start).where("closed_at <= ?", comp_end).all
These queries are essentially the same, just different date filters. I don't really see any way to optimize this besides trying to only "select(...)" the fields I need, which will probably be all of them.
Since there will be an average of 250-1000 records in a given year, they aren't "horrible" (in my not-very-skilled opinion).
However, the historical averages are causing me a lot of pain. In order to adequately show the averages, I have to query ALL the records for all time and perform calculations on them. This is a bad idea, but I don't know how to get around it.
all_for_average = Model.all
Surely people have run into these kinds of problems before and have some means of optimizing them? Returning somewhere in the ballpark of 2,000 - 50,000 records for historical average analysis can't be very efficient. However, I don't see another way to perform the analysis unless I first retrieve the records.
Option 1: Grab everything and filter using Ruby
Since I'm already grabbing everything via Model.all, I "could" remove the 2 year queries by simply grabbing the desired records from the historical average instead. But this seems wrong...I'm literally "downloading" my DB (so to speak) and then querying it with Ruby code instead of SQL. Seems very inefficient. Has anyone tried this before and seen any performance gains?
Option 2: Using multiple SQL DB calls to get select information
This would mean instead of grabbing all records for a given time period, I would make several DB queries to get the "answers" from the DB instead of analyzing the data in Ruby.
Instead of running something like this,
year = Model.where("closed_at >= ?", start).where("closed_at <= ?", end).all
I would perform multiple queries:
year_total_count = Model.where(DATE RANGE).size
year_amount_sum = Model.where(DATE RANGE).sum("amount")
year_count_per_month = Model.where(DATE RANGE).group("MONTH(closed_at)")
...other queries to extract selected info...
Again, this seems very inefficient, but I'm not knowledgeable enough about SQL and Ruby code efficiencies to know which would lead to obvious downsides.
I "can" code both routes and then compare them with each other, but it will take a few days to code/run them since there's a lot of information on the dashboard page I'm leaving out. Certainly these situations have been run into multiple times for dashboard/analytics pages; is there a general principle for these types of situations?
I'm using PostgreSQL on Rails 4. I've been looking into DB-specific solutions as well, as being "database agnostic" really is irrelevant for most applications.
Dan, I would look into using a materialized view (MV) for the all-time historical average. This would definitely fall under the "DB-specific" solutions category, as MVs are implemented differently in different databases (or sometimes not at all). Here is the basic PG documentation.
A materialized view is essentially a physical table, except its data is based on a query of other tables. In this case, you could create an MV that is based on a query that averages the historical data. This query only gets run once if the underlying data does not change. Then the dashboard could just do a simple read query on this MV instead of running the costly query on the underlying table.
After discussing the issue with other more experienced DBAs and developers, I decided I was trying to optimize a problem that didn't need any optimization yet.
For my particular use case, I would have a few hundred users a day running these queries anywhere from 5-20 times each, so I wasn't really having major performance issues (ie, I'm not a Google or Amazon servicing billions of requests a day).
I am actually just having the PostgreSQL DB execute the queries each time and I haven't noticed any major performance issues for my users; the page loads very quickly and the queries/graphs have no noticeable delay.
For others trying to solve similar issues, I recommend trying to run it for a while a staging environment to see if you really have a problem that needs solving in the first place.
If I hit performance hiccups, my first step will be specifically indexing data that I query on, and my 2nd step will be creating DB views that "pre-load" the queries more efficiently than querying them over live data each time.
Thanks to the incredible advances in DB speed and technology, however, I don't have to worry about this problem.
I'm answering my own question so others can spend time resolving more profitable questions.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).

How to store daily/monthly snapshots on Google BigQuery?

we need to store daily and monthly snapshots of some of ours database.
It's not backup, we need to store the data so to analyze them later and to see how they evolve during the time.
We still don't know exactly what sort of queries we will need in two months, for starting we need to track some evolutions of our user base, so we will save daily snapshots of users and other related collections.
We are thinking to put all the stuff on Google BigQuery, it's easy to put data on it and easier to make queries on that data.
We will create some tables, one for each set of data we need, with all the needed columns, plus an extra one that will contain the date on which the extraction process was done.
We will use this column to group the data by day, month, and so on.
An alternative approach could be to create a dataset for each .. well set of data, and one table every time we need a snapshot.
I honestly don't know what is the better between these two, or if there are better options.
It's difficult to say which is best for you since I don't know your needs or cost requirements.
However, with the "create some tables, one for each set of data we need, with all the needed columns, plus an extra one that will contain the date on which the extraction process was done" method, you could run queries that will allow you to see what has changed for your users over time. For example, you could say, for a particular time slice, the average activity of a particular user over time.
Probably a bit late, but for future readers: you are probably looking for date-partitioned tables. It corresponds exactly to this use case, and there's a straightforward example in the documentation page.
You can now create table snapshots in BigQuery.
You can only use the bq command line tool for now.
See here -> https://cloud.google.com/bigquery/docs/table-snapshots-create#creating_table_snapshots

Best approach to cache Counts from SQL tables?

I would like to develop a Forum from scratch, with special needs and customization.
I would like to prepare my forum for intensive usage and wondering how to cache things like User posts count and User replies count.
Having only three tables, tblForum, tblForumTopics, tblForumReplies, what is the best approach of cache the User topics and replies counts ?
Think at a simple scenario: user press a link and open the Replies.aspx?id=x&page=y page, and start reading replies. On the HTTP Request, the server will run an SQL command wich will fetch all replies for that page, also "inner joining with tblForumReplies to find out the number of User replies for each user that replied."
select
tblForumReplies.*,
tblFR.TotalReplies
from
tblForumReplies
inner join
(
select IdRepliedBy, count(*) as TotalReplies
from tblForumReplies
group by IdRepliedBy
) as tblFR
on tblFR.IdRepliedBy = tblForumReplies.IdRepliedBy
Unfortunately this approach is very cpu intensive, and I would like to see your ideas of how to cache things like table Counts.
If counting replies for each user on insert/delete, and store it in a separate field, how to syncronize with manual data changing. Suppose I will manually delete Replies from SQL.
These are the three approaches I'd be thinking of:
1) Maybe SQL Server performance will be good enough that you don't need to cache. You might be underestimating how well SQL Server can do its job. If you do your joins right, it's just one query to get all the counts of all the users that are in that thread. If you are thinking of this as one query per user, that's wrong.
2) Don't cache. Redundantly store the user counts on the user table. Update the user row whenever a post is inserted or deleted.
3) If you have thousands of users, even many thousand, but not millions, you might find that it's practical to cache user and their counts in the web layer's memory - for ASP.NET, the "Application" cache.
I would not bother with caching untill I will need this for sure. From my expirience this is no way to predict places that will require caching. Try iterative approach, try to implement witout cashe, then gether statistics and then implement right caching (there are many kinds like content, data, aggregates, distributed and so on).
BTW, I do not think that your query is CPU consuming. SQL server will optimaze that stuff and COUNT(*) will run in ticks...
tbl prefixes suck -- as much as Replies.aspx?id=x&page=y URIs do. Consider ASP.NET MVC or just routing part.
Second, do not optimize prematurely. However, if you really need so, denormalize your data: add TotalReplies column to your ForumTopics table and either rely on your DAL/BL to keep this field up to date (possibly with a scheduled task to resync those), or use triggers.
For each reply you need to keep TotalReplies and TotalDirectReplies. That way, you can support tree-like structure of replies, and keep counts update throughout the entire hierarchy without a need to count each time.