How to store daily/monthly snapshots on Google BigQuery? - google-bigquery

we need to store daily and monthly snapshots of some of ours database.
It's not backup, we need to store the data so to analyze them later and to see how they evolve during the time.
We still don't know exactly what sort of queries we will need in two months, for starting we need to track some evolutions of our user base, so we will save daily snapshots of users and other related collections.
We are thinking to put all the stuff on Google BigQuery, it's easy to put data on it and easier to make queries on that data.
We will create some tables, one for each set of data we need, with all the needed columns, plus an extra one that will contain the date on which the extraction process was done.
We will use this column to group the data by day, month, and so on.
An alternative approach could be to create a dataset for each .. well set of data, and one table every time we need a snapshot.
I honestly don't know what is the better between these two, or if there are better options.

It's difficult to say which is best for you since I don't know your needs or cost requirements.
However, with the "create some tables, one for each set of data we need, with all the needed columns, plus an extra one that will contain the date on which the extraction process was done" method, you could run queries that will allow you to see what has changed for your users over time. For example, you could say, for a particular time slice, the average activity of a particular user over time.

Probably a bit late, but for future readers: you are probably looking for date-partitioned tables. It corresponds exactly to this use case, and there's a straightforward example in the documentation page.

You can now create table snapshots in BigQuery.
You can only use the bq command line tool for now.
See here -> https://cloud.google.com/bigquery/docs/table-snapshots-create#creating_table_snapshots

Related

How to store millions of statistics records efficiently?

We have about 1.7 million products in our eshop, we want to keep record of how many views this products had for 1 year long period, we want to record the views every atleast 2 hours, the question is what structure to use for this task?
Right now we tried keeping stats for 30 days back in records that have 2 columns classified_id,stats where stats is like a stripped json with format date:views,date:views... for example a record would look like
345422,{051216:23212,051217:64233} where 051216,051217=mm/dd/yy and 23212,64233=number of views
This of course is kinda stupid if you want to go 1 year back since if you want to get the sum of views of say 1000 products you need to fetch like 30mb from the database and calculate it your self.
The other way we think of going right now is just to have a massive table with 3 columns classified_id,date,view and store its recording on its own row, this of course will result in a huge table with hundred of millions of rows , for example if we have 1.8 millions of classifieds and keep records 24/7 for one year every 2 hours we need
1800000*365*12=7.884.000.000(billions with a B) rows which while it is way inside the theoritical limit of postgres I imagine the queries on it(say for updating the views), even with the correct indices, will be taking some time.
Any suggestions? I can't even imagine how google analytics stores the stats...
This number is not as high as you think. In current work we store metrics data for websites and total amount of rows we have is much higher. And in previous job I worked with pg database which collected metrics from mobile network and it collected ~2 billions of records per day. So do not be afraid of billions in number of records.
You will definitely need to partition data - most probably by day. With this amount of data you can find indexes quite useless. Depends on planes you will see in EXPLAIN command output. For example that telco app did not use any indexes at all because they would just slow down whole engine.
Another question is how quick responses for queries you will need. And which steps in granularity (sums over hours/days/weeks etc) for queries you will allow for users. You may even need to make some aggregations for granularities like week or month or quarter.
Addition:
Those ~2billions of records per day in that telco app took ~290GB per day. And it meant inserts of ~23000 records per second using bulk inserts with COPY command. Every bulk was several thousands of records. Raw data were partitioned by minutes. To avoid disk waits db had 4 tablespaces on 4 different disks/ arrays and partitions were distributed over them. PostreSQL was able to handle it all without any problems. So you should think about proper HW configuration too.
Good idea also is to move pg_xlog directory to separate disk or array. No just different filesystem. It all must be separate HW. SSDs I can recommend only in arrays with proper error check. Lately we had problems with corrupted database on single SSD.
First, do not use the database for recording statistics. Or, at the very least, use a different database. The write overhead of the logs will degrade the responsiveness of your webapp. And your daily backups will take much longer because of big tables that do not need to be backed up so frequently.
The "do it yourself" solution of my choice would be to write asynchronously to log files and then process these files afterwards to construct the statistics in your analytics database. There is good code snippet of async write in this response. Or you can benchmark any of the many loggers available for Java.
Also note that there are products like Apache Kafka specifically designed to collect this kind of information.
Another possibility is to create a time series in column oriented database like HBase or Cassandra. In this case you'd have one row per product and as many columns as hits.
Last, if you are going to do it with the database, as #JosMac pointed, create partitions, avoid indexes as much as you can. Set fillfactor storage parameter to 100. You can also consider UNLOGGED tables. But read thoroughly PostgreSQL documentation before turning off the write-ahead log.
Just to raise another non-RDBMS option for you (so a little off topic), you could send text files (CSV, TSV, JSON, Parquet, ORC) to Amazon S3 and use AWS Athena to query it directly using SQL.
Since it will query free text files, you may be able to just send it unfiltered weblogs, and query them through JDBC.

Taking snapshot of SQL tables

I have a set of referential tables with different schema which we use as a reference data during integration of files. The reference data can be modified from the GUI.
And the requirement is, I need to create a snapshot of data if there are any changes. For eg., Users should be able to see which referential data has been used for particular date.
Option 1: Historize all the tables over night everyday with date. This way when users want to see the data used for particular date, we can easily query the corresponding history table. As users doesnt change the data everyday, this way we will make the database bigger day by day.
Option 2: Historize only the data(rows) which has been modified with modified date and use the view to fetch the data for particular days. But this way I need to write many views as the schema is different for different tables.
If you know of the best way I can use, I would appreciate it if you share your knowledge.
Thanks,
Not sure if possible but:
Option 3: Create/Edit triggers OnInsert/Update/Delete to write new values to an "historical table" and include a timestamp.
To get the Admin data used on day "X" just use the timestamp.
Another option (again not sure if possible) is to add "start_dt/end_dt" to the admin tables and have the processes lookup only the active data
Sérgio

How to tell there is new data available in ga_sessions_intraday_ efficiently

Google Analytics data should be exported to Big Query 3 times a day, according to the docs. I trying to determine an efficient way to detect new data is available in the ga_sessions_intraday_ table and run a query in BQ to extract on the new data.
My best idea is to poll ga_sessions_intraday_ by running a SQL query every hour. I would track the max visitStartTime (storing the state somewhere) and if a new max visitStartTime shows up in the ga_sessions_intraday_ then I would run my full queries.
Problems with this approach is I need to store state about the max visitStartTime. I would prefer something simpler.
Does GA Big Query have a better way of telling that new data is available in ga_sessions_intraday_? Some kind of event that fires? Do I use the last modified date of the table (but I need to keep track of the time window to run against)?
Thanks in advance for your help,
Kevin
Last modified time on the table is probably the best approach here (and cheaper than issuing a probe query). I don't believe there is any other signalling mechanism for delivery of the data.
If your full queries run more quickly than your polling interval, you could probably just use the modified time of your derived tables to hold the data (and update when your output tables are older than your input tables).
Metadata queries are free, so you can even embed most of the logic in a query:
SELECT
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_INPUT_DATASET.__TABLES__`) >
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_OUTPUT_DATASET.__TABLES__`) need_update
If you have a mix of tables in your output dataset, you can be more selective (with a WHERE clause) to filter down the tables you examine.
If you need a convenient place to run this scheduling logic (that isn't a developer's workstation), you might consider one of my previous answers. (Short version: Apps Script is pretty neat)
You might also consider filing a feature request for "materialized views" or "scheduled queries" on BigQuery's public issue tracker. I didn't see a existing entry for this with a quick skim, but I've certainly heard similar requests in the past.
I'm not sure how the Google Analytics team handles feature requests, but having a pubsub notification upon delivery of a new batch of Analytics data seems like it could be useful as well.

Need help designing a DB - for a non DBA

I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).

archiving the table : searching for the best way

there is a table which has 80.000 rows.
Everyday I will clone this table to another log table giving a name like 20101129_TABLE
, and every day the prefix will be changed according to date..
As you calculate, the data will be 2400 000 rows every month..
Advices please for saving space, and getting fast service and other advantages and disadvantages!! how should i think to create the best archive or log..
it is a table has the accounts info. branch code balance etc
It is quite tricky to answer your question since you are a bit vague on some important facts:
How often do you need the archived tables?
How free are you in your design-choices?
If you don't need the archived data often and you are free in your desgin I'd copy the data into an archive database. That will give you the option of storing the database on a separate disk (cost-efficiency) and you can have a separate backup-schedule on that database as well.
You could also store all the data in one table with just an additional column like ArchiveDate datetime. But I think this depends really on how you plan on accessing the data later.
Consider TABLE PARTITIONING (MSDN) - it is designed for exactly this kind of scenarios. Not only you can spread data across partitions (and map partitions to different disks), you can keep all data in the same table and let MSSQL do all the hard work in the background (what partition to use based on select criteria, etc.).