Bigquery - how is data distributed by partition key? - google-bigquery

I come from a Teradata and Netezza background in Data Warehousing in MPP technologies.
I would like to ask how Google BigQuery distributes data by partition key on a simple table? I am really trying to understand the logic in how the BigQuery engine works if this makes sense?
Teradata and Netezza had a well documented technical page from recollection which described the processes used (like a step by step walkthrough).
Thanks,
Simon

BigQuery's partitioned tables are also very well documented here:
https://cloud.google.com/bigquery/docs/partitioned-tables
I think I don't understand what you want to know. Please rephrase your question after reading all the above.

Related

Newbie looking for BigQuery Standard SQL tutorial, examples, books,

I have used SQL a fair amount for several years. I just started a project that use Google Firebase and BigQuery to explore what users are doing on our website.
The raw data in BigQuery (the Firebase events) are very complicated.
It appears BigQuery is using SQL 2011. I am not sure how that is different from SQL-99 or SQL-2009. I have not found a good over view or tutorial.
Some of the challenges I am struggling with include grouping events in to session and identifying groups with certain characteristics
I wonder if instead of using GROUP BY I need to learn how windowing works.
Any suggestions for getting up the learning curve faster would be greatly appreciated.
Andy
The main difference is that the most efficient schema is not multiple flat tables with relations anymore. Instead it is having nested data in one big table.
I call them subtables, but they're really just arrays containing structs. Which may contain arrays which contain structs. Which may ... etc.
The most important thing to learn is how to work with these arrays. There are basically two use cases:
you need a field from a subtable to be a dimension in your result: you have to flatten the table using cross join. Cross joining a subtable with its parent is a weird concept, but works pretty fine.
you want some aggregated information from a subtable: use a subquery on the array and get it
Both concepts can be learned by working on all the exercises here: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
But GCP also has some courses on coursera covering BigQuery. I'm not sure how much they go in depth, though.
As you mentioned in the question BigQuery is compliant with the SQL 2011 [1].
In BigQuery analytic functions or aggregate analytic functions are used for windowing.
For reference you can have a look at official BigQuery standard SQL document and for deeper understanding of BigQuery you can have a look at Google BigQuery Analytics book.

Standard or Legacy SQL for Google Analytics Data in BigQuery?

we are just starting to use Google Analytics data in BigQuery and previously used just the MSSQL Server in the work environment. We would like to move some of the analysis to the GCP and BigQuery, but could not decide on what is the better option to use - standard or legacy SQL?
In both cases we would have to adjust to the new language version, but the real question is what is the best choice when it comes to Google Analytics data analysis? Is there something that from the technical point of view should make us choose legacy over standard, or the other way around?
It is very misleading for us that there are two versions, because legacy seems to be more developed now, but perphaps standard will be the main version for SQL in the future in BQ?
BigQuery Standard SQL is the way to go. It has much more features than Legacy SQL.
Note: it is not binary choice. You always can use Legacy SQL - if there is something that you will find easier to express with it. From my experience it is mostly opposite - with very few exceptions. Most prominent (for me for example being) - Table Decorators - Support for table decorators in standard SQL is planned but not yet implemented.
I would recommend looking into Migrating from legacy SQL - not from migration point of view as you are the new to BigQuery - but because it is a good place to see and compare features of both dialects in one place.
Also I recommend to check BigQuery Issue Tracker so you can get some extra insight
Standard SQL is the preferred SQL dialect for use in BigQuery, as stated in the migration guide. While legacy SQL has been around for quite some time--and is still the default at the time of this writing--there is no active development work on it. If you are evaluating which to use, you should pick standard SQL, since in addition to being more similar to T-SQL (SQL Server's dialect) it is more expressive, has fewer surprising edge cases, and generally has more features.
Go with Standard SQL, as that's on the longterm roadmap.
From experience some queries are faster under Legacy SQL, but this is changing as Standard SQL is the one that is actively developed.

SQL and NoSQL which one is more suitable for this case and why?

In my project:
Data is not going to be modified (only query).
It is going to be more than 1.000.000 instances of data.
Query performance is critical.
In case of using SQL, it is going to be a single table with 7 columns. (no joints)
There are also different classification approaches used in NoSQL. Which are given below with some examples:
Column: Accumulo, Cassandra, HBase
Document: Clusterpoint, Couchdb, Couchbase, MarkLogic, MongoDB
Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeACE
Graph: Allegro, Neo4J, OrientDB, Virtuoso, Stardog
Source: http://en.wikipedia.org/wiki/NoSQL#cite_note-7
First of all, does the database system really makes an observable amount of performance difference for this case?
If it makes then,can you please explain which one is more suitable for my project SQL or NoSQL, if NoSQL then which classification approach?
Thank you in advance
I am currently enrolled in a project to set up a "standard" Database with a huge amount of data. We start by implementing in SQL to see the performance of the queries. Once this is done we address the problem of performance.
There is multiple reasons for this, but to name a few:
Standard SQL is easily implemented and standard across multiple instances (as of present day)
If you know SQL, make a fast implementation. To save time and get the project going.
There are loads of information available about SQL implementations.
I cannot answer about NoSQL but hopefully someone can fill me in.
The important question you need to ask is what kind of queries you will be performing. For example ClusterPoint offers real-time aggregation, so if you need result grouping and extracting summaries, it gives you great performance.
For a regular key/value they should all perform pretty well, so pick the one you are most comfortable with.

mondrian adapter for bigquery

It would be mighty to have a way to query Google's BigQuery with MDX. I believe the natural solution would be a Mondrian adapter.
Is something like this in the works?
The reason I'm asking is because there is a lot of know-how in MDX and an MDX connector would allow us to reuse what we already know.
Furthermore, MDX is ideally suited for OLAP queries. Things like hierarchies and calculating a ratio of one's parent (e.g. % contribution to total) are standardized in MDX but can be solved in 100 different ways in SQL.
Calculating a Moving Average of the last 3 non empty weeks is still complicated in SQL and easy in MDX. There are many examples.
And lastly, it would allow to analyze data from Google BigQuery with an Excel Pivot or any of the 100+ other existing tools spewing out MDX queries.
Cheers,
Micha
There is a demo here that is using Mondrian/BigQuery with the Saiku user interface:
http://dev.analytical-labs.com/
This archive contains dependencies that can be used to set up a BigQuery data source in Saiku's embedded Mondrian server (got this from the Saiku twitter feed):
http://t.co/EbtaP95G
Their instructions are here for setting up BigQuery:
https://gist.github.com/4073088
You can download Saiku (with embedded Tomcat and Mondrian) here to run locally for testing:
http://analytical-labs.com/downloads.php
One issue I notice is that the drill-down functionality doesn't work because of the limitations of BigQuery SQL. My guess is that Mondrian devs will have to add some special SQL support for BigQuery to get around this. For example, any fields used in an ORDER BY clause must also be in the SELECT field list.
There is no existing BigQuery integration with Pentaho's Mondrian. One thing I would point out is that BigQuery is already very fast over massive datasets, so some of Mondrian's advantages may be moot with a BigQuery back end. However, I could imagine that one could use an existing Pentaho analysis tool to explore data. I'd like to know more about the use case.

Can you recommend a good source for Teradata Best Practices?

Looks like my data warehouse project is moving to Teradata next year (from SQL Server 2005).
I'm looking for resources about best practices on Teradata - from limitations of its SQL dialect to idioms and conventions for getting queries to perform well - particularly if they highlight things which are significantly different from SQL Server 2005. Specifically tips similar to those found in The Art of SQL (which is more Oracle-focused).
My business processes are currently in T-SQL stored procedures and rely fairly heavily on SQL Server 2005 features like PIVOT, UNPIVOT, and Common Table Expressions to produce about 27m rows of output a month from a 4TB data warehouse.
One place to start is here: http://www.teradataforum.com/
This might be a little late, but there are a few things which I can warn you about Teradata which I have learned.
Use the most recent version as often as possible.
For V12 the optimizer was re-written and the database performs much better now.
Try to realize that SQL Server and Teradata are very different beasts, most of the concepts will not transition well.
Do not underestimate the importance of a primary index.
The locks that teradata uses are very primitive when compared to other databases.
Do NOT use TERA mode. You do not have any code which is legacy, ANSI mode is far superior and is widely encouraged.
Join indexes are very helpful tools, but they do not provide all the answers.
Parallelism, take the time to understand how FASTLOAD, MULTILOAD, and TPUMP works and find out how one can leverage it with their ETL strategy.
If you are attempting to run a query which needs to be performant, do not use any casts, the optimizer will not use statistics to generate the best execution plan.
Working with dates are going to be a pain, just a warning.
Teradata is very DDL oriented, try to understand all the syntax related when creating a table.
Compression is a wonderful tool, if you have any values which are repeated in a table, make use of it.
There are not many tools available with Teradata, be prepared to build a lot. The tools that exist are very expensive.
Unfortunately, I do not know much about SQL Server, so I cannot say what tools in SQL Server appear in Teradata.
Hope this helps
I would also look into the recently launched Teradata Developer Exchange as well as the TeradataForum and forums on Teradata's main website.
I don't know of any good references available online. Teradata has some design manuals that are available for download, but they're more instruction manuals and not "best practices" as such. check them out here: http://www.info.teradata.com/DataWarehouse/eTeradata-BrowseBy.cfm?page=Teradata%20Database
Alternatively, you need to find a friendly Teradata expert to bounce ideas off. Try Teradata themselves, or find a local consultant with Teradata experience.
Best Practices on Teradata isn't a topic that gets lots of discussions and most of the best tricks tend to be proprietary knowledge of the person/people who discovered them.
Sorry,
David Stewardson
Satyam Computer Services
Top of the list on a Google search for "Teradata Best Practices" gave me TERADATA ADVISORY GROUP SETS BEST PRACTICES FOR BUSINESS OBJECTS AND TERADATA CUSTOMERS
EDIT: Seeing as that's just advertising, as you've pointed out, see how you go with these. Please bear in mind that I don't have a clue what Teradata is and can't see myself using it any time this side of the 22nd century AD.
Teradata Discussion Forums
Best Practices for Teradata Deployments
Best Study Guides For NCR Teradata Certifications
The middle one looks promising with it's nice long link tree at the top
Oracle® Business Intelligence Applications Installation and Configuration Guide > Preinstallation and Predeployment Considerations for Oracle BI Applications > Teradata-Specific Database Guidelines for Oracle Business Analytics Warehouse >
and the first link, to the forums, should put you in touch with the right people.