I have used SQL a fair amount for several years. I just started a project that use Google Firebase and BigQuery to explore what users are doing on our website.
The raw data in BigQuery (the Firebase events) are very complicated.
It appears BigQuery is using SQL 2011. I am not sure how that is different from SQL-99 or SQL-2009. I have not found a good over view or tutorial.
Some of the challenges I am struggling with include grouping events in to session and identifying groups with certain characteristics
I wonder if instead of using GROUP BY I need to learn how windowing works.
Any suggestions for getting up the learning curve faster would be greatly appreciated.
Andy
The main difference is that the most efficient schema is not multiple flat tables with relations anymore. Instead it is having nested data in one big table.
I call them subtables, but they're really just arrays containing structs. Which may contain arrays which contain structs. Which may ... etc.
The most important thing to learn is how to work with these arrays. There are basically two use cases:
you need a field from a subtable to be a dimension in your result: you have to flatten the table using cross join. Cross joining a subtable with its parent is a weird concept, but works pretty fine.
you want some aggregated information from a subtable: use a subquery on the array and get it
Both concepts can be learned by working on all the exercises here: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
But GCP also has some courses on coursera covering BigQuery. I'm not sure how much they go in depth, though.
As you mentioned in the question BigQuery is compliant with the SQL 2011 [1].
In BigQuery analytic functions or aggregate analytic functions are used for windowing.
For reference you can have a look at official BigQuery standard SQL document and for deeper understanding of BigQuery you can have a look at Google BigQuery Analytics book.
Related
I come from a Teradata and Netezza background in Data Warehousing in MPP technologies.
I would like to ask how Google BigQuery distributes data by partition key on a simple table? I am really trying to understand the logic in how the BigQuery engine works if this makes sense?
Teradata and Netezza had a well documented technical page from recollection which described the processes used (like a step by step walkthrough).
Thanks,
Simon
BigQuery's partitioned tables are also very well documented here:
https://cloud.google.com/bigquery/docs/partitioned-tables
I think I don't understand what you want to know. Please rephrase your question after reading all the above.
I have a table with Id and Text fields. The Text field holds sentences, averaging 50 words. There are >1,000,000 rows.
This is part of a web app where users need to be able to search through these sentences. Here's the twist though - I need to be able to run a custom search function written in C# that uses Machine Learning instead.
From what I understand, this means I'll have to download the entire database of >1,000,000 rows every time a user makes a search! This seems really inefficient to me.
How would you implement this in the most efficient/fast way possible?
If this is relevant, I'm using EF Core with LINQ .Where(my_custom_search_function), with a PostgreSQL database
I think I've found the solution. Postgresql full-text search currently provides two ranking functions. In this case "sorting" in the question and "ranking" here refer to the same thing.
Postgresql docs state:
However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
These functions can any of the four kinds of supported postgresql functions.
Then they answer this exact question:
Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.
Credits to #Used_By_Already for pointing me to Postgresql full-text search.
In my project:
Data is not going to be modified (only query).
It is going to be more than 1.000.000 instances of data.
Query performance is critical.
In case of using SQL, it is going to be a single table with 7 columns. (no joints)
There are also different classification approaches used in NoSQL. Which are given below with some examples:
Column: Accumulo, Cassandra, HBase
Document: Clusterpoint, Couchdb, Couchbase, MarkLogic, MongoDB
Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeACE
Graph: Allegro, Neo4J, OrientDB, Virtuoso, Stardog
Source: http://en.wikipedia.org/wiki/NoSQL#cite_note-7
First of all, does the database system really makes an observable amount of performance difference for this case?
If it makes then,can you please explain which one is more suitable for my project SQL or NoSQL, if NoSQL then which classification approach?
Thank you in advance
I am currently enrolled in a project to set up a "standard" Database with a huge amount of data. We start by implementing in SQL to see the performance of the queries. Once this is done we address the problem of performance.
There is multiple reasons for this, but to name a few:
Standard SQL is easily implemented and standard across multiple instances (as of present day)
If you know SQL, make a fast implementation. To save time and get the project going.
There are loads of information available about SQL implementations.
I cannot answer about NoSQL but hopefully someone can fill me in.
The important question you need to ask is what kind of queries you will be performing. For example ClusterPoint offers real-time aggregation, so if you need result grouping and extracting summaries, it gives you great performance.
For a regular key/value they should all perform pretty well, so pick the one you are most comfortable with.
It would be mighty to have a way to query Google's BigQuery with MDX. I believe the natural solution would be a Mondrian adapter.
Is something like this in the works?
The reason I'm asking is because there is a lot of know-how in MDX and an MDX connector would allow us to reuse what we already know.
Furthermore, MDX is ideally suited for OLAP queries. Things like hierarchies and calculating a ratio of one's parent (e.g. % contribution to total) are standardized in MDX but can be solved in 100 different ways in SQL.
Calculating a Moving Average of the last 3 non empty weeks is still complicated in SQL and easy in MDX. There are many examples.
And lastly, it would allow to analyze data from Google BigQuery with an Excel Pivot or any of the 100+ other existing tools spewing out MDX queries.
Cheers,
Micha
There is a demo here that is using Mondrian/BigQuery with the Saiku user interface:
http://dev.analytical-labs.com/
This archive contains dependencies that can be used to set up a BigQuery data source in Saiku's embedded Mondrian server (got this from the Saiku twitter feed):
http://t.co/EbtaP95G
Their instructions are here for setting up BigQuery:
https://gist.github.com/4073088
You can download Saiku (with embedded Tomcat and Mondrian) here to run locally for testing:
http://analytical-labs.com/downloads.php
One issue I notice is that the drill-down functionality doesn't work because of the limitations of BigQuery SQL. My guess is that Mondrian devs will have to add some special SQL support for BigQuery to get around this. For example, any fields used in an ORDER BY clause must also be in the SELECT field list.
There is no existing BigQuery integration with Pentaho's Mondrian. One thing I would point out is that BigQuery is already very fast over massive datasets, so some of Mondrian's advantages may be moot with a BigQuery back end. However, I could imagine that one could use an existing Pentaho analysis tool to explore data. I'd like to know more about the use case.
I would like to know more about "MDX" (Multidimensional Expressions).
What is it?
What is it used for?
Why would you use it?
Is it better than SQL?
What is its use in SAP BPS (I haven't seen BPC, just heard that MDX is in it and want to know more)?
MDX is the query language developed by Microsoft for use with their OLAP tools. Since its creation, others (The open source project Mondrian, and Hyperion) have tried to create versions of it for use in their products.
OLAP data tends to look like a star-schema with a central fact table surrounded by multiple dimensions. MDX is designed to allow you to query these structures and create cross-tab type results.
While the language looks like SQL it doesn't behave like it and if you are an SQL programmer, the mental leap can be tough.
As to whether it is better than SQL, it serves a highly specialized purpose, i.e. analyzing data in a specific format. So if you want to query a star schema, it is better, otherwise, SQL will probably do the job.
MDX means Multi Dimensional eXpressions or some such. It is relevant to OLAP cubes and not to regular relational databases such as Oracle or SQL Server (although some SQL Server editions come with Analysis Services which is OLAP). The multidimensional world is about data warehousing and efficient reporting, not about doing normal transactional processing so you wouldn't use it for an order entry system, but you might move that data into a datamart to run reports against to see sales trends. That should be enough to get you started I hope.
SQL is for 'traditional' databases (OLTP). Most people learn the basics fairly easily.
MDX is only for multi-dimensional databases (OLAP), and is harder to learn than SQL in my opinion. The trouble is they look very similar.
Many programmers never need MDX even if they have to query multi-dimensional databases, because most analysis software forces them to build reports with drag-drop interfaces.
If you don't have a requirement to work with a multi-dimensional database, then don't create one just for the fun of it.....it won't be...
There are 2 versions of SAP-BPC (Business Objects Planning and Consolidation)
SAP-BPC Netweaver
SAP-BPC Microsoft Analysis Services
The Microsoft analysis services version of the product allows you to use MDX or multi dimensional expressions to both query the multi-dimensional database (OLAP) and write calculation logic.
However, SAP-BPC does not require a knowledge of MDX to either be used or administered.
You can see product documentation and a demonstration.
Best of luck on your research,
Focused on SAP BPC:
What is it used for?
It's used when you want to apply some custom calculation/business logic over many records/intersections and after submitting raw data. Example, first send prices in one input schedule, then quantities in other one, as a third step run a calculation for sales amount based on prices and quantities for all products.
It's also used to execute the Business Rules, for that you run a predefined program (like CALC_ACCOUNT, CONSOLIDATION, etc)
Is it better than SQL?
In BPC, "SQL" logic scripts have better performance than MDX. However SQL for BPC purposes has not much to do with SQL used in other it's just how they call it.
You will get a good start by just searching for MDX in the search box up top.