I want to join two large tables with many columns using Presto SQL syntax in AWS Athena. My code is pretty simple:
select
*
from TableA as A
left join TableB as B
on A.key_id = B.key_id
;
After joining, the primary key column (key_id) is repeated two times. Both tables have more than 100 columns, and the joining takes very long. How can I fix it such that the key_id column does not repeat twice in the final result?
P.S. AWS Athena does not support except command, unlike Google BigQuery.
This would be a nice feature, but is not part of standard SQL. The EXCEPT keyword is a set-based operation (i.e. filtering rows).
In Athena, as with standard SQL, you will have to specify the columns you want to include. The argument for this is that it's lower maintenance, and in fact best practice is to always explicitly state the columns you want - never leaving this to "whatever columns exist". This will help ensure your queries don't change behaviour if/when your table structure changes.
Some SQL languages have features like this. I understand Oracle has this too. But to my knowledge Athena (/ PrestoSQL / Trino) does not.
Related
we have a data set in Big Query with more than 500000 tables, when we run queries against this data set using legacy SQL, its throwing an error
As per Jordan Tigani, it executes SELECT table_id FROM .TABLES_SUMMARY to get relevant tables to query
How do I use the TABLE_QUERY() function in BigQuery?
Does queries using _TABLE_SUFFIX(standard SQL) executes TABLES_SUMMARY to get relevant tables to query?
According to the documentation TABLE_SUFFIX is a pseudo column that contains the values matched by the table wildcard and it is olny available in StandardSQL. Meanwhile, __TABLE_SUMMARY_ is a meta-table that contains information about the tables within a dataset and it is available in Standard and Legacy SQL. Therefore, they have two different concepts.
However, in StandardSQL, you can use INFORMATION_SCHEMA.TABLES to retrieve information about the tables within the chosen dataset, similarly to __TABLE_SUMMARY_. Here you can find examples of usage and also its limitations.
Below, I queried against a public dataset using both methods:
First, using INFORMATION_SCHEMA.TABLES.
SELECT * FROM `bigquery-public-data.noaa_gsod.INFORMATION_SCHEMA.TABLES`
And part of the output:
Secondly, using __TABLES_SUMMARY__.
SELECT * FROM `bigquery-public-data.noaa_gsod.__TABLES_SUMMARY__`
And part of the output table,
As you can see, for each method the output has a particular. Even though, both retrieve metadata about the tables within a particular dataset.
NOTE: BigQuery's queries have quotas. This quotas applies for some situations, including for the number of tables a single query can reference, which is 1000 per query, here.
No, querying using wildcard table does not execute TABLES_SUMMARY. You can have more than 500k tables in the dataset, but it does require that the number of tables matching the prefix pattern to be less than 500k. For other limitations on wildcard tables you can refer to the documentation.
we are using Looker (dashboard/reporting solution) to create persistent derived tables in BigQuery. These are normal tables as far as bigquery is concerned, but the naming is as per looker standard (it creates a hash based on DB + SQL etc.) and names the table accordingly. These tables are generated through view in scheduled time daily. The table names in BigQuery look like below.
table_id
LR_Z504ZN0UK2AQH8N2DOJDC_AGG__table1
LR_Z5321I8L284XXY1KII4TH_MART__table2
LR_Z53WLHYCZO32VK3FWRS2D_JND__table3
If I query the resulting table in BQ by explicit name then the result is returned as expected.
select * from `looker_scratch.LR_Z53WLHYCZO32VK3FWRS2D_JND__table3`
Looker changes the hash value in the table name when the table is regenerated after a query/job change. Hence I wanted to create a view with a wildcard table query to make the changes in the table name transparent to outside world.
But the below query always fails.
SELECT *
FROM \`looker_scratch.LR_*\`
where _table_suffix like '%JND__table3'
I either get a completely random schema with null values or errors such as:
Error: Cannot read field 'reportDate' of type DATE as TIMESTAMP_MICROS
There are no clashing table suffixes and I have used all sort of regular expression checks (lower , contains, etc)
Is this happening since the table names have hash values in them? I have run multiple tests on other datasets and there are absolutely no problem, we have been running wildcard table queries since a long time and have faced no issues whatsoever.
Please let me know your thoughts.
When you are using wildcard like below
`looker_scratch.LR_*`
you actually looking for ALL tables with this prefix and than - when you apply below clause
LIKE '%JND__table3'
you further filter in tables with such suffix
So the trick here is that very first (chronologically) table defines the schema of your output
To address your issue - verify if there are more tables that match your query and than look into very first one (the one that was created first)
If I have 2,000 tables that I'd like to union together, can I do that using a wildcard query, like this?
Or does the 1,000-tables referenced per query limit still apply?
does the 1,000-tables referenced per query limit still apply?
Yes. It still applies!
BigQuery looks for how many tables involved in query (no matter what exactly syntax/functionality is used). If you explicitely list all needed tables or using wildcard - at the end it is the same number of tables to be involved - thus same limitation applied
Note: partitions in partitioned table are not considered as a separate tables
My boss wants me to do a join on three tables, let's call them tableA, tableB, tableC, which have respectively 74M, 3M and 75M rows.
In case it's useful, the query looks like this :
SELECT A.*,
C."needed_field"
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
When trying the query on Dataiku Data Science Studio + Vertica, it seems to create temporary data to produce the output, which fills up the 1T of space on the server, bloating it.
My boss doesn't know much about SQL, so he doesn't understand that in the worst case scenario, it can produce a table with 74M*3M*75M = 1.6*10^19 rows, possibly being the problem here (and I'm brand new and I don't know the data yet, so I don't know if the query is likely to produce that many rows or not).
Therefore I would like to know if I have a way of knowing beforehand how many rows will be produced : if I did a COUNT(), such as this, for instance :
SELECT COUNT(*)
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
Does the underlying engine produces the whole dataset, and then counts it ? (which would mean I can't count it beforehand, at least not that way).
Or is it possible that a COUNT() gives me a result ? (because it's not building the dataset but working it out some other way)
(NB : I am currently testing it, but the count has been running for 35mn now)
Vertica is a columnar database. Any query you do only needs to look at the columns required to resolve output, joins, predicates, etc.
Vertica also is able to query against encoded data in many cases, avoiding full materialization until it is actually needed.
Counts like that can be very fast in Vertica. You don't really need to jump through hoops, Vertica will only include columns that are actually used. The optimizer won't try to reconstitute the entire row, only the columns it needs.
What's probably happening here is that you have hash joins with rebroadcasting. If your underlying projections do not line up and your sorts are different and you are joining multiple large tables together, just the join itself can be expensive because it has to load it all into hash and do a lot of network rebroadcasting of the data to get the joins to happen on the initiator node.
I would consider running DBD using these queries as input, especially if these are common query patterns. If you haven't run DBD at all yet and are not using custom projections, then your default projections will likely not perform well and cause the situation I mention above.
You can do an explain to see what's going on.
I am new to columnar DB concepts and BigQuery in particular. I noticed that for the sake of performance and cost efficiency it is recommended to split data across tables not only logically - but also by time.
For example - while I need a table to store my logs (1 logical table that is called "logs"), it is actually considered a good practice to have a separate table for different periods, like "logs_2012", "logs_2013", etc... or even "logs_2013_01", "logs_2013_02", etc...
My questions:
1) Is it actually the best practice?
2) Where would be best to draw the line - an annual table? A monthly table? A daily table? You get the point...
3) In terms of retrieving the data via queries - what is the best approach? Should I construct my queries dynamically using the UNION option? If I had all my logs in one table - I would naturally use the where clause to get data for the desired time range, but having data distributed over multiple tables makes it weird. I come from the world of relational DB (if it wasn't obvious so far) and I'm trying to make the leap as smoothly as possible...
4) Using the distributed method (different tables for different periods) still raises the following question: before querying the data itself - I want to be able to determine for a specific log type - what is the available range for querying. For example - for a specific machine I would like to first present to my users the relevant scope of their available logs, and let them choose the specific period within that scope to get insights for. The question is - how do I construct such a query when my data is distributed over a number of tables (each for a period) where I don't know which tables are available? How can I construct a query when I don't know which tables exist? I might try to access the table "logs_2012_12" when this table doesn't actually exist, or event worst - I wouldn't know which tables are relevant and available for my query.
Hope my questions make sense...
Amit
Table naming
For daily tables, the suggested table name pattern is the specific name of your table + the date like in '20131225'. For example, "logs20131225" or "logs_20131225".
Ideal aggregation: Day, month, year?
The answer to this question will depend on your data and your queries.
Will you usually query one or two days of data? Then have daily tables, and your costs will be much lower, as you query only the data you need.
Will you usually query all your data? Then have all the data in one table. Having many tables in one query can get slower as the number of tables to query grow.
If in doubt, do both! You could have daily, monthly, yearly tables. For a small storage cost, you could save a lot when doing queries that target only the intended data.
Unions
Feel free to do unions.
Keep in mind that there is a limit of a 1000 tables per query. This means if you have daily tables, you won't be able to query 3 years of data (3*365 > 1000).
Remember that unions in BigQuery don't use the UNION keyword, but the "," that other databases use for joins. Joins in BigQuery can be done with the explicit SQL keyword JOIN (or JOIN EACH for very big joins).
Table discovery
API: tables.list will list all tables in a dataset, through the API.
SQL: To query the list of tables within SQL... keep tuned.
New 2016 answer: Partitions
Now you can have everything in one table, and BigQuery will analyze only the data contained in the desired dates - if you set up the new partitioned tables:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables