I've looked at previous questions, but the links given to GCP were outdated so I would like to learn what is the best way to do the conversion while inserting the correct partition (meaning not the day i inserted the records, but according to the "date" column.
Could someone point me in the right direction, specifically for Legacy SQL.
From the docs: "Currently, legacy SQL is not supported for querying partitioned tables or for writing query results to partitioned tables".
So, in this case, because Legacy can't write to partitioned tables, which seems to be a major blocking with no workarounds, you would have to use Standard SQL or Dataflow, as detailed in the answers of the question provided by Graham.
Related
As far as I understand Hive keeps track of schema for all partitions, handling schema evolution.
Is there any way to get schema for particular partition? For example, if I want to compare schema for some old partition with the latest one.
Show extended command does give you a bunch of information around the partition columns and its types, probably you could use those.
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTable/PartitionExtended
I'm trying to understand the rationale behind the partition specific schema managed for Hive/Glue tables. Albeit, I couldn't find any documentation, specifically talking about this but during my search, I found a couple of Hive JIRAs (as attached in references) which hint at its purpose. From what I gathered, partition schema is a snapshot of table schema when it is registered, and it allows Hive to support schema evolution without invalidating existing table partitions and the underlying data. Also, it enables Hive to support different partitions and table level file formats, giving clients more flexibility.
The exact purpose is still not clear to me, so requesting the experts to comment on the following set of questions:
What is the rationale behind maintaining partition specific schema?
How does Hive/Glue behave in case there is a discrepancy in the partition and table schema? Does the resolution criteria consider or is dependent on the underlying data file format?
What are the repercussions of not maintaining partition specific schema in table metadata?
Experimentation and observations:
I ran an experiment on my end, in which I tested a few count, count with partition filters and schema description queries against Glue table without explicit schema definition in partition properties (underlying data files are written in parquet) using Spark-Shell, Hive CLI and Athena. The results retrieved were consistent with the ones computed from the original table.
References:
https://issues.apache.org/jira/browse/HIVE-6131
https://issues.apache.org/jira/browse/HIVE-6835
https://issues.apache.org/jira/browse/HIVE-8839
Thanks!
(I am guessing based on How do I query the streaming buffer in BigQuery if the _PARTITIONTIME field isn't available with Standard SQL that my question has no simple solution, so I will "enhance" it)
I stream my data into Bigquery's partitioned and clustered table using a timestamp field (not an ingestion time partition).
I want to have a view that always look into the last hour data, what already in the table, plus what still in the buffer.
Since this table is not an ingestion time partitioned table, there is no pseudo column _PARTITIONTIME/DATE, so I can't use it in order to get the buffer data.
The only way I've found is by using legacy SQL: SELECT * FROM [dataset.streaming_data$__UNPARTITIONED__]
This is not good enough for me, since even if I save this as a view, I can't refer to a legacy SQL view from a standard SQL query.
Any idea how I can achieve this ?
Another idea I am thinking of - bigquery can have an external data source (using EXTERNAL_QUERY), which I can query using standard SQL.
A solution might be some "temporary" table on a separate database (such as PostgreSQL Cloud SQL) which will only have 1 hour of data, and won't have bigquery's buffer mechanism.
I think this is a bad solution, but I guess it might work...
What do you think ?
Thanks to #Felipe Hoffae I just found out I need to do nothing :-)
Buffered data is already available in any SQL query if the WHERE clause includes the data in it...
When I try to use dynamic table partitions in a query in the web UI in BigQuery (like documented e.g. here), i.e.
SELECT * FROM [dataset.table$0-of-3]
I get the following error:
Error: Cannot read partition information from a table that is not partitioned: project:dataset.table$0-of-3
When I try a table that was partitioned with the new date partitioning (bq mk --time_partitioning_type=DAY ...), I do not get an error but instead:
Query returned zero records.
Also, I can't find the documentation on this feature anymore. Has it been deprecated?
I don't have enough reputation to comment on Mikhail's answer -- so adding an answer here.
At least for now, the dynamic table partitions described in the book were deprecated in favor of table partitioning as described in the latest BigQuery documentation.
We hope to provide richer flavors of partitioning in the future, but they may not be necessarily be available as table decorators.
This ($0-of-3) feature was never implemented - hopefuly it will at some point.
The ONLY partitioning decorator that was recently implemented was for date partitioned tables. see more at Partitioned Tables and timePartitioning.type
DataColumn, DataColumn, DateColumn
Every so often we put data into the table via date.
So everything seems great at first, but then I thought: What happens when there are a million or billion rows in the table? Should I be breaking up the tables by date? This way the query performance will never degrade? How do people deal with this sort of thing?
You can use partitioned tables starting with SQL 2K5: Partitioned Tables
This way you gain the benefits of keeping the logical design pure while being able to move old data into a different file group.
You should not break your tables because of data. Instead, you should worry about your indexes, normalization and so on.
Update
A little deeper explanation. Let's suppose you have a table with a million records. If you have different dates on [DateColumn], your greatest ally will be the indexes that work with the [DateColumn]. Then you make sure your queries always filter by at least [DateColumn].
This way, you will be fine.
This easily qualifies as premature optimization, which is tough to achieve in db design IMHO, because optimization is/should be closer to the surface in data modeling.
But all you need to do is create an index on the DateColumn field. An index is actually a much better performance solution than any kind of table splitting/breaking up and keeps your design and therefore all of you programming much simpler. (And you can decide to use partitioning w/o affecting your design in the future if it helps.)
Sounds like you could use a history table. If you are mostly going to query the current date's data, then migrate the old data to the history table and your main table will not grow so much.
If I understand you question correctly, you have a table with some data and a date. Your question is -- will I see improved performance if I make a new table say, every year. This way the queries will never have to look at more than one years worth of data.
This is wrong. Instead what you should do is set the date field as an index. The server will be able to give you the performance gain you need if it is an index.
If you don't do this your program's logic will get crazy and ultimately slow down your system.
Keep it simple.
(NB - There are some advanced partitioning features you can make use of, but these can be layered in later if needed -- it is unlikely you will need these features but the simple design should be able to migrate to them if needed.)
When tables and indexes become very
large, partitioning can help by
partitioning the data into smaller,
more manageable sections.
Microsoft SQL Server 2005 allows you
to partition your tables based on
specific data usage patterns using
defined ranges or lists. SQL Server
2005 also offers numerous options for
the long-term management of
partitioned tables and indexes by the
addition of features designed around
the new table and index structure.
Furthermore, if a large table exists
on a system with multiple CPUs,
partitioning the table can lead to
better performance through parallel
operations.
You might need considering the
following too: In SQL Server 2005,
related tables (such as Order and
OrderDetails tables) that are
partitioned to the same partitioning
key and the same partitioning function
are said to be aligned. When the
optimizer detects that two partitioned
and aligned tables are joined, SQL
Server 2005 can join the data that
resides on the same partitions first
and then combine the results. This
allows SQL Server 2005 to more
effectively use multiple-CPU
computers.
Read about Partitioned Tables and Indexes in SQL Server 2005