(I am guessing based on How do I query the streaming buffer in BigQuery if the _PARTITIONTIME field isn't available with Standard SQL that my question has no simple solution, so I will "enhance" it)
I stream my data into Bigquery's partitioned and clustered table using a timestamp field (not an ingestion time partition).
I want to have a view that always look into the last hour data, what already in the table, plus what still in the buffer.
Since this table is not an ingestion time partitioned table, there is no pseudo column _PARTITIONTIME/DATE, so I can't use it in order to get the buffer data.
The only way I've found is by using legacy SQL: SELECT * FROM [dataset.streaming_data$__UNPARTITIONED__]
This is not good enough for me, since even if I save this as a view, I can't refer to a legacy SQL view from a standard SQL query.
Any idea how I can achieve this ?
Another idea I am thinking of - bigquery can have an external data source (using EXTERNAL_QUERY), which I can query using standard SQL.
A solution might be some "temporary" table on a separate database (such as PostgreSQL Cloud SQL) which will only have 1 hour of data, and won't have bigquery's buffer mechanism.
I think this is a bad solution, but I guess it might work...
What do you think ?
Thanks to #Felipe Hoffae I just found out I need to do nothing :-)
Buffered data is already available in any SQL query if the WHERE clause includes the data in it...
Related
As far as I understand Hive keeps track of schema for all partitions, handling schema evolution.
Is there any way to get schema for particular partition? For example, if I want to compare schema for some old partition with the latest one.
Show extended command does give you a bunch of information around the partition columns and its types, probably you could use those.
SHOW TABLE EXTENDED [IN|FROM database_name] LIKE 'identifier_with_wildcards' [PARTITION(partition_spec)];
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTable/PartitionExtended
Let's say I have a table with multiple partitions and I need to query something from the entire table. Is there a difference, from a performance point of view, between running a single sql query on the entire table and running one sql for each partition?
LE: I'm using Postgres
In Microsoft SQL Server when you create a partition function for partitioning a table, this function partitions data and route the query to the best data file.
For example if your partition function creates in a datetime field and partition data yearly, your query just run in a single data file that contains your where clause data.
Therefore you don't need to separate your query and the SQL Server Engine will do that automatically.
It depends on what your intention is.
If you already have a partitioned table and are deciding what the best strategy to retrieve all rows is, then running a query against the partitioned table is almost certainly the faster solution.
Retrieval of all partitions will most likely be parallelized (depending on your configuration of parallel query). If you query each partition manually, you would need to implement that yourself e.g. creating multiple connections with each one running a query against one partition.
However if your intention is to decide whether it makes sense to partition a table, then the answer isn't so straightforward. If you have to query all rows of the table very often, then this is usually (slightly) slower than querying a single non-partitioned table. If that is the exception and you almost always have run queries that target a single partition, then partitioning does make sense.
I've looked at previous questions, but the links given to GCP were outdated so I would like to learn what is the best way to do the conversion while inserting the correct partition (meaning not the day i inserted the records, but according to the "date" column.
Could someone point me in the right direction, specifically for Legacy SQL.
From the docs: "Currently, legacy SQL is not supported for querying partitioned tables or for writing query results to partitioned tables".
So, in this case, because Legacy can't write to partitioned tables, which seems to be a major blocking with no workarounds, you would have to use Standard SQL or Dataflow, as detailed in the answers of the question provided by Graham.
Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?
You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to
I'm Working on My Program that Works With SQL Server.
for Store Data in Database Table, Which of the below approaches is correct?
Store Many Rows Just in One Table (10 Million Record)
Store Fewer Rows in Several Table (500000 Record) (exp: for each Year Create One Table)
It depends on how often you access data.If you are not using the old records, then you can archive those records. Splitting up of tables is not desirable as it may confuse you while fetching data.
I would say to store all the data in a single table, but implement a table partition on the older data. Partioning the data will increase query performance.
Here are some references:
http://www.mssqltips.com/sqlservertip/1914/sql-server-database-partitioning-myths-and-truths/
http://msdn.microsoft.com/en-us/library/ms188730.aspx
http://blog.sqlauthority.com/2008/01/25/sql-server-2005-database-table-partitioning-tutorial-how-to-horizontal-partition-database-table/
Please note that this table partioning functionality is only available in Enterprise Edition.
Well, it depends!
What are you going to do with the data? If you are querying this data a lot of times it could be a better solution to split the data in (for example) year tables. That way you would have a better performance since you have to query smaller tables.
But on the other side. With a bigger table and with good query's you might not even see a performance issue. If you only need to store this data it would be better to just use 1 table.
BTW For loading this data into the database you could use BCP (bulkcopy), which is a fast way of inserting a lot of rows.