What are the different ways to distribute data in Hive ? E.g. Partition By, Bucketing, distributed by etc.
Apache Hive organizes tables into partitions. Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department.
Related
My BigQuery table is commonly queried with different combinations of "where" conditions across 1 or more common columns, say across columns A, B, C (not in order). Hence, I would like to add individual clusters for columns A, B, and C respectively.
How can I create multiple clusters for BigQuery tables? (Similar to how multiple indexes can be created on a traditional rdbms table)
Multiple clustering is allowed (but it is hierarchical,you cluster by specific field and then it is subclustered on the following, etc).
At the same time, clustering is only allowed for partitioned tables.
You can find the corresponding documentation here
Upon viewing some comments and pages it appears that there are no ways to have multiple independent clusters (vs how multiple indexes can be created on a traditional rdbms) on a single bigquery table.
This is because clusters pretty much just sort the data blocks of that table as per docs:
When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query that contains a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Hence, it appears that there is no way of applying multiple sorting logic for each independent cluster on the same set of data, and so what I require appears to be impossible as of now.
I have DataSet with uber schema and the requirement is to write to different Hive table depending on some column values. Basically, the combined column values determine the target Hive table. I thought about using groupBy but the result is for aggregation and using repartition doesn't always guarantee one partition maps to one Hive table. Any other options?
For streaming inserts, I want to use a template table (with user id suffix) which is itself a Partitioned table. This way I can make my tables smaller than just using Partitioned Tables and hence make my queries more cost-effective. Also my query cost per user stays constant irrespective of the number of users in my system. As per the documentation at https://cloud.google.com/bigquery/streaming-data-into-bigquery:-
To create smaller sets of data by date, use time-partitioned tables. To create smaller tables that are not date-based, use template tables and BigQuery creates the tables for you.
It sounds as if it can either be a time-partitioned table OR a template table. Can it not be both? If not, is there another architecture that I should look into?
One more concern regarding my above proposed architecture is the 4000 limit that I saw on https://cloud.google.com/bigquery/docs/partitioned-tables . Does it mean that my partitioned table can't cover more than 4000 days? Will I have to delete old partitions in this case or will the last partition keep storing any subsequent streamed data?
You should look into Clustered Tables on partitioned tables.
With that you can have ONE table with all users in it, partitioned by time, and clustered by user_id as you would use in a template table.
Introduction to Clustered Tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Clustered table pricing
When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data. Clustered tables help you to reduce query costs by pruning data so it is not processed by the query.
I have table about 20-25 million records, I have to put in another table based on some condition and also sorted. Example
Create table X AS
select * from Y
where item <> 'ABC'
Order By id;
I know that Order by Uses single reducer to guarantee total order in output.
I need optimize way to do sorting for above query.
SQL tables represent unordered sets. This is especially true in parallel databases where the data is spread among multiple processors.
That said, Hive does support clustered indexes (which essentially define partitions) and sorting within the partitions. The documentation is quite specific, though, that this is not supported with CREATE TABLE AS:
CTAS has these restrictions:
The target table cannot be a partitioned table.
You could do what you want by exporting the data and re-importing it.
However, I would suggest that you figure out what you really need without requiring the data to be ordered within the database.
Would someone be able to tell me the difference between wildcard table and sharded table in big query.
is the tables physical broken into different tables in shareded tables.How to query sharded table.Any Documentation for the same.
Yes, a shared table is a table that is broken into many shards, typically with the same schema. You can use a wildcard to query across these shards in BigQuery.
For example, suppose you have man tables with the prefix ga_sessions_, you can select them all with SELECT * FROMga_sessions_*`.
If you are creating these tables for the first time you may want to consider partitioning or clustering, both of which are often more performent than sharding. here's a link to the official documentation, which also describes some of these benefits:
https://cloud.google.com/bigquery/docs/partitioned-tables