In dynamic partitioning in Hive, suppose we want to partition a column which is there in the middle of the table, we must be creating a new table and then reordering the columns to get the column which is to be partitioned in the last.
Is it really fine if we do this on the cluster where we get huge data?
Related
Query is related to list partition. I have 600 tables and each of them stored in list partition. Question is - can i drop partition without losing data?
If your partition has data you will lose them.
If your partition is empty you will not have any problem.
Regards.
Let's say I have a table with multiple partitions and I need to query something from the entire table. Is there a difference, from a performance point of view, between running a single sql query on the entire table and running one sql for each partition?
LE: I'm using Postgres
In Microsoft SQL Server when you create a partition function for partitioning a table, this function partitions data and route the query to the best data file.
For example if your partition function creates in a datetime field and partition data yearly, your query just run in a single data file that contains your where clause data.
Therefore you don't need to separate your query and the SQL Server Engine will do that automatically.
It depends on what your intention is.
If you already have a partitioned table and are deciding what the best strategy to retrieve all rows is, then running a query against the partitioned table is almost certainly the faster solution.
Retrieval of all partitions will most likely be parallelized (depending on your configuration of parallel query). If you query each partition manually, you would need to implement that yourself e.g. creating multiple connections with each one running a query against one partition.
However if your intention is to decide whether it makes sense to partition a table, then the answer isn't so straightforward. If you have to query all rows of the table very often, then this is usually (slightly) slower than querying a single non-partitioned table. If that is the exception and you almost always have run queries that target a single partition, then partitioning does make sense.
I have DataSet with uber schema and the requirement is to write to different Hive table depending on some column values. Basically, the combined column values determine the target Hive table. I thought about using groupBy but the result is for aggregation and using repartition doesn't always guarantee one partition maps to one Hive table. Any other options?
For streaming inserts, I want to use a template table (with user id suffix) which is itself a Partitioned table. This way I can make my tables smaller than just using Partitioned Tables and hence make my queries more cost-effective. Also my query cost per user stays constant irrespective of the number of users in my system. As per the documentation at https://cloud.google.com/bigquery/streaming-data-into-bigquery:-
To create smaller sets of data by date, use time-partitioned tables. To create smaller tables that are not date-based, use template tables and BigQuery creates the tables for you.
It sounds as if it can either be a time-partitioned table OR a template table. Can it not be both? If not, is there another architecture that I should look into?
One more concern regarding my above proposed architecture is the 4000 limit that I saw on https://cloud.google.com/bigquery/docs/partitioned-tables . Does it mean that my partitioned table can't cover more than 4000 days? Will I have to delete old partitions in this case or will the last partition keep storing any subsequent streamed data?
You should look into Clustered Tables on partitioned tables.
With that you can have ONE table with all users in it, partitioned by time, and clustered by user_id as you would use in a template table.
Introduction to Clustered Tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Clustered table pricing
When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data. Clustered tables help you to reduce query costs by pruning data so it is not processed by the query.
I have table about 20-25 million records, I have to put in another table based on some condition and also sorted. Example
Create table X AS
select * from Y
where item <> 'ABC'
Order By id;
I know that Order by Uses single reducer to guarantee total order in output.
I need optimize way to do sorting for above query.
SQL tables represent unordered sets. This is especially true in parallel databases where the data is spread among multiple processors.
That said, Hive does support clustered indexes (which essentially define partitions) and sorting within the partitions. The documentation is quite specific, though, that this is not supported with CREATE TABLE AS:
CTAS has these restrictions:
The target table cannot be a partitioned table.
You could do what you want by exporting the data and re-importing it.
However, I would suggest that you figure out what you really need without requiring the data to be ordered within the database.