Sort field in hive - sql

I have table about 20-25 million records, I have to put in another table based on some condition and also sorted. Example
Create table X AS
select * from Y
where item <> 'ABC'
Order By id;
I know that Order by Uses single reducer to guarantee total order in output.
I need optimize way to do sorting for above query.

SQL tables represent unordered sets. This is especially true in parallel databases where the data is spread among multiple processors.
That said, Hive does support clustered indexes (which essentially define partitions) and sorting within the partitions. The documentation is quite specific, though, that this is not supported with CREATE TABLE AS:
CTAS has these restrictions:
The target table cannot be a partitioned table.
You could do what you want by exporting the data and re-importing it.
However, I would suggest that you figure out what you really need without requiring the data to be ordered within the database.

Related

SQL for ordering the actual data in the database table

Fairly new to SQL. You can order records returned with something like:
SELECT * FROM MyTable ORDER BY SalesPrice;
Without physical extracting all records, clearing the table, and then re-writing them back in some customised order, is there a way to ask the database driver to re-order the physical data in the database table by some important column or key (e.g. "SalesPrice")?
This would not be desirable for tables with many records or where the data changes regularly, but may make more sense as a maintainenace action where the table is relatively static.
Using DAO and ODBC, though I'm hoping for a general answer that might apply to a variety of databases.
SQL tables represent unordered sets. Period. Well, there is a nuance to that, which are clustered indexes. So you can arrange to have the data stored in a particular order. However that does not guarantee that select * from table will return the results in that order.
If you are concerned about the performance of such a query, you wnat an index on (salesprice). Then use order by salesprice in the query.

Querying the entire table VS Querying each partition of the table

Let's say I have a table with multiple partitions and I need to query something from the entire table. Is there a difference, from a performance point of view, between running a single sql query on the entire table and running one sql for each partition?
LE: I'm using Postgres
In Microsoft SQL Server when you create a partition function for partitioning a table, this function partitions data and route the query to the best data file.
For example if your partition function creates in a datetime field and partition data yearly, your query just run in a single data file that contains your where clause data.
Therefore you don't need to separate your query and the SQL Server Engine will do that automatically.
It depends on what your intention is.
If you already have a partitioned table and are deciding what the best strategy to retrieve all rows is, then running a query against the partitioned table is almost certainly the faster solution.
Retrieval of all partitions will most likely be parallelized (depending on your configuration of parallel query). If you query each partition manually, you would need to implement that yourself e.g. creating multiple connections with each one running a query against one partition.
However if your intention is to decide whether it makes sense to partition a table, then the answer isn't so straightforward. If you have to query all rows of the table very often, then this is usually (slightly) slower than querying a single non-partitioned table. If that is the exception and you almost always have run queries that target a single partition, then partitioning does make sense.

How to use time partitioned tables with template tables and beyond 4000 limit for BigQuery?

For streaming inserts, I want to use a template table (with user id suffix) which is itself a Partitioned table. This way I can make my tables smaller than just using Partitioned Tables and hence make my queries more cost-effective. Also my query cost per user stays constant irrespective of the number of users in my system. As per the documentation at https://cloud.google.com/bigquery/streaming-data-into-bigquery:-
To create smaller sets of data by date, use time-partitioned tables. To create smaller tables that are not date-based, use template tables and BigQuery creates the tables for you.
It sounds as if it can either be a time-partitioned table OR a template table. Can it not be both? If not, is there another architecture that I should look into?
One more concern regarding my above proposed architecture is the 4000 limit that I saw on https://cloud.google.com/bigquery/docs/partitioned-tables . Does it mean that my partitioned table can't cover more than 4000 days? Will I have to delete old partitions in this case or will the last partition keep storing any subsequent streamed data?
You should look into Clustered Tables on partitioned tables.
With that you can have ONE table with all users in it, partitioned by time, and clustered by user_id as you would use in a template table.
Introduction to Clustered Tables
When you create a clustered table in BigQuery, the table data is automatically organized based on the contents of one or more columns in the table’s schema. The columns you specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns you specify is important. The order of the specified columns determines the sort order of the data.
Clustering can improve the performance of certain types of queries such as queries that use filter clauses and queries that aggregate data. When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query containing a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Similarly, when you submit a query that aggregates data based on the values in the clustering columns, performance is improved because the sorted blocks colocate rows with similar values.
Clustered table pricing
When you create and use clustered tables in BigQuery, your charges are based on how much data is stored in the tables and on the queries you run against the data. Clustered tables help you to reduce query costs by pruning data so it is not processed by the query.

Slow query performance with Partitioned tables?

I was reading articles about partition tables and got confused as in whether it is a boon or bane. I do understand that partition is meant for large amount of datasets. but here is my confusion:
Lets assume that there is a table:
Orders(Orderid,Custid,Orderdate,Shipperid)
and its has huge amount of data; well enough to justify the partitioning. There are select queries done on every column of this table; many queries having joins with other tables.
If I partition the table on the basis of OrderId; will other queries based on other columns become slow?
Will the join queries involving column other than OrderId column become slow?
Will appreciate any guidance!! Thanks
Imagine you have two tables with the same schema and the same data. Both are clustered on OrderID. One of these tables is also partitioned by OrderID. Sometimes access is keyed by OrderID and sometimes not.
Lookups for a single OrderID may be faster against the partitioned table if you have sufficient data to force multiple levels in your index BTree. This is because there is one BTree per partition. Lookups for a range of OrderIDs will, in general, be faster because of partition elimination - SQL Server will only access those partitions needed to satisfy the query.
Lookups or scans on other keys will be no different.
Partitioning also allows swap in and swap out of a whole partition which can save hours in a daily load / delete cycle.

How do I manage large data set spanning multiple tables? UNIONs vs. Big Tables?

I have an aggregate data set that spans multiple years. The data for each respective year is stored in a separate table named Data. The data is currently sitting in MS ACCESS tables, and I will be migrating it to SQL Server.
I would prefer that data for each year is kept in separate tables, to be merged and queried at runtime. I do not want to do this at the expense of efficiency, however, as each year is approx. 1.5M records of 40ish fields.
I am trying to avoid having to do an excessive number of UNIONS in the query. I would also like to avoid having to edit the query as each new year is added, leading to an ever-expanding number of UNIONs.
Is there an easy way to do these UNIONs at runtime without an extensive SQL query and high system utility? Or, if all the data should be managed in one large table, is there a quick and easy way to append all the tables together in a single query?
If you really want to store them in separate tables, then I would create a view that does that unioning for you.
create view AllData
as
(
select * from Data2001
union all
select * from Data2002
union all
select * from Data2003
)
But to be honest, if you use this, why not put all the data into 1 table. Then if you wanted you could create the views the other way.
create view Data2001
as
(
select * from AllData
where CreateDate >= '1/1/2001'
and CreateDate < '1/1/2002'
)
A single table is likely the best choice for this type of query. HOwever you have to balance that gainst the other work the db is doing.
One choice you did not mention is creating a view that contains the unions and then querying on theview. That way at least you only have to add the union statement to the view each year and all queries using the view will be correct. Personally if I did that I would write a createion query that creates the table and then adjusts the view to add the union for that table. Once it was tested and I knew it would run, I woudl schedule it as a job to run on the last day of the year.
One way to do this is by using horizontal partitioning.
You basically create a partitioning function that informs the DBMS to create separate tables for each period, each with a constraint informing the DBMS that there will only be data for a specific year in each.
At query execution time, the optimiser can decide whether it is possible to completely ignore one or more partitions to speed up execution time.
The setup overhead of such a schema is non-trivial, and it only really makes sense if you have a lot of data. Although 1.5 million rows per year might seem a lot, depending on your query plans, it shouldn't be any big deal (for a decently specced SQL server). Refer to documentation
I can't add comments due to low rep, but definitely agree with 1 table, and partitioning is helpful for large data sets, and is supported in SQL Server, where the data will be getting migrated to.
If the data is heavily used and frequently updated then monthly partitioning might be useful, but if not, given the size, partitioning probably isn't going to be very helpful.