Benefit of $PARTITION? - sql

I don't clearly understand the purpose of using $PARTITION in SQL server? I have read the content from MSDN but don't still understand.
What benefit can you use it?

Partitioning tables in a database means splitting it into multiple pieces while still treating it like a single table. We are talking here about horizontal partitioning which means that each partition contains all of a columns but only some of the rows. To do this, you have to define a partition function which defines which partition a row belongs to, based on that row's value in the partitioning column. $PARTITION tells you which partition a row belongs to.
Let's say we had an integer column that represented the rating of our rows and we expect that users will often ask only for items rated above 4. We could partition based on rating. Our partition function could be:
CREATE PARTITION FUNCTION partitionByRating (int) FOR VALUES (4);
Then, to find out which partition the rows of our table belong to, we could query:
SELECT $partition.partitionByRating(i.Rating) AS [Partition Number],
rating AS [Rating],
name AS [Item Name]
FROM dbo.Items AS i
And we might get a result like:
Partition Rating Item Name
1 1 Ball
1 4 Scooter
2 5 Blaster
Which means that Ball and Scooter will be stored together, but depending on how you assign partitions to storage, Blaster might be stored somewhere else.

You can use this function to determine the partition number that will be used for a certain value. For example, with a table partitioned by date, this will tell you which partition will be used for July 1, 2001, even before you insert the value into the table:
SELECT $PARTITION.PF_RangeByYear('20010701')
You can also use it to find the row count in each partition number:
SELECT $partition.PF_RangeByYear(orderdate) AS partition#,
COUNT(*) AS cnt
FROM Orders
GROUP BY $partition.PF_RangeByYear(orderdate);
Which arguably you could do much simpler from sys.partitions or sys.dm_db_partition_stats.
(Both of these examples taken from Itzik's SQL Mag article here.)

Related

PostgreSQL Sequence Ascending Out of Order

I'm having an issue with Sequences when inserting data into a Postgres table through SQL Alchemy.
All of the data is inserted fine, the id BIGSERIAL PRIMARY KEY column has all unique values which is great.
However when I query the first 10/20 rows etc. of the table, the id values are not ascending in numeric order. There are gaps in the sequence, fine, that's to be expected, I mean rows will go through values randomly not ascending like:
id
15
22
16
833
30
etc...
I've gone through plenty of SO and Postgres forum posts around this and have only found people talking about having huge serial gaps in their sequences, not about incorrect ascending order when being created
Screenshots of examples:
The table itself has being created through standard DDL statement like so:
CREATE TABLE IF NOT EXISTS schema.table_name (
id BIGSERIAL NOT NULL,
col1 text NOT NULL,
col2 JSONB[] NOT NULL,
etc....
PRIMARY KEY (id)
);
However when I query the first 10/20 rows etc. of the table
Your query has no order by clause, so you are not selecting the first rows of the table, just an undefined set of rows.
Use order by - you will find out that sequence number are indeed assigned in ascending order (potentially with gaps):
select id from ht_data order by id limit 30
In order to actually check the ordering of the sequence, you would actually need another column that stores the timestamp when each row was created. You could then do:
select id from ht_data order by ts limit 30
In general, there is no defined "order" within a SQL table. If you want to view your data in a certain order, you need an ORDER BY clause:
SELECT *
FROM table_name
ORDER BY id;
As for gaps in the sequence, the contract of an auto increment column generally only guarantees that each newly generated id value with be unique and, most of the time (but not necessarily always), will be increasing.
How could you possibly know if the values are "out of order"? SQL tables represent unordered sets. The only indication of ordering in your table is the serial value.
The query that you are running has no ORDER BY. The results are not guaranteed to be in any particular ordering. Period. That is a very simply fact about SQL. That you want the results of a SELECT to be ordered by the primary key or by insertion order is nice, but not how databases work.
The only way you could determine if something were out of order would be if you had a column that separate specified the insert order -- you could have a creation timestamp for instance.
All you have discovered is that SQL lives up to its promise of not guaranteeing ordering unless the query specifically asks for it.

Understanding Hive table creation notation

I have come across Hive tables which I need to convert to Redshift/MySql equivalent.
I am having trouble understanding Hive query structure and would appreciate some help:
CREATE TABLE IF NOT EXISTS table_1 (
id BIGINT,
price DOUBLE,
asset string
)
PARTITIONED BY (
pt STRING
);
ALTER TABLE table_1 DROP IF EXISTS PARTITION (pt== '${yyyymmdd}');
INSERT OVERWRITE TABLE table_1 PARTITION (pt= '${yyyymmdd}')
select aa.id,aa.price,aa.symbol from
...
...
from
table_2 table
I am having trouble understanding the PARTITIONED BY clause. This, if I am understanding correctly, is different from MySQL table partitions, and is a Hive specific dynamic partition.
The partition does not define a column or a key, and partitions by the current date.
Does this mean that table_1 is partitioned by the date? Each day has a separate partition?
Then later on in the code there are notations similar to
inner join table_new table on table.pt = '${yyyymmdd}' and ...
In this context, does it mean only rows inserted on yyyymmdd are selected for the join?
Thank you.
Partition in Hive is a folder in HDFS by default with name key=value + metadata in the Hive metastore. You can alter partition location and create partition on top of any folder.
This PARTITIONED BY (pt STRING) defines partition column pt of type string, not date. Partition values are stored in the metadata. The pt column is not present in the table data files, it is only defined in PARTITIONED BY, all partition values are stored in the metadata. If you load partition dynamically, partition folder is being created with name pt='value'.
This sentence creates partition dynamically:
INSERT OVERWRITE TABLE table_1 PARTITION (pt)
select id, price, symbol
coln as pt --partition column should be the last one
from ...
And this sentence loads single STATIC partition:
INSERT OVERWRITE TABLE table_1 PARTITION (pt= '${yyyymmdd}')
select aa.id,aa.price,aa.symbol
from
No partition column is selected, partition value specified in the
PARTITION (pt= '${yyyymmdd}')
'${yyyymmdd}' here is a parameter with name yyyymmdd which is passed to the script using --hivevar like this:
hive --hivevar yyyymmdd=20200604 -f myscript.sql
You can pass ANY string as partition value in this case, though parameter name yyyymmdd suggests it's format.
BTW date format in hive is 'yyyy-MM-dd' Strings in 'yyyy-MM-dd' format can be implicitly converted to DATE.
I will try in one shot explain what is partitioning in Hive. First of all would be
WHEN TO USE TABLE PARTITIONING
Table partitioninig is good when:
Reading the entire dataset takes too long
Queries almost always filter on the partition columns
There are a reasonable number of different values for partition columns
Data generation of ETL process splits data by file or directory names
Partition column values are not in the data itself
Don't partition on columns with many unique values
Example: Partitioning customers by first name
CREATING PARTITIONED TABLES
To create a partitioned table, use the PARTITIONED BY clause in the CREATE TABLE statement.
The names and types of the partition columns must be specified
in the PARTITIONED BY clause, and only in the PARTITIONED BY clause.
They must not also appear in the list of all the other columns.
CREATE TABLE customers_by_country
(cust_id STRING, name STRING)
PARTITIONED BY (country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
The example CREATE TABLE statement shown above creates the table customers_by_country,
which is partitioned by the STRING column named country.
Notice that the country column appears only in the PARTITIONED BY clause,
and not in the column list above it.
This example specifies only one partition column, but you can specify more than one by using
a comma-separated column list in the PARTITIONED BY clause.
Aside from these specific differences, this CREATE TABLE statement is the same
as the statement used to create an equivalent non-partitioned table.
Table partitioning is implemented in a way that is mostly transparent
to a user issuing queries with Hive.
A partition column is what’s known as a virtual column, because its values are not stored within the data files.
Following is the result of the DESCRIBE command on customers_by_country;
it displays the partition column country just as if it were a normal column within the table.
You can refer to partition columns in any of the usual clauses of a SELECT statement.
name type comment
cust_id string
name string
country string
You can load data in partitioned tables dynamically or statically
LOADING DATA WITH DYNAMIC PARTITION
One way to load data into a partitioned table is to use dynamic partitioning,
which automatically defines partitions when you load the data, using the values in the partition column.
(The other way is to manually define the partitions with Static Partitioning)
To use dynamic partitioning, you must load data using an INSERT statement.
In the INSERT statement, you must use the PARTITION clause to list the partition columns.
The data you are inserting must include values for the partition columns.
The partition columns must be the rightmost columns in the data you are inserting,
and they must be in the same order as they appear in the PARTITION clause.
INSERT OVERWRITE TABLE customers_by_country
PARTITION(country)
SELECT cust_id, name, country FROM customers;
The example shown above uses an INSERT … SELECT statement
to load data into the customers_by_country table with dynamic partitioning.
Notice that the partition column, country, is included
in the PARTITION clause and is specified last in the SELECT list.
When Hive executes this statement, it automatically creates partitions
for the country column and loads the data into these partitions based on the values in the country column.
The resulting data files in the partition subdirectories do not include values for the country column.
Since the country is known based on which subdirectory a data file is in,
it would be redundant to include country values in the data files as well.
Look at the contents of the customers_by_country directory.
It should now have one subdirectory for each value in the country column.
Look at the file in one of those directories.
Notice that the file contains the row for the customer from that country,
and no others; notice also that the country value is not included.
Note: Hive includes a safety feature that prevents users
from accidentally creating or overwriting a large number of partitions.
(See “Risks of Using Partitioning” for more about this.)
By default, Hive sets the property hive.exec.dynamic.partition.mode to strict.
This prevents you from using dynamic partitioning, though you can still use static partitions.
You can disable this safety feature in Hive by setting
the property hive.exec.dynamic.partition.mode to nonstrict:
SET hive.exec.dynamic.partition.mode=nonstrict;
Then you can use the INSERT statement to load the data dynamically.
Hive properties set in Beeline are for the current session only,
so the next time you start a Hive session this property will be set back to strict.
But you or your system administrator can configure properties permanently, if necessary.
When you run some SELECT queries on the partitioned table, if the table is big enough you can note significant difference in the time it takes to run.
Notice that you will not query the table any differently than you would query the customers table.
LOADING DATA WITH STATIC PARTITIONING
One way to load data into a partitioned table is to use static partitioning,
in which you manually define the different partitions.
With static partitioning, you create a partition manually, using an ALTER TABLE … ADD PARTITION statement,
and then load the data into the partition.
For example, this ALTER TABLE statement creates the partition for Pakistan (pk):
ALTER TABLE customers_by_country
ADD PARTITION (country='pk');
Notice how the partition column name, which is country, and the specific value that defines this partition,
which is pk, are both specified in the ADD PARTITION clause.
This creates a partition directory named country=pk inside the customers_by_country table directory.
After the partition for Pakistan is created, you can add data into the partition using an INSERT … SELECT statement:
INSERT OVERWRITE TABLE customers_by_country
PARTITION(country='pk')
SELECT cust_id, name FROM customers WHERE country='pk'
Notice how in the PARTITION clause, the partition column name, which is country,
and the specific value, which is pk, are both specified, just like in the ADD PARTITION command used to create the partition.
Also notice that in the SELECT statement, the partition column is not included in the SELECT list.
Finally, notice that the WHERE clause in the SELECT statement selects only customers from Pakistan.
With static partitioning, you need to repeat these two steps for each partition:
first create the partition, then add data.
You can actually use any method to load the data; you need not use an INSERT statement.
You could instead use hdfs dfs commands or a LOAD DATA INPATH command.
But however you load the data, it’s your responsibility to ensure that data is stored in the correct partition subdirectories.
For example, data for customers in Pakistan must be stored in the Pakistan partition subdirectory,
and data for customers in other countries must be stored in those countries’ partition subdirectories.
Static partitioning is most useful when the data being loaded
into the table is already divided into files based on the partition column,
or when the data grows in a manner that coincides with the partition column:
For example, suppose your company opens a new store in a different country,
like New Zealand ('nz'), and you're given a file of data for new customers, all from that country.
You could easily add a new partition and load that file into it.
RISKS OF USING PARTITIONING
A major risk when using partitioning is creating partitions that lead you into the small files problem.
When this happens, partitioning a table will actually worsen query performance
(the opposite of the goal when using partitioning) because it causes too many small files to be created.
This is more likely when using dynamic partitioning, but it could still
happen with static partitioning—for example if you added a new partition to a sales table
on a daily basis containing the sales from the previous day,
and each day’s data is not particularly big.
When choosing your partitions, you want to strike a happy balance between too many partitions
(causing the small files problem) and too few partitions (providing performance little benefit).
The partition column or columns should have a reasonable number of values
for the partitions—but what you should consider reasonable is difficult to quantify.
Using dynamic partitioning is particularly dangerous because if you're not careful,
it's easy to partition on a column with too many distinct values.
Imagine a use case where you are often looking for data that falls within
a time frame that you would specify in your query.
You might think that it's a good idea to partition on a column that pertains to time.
But a TIMESTAMP column could have the time to the nanosecond, so every row could have a unique value;
that would be a terrible choice for a partition column! Even to the minute or hour could create
far too many partitions, depending on the nature of your data;
partitioning by larger time units like day, month, or even year might be a better choice.
As another example, consider an employees table.
This has five columns: empl_id, first_name, last_name, salary, and office_id.
Before reading on, think for a moment, which of these might be reasonable for partitioning
The column empl_id is a unique identifier.
If that were your partition column, you would have a separate partition for each employee,
and each would have exactly one row.
In addition, it's not likely you'll be doing a lot of queries looking for a particular value,
or even a particular range of values. This is a poor choice.
The column first_name will not have one per employee, but there will likely be many columns that have only one row.
This is also true for last_name.
Also, like empl_id, it's not likely you'll need filter queries based on these columns. These are also poor choices.
The column salary also will have many divisions
(and even more so if your salaries go to the cent rather than to the dollar as our sample table does).
While it may be that you'll sometimes want to query on salary ranges,
it's not likely you'll want to use individual salaries.
So salary is a poor choice.
A more limited salary_grades specification, like the ones in the salary_grades table,
might be reasonable if your use case involves looking at the data by salary grade frequently.
The office_id column identifies the office where an employee works.
This will have a much smaller number of unique values, even if you have a large company with offices in many cities.
It's imaginable that your use case might be to frequently filter
your employee data based on office location, too. So this would be a good choice.
You also can use multiple columns and create nested partitions.
For example, a dataset of customers might include country and state_or_province columns.
You can partition by country and then partition those further by state_or_province, so customers from Ontario,
Canada would be in the country=ca/state_or_province=on/ partition directory.
This can be extremely helpful for large amounts of data that you want to access either by country or by state or province.
However, using multiple columns increases the danger of creating too many partitions, so you must take extra care when doing so.
The risk of creating too many partitions is why Hive includes the property
hive.exec.dynamic.partition.mode, set to strict by default, which must be reset to nonstrict before you can create a partition.
Rather than automatically and mechanically resetting that property when you're about to load data dynamically,
take it as an opportunity to think about the partitioning columns
and maybe check the number of unique values you would get when you load the data.
And that's all.

Row Stores vs Column Stores

Assuming that the database is already populated with data, and that each of the following SQL statements is the one and only query that an application will perform, why is it better to use row-wise or column-wise record storage for the following queries?...
1) SELECT * FROM Person
2) SELECT * FROM Person WHERE id=5
3) SELECT AVG(YEAR(DateOfBirth)) FROM Person
4) INSERT INTO Person (ID,DateOfBirth,Name,Surname) VALUES(2e25,’1990-05-01’,’Ute’,’Muller’)
In those examples Person.id is the primary key.
The article Row Store and Column Store Databases gives a general discussion on this, but I am specifically concerned about the four queries above.
SELECT * FROM ... queries are better for row stores since it has to access numerous files.
Column store is good for aggregation over large volume of date or when you have quesries that only need a few fields from a wide table.
Therefore:
1st querie: row-wise
2nd query: row-wise
3rd query: column-wise
4th query: row-wise
I have no idea what you are asking. You have this statement:
INSERT INTO Person (ID, DateOfBirth, Name, Surname)
VALUES('2e25', '1990-05-01', 'Ute', 'Muller');
This suggests that you have a table with four columns, one of which is an id. Each person is stored in their own column.
You then have three queries. The first cannot be optimized. The second is optimized, assuming that id is a primary key (a reasonable assumption). The third requires a full table scan -- although that could be ameliorated with an index only on DateOfBirth.
If the data is already in this format, why would you want to change it?
This is a very simple data structure. Three of your four query examples access all columns. I see no reason why you would not use a regular row-store table structure.

How to force ID column to remain sequential even if a recored has been deleted, in SQL server?

I don't know what is the best wording of the question, but I have a table that has 2 columns: ID and NAME.
when I delete a record from the table the related ID field deleted with it and then the sequence spoils.
take this example:
if I deleted row number 2, the sequence of ID column will be: 1,3,4
How to make it: 1,2,3
ID's are meant to be unique for a reason. Consider this scenario:
**Customers**
id value
1 John
2 Jackie
**Accounts**
id customer_id balance
1 1 $500
2 2 $1000
In the case of a relational database, say you were to delete "John" from the database. Now Jackie would take on the customer_id of 1. When Jackie goes in to check here balance, she will now show $500 short.
Granted, you could go through and update all of her other records, but A) this would be a massive pain in the ass. B) It would be very easy to make mistakes, especially in a large database.
Ids (primary keys in this case) are meant to be the rock that holds your relational database together, and you should always be able to rely on that value regardless of the table.
As JohnFx pointed out, should you want a value that shows the order of the user, consider using a built in function when querying.
In SQL Server identity columns are not guaranteed to be sequential. You can use the ROW_NUMBER function to generate a sequential list of ids when you query the data from the database:
SELECT
ROW_NUMBER() OVER (ORDER BY Id) AS SequentialId,
Id As UniqueId,
Name
FROM dbo.Details
If you want sequential numbers don't store them in the database. That is just a maintenance nightmare, and I really can't think of a very good reason you'd even want to bother.
Just generate them dynamically using tSQL's RowNumber function when you query the data.
The whole point of an Identity column is creating a reliable identifier that you can count on pointing to that row in the DB. If you shift them around you undermine the main reason you WANT an ID.
In a real world example, how would you feel if the IRS wanted to change your SSN every week so they could keep the Social Security Numbers sequential after people died off?

How to renumber a table column

I have a SQLite table sorted by column ID. But I need to sort it by another numerical field called RunTime.
CREATE TABLE Pass_2 AS
SELECT RunTime, PosLevel, PosX, PosY, Speed, ID
FROM Pass_1
The table Pass_2 looks good, but I need to renumber the ID column from 1 .. n without resorting the records.
It is a principle of SQL databases that the underlying tables have no natural or guaranteed order to their records. You must specify the order in which you want to see the records when SELECTing from a table using an ORDER BY clause.
You can obtain the records you want using SELECT * FROM your_table ORDER BY RunTime, and that is the correct and reliable way to do this in any SQL database.
If you want to attempt to get the records in Pass_2 to "be" in RunTime order, you can add the ORDER BY clause to the SELECT you use to create the table but remember: you are not guaranteed to get the records back in the order in which they were added to the table.
When might you get the records back in a different order? This is most likely to happen when your query can be answered using columns in a covering index -- in that case the records are more likely to be returned in index order than any "natural" order (but again, no guarantees with an ORDER BY clause).
If you want a new ID column starting at 1, then use the ROW_NUMBER() function. Instead of ID in your query use this ROW_NUMBER() OVER(ORDER BY Runtime) AS ID.... This will replace the old ID column with a freshly calculated column