Impala | KUDU Show PARTITION BY HASH. Where my row are? - impala

I want to test CREATE TABLE with PARTITION BY HASH in KUDU
This is my CREATE clause.
CREATE TABLE customers (
state STRING,
name STRING,
purchase_count int,
PRIMARY KEY (state, name)
)
PARTITION BY HASH (state) PARTITIONS 2
STORED AS KUDU
TBLPROPERTIES (
'kudu.master_addresses' = '127.0.0.1',
'kudu.num_tablet_replicas' = '1'
)
Some inserts...
insert into customers values ('madrid', 'pili', 8);
insert into customers values ('barcelona', 'silvia', 8);
insert into customers values ('galicia', 'susi', 8);
Avoiding issues...
COMPUTE STATS customers;
Query: COMPUTE STATS customers
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 3 column(s). |
+-----------------------------------------+
And then...
show partitions customers;
Query: show partitions customers
+--------+-----------+----------+----------------+------------+
| # Rows | Start Key | Stop Key | Leader Replica | # Replicas |
+--------+-----------+----------+----------------+------------+
| -1 | | 00000001 | hidra:7050 | 1 |
| -1 | 00000001 | | hidra:7050 | 1 |
+--------+-----------+----------+----------------+------------+
Fetched 2 row(s) in 2.31s
Where my rows are? What means the "-1"?
There is any way to see if row distribution is workings properly?

Based further research presented in this white-paper https://kudu.apache.org/kudu.pdf
The COMPUTE STATS statement works with partitioned tables that use HDFS not for Kudu tables, although Kudu does not use HDFS files internally Impala’s modular architecture allows a single query to transparently join data from multiple different storage components. For example, a text log file on HDFS can be joined against a large dimension table stored in Kudu.
For queries involving Kudu tables, Impala can delegate much of the work of filtering the result set to Kudu, avoiding some of the I/O involved in full table scans of tables containing HDFS data files. This type of optimization is especially effective for partitioned Kudu tables, where the Impala query WHERE clause refers to one or more primary key columns that are also used as partition key columns.

Related

Understanding the precise difference in how SQL treats temp tables vs inline views

I know similar questions have been asked, but I will try to explain why they haven't answered my exact confusion.
To clarify, I am a complete beginner to SQL so bear with me if this is an obvious question.
Despite being a beginner I have been fortunate enough to be given a role doing some data science and I was recently doing some work where I wrote a query that self-joined a table, then used an inline view on the result, which I then selected from. I can include the code if necessary but I feel it is not for the question.
After running this, the admin emailed me and asked to please stop since it was creating very large temp tables. That was all sorted and he helped me write it more efficiently, but it made me very confused.
My understanding was that temp tables are specifically created by a statement like
SELECT INTO #temp1
I was simply using a nested select statement. Other questions on here seem to confirm that temp tables are different. For example the question here along with many others.
In fact I don't even have privileges to create new tables, so what am I misunderstanding? Was he using "temp tables" differently from the standard use, or do inline views create the same temp tables?
From what I can gather, the only explanation I can think of is that genuine temp tables are physical tables in the database, while inline views just store an array in RAM rather than in the actual database. Is my understanding correct?
There are two kind of temporary tables in MariaDB/MySQL:
Temporary tables created via SQL
CREATE TEMPORARY TABLE t1 (a int)
Creates a temporary table t1 that is only available for the current session and is automatically removed when the current session ends. A typical use case are tests in which you don't want to clean everything up in the end.
Temporary tables/files created by server
If the memory is too low (or the data size is too large), the correct indexes are not used, etc. the database server needs to create temporary files for sorting, collecting results from subqueries, etc. Temporary files are an indicator of your database design / and / or instructions should be optimized. Disk access is much slower than memory access and unnecessarily wastes resources.
A typical example for temporary files is a simple group by on a column which is not indexed (information displayed in "Extra" column):
MariaDB [test]> explain select first_name from test group by first_name;
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | test | ALL | NULL | NULL | NULL | NULL | 4785970 | Using temporary; Using filesort |
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
1 row in set (0.000 sec)
The same statement with an index doesn't need to create temporary table:
MariaDB [test]> alter table test add index(first_name);
Query OK, 0 rows affected (7.571 sec)
Records: 0 Duplicates: 0 Warnings: 0
MariaDB [test]> explain select first_name from test group by first_name;
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | range | NULL | first_name | 58 | NULL | 2553 | Using index for group-by |
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+

How does BigQuery search through a cluster / partition?

My colleague asked if it was possible to reverse the order of the data in a cluster. So it would look something like the following.
| Normal cluster | Reversed cluster |
|---|---|
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
I said that I can remember reading that the data is searched through like a binary tree, so it doesn't really matter if it's reversed or not. But now I can't find anything that mentions how it actually searches through the cluster.
How does BigQuery actually search for a specific value in clusters / partitions?
When you create a clustered table in BigQuery, the data is automatically organized based on the contents of one or more columns in the table’s schema. The columns that we specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns we specify is important, as the order of the columns determines the sort order of the data.
When you create a partitioned table, data is stored in physical blocks, each of which holds one partition of data. A partitioned table maintains these properties across all operations that modify the data. You can typically split large tables into many smaller partitions using data ingestion time or TIMESTAMP/DATE column or an INTEGER column.

Hive view - partitions are not listed

I have a internal hive table that is partitioned. I am creating a view on the hive table like this:
create view feat_view PARTITIONED ON(partition_dt) AS SELECT col1, partition_dt from features_v2;
This works fine. But when I try listing the partitions on the view, I get an empty result:
show partitions feat_view;;
+------------+--+
| partition |
+------------+--+
+------------+--+
The base table is partitioned:
show partitions features_v2;;
+--------------------------+--+
| partition |
+--------------------------+--+
| partition_dt=2018-11-17 |
+--------------------------+--+
Is this intended to work? Can I list the partitions on a view just the way I would on a base table?
From the Apache docs, showing view partitions doesn't seem to be supported. You can show partitions of materialized views (Hive 3). See the example at the end of Create and use a partitioned materialized view:
CREATE MATERIALIZED VIEW partition_mv_3 PARTITIONED ON (deptno) AS
SELECT emps.hire_date, emps.deptno FROM emps, emps2
WHERE emps.deptno = emps2.deptno
AND emps.deptno > 100 AND emps.deptno < 200;
SHOW PARTITIONS partition_mv_3;
+-------------+
| partition |
+-------------+
| deptno=101 |
+-------------+

How to delete duplicate rows from a table without unique key with only "plain" SQL and no temporary tables?

Similar questions have been asked and answered here multiple times. From what I could find they were either specific to particular SQL implementation (Oracle, SQL Server, etc) or relied on a temporary table (where result would be initially copied).
I wonder if their is a platform-independent pure DML solution (just a single DELETE statement).
Sample data: Table A with a single field.
---------
|account|
|-------|
| A22 |
| A33 |
| A44 |
| A22 |
| A55 |
| A44 |
---------
The following SQL Fiddle shows Oracle-specific solution based on ROWID pseudo-column. It wouldn't work for any other database and is shown here just as an example.
The only platform-independent way I can think of is to store the data in a secondary table, truncate the first, and load it back in:
create table _tableA (
AccountId varchar(255)
);
insert into _TableA
select distinct AccountId from TableA;
truncate table TableA;
insert into TableA
select AccountId from _TableA;
drop table _TableA;
If you have a column that is unique for each account or relax the dialects of SQL, then you can possible find a single query solution.

Most efficient way of getting the next unused id

(related to Finding the lowest unused unique id in a list and Getting unused unique values on a SQL table)
Suppose I have a table containing on id column and some others (they don't make any difference here):
+-----+-----+
| id |other|
+-----+-----+
The id has numerical increasing value. My goal is to get the lowest unused id and creating that row. So of course for the first time I run it will return 0 and the the row of this row would have been created. After a few executions it will look like this:
+-----+-----+
| id |other|
+-----+-----+
| 0 | ... |
| 1 | ... |
| 2 | ... |
| 3 | ... |
| 4 | ... |
+-----+-----+
Fairly often some of these rows might get deleted. Let's assume the rows with the id's of 1 and 3 are removed. No the table will look like this:
+-----+-----+
| id |other|
+-----+-----+
| 0 | ... |
| 2 | ... |
| 4 | ... |
+-----+-----+
If I now run again the query it would like to get back the id 1 and this row should be created:
| id |other|
+-----+-----+
| 0 | ... |
| 1 | ... |
| 2 | ... |
| 4 | ... |
+-----+-----+
The next times the query runs it should return the id's 3, 5, 6, etc.
What's the most effective way to run those kinds of query as I need to execute them fairly often in a second (it is fair to assume that the the id's are the only purpose of the table)? Is it possible to get the next unused row with one query? Or is it easier and faster by introducing another table which keeps track of the unused id's?
If it is significantly faster it is also possible to get a way to reuse any hole in the table provided that all numbers get reused at some time.
Bonus question: I plan to use SQLite for this kind of storing information as I don't need a database except for storing these id's. Is there any other free (as in speech) server which can do this job significantly faster?
I think I'd create a trigger on delete, and insert the old.id in a separate table.
Then you can select min(id) from that table to get the lowest id.
disclaimer: i don't know what database engine you use, so i don't know if triggers are available to you.
Like Dennis Haarbrink said; a trigger on delete and another on insert :
The trigger on delete would take the deleted id and insert it in a id pool table (only one column id)
The trigger on before insert would check if an id value is provided, otherwise it just query the id pool table (ex: SELECT MIN(id) FROM id_pool_table) and assign it (i.g. deletes it from the id_pool_table)
Normally you'd let the database handle assigning the ids. Is there a particular reason you need to have the id's sequential rather than unique? Can you, instead, timestamp them, and just number them when you display them? Or make a separate column for the sequential id, and renumber them?
Alternatively, you could not delete the rows themselves, but rather, mark them as deleted with a flag in a column, and then re-use the id's of the marked rows by finding the lowest numbered 'deleted' row, and reusing that id.
The database doesn't care if the values are sequential, only that they are unique. The desire to have your id values sequential is purely cosmetic, and if you are exposing this value to users -- it should not be your primary key, nor should there be any referential integrity based on the value because a client could change the format if desired.
The fastest and safest way to deal with the id value generation is to rely on native functionality that gives you a unique integer value (IE: SQLite's autoincrement). Using triggers only adds overhead, using MAX(id) +1 is extremely risky...
Summary
Ideally, use the native unique integer generator (SQLite/MySQL auto_increment, Oracle/PostgreSQL sequences, SQL Server IDENTITY) for the primary key. If you want a value that is always sequential, add an additional column to store that sequential value & maintain it as necessary. MySQL/SQLite/SQL Server unique integer generation only allows one per column - sequences are more flexible.