Hive view - partitions are not listed - hive

I have a internal hive table that is partitioned. I am creating a view on the hive table like this:
create view feat_view PARTITIONED ON(partition_dt) AS SELECT col1, partition_dt from features_v2;
This works fine. But when I try listing the partitions on the view, I get an empty result:
show partitions feat_view;;
+------------+--+
| partition |
+------------+--+
+------------+--+
The base table is partitioned:
show partitions features_v2;;
+--------------------------+--+
| partition |
+--------------------------+--+
| partition_dt=2018-11-17 |
+--------------------------+--+
Is this intended to work? Can I list the partitions on a view just the way I would on a base table?

From the Apache docs, showing view partitions doesn't seem to be supported. You can show partitions of materialized views (Hive 3). See the example at the end of Create and use a partitioned materialized view:
CREATE MATERIALIZED VIEW partition_mv_3 PARTITIONED ON (deptno) AS
SELECT emps.hire_date, emps.deptno FROM emps, emps2
WHERE emps.deptno = emps2.deptno
AND emps.deptno > 100 AND emps.deptno < 200;
SHOW PARTITIONS partition_mv_3;
+-------------+
| partition |
+-------------+
| deptno=101 |
+-------------+

Related

Understanding the precise difference in how SQL treats temp tables vs inline views

I know similar questions have been asked, but I will try to explain why they haven't answered my exact confusion.
To clarify, I am a complete beginner to SQL so bear with me if this is an obvious question.
Despite being a beginner I have been fortunate enough to be given a role doing some data science and I was recently doing some work where I wrote a query that self-joined a table, then used an inline view on the result, which I then selected from. I can include the code if necessary but I feel it is not for the question.
After running this, the admin emailed me and asked to please stop since it was creating very large temp tables. That was all sorted and he helped me write it more efficiently, but it made me very confused.
My understanding was that temp tables are specifically created by a statement like
SELECT INTO #temp1
I was simply using a nested select statement. Other questions on here seem to confirm that temp tables are different. For example the question here along with many others.
In fact I don't even have privileges to create new tables, so what am I misunderstanding? Was he using "temp tables" differently from the standard use, or do inline views create the same temp tables?
From what I can gather, the only explanation I can think of is that genuine temp tables are physical tables in the database, while inline views just store an array in RAM rather than in the actual database. Is my understanding correct?
There are two kind of temporary tables in MariaDB/MySQL:
Temporary tables created via SQL
CREATE TEMPORARY TABLE t1 (a int)
Creates a temporary table t1 that is only available for the current session and is automatically removed when the current session ends. A typical use case are tests in which you don't want to clean everything up in the end.
Temporary tables/files created by server
If the memory is too low (or the data size is too large), the correct indexes are not used, etc. the database server needs to create temporary files for sorting, collecting results from subqueries, etc. Temporary files are an indicator of your database design / and / or instructions should be optimized. Disk access is much slower than memory access and unnecessarily wastes resources.
A typical example for temporary files is a simple group by on a column which is not indexed (information displayed in "Extra" column):
MariaDB [test]> explain select first_name from test group by first_name;
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | test | ALL | NULL | NULL | NULL | NULL | 4785970 | Using temporary; Using filesort |
+------+-------------+-------+------+---------------+------+---------+------+---------+---------------------------------+
1 row in set (0.000 sec)
The same statement with an index doesn't need to create temporary table:
MariaDB [test]> alter table test add index(first_name);
Query OK, 0 rows affected (7.571 sec)
Records: 0 Duplicates: 0 Warnings: 0
MariaDB [test]> explain select first_name from test group by first_name;
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+
| 1 | SIMPLE | test | range | NULL | first_name | 58 | NULL | 2553 | Using index for group-by |
+------+-------------+-------+-------+---------------+------------+---------+------+------+--------------------------+

How does BigQuery search through a cluster / partition?

My colleague asked if it was possible to reverse the order of the data in a cluster. So it would look something like the following.
| Normal cluster | Reversed cluster |
|---|---|
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
I said that I can remember reading that the data is searched through like a binary tree, so it doesn't really matter if it's reversed or not. But now I can't find anything that mentions how it actually searches through the cluster.
How does BigQuery actually search for a specific value in clusters / partitions?
When you create a clustered table in BigQuery, the data is automatically organized based on the contents of one or more columns in the table’s schema. The columns that we specify are used to colocate related data. When you cluster a table using multiple columns, the order of columns we specify is important, as the order of the columns determines the sort order of the data.
When you create a partitioned table, data is stored in physical blocks, each of which holds one partition of data. A partitioned table maintains these properties across all operations that modify the data. You can typically split large tables into many smaller partitions using data ingestion time or TIMESTAMP/DATE column or an INTEGER column.

Impala | KUDU Show PARTITION BY HASH. Where my row are?

I want to test CREATE TABLE with PARTITION BY HASH in KUDU
This is my CREATE clause.
CREATE TABLE customers (
state STRING,
name STRING,
purchase_count int,
PRIMARY KEY (state, name)
)
PARTITION BY HASH (state) PARTITIONS 2
STORED AS KUDU
TBLPROPERTIES (
'kudu.master_addresses' = '127.0.0.1',
'kudu.num_tablet_replicas' = '1'
)
Some inserts...
insert into customers values ('madrid', 'pili', 8);
insert into customers values ('barcelona', 'silvia', 8);
insert into customers values ('galicia', 'susi', 8);
Avoiding issues...
COMPUTE STATS customers;
Query: COMPUTE STATS customers
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 3 column(s). |
+-----------------------------------------+
And then...
show partitions customers;
Query: show partitions customers
+--------+-----------+----------+----------------+------------+
| # Rows | Start Key | Stop Key | Leader Replica | # Replicas |
+--------+-----------+----------+----------------+------------+
| -1 | | 00000001 | hidra:7050 | 1 |
| -1 | 00000001 | | hidra:7050 | 1 |
+--------+-----------+----------+----------------+------------+
Fetched 2 row(s) in 2.31s
Where my rows are? What means the "-1"?
There is any way to see if row distribution is workings properly?
Based further research presented in this white-paper https://kudu.apache.org/kudu.pdf
The COMPUTE STATS statement works with partitioned tables that use HDFS not for Kudu tables, although Kudu does not use HDFS files internally Impala’s modular architecture allows a single query to transparently join data from multiple different storage components. For example, a text log file on HDFS can be joined against a large dimension table stored in Kudu.
For queries involving Kudu tables, Impala can delegate much of the work of filtering the result set to Kudu, avoiding some of the I/O involved in full table scans of tables containing HDFS data files. This type of optimization is especially effective for partitioned Kudu tables, where the Impala query WHERE clause refers to one or more primary key columns that are also used as partition key columns.

Materialized View vs Trigger for aggregating data?

I have a TASK table :
ID | NAME | STATUS |
----------------------
1 | Task 1 | Open |
2 | Task 2 | Closed |
3 | Task 3 | Closed |
And in my application i constantly query for a count of tasks grouped by status, so I'm looking for a caching solution.
Naturally, I thought of a trigger that automatically updates an aggregation table on any change to the TASKS table
TASK_COUNT table :
OPEN | CLOSED |
----------------
1 | 2 |
But I've read there is also materialized views.
Which is more reccomended for aggregating data? Materialized Views or Triggers?
Important to note that in my actual scenario I have more aggregations than just STATUS, and more tables than just TASK.
Also this is a rapidly evolving table, and I need the aggregated data to be always up to date.
The downside to materialized views is that the data may not be totally current. As explained in the documentation:
While access to the data stored in a materialized view is often much faster than accessing the underlying tables directly or through a view, the data is not always current; yet sometimes current data is not needed.
The advantage of materialized views is that they are much simpler to maintain -- basically define and go. But there can be a lag for updates.
If you need totally current information, then triggers are probably the better solution.

SQL Server Replication questions

I'm Brazilian and I'm not very good English, I apologize.
I have a problem: before replication when replicating tables I wanted to set some rules for some columns not to be replicated, or be replicated with a default value.
id | descrisaoProduto | estoque
1 | abcd | 10
on replication
id | descrisaoProduto | estoque
1 | (null or value default) | 10**
And find out if there is any way that when it is replicated, it convert a table to another.
id | estoqueLocal | estoqueMatriz
1 | 10 | 0
on replication
(replication)
id | estoqueLocal | estoqueMatriz
1 | 0 | 10
Probably the simplest way to accomplish this would be to create a view representing the data you wish the subscriber to see, and then replicate that view instead of the underlying source table. Views can be replicated as easily as tables.
In your scenario, you would want to replicate an indexed view as a table on the subscriber side. In this way, you would not need to replicate the underlying table. From the article above:
For indexed views, transactional replication also allows you to replicate the indexed view as a table rather than a view, eliminating the need to also replicate the base table. To do this, specify one of the "indexed view logbased" options for the #type parameter of sp_addarticle (Transact-SQL).
Here's an article demonstrating how to set up replication of an indexed view with transactional replication.