Now, I have a hive table partitioned by dt, dt is the date string. This table also has a field col, value of which is equal to dt. Do those two sqls have any difference in performance?
SQL1:
select max(dt) from test_table
SQL2:
select max(col) from test_table
There won't be any difference if the datatype and record are similar for both case.
But if the query contains where clause with partitioned value then performance will be faster as partitioned table is going to scan only specific partition not entire table.
Related
I have two tables (pageviews and base_events) that are both partitioned on a date field derived_tstamp. Every night I'm doing an incremental update to the base_events table, querying the new data from pageviews like so:
select
*
from
`project.sp.pageviews`
where derived_tstamp > (select max(derived_tstamp) from `project.sp_modeled.base_events`)
Looking at the query costs, this query scans the full table instead of only the new data. Usually this should only get yesterdays data.
Do you have any idea, what's wrong with the query?
Subqueries will trigger a full table scan. The solution is to use scripting. I have solved my problem with the following query:
declare event_date_checkpoint DATE default (
select max(date(page_view_start)) from `project.sp_modeled.base_events
);
select
*
from
`project.sp.pageviews`
where derived_tstamp > event_date_checkpoint
More on scripting:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#declare
I have a large table (millions of rows).
I have to often get DISTINCT values of some columns. In my case, those columns actually have very few distinct values (a few to a few dozen)
What is the most efficient way of doing this?
Add an index on the column and then run:
select distinct column
from t;
To add to Gordons answer in large databases you could partition your data in addition to the index as well. Partitioning of data is like
Table_1 (id)
Select distinct records from table
Where id <1000
Table_2 (id)
Select distinct records from table
Where id >1000
Actual table =table_1+table_2 (id)
Just a sample to illustrate this partition is not extra its actually the same table or db just that it gets split up on basis of unique column
I am curious if there is a way to query and write to all the partitioned tables in big query. I wanted to cast a single column to a different datatype and apply it to all the values across the partitions in a big query table.
i.e.
select cast(nums as STRING) from `project_id.dataset.table`
And have it written back out to all the values in the column across the table. Is there a straightforward way to do this in bigquery?
Let's create a table:
CREATE TABLE `deleting.part`
PARTITION BY day
AS
SELECT DATE('2018-01-01') day, 2 i
UNION ALL SELECT DATE('2018-01-02'), 3
Now, let's change i from INT64 to FLOAT64:
CREATE OR REPLACE TABLE `deleting.part`
PARTITION BY day
AS
SELECT * REPLACE(CAST(i AS FLOAT64) AS i)
FROM `deleting.part`
Cost: Full table scan.
I have a partitioned table with a field MY_DATE which is always and only the first day of EVERY month from year 1999 to year 2017.
In example, it contains records with 01/01/2015, 01/02/2015, ..... 01/12/2015, such as 01/01/1999, 01/02/1999, and so on.
The field MY_DATE is the partitioning field.
I would like to copy, IN THE MOST EFFICIENT WAY, the distinct values of the field2 and the field3 of two adjacent partitions (month M and month M-1), to another table, in order to find the distinct couple of (field2, field3) of the date overall.
Exchange Partition works only if destination table is not partitioned, but when copying the data of the second, adjacent partition, I receive the error,
"ORA-14099: all rows in table do not qualify for specified partition".
I am using the statement:
ALTER TABLE MY_USER.MY_PARTITIONED_TABLE EXCHANGE PARTITION PART_P201502 WITH TABLE MY_USER.MY_TABLE
Of course MY_PARTITIONED_TABLE and MY_TABLE have the same fields, but the first is partitioned as described above.
Please suppose that MY_PARTITIONED_TABLE is a huge table with about 500 million records.
The goal is to find the different couples of (field2, field3) values of the two adjacent partitions.
My approach was: copy the data of the partition M, copy the data of the partition M-1, and then SELECT DISTINCT FIELD2, FIELD3 from DESTINATION_TABLE.
Thank you very much for considering my request.
I would like to copy, ...
Please note that EXCHANGE PARTITION performs no copy, but EXCHANGE. I.e. the content of the partition of the big table and the temporary table are switched.
If you performs this twice for two different partitions and the same temp table you get exactly the error you received.
To copy (extract the data without changing the big table) you may use
create table tab1 as
select * from bigtable partition (partition_name1)
create table tab2 as
select * from bigtable partition (partition_name2)
Your source table is unchanged, after you are ready simple drop the two temp tables. You need only additional space for the two partitions.
Maybe you can event perform your query without copying the data
with tmp as (
select * from bigtable partition (partition_name1)
union all
select * from bigtable partition (partition_name2)
)
select ....
from tmp;
Good luck!
Not sure if this is possible in PostgreSQL 9.3+, but I'd like to create a unique index on a non-unique column. For a table like:
CREATE TABLE data (
id SERIAL
, day DATE
, val NUMERIC
);
CREATE INDEX data_day_val_idx ON data (day, val);
I'd like to be able to [quickly] query only the distinct days. I know I can use data_day_val_idx to help perform the distinct search, but it seems this adds extra overhead if the number of distinct values is substantially less than the number of rows in the index covers. In my case, about 1 in 30 days is distinct.
Is my only option to create a relational table to only track the unique entries? Thinking:
CREATE TABLE days (
day DATE PRIMARY KEY
);
And update this with a trigger every time we insert into data.
An index can only index actual rows, not aggregated rows. So, yes, as far as the desired index goes, creating a table with unique values like you mentioned is your only option. Enforce referential integrity with a foreign key constraint from data.day to days.day. This might also be best for performance, depending on the complete situation.
However, since this is about performance, there is an alternative solution: you can use a recursive CTE to emulate a loose index scan:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT day FROM data ORDER BY 1 LIMIT 1
)
UNION ALL
SELECT (SELECT day FROM data WHERE day > c.day ORDER BY 1 LIMIT 1)
FROM cte c
WHERE c.day IS NOT NULL -- exit condition
)
SELECT day FROM cte;
Parentheses around the first SELECT are required because of the attached ORDER BY and LIMIT clauses. See:
Combining 3 SELECT statements to output 1 table
This only needs a plain index on day.
There are various variants, depending on your actual queries:
Optimize GROUP BY query to retrieve latest row per user
Unused index in range of dates query
Select first row in each GROUP BY group?
More in my answer to your follow-up querstion:
Counting distinct rows using recursive cte over non-distinct index