Bigquery - Select all table and partition in dataset - google-bigquery

I want to select all data in the datasets. I know I can use wildcard. but the problem is there's a partition table within the dataset
example data:
data_2021_05_04 <- partition table
data_2021_05_05 <- partition table
data_2021_05_06 <- normal table
data_2021_05_07 <- normal table
if I use
select * from dataset.data_*
it will return
Wildcard table over non partitioning tables and field based partitioning tables is not yet supported
or
Wildcard matched incompatible partitioning tables, first table1, first incompatible table table2
Is there's anyway to solve this?
thank you

This worked fine for me :
SELECT * FROM ProjectID.DatasetID.TableID_* LIMIT 1000

Related

BigQuery, Wildcard table over non partitioning tables and field based partitioning tables is not yet supported

I'm trying to run a simple query with a wildcard table using standardSQL on Bigquery. Here's the code:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_*`
WHERE _TABLE_SUFFIX BETWEEN '20150518' AND '20210406'
GROUP BY 1
My sharded dataset contains one table each day since 18/05/2015. So today's table will be 'dataset_20150518'.
The error is: 'Wildcard table over non partitioning tables and field based partitioning tables is not yet supported, first normal table dataset_test, first column table dataset_20150518.'
I've tried different kinds of select and aggregations but the error won't fix. I just want to query on all tables in that timeframe.
This is because in the wildcard you have to have all the tables with same schema. In your case, you are also adding dataset_test which is not with the same schema than others (dataset_test is a partition table?)
You should be able to get around this limitation by deleting _test and other tables with different schema or by running this query:
#standardSQL
SELECT dataset_id, SUM(totals.visits) AS sessions
FROM `dataset_20*`
WHERE _TABLE_SUFFIX BETWEEN '150518' AND '210406'
GROUP BY 1
Official documentation

Full table scan on partitioned table

I have two tables (pageviews and base_events) that are both partitioned on a date field derived_tstamp. Every night I'm doing an incremental update to the base_events table, querying the new data from pageviews like so:
select
*
from
`project.sp.pageviews`
where derived_tstamp > (select max(derived_tstamp) from `project.sp_modeled.base_events`)
Looking at the query costs, this query scans the full table instead of only the new data. Usually this should only get yesterdays data.
Do you have any idea, what's wrong with the query?
Subqueries will trigger a full table scan. The solution is to use scripting. I have solved my problem with the following query:
declare event_date_checkpoint DATE default (
select max(date(page_view_start)) from `project.sp_modeled.base_events
);
select
*
from
`project.sp.pageviews`
where derived_tstamp > event_date_checkpoint
More on scripting:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#declare

How do I make a query to cast the value in a column for all partitioned tables in big query

I am curious if there is a way to query and write to all the partitioned tables in big query. I wanted to cast a single column to a different datatype and apply it to all the values across the partitions in a big query table.
i.e.
select cast(nums as STRING) from `project_id.dataset.table`
And have it written back out to all the values in the column across the table. Is there a straightforward way to do this in bigquery?
Let's create a table:
CREATE TABLE `deleting.part`
PARTITION BY day
AS
SELECT DATE('2018-01-01') day, 2 i
UNION ALL SELECT DATE('2018-01-02'), 3
Now, let's change i from INT64 to FLOAT64:
CREATE OR REPLACE TABLE `deleting.part`
PARTITION BY day
AS
SELECT * REPLACE(CAST(i AS FLOAT64) AS i)
FROM `deleting.part`
Cost: Full table scan.

Oracle PL/SQL: SELECT DISTINCT FIELD1, FIELD2 FROM TWO ADIACENT PARTITIONS OF A PARTITIONED TABLE

I have a partitioned table with a field MY_DATE which is always and only the first day of EVERY month from year 1999 to year 2017.
In example, it contains records with 01/01/2015, 01/02/2015, ..... 01/12/2015, such as 01/01/1999, 01/02/1999, and so on.
The field MY_DATE is the partitioning field.
I would like to copy, IN THE MOST EFFICIENT WAY, the distinct values of the field2 and the field3 of two adjacent partitions (month M and month M-1), to another table, in order to find the distinct couple of (field2, field3) of the date overall.
Exchange Partition works only if destination table is not partitioned, but when copying the data of the second, adjacent partition, I receive the error,
"ORA-14099: all rows in table do not qualify for specified partition".
I am using the statement:
ALTER TABLE MY_USER.MY_PARTITIONED_TABLE EXCHANGE PARTITION PART_P201502 WITH TABLE MY_USER.MY_TABLE
Of course MY_PARTITIONED_TABLE and MY_TABLE have the same fields, but the first is partitioned as described above.
Please suppose that MY_PARTITIONED_TABLE is a huge table with about 500 million records.
The goal is to find the different couples of (field2, field3) values of the two adjacent partitions.
My approach was: copy the data of the partition M, copy the data of the partition M-1, and then SELECT DISTINCT FIELD2, FIELD3 from DESTINATION_TABLE.
Thank you very much for considering my request.
I would like to copy, ...
Please note that EXCHANGE PARTITION performs no copy, but EXCHANGE. I.e. the content of the partition of the big table and the temporary table are switched.
If you performs this twice for two different partitions and the same temp table you get exactly the error you received.
To copy (extract the data without changing the big table) you may use
create table tab1 as
select * from bigtable partition (partition_name1)
create table tab2 as
select * from bigtable partition (partition_name2)
Your source table is unchanged, after you are ready simple drop the two temp tables. You need only additional space for the two partitions.
Maybe you can event perform your query without copying the data
with tmp as (
select * from bigtable partition (partition_name1)
union all
select * from bigtable partition (partition_name2)
)
select ....
from tmp;
Good luck!

Delete duplicate records in oracle table : size is 389 GB

Need to delete duplicate records from the table. Table contains 33 columns out of them only PK_NUM is the primary key columns. As PK_NUM contains unique records we need to consider either min/max value.
Total records in the table : 1766799022
Distinct records in the table : 69237983
Duplicate records in the table : 1697561039
Column details :
4 : Date data type
4 : Number data type
1 : Char data type
24 : Varchar2 data type
Size of table : 386 GB
DB details : Oracle Database 11g EE::11.2.0.2.0 ::64bit Production
Sample data :
col1 ,col2,col3
1,ABC,123
2,PQR,456
3,ABC,123
Expected data should contains only 2 records:
col1,col2,col3
1,ABC,123
2,PQR,456
*1 can be replaced by 3 ,vice versa.
My plan here is to
Pull distinct records and store it in a back up table.(ie by using insert into select)
Truncate existing table and move records from back up to existing.
As data size is huge ,
Want to know what is the optimized sql for retrieving the distinct
records
Any estimate on how much it will take to complete (insert into
select) and to truncate the existing table.
Please do let me know, if there is any other best way to achieve this. My ultimate goal is to remove the duplicates.
One option for making this memory efficient is to insert (nologging append) all of the rows into a table that is hash partitioned on the list of columns on which duplicates are to be detected, or if there is a limitation on the number of columns then on as many as you can use (aiming to use those with maximum selectivity). Use something like 1024 partitions, and each one will ideally be around
You have then isolated all of the potential duplicates for each row into the same partition, and standard methods for deduplication will run on each partition without as much memory consumption.
So for each partition you can do something like ...
insert /*+ append */ into new_table
select *
from temp_table partition (p1) t1
where not exists (
select null
from temp_table partition (p1) t2
where t1.col1 = t2.col1 and
t1.col2 = t2.col2 and
t1.col3 = t2.col3 and
... etc ...
t1.rownum < t2.rownum);
The key to good performance here is that the hash table created to perform the anti-join in that query, which is going to be nearly as big as the partition itself, be able to fit in memory. So if you can manage a 2GB sort area you need at least 389/2 = approx 200 table partitions. Round up to the nearest power of two, so make it 256 table partitions in that case.
try this:
rename table_name to table_name_dup;
and then:
create table table_name
as
select
min(col1)
, col2
, col3
from table_name_dup
group by
col2
, col3;
as far as i know the temp_tablespace used is not much as the whole group by is taking place in the target tablespace where the new table will be created. once finished, you can just drop the one with the duplicates:
drop table table_name_dup;