Querying a BigQuery table with multiple nested records - google-bigquery

I’m trying to copy a table’s schema to an empty table. It works for schemas with no nested records, but when I try to copy a schema with multiple nested records via this query:
SELECT * FROM [table] LIMIT 0
I get the following error:
Cannot output multiple independently repeated fields at the same time.

BigQuery will automatically flatten all results (see docs), which won't work when you have more than one nested record. In the BigQuery UI, click on Show Options:
Then select your destination table and make sure Allow Large Results is checked and Flatten Results is unchecked:

SELECT * FROM [table] LIMIT 0 with Allow Large Results and Unflatten
Results
The drawback of above approach is that user can end up with quite a bill – as this way of copying schema will cost the whole original table scan.
Instead I would programmatically get/acquire table schema and then create table with this schema

Related

Simple way to select a few rows of data from table in BigQuery?

I am transitioning from SQL Server to BigQuery and noticed that the TOP function in BigQuery is only allowed to aggregate in queries. Therefore the below code would not work:
SELECT TOP 5 * FROM TABLE
This is a habit I've had when trying to learn new tables and get more information on the data. Is there another alternative to selecting a few rows from the table? The following select all query works, but is incredibly inefficient and takes a long time to run for large tables:
SELECT * FROM TABLE
In BigQuery, you can use LIMIT as in:
SELECT t.*
FROM TABLE t
LIMIT 5;
But I caution you to be very careful with this. BigQuery charges for the number of columns accessed in a table, not the number of rows. So, in a large table, such a query can be quite expensive.
You can also go into the BigQuery GUI, navigate to the table, and click on "Preview". The preview functionality is free.
As Gordon Linoff mentioned, using LIMIT statement in BigQuery may be very expensive when used with big tables. To make exploratory queries more cost effective BigQuery now supports TABLESAMPLE operator, see also Using table sampling.
Sampling returns a variety of records while avoiding the costs associated with scanning and processing an entire table.
Query example:
SELECT * FROM dataset.my_table TABLESAMPLE SYSTEM (2 PERCENT)
If you are querying e.g. table views or TABLESAMPLE SYSTEM is not working for other reasons, what you can do is to use e.g. [...] WHERE RAND() < 0.05 for getting 5% of the results randomly selected. Make sure to put it at the end of your query in the WHERE statement.
This works also with table views and if you are not the owner of a table. :)

Extract data from view ORACLE performance

Hello I created a view to make a subquery ( select from two tables)
this the sql order :
CREATE OR REPLACE VIEW EMPLOYEER_VIEW
AS
SELECT A.ID,A.FIRST_NAME||' '||A.LAST_NAME AS NAME,B.COMPANY_NAME
FROM EMPLOY A, COMPANY B
WHERE A.COMPANY_ID=B.COMPANY_ID
AND A.DEPARTEMENT !='DEP_004'
ORDER BY A.ID;
If I select data from EMPLOYEER_VIEW the average execution time is 135,953 s
Table EMPLOY contiens 124600329 rows
Table COMPANY contiens 609 rows.
My question is :
How can i make the execution faster ?
I created two indexes:
emply_index (ID,COMPANY_ID,DEPARTEMENT)
and company_index(COMPANY_ID)
Can you help me to make selections run faster ? ( creating another index or change join )
PS: I Can't create a materialized view in this database.
In advance thanks for help.
You have a lot of things to do.
If you must work with a view, and can not create a scheduled job to insert data in a table, I will remove my answer.
VIEWs does not have the scope to support hundred of million data. Is for few million.
INDEXes Must be cleaned when data are inserting. If you insert data with an index the process is 100 times slower. (You can drop and create or update them).
In table company CREATE PARTITION.
If you have a lot of IDs, use RANGE.
If you have around 100 IDs LIST PARTITION.
You do not need Index, because the clause to JOIN does not optimize, INDEXes is specified to strict WHERE Clause.
We had a project with 433.000.000 data, and the only way to works was playing with partitions.

How to disallow loading duplicate rows to BigQuery?

I was wondering if there is a way to disallow duplicates from BigQuery?
Based on this article I can deduplicate a whole or a partition of a table.
To deduplicate a whole table:
CREATE OR REPLACE TABLE `transactions.testdata`
PARTITION BY date
AS SELECT DISTINCT * FROM `transactions.testdata`;
To deduplicate a table based on partitions defined in a WHERE clause:
MERGE `transactions.testdata` t
USING (
SELECT DISTINCT *
FROM `transactions.testdata`
WHERE date=CURRENT_DATE()
)
ON FALSE
WHEN NOT MATCHED BY SOURCE AND date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
If there is no way to disallow duplicates then is this a reasonable approach to deduplicate a table?
BigQuery doesn't have a mechanism like constraints that can be found in traditional DBMS. In other words, you can't set a primary key or anything like that because BigQuery is not focused on transactions but in fast analysis and scalability. You should think about it as a Data Lake and not as a database with uniqueness property.
If you have an existing table and need to de-duplicate it, the mentioned approaches will work. If you need your table to have unique rows by default and want to programmatically insert unique rows in your table without resorting to external resources, I can suggest you a workaround:
First insert your data into an temporary table
Then, run a query in your temporary table and save the results into your actual table. This step could be programmatically done in some different ways:
Using the approach you mentioned as a scheduled query
Using a bq command such as bq query --use_legacy_sql=false --destination_table=<dataset.actual_table> 'select distinct * from <dataset.temporary_table>' that will query the distinct values in your temporary table and load the results into the target table pointed in the --destination_table attribute. Its important to mention that this approach will also work for partitioned tables.
Finally, drop the temporary table. Like the previous step, this step could be done either using a scheduled query or bq command.
I hope it helps

BigQuery flattens result when selecting into table with GROUP BY even with "noflatten_results" flag on

I have a table with duplicate records. I want to remove them. I've created a column called "hash_code" which is just a sha1 hash of all the columns. Duplicate rows will have the same hash code. Everything is fine except when I tried to create a new table with a query containing GROUP BY. My table has RECORD data type, but the new table created flattens it even when I had specified it to not flatten. Seems like GROUP BY and the "-noflatten_results" flag doesn't place nice.
Here's an example command line I ran:
bq query --allow_large_results --destination_table mydataset.my_events --noflatten_results --replace
"select hash_code, min(event) as event, min(properties.adgroup_name) as properties.adgroup_name,
min(properties.adid) as properties.adid, min(properties.app_id) as properties.app_id,
min(properties.campaign_name) as properties.campaign_name from mydataset.my_orig_events group each
by hash_code "
In the above example, properties is a RECORD data type with nested fields. The resulting table doesn't have properties as RECORD data type. Instead it translated properties.adgroup_name to properties_adgroup_name, etc.
Any way to force BigQuery to treat the result set as RECORD and not flatten in GROUP BY?
Thanks!
There are a few known cases where query results can be flattened despite requesting unflattened results.
Queries containing a GROUP BY clause
Queries containing an ORDER BY clause
Selecting a nested field with a flat alias (e.g. SELECT record.record.field AS flat_field). Note that this only flattens the specific field with the alias applied, and only flattens the field if it and its parent records are non-repeated.
The BigQuery query engine always flattens query results in these cases. As far as I know, there is no workaround for this behavior, other than removing these clauses or aliases from the query.

Whats the best way to select fields from multiple tables with a common prefix?

I have sensor data from a client which is in ongoing acquisition. Every week we get a table of new data (about one million rows each) and each table has the same prefix. I'd like to run a query and select some columns across all of these tables.
what would be the best way to go about this ?
I have seen some solutions that use dynammic sql and i was considering writing a stored procedure that would form a dynamic sql statement and execute it for me. But im not sure this is the best way.
I see you are using Postgresql. This is an ideal case for partitioning with constraint exclusion based on dates. You create one master table without data, and the other tables added daily inherit from it. In your case, you don't even have to worry about the nuisance of triggers on INSERT; sounds like there is never any insertion other than the daily bulk creation of a new table. See the link above for full documentation.
Queries can be run against the parent table, and Postgres takes care of looking in all the child tables, plus it is smart enough to skip child tables ruled out by WHERE criteria.
You could query the meta data for tables with the same prefix.
select table_name from information_schema.tables where table_name like 'week%'
Then you could use union all to combine queries like
select * from week001
union all
select * from week002
[...]
However I suggest appending new records to one single table, and use an index on the timestamp column. This would especially speed up queries which span multiple weeks etc. It will simplify your queries a lot, if you only have to deal with one table. If the table is getting too large, you could partition by date etc. So there should be no need to partition manually by having multiple tables.
You are correct, sometimes you have to write dynamic SQL to handle cases such as this.
If all of your tables are loaded you can query for table names within your stored procedure. Something like this:
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
Play with that to get the specific table names you need.
How are the table names differentiated? By date? Some incrementing ID?