Paritions counts of tables across different datasets in bigquery

Paritions counts of tables across different datasets in bigquery - google-bigquery

I am looking a way to find the total no of partitions(count of partitions to find ahead if any table hitting 4000 limit threshold.) across bigquery tables in all datasets of a project. COuld someone please help me with the query.
Thanks

You can use INFORMATION_SCHEMA.PARTITIONS metadata table in order to extract partitions information from a whole schema/dataset.
It works as follows:
SELECT
*
FROM
`project.schema.INFORMATION_SCHEMA.PARTITIONS`
In case you want to look at a specific table, you just need to include it in the WHERE clause:
SELECT
*
FROM
`project.schema.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'partitioned_table'

Related

Is it possible to write a single BQ Query that reads from external table and writes to multiple tables?

I need your expert advice on the below query, your help is greatly appreciated.
Query:
I have an External Table (in GCS), I have a requirement to read this table every 5 mins(300GB) and run multiple filters/transformations/aggregations on that data and insert the results into 8 different tables.
Example:
INSERT INTO ds.Native_Cube1 (col1,col2, sumagg) SELECT
col1,col2,sum(col25) as sumagg FROM ds.External_Table Where
CAST(SUBSTR(_FILE_NAME,43,12) AS INT64) > 123456 GROUP BY sumagg
INSERT INTO ds.Native_Cube2 (col1,col2, col3, meancol5) SELECT
col1,col2,col3,mean(col5) as meancol5 FROM ds.External_Table Where
CAST(SUBSTR(_FILE_NAME,43,12) AS INT64) > 123456 AND col3=http GROUP
BY meancol5
...8 such queries.
With this approach I end up reading the input data multiple times and pay for it. I want to read the input data only once and populate these native_cubeN tables appropriately.
So the question is, Is it possible to avoid these extra reads and cost? If Yes, please suggest how can achieve.
Thank you for listening to me.

A query can only update one table, so maybe you can try to reduce the read cost for each query instead. It seems all your queries have a filter on CAST(SUBSTR(_FILE_NAME,43,12) AS INT64). A possible way to reduce cost is to import the external table into a native table, partitioned on a column with value CAST(SUBSTR(_FILE_NAME,43,12) AS INT64). Then queries would read from this native table, with filters on the partitioning column. The queries would only read the matching partitions, instead of the whole table.

How to select data from partitioned tables into partitioned destination tables?

In Google Big Query, it's straightforward to select across all partitioned tables using wildcard operators. For example, I could select all rows from date partitioned tables with something like this:
SELECT * FROM `project.dataset.table_name__*`;
That would give me all results from project.dataset.table_name__20161127,
project.dataset.table_name__20161128, project.dataset.table_name__20161129, etc.
What I don't understand is how to specify partitioned destination tables. How do I ensure that result set is written to, as an example, project.dataset.dest-table__20161127,
project.dataset.table-name__20161128, project.dataset.table-name__20161129?
Thanks in advance!

How to get records from row index 1 million to 2 million in big query tables?

I've been trying to download the m-lab dataset from big query recently. There seems to be a limit that we can only query and get as much as around 1 million rows with one query. The m-lab dataset contains multiple billion records in many tables. I'd love to use queries like bq query --destination_table=mydataset.table1 "select * from (select ROW_NUMBER() OVER() row_number, * from (select * from [measurement-lab:m_lab.2013_03] limit 10000000)) where row_number between 2000001 and 3000000;" but it didn't work. Is there a workaround to make it work? Thanks a lot!

If you're trying to download a large table (like the m-lab table), your best option is to use an extract job. For example, run
bq extract 'mlab-project:datasset.table' 'gs://bucket/foo*'
Which will extract the table to the google cloud storage objects gs://bucket/foo000000000.csv, gs://bucket/foo0000000001.csv, etc. The default extracts as CSV, but you can pass `--destination_format=NEWLINE_DELIMITED_JSON to extract the table as json.
The other thing to mention is that you can read the 1 millionth row in bigquery using the tabledata list api to read from that particular offset (no query required!).
bq head -n 1000 -s 1000000 'm-lab-project:dataset.table'
will read 1000 rows starting at the 1000000th row.

Is there efficient SQL to query a portion of a large table

The typical way of selecting data is:
select * from my_table
But what if the table contains 10 million records and you only want records 300,010 to 300,020
Is there a way to create a SQL statement on Microsoft SQL that only gets 10 records at once?
E.g.
select * from my_table from records 300,010 to 300,020
This would be way more efficient than retrieving 10 million records across the network, storing them in the IIS server and then counting to the records you want.

SELECT * FROM my_table is just the tip of the iceberg. Assuming you're talking a table with an identity field for the primary key, you can just say:
SELECT * FROM my_table WHERE ID >= 300010 AND ID <= 300020
You should also know that selecting * is considered poor practice in many circles. They want you specify the exact column list.

Try looking at info about pagination. Here's a short summary of it for SQL Server.

Absolutely. On MySQL and PostgreSQL (the two databases I've used), the syntax would be
SELECT [columns] FROM table LIMIT 10 OFFSET 300010;
On MS SQL, it's something like SELECT TOP 10 ...; I don't know the syntax for offsetting the record list.
Note that you never want to use SELECT *; it's a maintenance nightmare if anything ever changes. This query, though, is going to be incredibly slow since your database will have to scan through and throw away the first 300,010 records to get to the 10 you want. It'll also be unpredictable, since you haven't told the database which order you want the records in.
This is the core of SQL: tell it which 10 records you want, identified by a key in a specific range, and the database will do its best to grab and return those records with minimal work. Look up any tutorial on SQL for more information on how it works.

When working with large tables, it is often a good idea to make use of Partitioning techniques available in SQL Server.
The rules of your partitition function typically dictate that only a range of data can reside within a given partition. You could split your partitions by date range or ID for example.
In order to select from a particular partition you would use a query similar to the following.
SELECT <Column Name1>…/*
FROM <Table Name>
WHERE $PARTITION.<Partition Function Name>(<Column Name>) = <Partition Number>
Take a look at the following white paper for more detailed infromation on partitioning in SQL Server 2005.
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope this helps however please feel free to pose further questions.
Cheers, John

I use wrapper queries to select the core query and then just isolate the ROW numbers that i wish to take from the query - this allows the SQL server to do all the heavy lifting inside the CORE query and just pass out the small amount of the table that i have requested. All you need to do is pass the [start_row_variable] and the [end_row_variable] into the SQL query.
NOTE: The order clause is specified OUTSIDE the core query [sql_order_clause]
w1 and w2 are TEMPORARY table created by the SQL server as the wrapper tables.
SELECT
w1.*
FROM(
SELECT w2.*,
ROW_NUMBER() OVER ([sql_order_clause]) AS ROW
FROM (
<!--- CORE QUERY START --->
SELECT [columns]
FROM [table_name]
WHERE [sql_string]
<!--- CORE QUERY END --->
) AS w2
) AS w1
WHERE ROW BETWEEN [start_row_variable] AND [end_row_variable]
This method has hugely optimized my database systems. It works very well.
IMPORTANT: Be sure to always explicitly specify only the exact columns you wish to retrieve in the core query as fetching unnecessary data in these CORE queries can cost you serious overhead

Use TOP to select only a limited amont of rows like:
SELECT TOP 10 * FROM my_table WHERE ID >= 300010
Add an ORDER BY if you want the results in a particular order.
To be efficient there has to be an index on the ID column.

MySQL - Selecting data from multiple tables all with same structure but different data

Ok, here is my dilemma I have a database set up with about 5 tables all with the exact same data structure. The data is separated in this manner for localization purposes and to split up a total of about 4.5 million records.
A majority of the time only one table is needed and all is well. However, sometimes data is needed from 2 or more of the tables and it needs to be sorted by a user defined column. This is where I am having problems.
data columns:
id, band_name, song_name, album_name, genre
MySQL statment:
SELECT * from us_music, de_music where `genre` = 'punk'
MySQL spits out this error:
#1052 - Column 'genre' in where clause is ambiguous
Obviously, I am doing this wrong. Anyone care to shed some light on this for me?

I think you're looking for the UNION clause, a la
(SELECT * from us_music where `genre` = 'punk')
UNION
(SELECT * from de_music where `genre` = 'punk')

It sounds like you'd be happer with a single table. The five having the same schema, and sometimes needing to be presented as if they came from one table point to putting it all in one table.
Add a new column which can be used to distinguish among the five languages (I'm assuming it's language that is different among the tables since you said it was for localization). Don't worry about having 4.5 million records. Any real database can handle that size no problem. Add the correct indexes, and you'll have no trouble dealing with them as a single table.

Any of the above answers are valid, or an alternative way is to expand the table name to include the database name as well - eg:
SELECT * from us_music, de_music where `us_music.genre` = 'punk' AND `de_music.genre` = 'punk'

The column is ambiguous because it appears in both tables you would need to specify the where (or sort) field fully such as us_music.genre or de_music.genre but you'd usually specify two tables if you were then going to join them together in some fashion. The structure your dealing with is occasionally referred to as a partitioned table although it's usually done to separate the dataset into distinct files as well rather than to just split the dataset arbitrarily. If you're in charge of the database structure and there's no good reason to partition the data then I'd build one big table with an extra "origin" field that contains a country code but you're probably doing it for legitimate performance reason.
Either use a union to join the tables you're interested in http://dev.mysql.com/doc/refman/5.0/en/union.html or by using the Merge database engine http://dev.mysql.com/doc/refman/5.1/en/merge-storage-engine.html.

Your original attempt to span both tables creates an implicit JOIN. This is frowned upon by most experienced SQL programmers because it separates the tables to be combined with the condition of how.
The UNION is a good solution for the tables as they are, but there should be no reason they can't be put into the one table with decent indexing. I've seen adding the correct index to a large table increase query speed by three orders of magnitude.

The union statement cause a deal time in huge data. It is good to perform the select in 2 steps:
select the id
then select the main table with it

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Paritions counts of tables across different datasets in bigquery - google-bigquery

I am looking a way to find the total no of partitions(count of partitions to find ahead if any table hitting 4000 limit threshold.) across bigquery tables in all datasets of a project. COuld someone please help me with the query. Thanks

Related

Is it possible to write a single BQ Query that reads from external table and writes to multiple tables?

How to select data from partitioned tables into partitioned destination tables?

How to get records from row index 1 million to 2 million in big query tables?

Is there efficient SQL to query a portion of a large table

MySQL - Selecting data from multiple tables all with same structure but different data

Categories

Resources