Bigquery query to find the column names of a table - sql

I need a query to find column names of a table (table metadata) in Bigquery, like the following query in SQL:
SELECT column_name,data_type,data_length,data_precision,nullable FROM all_tab_cols where table_name ='EMP';

BigQuery now supports information schema.
Suppose you have a dataset named MY_PROJECT.MY_DATASET and a table named MY_TABLE, then you can run the following query:
SELECT column_name
FROM MY_PROJECT.MY_DATASET.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'MY_TABLE'

Yes you can get table metadata using INFORMATION_SCHEMA.
One of the examples mentioned in the past link retrieves metadata from the INFORMATION_SCHEMA.COLUMN_FIELD_PATHS view for the commits table in the github_repos dataset, you just have to
Open the BigQuery web UI in the GCP Console.
Enter the following standard SQL query in the Query editor box. INFORMATION_SCHEMA requires standard SQL syntax. Standard SQL is the default syntax in the GCP Console.
SELECT
*
FROM
`bigquery-public-data`.github_repos.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE
table_name="commits"
AND column_name="author"
OR column_name="difference"
Note: INFORMATION_SCHEMA view names are case-sensitive.
Click Run.
The results should look like the following
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| table_name | column_name | field_path | data_type | description |
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| commits | author | author | STRUCT<name STRING, email STRING, time_sec INT64, tz_offset INT64, date TIMESTAMP> | NULL |
| commits | author | author.name | STRING | NULL |
| commits | author | author.email | STRING | NULL |
| commits | author | author.time_sec | INT64 | NULL |
| commits | author | author.tz_offset | INT64 | NULL |
| commits | author | author.date | TIMESTAMP | NULL |
| commits | difference | difference | ARRAY<STRUCT<old_mode INT64, new_mode INT64, old_path STRING, new_path STRING, old_sha1 STRING, new_sha1 STRING, old_repo STRING, new_repo STRING>> | NULL |
| commits | difference | difference.old_mode | INT64 | NULL |
| commits | difference | difference.new_mode | INT64 | NULL |
| commits | difference | difference.old_path | STRING | NULL |
| commits | difference | difference.new_path | STRING | NULL |
| commits | difference | difference.old_sha1 | STRING | NULL |
| commits | difference | difference.new_sha1 | STRING | NULL |
| commits | difference | difference.old_repo | STRING | NULL |
| commits | difference | difference.new_repo | STRING | NULL |
+------------+-------------+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------+

For newbies like me, the above is of the following syntax:
select * from project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS where table_catalog=project_name and table_schema=dataset_name and table_name=table_name

Update: This is now possible! See the INFORMATION SCHEMA docs and the answers below.
Answer, circa 2012:
It's not currently possible to retrieve table metadata (i.e. column names and types) via a query, though this isn't the first time it's been requested.
Is there a reason you need to do this as a query? Table metadata is available via the tables API.

Actually it is possible to do so using SQL. To do so you need to query the logging table for the last log of this particular table being created.
For example, assuming the table is loaded/created daily:
CREATE TEMP FUNCTION jsonSchemaStringToArray(jsonSchema String)
RETURNS ARRAY<STRING> AS ((
SELECT
SPLIT(
REGEXP_REPLACE(REPLACE(LTRIM(jsonSchema,'{ '),'"fields": [',''), r'{[^{]+"name": "([^\"]+)"[^}]+}[, ]*', '\\1,')
,',')
));
WITH valid_schema_columns AS (
WITH array_output aS (SELECT
jsonSchemaStringToArray(jsonSchema) AS column_names
FROM (
SELECT
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.schemaJson AS jsonSchema
, ROW_NUMBER() OVER (ORDER BY metadata.timestamp DESC) AS record_count
FROM `realself-main.bigquery_logging.cloudaudit_googleapis_com_data_access_20170101`
WHERE
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.tableId = '<table_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.destinationTable.datasetId = '<schema_name>'
AND
protoPayload.serviceData.jobInsertRequest.resource.jobConfiguration.load.createDisposition = 'CREATE_IF_NEEDED'
) AS t
WHERE
t.record_count = 1 -- grab the latest entry
)
-- this is actually what UNNESTS the array into standard rows
SELECT
valid_column_name
FROM array_output
LEFT JOIN UNNEST(column_names) AS valid_column_name
)

To Check column, You can access your table Through CLI Easy and simple to find
bq query --use_legacy_sql=false 'select Hour, sum(column 1) as column from `project_id.dataset.table_name` where Date(Hour) = '2020-06-10';'

Related

Sql Server how to find values in different tables that have different suffix

I'm struggling to find a value that might be in different tables but using UNION is a pain as there are a lot of tables.
[Different table that contains the suffixes from the TestTable_]
| ID | Name|
| -------- | -----------|
| 1 | TestTable1 |
| 2 | TestTable2 |
| 3 | TestTable3 |
| 4 | TestTable4 |
TestTable1 content:
| id | Name | q1 | a1 |
| -------- | ---------------------------------------- |
| 1 | goose | withFeather? |featherID |
| 2 | rooster| withoutFeather?|shinyfeatherID |
| 3 | rooster| age | 20 |
TestTable2 content:
| id | Name | q1 | a1 |
| -------- | ---------------------------------------------------|
| 1 | brazilian_goose | withFeather? |featherID |
| 2 | annoying_rooster | withoutFeather?|shinyfeatherID |
| 3 | annoying_rooster | no_legs? |dead |
TestTable3 content:
| id | Name | q1 | a1 |
| -------- | ---------------------------------------- |
| 1 | goose | withFeather? |featherID |
| 2 | rooster| withoutFeather?|shinyfeatherID |
| 3 | rooster| age | 15 |
Common columns: q1 and a1
Is there a way to parse through all of them to lookup for a specific value without using UNION because some of them might have different columns?
Something like: check if "q1='age'" exists in all those tables (from 1 to 50)
Select q1,*
from (something)
where q1 exists in (TestTable_*)... or something like that.
If not possible, not a problem.
You could use dynamic SQL but something I do in situations like this where I have a list of tables that I want to quickly perform the same actions on is to either use a spreadsheet to paste the list of tables into and type a query into the cell with something like #table then use the substitute function to replace it.
Alternative I just paste the list into SSMS and use SHIFT+ALT+ArrowKey to select the column and start typing stuff out.
So here is my list of tables
Then I use that key combo. As you can see my cursor has now selected all those rows.
Now I can start typing and all rows selected will get the input.
Then I just go to the other side of the table names and repeat the action
It's not a perfect solution but it's quick a quick and dirty way of doing something repetitive quickly.
If you want to find all the tables with that column name you can use information schema.
Select table_name from INFORMATION_SCHEMA.COLUMNS where COLUMN_NAME = 'q1'
Given the type of solution you are after I can offer a method that I've had to use on legacy systems.
You can query sys.columns for the name of the column(s) you need to find in N tables and join using object_id to sys.tables where type='U'. This will give you a list of table names.
From this list you can then build a working query for each table, and depending on your requirements (is this ad-hoc?) either just manually execute it yourself of build a procedure that will do it for you using sp_executesql
Eg
select t.name, c.name
into #workingtable
from sys.columns c
join sys.tables t on t.object_id=c.object_id
where c.name in .....
psudocode:
begin loop while rows exist in #working table
select top 1 row from #workingtable
set #sql=your query specific to that table and column(s)
exec(#sql) / sp_executesql / try/catch as necessary
delete row from working table
end loop
Hopefully that give ideas at least for how you might implement your requirements.

Snowflake Create View with JSON (VARIANT) field as columns with dynamic keys

I am having a problem creating VIEWS with Snowflake that has VARIANT field which stores JSON data whose keys are dynamic and keys definition is stored in another table. So I want to create a VIEW that has dynamic columns based on the foreign key.
Here are my table looks like:
companies:
| id | name |
| -- | ---- |
| 1 | Company 1 |
| 2 | Company 2 |
invoices:
| id | invoice_number | custom_fields | company_id |
| -- | -------------- | ------------- | ---------- |
| 1 | INV-01 | {"1": "Joe", "3": true, "5": "2020-12-12"} | 1 |
| 2 | INV-01 | {"2":"Hello", "4": 1000} | 2 |
customization_fields:
| id | label | data_type | company_id |
| -- | ----- | --------- | ---------- |
| 1 | manager | text | 1 |
| 2 | reference | text | 2 |
| 3 | emailed | boolean | 1 |
| 4 | account | integer | 2 |
| 5 | due_date | date | 1 |
So I want to create a view for getting each companies invoices something like:
CREATE OR REPLACE VIEW companies_invoices AS SELECT * FROM invoices WHERE company_id = 1
which should get a result like below:
| id | invoice_number | company_id | manager | emailed | due_date |
| -- | -------------- | ---------- | ------- | ------- | -------- |
| 1 | INV-01 | 1 | Joe | true | 2020-12-12 |
So my challenge above here is I cannot make sure the keys when I write the query. If I know that I could write
SELECT
id,
invoice_number,
company_id,
custom_fields:"1" AS manager,
custom_fields:"3" AS emailed,
custom_fields:"5" AS due_date
FROM invoices
WHERE company_id = 1
These keys and labels are written in the customization_fields table, so I tried different ways and I am not able to do that.
So could anyone tell me if we can do or not? If we can please give me an example so it would really help.
You cannot do what you want to do with a view. A view has a fixed set of columns and they have specific types. Retrieving a dynamic set of columns requires some other mechanism.
If you're trying to change the number of columns or the names of the columns based on the rows in the customization_fields table, you can't do it in a view.
If you have a defined schema and just need to grab dynamic JSON properties, you may want to consider looking into Snowflake's GET function. It allows you to get any part of a JSON using a string for the path rather than using a literal path in the SQL statement. For example:
create temp table foo(v variant);
insert into foo select parse_json('{ "name":"John", "age":30, "car":null }');
-- This uses a literal path in the SQL to get to a JSON property
select v:name::string as first_name from foo;
-- This uses the GET function to get the value from a path in a string
select get(v, 'name')::string as first_name from foo;
You can replace the 'name' in the second parameter of the GET function with the value stored in the customization_fields table.
In SF, You will have to use a Stored Proc function to retrieve the dynamic set of columns

SQL - Given sequence of data, how do I query the origin?

Let's assume we have the following data.
| UUID | SEENTIME | LAST_SEENTIME |
------------------------------------------------------
| UUID1 | 2020-11-10T05:00:00 | |
| UUID2 | 2020-11-10T05:01:00 | 2020-11-10T05:00:00 |
| UUID3 | 2020-11-10T05:03:00 | 2020-11-10T05:01:00 |
| UUID4 | 2020-11-10T05:04:00 | 2020-11-10T05:03:00 |
| UUID5 | 2020-11-10T05:07:00 | 2020-11-10T05:04:00 |
| UUID6 | 2020-11-10T05:08:00 | 2020-11-10T05:07:00 |
Each data is connected to each other via LAST_SEENTIME.
In such case, is there a way to use SQL to identify these connected events as one? I want to be able to calculate start and end to calculate the duration of this event.
You can use a recursive CTE. The exact syntax varies by database, but something like this:
with recursive cte as
select uuid as orig_uuid, uuid, seentime
from t
where last_seentime is null
union all
select cte.orig_uuid, t.uuid, t.seentime
from cte join
t
on cte.seentime = t.last_seentime
)
select orig_uuid,
max(seentime) - min(seentime) -- or whatever your database uses
from cte
group by orig_uuid;

Last accessed timestamp of a Netezza table?

Does anyone know of a query that gives me details on the last time a Netezza table was accessed for any of the operations (select, insert or update) ?
Depending on your setup you may want to try the following query:
select *
from _v_qryhist
where lower(qh_sql) like '%tablename %'
There are a collection of history views in Netezza that should provide the information you require.
Netezza does not track this information in the catalog, so you will typically have to mine that from the query history database, if one is configured.
Modern Netezza query history information is typically stored in a dedicated database. Depending on permissions, you may be able to see if history collection is enabled, and which database it is using with the following command. Apologies in advance for the screen-breaking wrap to come.
SYSTEM.ADMIN(ADMIN)=> show history configuration;
CONFIG_NAME | CONFIG_DBNAME | CONFIG_DBTYPE | CONFIG_TARGETTYPE | CONFIG_LEVEL | CONFIG_HOSTNAME | CONFIG_USER | CONFIG_PASSWORD | CONFIG_LOADINTERVAL | CONFIG_LOADMINTHRESHOLD | CONFIG_LOADMAXTHRESHOLD | CONFIG_DISKFULLTHRESHOLD | CONFIG_STORAGELIMIT | CONFIG_LOADRETRY | CONFIG_ENABLEHIST | CONFIG_ENABLESYSTEM | CONFIG_NEXT | CONFIG_CURRENT | CONFIG_VERSION | CONFIG_COLLECTFILTER | CONFIG_KEYSTORE_ID | CONFIG_KEY_ID | KEYSTORE_NAME | KEY_ALIAS | CONFIG_SCHEMANAME | CONFIG_NAME_DELIMITED | CONFIG_DBNAME_DELIMITED | CONFIG_USER_DELIMITED | CONFIG_SCHEMANAME_DELIMITED
-------------+---------------+---------------+-------------------+--------------+-----------------+-------------+---------------------------------------+---------------------+-------------------------+-------------------------+--------------------------+---------------------+------------------+-------------------+---------------------+-------------+----------------+----------------+----------------------+--------------------+---------------+---------------+-----------+-------------------+-----------------------+-------------------------+-----------------------+-----------------------------
ALL_HIST_V3 | NEWHISTDB | 1 | 1 | 20 | localhost | HISTUSER | aFkqABhjApzE$flT/vZ7hU0vAflmU2MmPNQ== | 5 | 4 | 20 | 0 | 250 | 1 | f | f | f | t | 3 | 1 | 0 | 0 | | | HISTUSER | f | f | f | f
(1 row)
Also make note of the CONFIG_VERSION, as it will come into play when crafting the following query example. In my case, I happen to be using the version 3 format of the query history database.
Assuming history collection is configured, and that you have access to the history database, you can get the information you're looking for from the tables and views in that database. These are documented here. The following is an example, which reports when the given table was the target of a successful insert, update, or delete by referencing the "usage" column. Here I use one of the history table helper functions to unpack that column.
SELECT FORMAT_TABLE_ACCESS(usage),
hq.submittime
FROM "$v_hist_queries" hq
INNER JOIN "$hist_table_access_3" hta
USING (NPSID, NPSINSTANCEID, OPID, SESSIONID)
WHERE hq.dbname = 'PROD'
AND hta.schemaname = 'ADMIN'
AND hta.tablename = 'TEST_1'
AND hq.SUBMITTIME > '01-01-2015'
AND hq.SUBMITTIME <= '08-06-2015'
AND
(
instr(FORMAT_TABLE_ACCESS(usage),'ins') > 0
OR instr(FORMAT_TABLE_ACCESS(usage),'upd') > 0
OR instr(FORMAT_TABLE_ACCESS(usage),'del') > 0
)
AND status=0;
FORMAT_TABLE_ACCESS | SUBMITTIME
---------------------+----------------------------
ins | 2015-06-16 18:32:25.728042
ins | 2015-06-16 17:46:14.337105
ins | 2015-06-16 17:47:14.430995
(3 rows)
You will need to change the digit at the end of the $v_hist_table_access_3 view to match your query history version.

obtaining unique/distinct values from multiple unassociated columns

I have a table in a postgresql-9.1.x database which is defined as follows:
# \d cms
Table "public.cms"
Column | Type | Modifiers
-------------+-----------------------------+--------------------------------------------------
id | integer | not null default nextval('cms_id_seq'::regclass)
last_update | timestamp without time zone | not null default now()
system | text | not null
owner | text | not null
action | text | not null
notes | text
Here's a sample of the data in the table:
id | last_update | system | owner | action |
notes
----+----------------------------+----------------------+-----------+------------------------------------- +-
----------------
584 | 2012-05-04 14:20:53.487282 | linux32-test5 | rfell | replaced MoBo/CPU |
34 | 2011-03-21 17:37:44.301984 | linux-gputest13 | apeyrovan | System deployed with production GPU |
636 | 2012-05-23 12:51:39.313209 | mac64-cvs11 | kbhatt | replaced HD |
211 | 2011-09-12 16:58:16.166901 | linux64-test12 | rfell | HD swap |
drive too small
What I'd like to do is craft a SQL query that returns only the unique/distinct values from the system and owner columns (and filling in NULLs if the number of values in one column's results is less than the other column's results), while ignoring the association between them. So something like this:
system | owner
-----------------+------------------
linux32-test5 | apeyrovan
linux-gputest13 | kbhatt
linux64-test12 | rfell
mac64-cvs11 |
The only way that I can figure out to get this data is with two separate SQL queries:
SELECT system FROM cms GROUP BY system;
SELECT owner FROM cms GROUP BY owner;
Far be it from me to inquire why you would want to do such a thing. The following query does this by doing a join, on a calculated column using the row_number() function:
select ts.system, town.owner
from (select system, row_number() over (order by system) as seqnum
from (select distinct system
from t
) ts
) ts full outer join
(select owner, row_number() over (order by owner) as seqnum
from (select distinct owner
from t
) town
) town
on ts.seqnum = town.seqnum
The full outer join makes sure that the longer of the two lists is returned in full.