IN with a list of numbers (BigQuery + GoogleSheetS) - sql

I have access to BigQuery through work, but no write access. Just read.
So I have a bunch of integers in a GoogleSheets, one column (~400):
User.
332321
031230
938101
These numbers all correspond to a specific value in a table in BQ, but unfortunately, they aren't easily queried, as they are the result of multiple queries, etc.
So my dilemma. How can I take the column of integers from GoogleSheets and then use it in a query (say, in a WHERE clause)? My only suggestion has been to get write access: https://supermetrics.com/blog/bigquery-query-google-sheets

You will need to create an external table in BigQuery using your Google Sheet data for you to be able to query it in BigQuery. However, as you have already mentioned, the only solution for this is to get write/create access permission to BigQuery which is also mentioned in your provided reference --> https://supermetrics.com/blog/bigquery-query-google-sheets.
In addition, you may refer to Query Google Drive Data documentation as it conatains more details about the permissions needed for BigQuery and Google Drive before you can create/query an external table in BigQuery and also the actual creation and query execution.
BigQuery permissions
At a minimum, the following permissions are required to create and query an external table in BigQuery.
bigquery.tables.create
bigquery.tables.getData
bigquery.jobs.create
Drive permissions
At a minimum, to query external data in Drive you must be granted View access to the Drive file linked to the external table.

Related

Automatic ETL data before loading to Bigquery

I have CSV files added to a GCS bucket daily or weekly each file name contains (date + specific parameter)
The files contain the schema (id + name) columns and we need to auto load/ingest these files into a bigquery table so that the final table have 4 columns (id,name,date,specific parameter)
We have tried dataflow templates but we couldn't get the date and specific parameter from the file name to the dataflow
And we tried cloud function (we can get the date and specific parameter value from file name) but couldn't add it in columns while ingestion
Any suggestions?
Disclaimer: I have authored an article for this kind of problem using Cloud Workflows. When you want to extract parts of filename, to use as table definition later.
We will create a Cloud Workflow to load data from Google Storage into BigQuery. This linked article is a complete guide on how to work with workflows, connecting any Google Cloud APIs, working with subworkflows, arrays, extracting segments, and calling BigQuery load jobs.
Let’s assume we have all our source files in Google Storage. Files are organized in buckets, folders, and could be versioned.
Our workflow definition will have multiple steps.
(1) We will start by using the GCS API to list files in a bucket, by using a folder as a filter.
(2) For each file then, we will further use parts from the filename to use in BigQuery’s generated table name.
(3) The workflow’s last step will be to load the GCS file into the indicated BigQuery table.
We are going to use BigQuery query syntax to parse and extract the segments from the URL and return them as a single row result. This way we will have an intermediate lesson on how to query from BigQuery and process the results.
Full article with lots of Code Samples is here: Using Cloud Workflows to load Cloud Storage files into BigQuery

getting Clustering/Bucketing columns programmatically

For reference, I am connecting to amazon-athena via sqlalchemy using essentially:
create_engine(
f'awsathena+rest://:#athena.{myRegion}.amazonaws.com:443/{athena_schema}?s3_staging_dir={myS3_staging_path}',
echo=True)
In most relational databases that adhere to the ANSI-SQL standard, I can programmatically get the partition columns of a table by running something like the following:
select *
from information_schema.columns
where table_name='myTable' and table_schema='mySchema'
and extra_info = 'partition key'
However the bucketing or clustering columns seem to not be similarly flagged. I know I can access this information via:
show create table mySchema.myTable
but I am interested in clean programmatical solution, if one exists. I am trying to not reinvent the wheel. Please show me how to do this or point me to the relevant documentation.
Thank you in advance.
PS: It would also be great if other information about the table, like location of files and storage format were also accessible programmatically.
Athena uses Glue Data Catalog to store metadata about databases and tables. I don't know how much of this is exposed in information_schema, and there is very little documentation about it.
However, you can get everything Athena knows by querying the Glue Data Catalog directly. In this case if you call GetTable (e.g. aws glue get-table …) you will find the bucketing information in Table.StorageDescriptor.BucketColumns.
The GetTable call will also give you the storage format and the location of the files (but for a partitioned table you need to make additional calls with GetPartitions to retrieve the location of each partition's data).

Imported data into Big Query but can only access the 'table' via job history and can't see it in datasets

I've imported some data into Big Query, however I can only query the table from Job History but can't seem to add it as a dataset.
What do I need to do in order to convert this as a dataset?
How I imported the data: It was done via a third party app in which had access to my Google Analytics (StitchData).
Here are some more additional import details.
From your screenshot, "Destination table" should be in format: [DATASET].[TABLE].
Also "Table Info"."Table ID" should have same info.
Guess you already have a dataset, just need a way to see it.
If so, this video may help you to locate "dataset" in BigQuery Classic UI.

BigQuery - Transfers automation from Google Cloud Storage - Overwrite table

Here's the case:
Our client daily uploads CSVs (overwritten) to a bucket in Google Cloud Storage (each table in a different file).
We use BigQuery as DataSource in DataStudio
We want to automatically transfer the CSVs to BigQuery.
The thing is, even though we've:
Declared the tables in BigQuery with "Overwrite table" write preference option
Configured the daily Transfers vía UI (BigQuery > Transfers) to automatically upload the CSVs from Google Cloud one hour after the files are uploaded to Google Cloud, as stated by the limitations.
The automated transfer/load is by default in "WRITE_APPEND", thus the tables are appended instead of overwritten in BigQuery.
Hence the question: How/where can we change the
configuration.load.writeDisposition = WRITE_TRUNCATE
as stated here in order to overwrite the tables when the CSVs are automatically loaded?
I think that's what we're missing.
Cheers.
None of the above worked for us, so I'm posting this in case anyone has the same issue.
We scheduled a query to erase the table content just before the automatic importation process starts:
DELETE FROM project.tableName WHERE true
And then, new data will be imported to a void table, therefore default "WRITE_APPEND" doesn't affect us.
1) One way to do this is to use DDL to CREATE and REPLACE your table before running the query which imports the data.
This is an example of how to create a table
#standardSQL
CREATE TABLE mydataset.top_words
OPTIONS(
description="Top ten words per Shakespeare corpus"
) AS
SELECT
corpus,
ARRAY_AGG(STRUCT(word, word_count) ORDER BY word_count DESC LIMIT 10) AS top_words
FROM bigquery-public-data.samples.shakespeare
GROUP BY corpus;
Now that it's created you can import your data.
2) Another way is to use BigQuery schedule Queries
3) If you write Python you can find an even better solution here

Azure Machine Learning Write output to Azure SQL Database

I am using Azure Machine Learning to clustering data.
The input data is from an Azure SQL Database, and it works fine.
At the end of everything I want to write the output to a table in the same Azure SQL Database, but I get this error:
Error: Error 1000: AFx Library library exception:
Sql encountered an error: Login failed for user
Anyone any idea?
Thank you very much!
Please follow the instructions and examine the examples provided here to properly use the Export Data module to save the data of ML to Azure SQL Database.
How to Export Data to an Azure SQL Database
Add the Export Data module to your experiment. You can find this module in the Data Input and Output group in the experiment items list in Azure Machine Learning Studio.
Connect it to the module that produces the data that you want to export to Azure SQL DB.
For Data destination, select Azure SQL Database. This option supports Azure SQL Data Warehouse as well.
Set the following options specific to Azure SQL Database or Azure SQL Data Warehouse.
Database server name
Type the server name that is generated by Azure. Typically it has the form <generated_identifier>.database.windows.net.
Database name
Type the name of a database on the server you just specified.The database must already exist; the Export Data cannot create it.
Server user account name
Type the user name of an account that has access permissions for the database.
Server user account password
Provide the password for the specified user account.
Comma-separated list of columns to be saved
Type the names of the columns in the experiment that you want to write to the database.
Data table name
Type the name of the table where data will be stored.
For Azure SQL Database, if the table does not exist, it will be created. For Azure SQL Data Warehouse, the table must already exist and have the correct schema, so be sure to create it in advance.
Comma-separated list of datatable columns
Type the names of the columns as you wish them to appear in the destination table. The columns should correspond in order with the column names that you list in Comma-separated list of columns to be saved.
if you are writing to Azure SQL Data Warehouse, the columns names must match those already in the destination table schema.
Number of rows written per SQL Azure operation
Indicate how many rows should be written to the destination table in each batch. By default, the value is set to 50, which is the default batch size for Azure SQL Database. However, you should increase this value if you have a large number of rows to write.
TIP:
For Azure SQL Data Warehouse, we recommend that you set this value to 1. If you use a larger batch size, the size of the command string that is sent to Azure SQL Data Warehouse can exceed the allowed string length, causing an error.
If you don't want to write new results each time you run the experiment, select the Use cached results option. If there are no other changes to module parameters, the experiment will write the data the first time the module is run, and thereafter not perform writes.
However, a write will always be performed if any parameters have been changed in Export Data that would change the results.
Run the experiment.
Find the issue!
I needed to create an specific user with this SQL code:
CREATE USER AMLApplicationUser WITH PASSWORD = '************';
and then add the user to these roles on the database I want to write.
ALTER ROLE db_datareader ADD MEMBER AMLApplicationUser;
ALTER ROLE db_datawriter ADD MEMBER AMLApplicationUser;
I guess only the datawriter role is enough, but I needed datareader too.
So in conclusion, seems that database admin role can be used to read data, but not to write data from AML.
Thank you for your help!