Google BigQuery: Dataset not found when saving query results to table - google-bigquery

I just ran a query in Google BigQuery. It produced just over 10 million results.
BigQuery tells me that I cannot save query results that big directly as CSV, JSON, etc. I can only save it to a table.
So I try to save it to a table, using the new UI, Google automatically fills in the "Project name" and "Dataset name". I have tried random table names as well as created a new table and tried to save there.
Every time I get an error:
Not found: Dataset ethereum-account-balances:Eth_Account_Balances_Dataset
Where Ethereum-Account-Balances is the project name and Eth_Account_Balances_Dataset is the Dataset name
I have already created this dataset and confirmed it works. Even re-ran the query after creating the new data set.
Anyone has seen this before?

The new dataset must be in the same location as the dataset whose tables are referenced in the query.
Reference - https://cloud.google.com/bigquery/docs/managing-tables#limitations_on_copying_tables

I had the same problem when trying to save the new BigQuery table in a different dataset of the original table.
What I did to get around this was:
1 - Saved a new BigQuery table in the same dataset or the original table
2 - Copied this new table to the other dataset, using copy table functionality

Related

Creating Table in dataset in BigQuery

I want to create a Table for my dataset in BigQuery. I want to upload CSV file. When I upload it and clicked on "create table" it is saying:
unexpected error. Tracking number c986854671035387
What is this error and How can I solve this? (I also upgraded my BigQuery to 90 days free trial).
You need to check the data inside CSV. If it has column names and no faulty records.
you can download any sample CSV file from here and try
http://www.mytrapture.com/sampledata/

How to stop Big Query using old schema when creating new table with the same name as a deleted one from Google Sheets

I am using a Google Sheet as the source of a table in Big Query. Since I am unable to rename fieldnames in the schema of an existing table I deleted the table and attempted to re-create it after amending the column names in the source Google Sheet. I need to keep the table name the same as I already have analysis files connecting to the table, however when I create the new table as ask Big Query to auto-detect the schema it uses the schema of the previous table. Even if I enter the new schema as text when creating the table it ignores what I enter and use the schema from the old table.
Any ideas how I get Big Query to detect the new schema from the Google Sheet whilst using the same table name as the deleted table?
Thanks in advance!
After trying this multiple times and it not working - with several tables - randomly it worked and let me create a table with the new scheme (manually). Not sure why this didn't work before as I'm pretty sure I didn't do anything differently. If anyone has any insight on what might have caused the initial errors I'd love to hear it for future reference but my current problem is solved.

How to save a view using federated queries across two projects?

I'm looking to save a view which uses federated queries (from a MySQL Cloud SQL connection) between two projects. I'm receiving two different errors (depending on which project I try to save in).
If I try to save in the project containing the dataset I get error:
Not found: Connection my-connection-name
If I try to save in the project that contains the connection I get error:
Not found: Dataset my-project:my_dataset
My example query that crosses projects looks like:
SELECT
bq.uuid,
sql.item_id,
sql.title
FROM
`project_1.my_dataset.psa_v2_202005` AS bq
LEFT OUTER JOIN
EXTERNAL_QUERY( 'project_2.us-east1.my-connection-name',
'''SELECT item_id, title
FROM items''') AS sql
ON
bq.looks_info.query_item.item_id = sql.item_id
The documentation at https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#known_issues_and_limitations doesn't mention any limitations here.
Is there a way around this so I can save a view using an external connection from one project and dataset from another?
Your BigQuery table is located in US and your MySQL data source is located in us-east1. BigQuery automatically chooses to run the query in the location of your BigQuery table (i.e. in US), however, your Cloud MySQL is in us-east1 and that's why your query fails. Therefore the BigQuery table and Cloud SQL instance, must be in the same location in order for this query to succeed.
The solution for this kind of cases is moving your BigQuery dataset to the same location as your Cloud SQL instance manually by following the steps explained in detail in this documentation. However, the us-east1 is not currently supported for copying datasets. Thus, I will recommend you to create a new connection in one of the locations mentioned in the documentation.
I hope you find the above pieces of information useful.

Using SSIS Package, How to validate the source records for duplicate before inserting?

SQL Server 2012: using a SSIS package, how to validate the source records for duplicate before inserting?
Our source file is a .csv. We are facing duplicate records loaded in the staging table.
At present , we are following manual process of loading data.
How to validate the source file data against the destination table before loading and load only the valid records? Possibility of loading duplicate records not only because of the source file having duplicate records in it but also reloading the same file to the staging table.
We are not Truncate the staging table. We are keeping records as is.
Second question : How to pick the name of the source file and pass it in the loading ? Possibly having a derived column as "FileName" which will get loaded along with raw data to the staging table.
The typical load pattern I use in this case is:
Prepare a staging table that matches the source file
In SSIS run a SQL Task with TRUNCATE StagingTable; (which clears it out)
Then, run a data flow task that loads the entire data file into the staging table
Lastly, merge the staging table into the final table.
I prefer to do this last step in a SQL Task also:
INSERT INTO FinalTable
(PrimaryKey,Column1,Column2,Column3)
SELECT
PrimaryKey,Column1,Column2,Column3
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM FinalTable TGT WHERE TGT.PrimaryKey=SRC.PrimaryKey
);
If you prefer a graphical UI, and you don't mind the extra network traffic, and slower processing time, you can do the same type of merge operation using lookups. You can even use the SCD component but I strongly discourage it's use.
Whether you do it in T-SQL or the UI, you need a key that can be used to uniquely identify the records (referred to as PrimaryKey in my example). If you don't have this key, there is no way to 'deduplicate'
Note in this example you have a 'real' staging table whose only purpose is to get the data file into the database. Then you have a final table that contains the final consistent result
Also note that this pattern only adds new rows - it will not update existing rows if they change in the data file.
Given your exact scenario (of loading the same file again), I would first check if the data is even loaded to the staging table. If you do that, you don't have to worry about checking the duplicates at record level.
How are you setting the connection to the file? Most of the data loads I have dealt with, I designed for-each-loop-container where the file name/path would be populated in a user variable. As you said, you could just use a derived column transform to add a new column which gets the value from a variable. If you don't have the file name in a user variable, you could use expression task in the control flow to populate it.
To cover your exact requirement, I would use the above step to populate the file name in the table. You could even normalize to a different table instead of storing long file name for every data record. Once you have all the file names in the database, you could just have an "Execute SQL" at the beginning to see if that file name is already in the database.
Two years back I have faced the same problem with importing TSV files.
I tried many other solutions but best I could design is C# code script for such validation at its best.
What I did as a solution
Create one C# DataTable object in memory with Primary Key constraints,
like:-
DataColumn[] keyColumn = new DataColumn[30];
keyColumn[intJ] = dtFilterdPK.Columns["Column name"];
Then try to add one by one row from your CSV to this DataTables.
Whenever your data will get Duplication based on Primary Key will have an error
Handle this error code in (TRY)..CATCH block and make this duplication error as per your logging requirement.
Avoid those error records importing in DataTable object.
Atlast import your CSV file into your table as BulkImport
Like:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(myConnection))
{
bulkCopy.DestinationTableName = "Your DB Table Name"; //Assign table name
bulkCopy.WriteToServer(dtToBeImport); //Write into Actual table.
}
Hope this will help you.

Google BigQuery - Error while downloading data to a table

I am trying to work with the github data which has been uploaded to Google's big data. I ran a few queries (which generated a lot of rows -
eg: a query SELECT actor_attributes_login, repository_watchers , repository_forks FROM [githubarchive:github.timeline]
where repository_watchers > 2 and REGEXP_MATCH(repository_created_at, '2012-')
ORDER BY actor_attributes_login;
The answer had more than 2,20,000 rows. When I attempted to download to CSV , it said
Download Unavailable
This result set contains too many rows for direct download. Please use "Save as Table" and then export the resulting table.
When I tried to do it as Save as Table I got the following error:
Access Denied: Job publicdata:job_c2338ba91e494b21970854e13cdc4b2a: RUN_JOB
Also, I ran queries where I limited the number of rows to 200 or so, even in such cases I got the error as mentioned above. However I was able to download it as CSV.
Any solution to this problem?
#Anerudh You don't have access to modify the publicdata samples dataset. Create a brand new dataset, and try to save your query results to a new table in that dataset.