When I run snowflake stage query I get aws error - amazon-s3

I've created an s3 linked stage on snowflake called csv_stage with my aws credentials, and the creation was successful.
Now I'm trying to query the stage like below
select t.$1, t.$2 from #sandbox_ra.public.csv_stage/my_file.csv t
However the error I'm getting is
Failure using stage area. Cause: [The AWS Access Key Id you provided is not valid.]
Any idea why? Do I have to pass something in the query itself?
Thanks for your help!
Ultimately let's say my s3 location has 3 different csv files. I would like to load each one of them individually to different snowflake tables. What's the best way to go about doing this?

Regarding the last part of your question: You can load multiple files with one COPY INTO-command by using the file names or a certain regex-pattern. But as you have 3 different files for 3 different tables you also have to use three different COPY INTO-commands.
Regarding querying your stage you can find some more hints in these questions:
Missing List-permissions on AWS - Snowflake - Failure using stage area. Cause: [The AWS Access Key Id you provided is not valid.] and
https://community.snowflake.com/s/question/0D50Z00008EKjkpSAD/failure-using-stage-area-cause-access-denied-status-code-403-error-code-accessdeniedhow-to-resolve-this-error
https://aws.amazon.com/de/premiumsupport/knowledge-center/access-key-does-not-exist/

I found out the aws credential I provided was not right. After fixing that, query worked.

This approach works to import data from S3 into a snowgflake Table from a public S3 bucket:
COPY INTO SNOW_SCHEMA.table_name FROM 's3://test-public/new/solution/file.csv'

Related

HIVE_METASTORE_ERROR persists after removing problematic column from schema

I am trying to query my CloudTrail logging bucket through the use of Athena. I already deployed a crawler into the bucket and managed to populate a few tables. When I tried running a simple "preview table" query, I get the following error:
HIVE_METASTORE_ERROR:
com.amazonaws.services.datacatalog.model.InvalidInputException: Error: : expected at the position 121 of 'struct<roleArn:string,roleSessionName:string,durationSeconds:int,keySpec:string,keyId:string,encryptionContext:struct<aws\:cloudtrail\:arn:string,aws\:s3\......
I narrowed down the column name in question and removed it completely from my schema.
After removing it from the schema in AWS Glue and rerunning the preview table query I still get the same error at the same position. I tried again in a different browser but I get the same error. How can this be, am I missing something?
Please provide any advice.
Thanks in advance!

Trouble loading data into Snowflake using Azure Data Factory

I am trying to import a small table of data from Azure SQL into Snowflake using Azure Data Factory.
Normally I do not have any issues using this approach:
https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake?tabs=data-factory#staged-copy-to-snowflake
But now I have an issue, with a source table that looks like this:
There is two columns SLA_Processing_start_time and SLA_Processing_end_time that have the datatype TIME
Somehow, while writing the data to the staged area, the data is changed to something like 0:08:00:00.0000000,0:17:00:00.0000000 and that causes for an error like:
Time '0:08:00:00.0000000' is not recognized File
The mapping looks like this:
I have tried adding a TIME_FORMAT property like 'HH24:MI:SS.FF' but that did not help.
Any ideas to why 08:00:00 becomes 0:08:00:00.0000000 and how to avoid it?
Finally, I was able to recreate your case in my environment.
I have the same error, a leading zero appears ahead of time (0: 08:00:00.0000000).
I even grabbed the files it creates on BlobStorage and the zeros are already there.
This activity creates CSV text files without any error handling (double quotes, escape characters etc.).
And on the Snowflake side, it creates a temporary Stage and loads these files.
Unfortunately, it does not clean up after itself and leaves empty directories on BlobStorage. Additionally, you can't use ADLS Gen2. :(
This connector in ADF is not very good, I even had problems to use it for AWS environment, I had to set up a Snowflake account in Azure.
I've tried a few workarounds, and it seems you have two options:
Simple solution:
Change the data type on both sides to DateTime and then transform this attribute on the Snowflake side. If you cannot change the type on the source side, you can just use the "query" option and write SELECT using the CAST / CONVERT function.
Recommended solution:
Use the Copy data activity to insert your data on BlobStorage / ADLS (this activity did it anyway) preferably in the parquet file format and a self-designed structure (Best practices for using Azure Data Lake Storage).
Create a permanent Snowflake Stage for your BlobStorage / ADLS.
Add a Lookup activity and do the loading of data into a table from files there, you can use a regular query or write a stored procedure and call it.
Thanks to this, you will have more control over what is happening and you will build a DataLake solution for your organization.
My own solution is pretty close to the accepted answer, but I still believe that there is a bug in the build-in direct to Snowflake copy feature.
Since I could not figure out, how to control that intermediate blob file, that is created on a direct to Snowflake copy, I ended up writing a plain file into the blob storage, and reading it again, to load into Snowflake
So instead having it all in one step, I manually split it up in two actions
One action that takes the data from the AzureSQL and saves it as a plain text file on the blob storage
And then the second action, that reads the file, and loads it into Snowflake.
This works, and is supposed to be basically the same thing the direct copy to Snowflake does, hence the bug assumption.

How to save a view using federated queries across two projects?

I'm looking to save a view which uses federated queries (from a MySQL Cloud SQL connection) between two projects. I'm receiving two different errors (depending on which project I try to save in).
If I try to save in the project containing the dataset I get error:
Not found: Connection my-connection-name
If I try to save in the project that contains the connection I get error:
Not found: Dataset my-project:my_dataset
My example query that crosses projects looks like:
SELECT
bq.uuid,
sql.item_id,
sql.title
FROM
`project_1.my_dataset.psa_v2_202005` AS bq
LEFT OUTER JOIN
EXTERNAL_QUERY( 'project_2.us-east1.my-connection-name',
'''SELECT item_id, title
FROM items''') AS sql
ON
bq.looks_info.query_item.item_id = sql.item_id
The documentation at https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#known_issues_and_limitations doesn't mention any limitations here.
Is there a way around this so I can save a view using an external connection from one project and dataset from another?
Your BigQuery table is located in US and your MySQL data source is located in us-east1. BigQuery automatically chooses to run the query in the location of your BigQuery table (i.e. in US), however, your Cloud MySQL is in us-east1 and that's why your query fails. Therefore the BigQuery table and Cloud SQL instance, must be in the same location in order for this query to succeed.
The solution for this kind of cases is moving your BigQuery dataset to the same location as your Cloud SQL instance manually by following the steps explained in detail in this documentation. However, the us-east1 is not currently supported for copying datasets. Thus, I will recommend you to create a new connection in one of the locations mentioned in the documentation.
I hope you find the above pieces of information useful.

How to view what is being copied in SQL

I have JSON data in an Amazon Web Service S3 bucket. I am trying to copy it into a database (AWS Redshift).
I am using the following command:
COPY mytable FROM 's3://bucket/somedata'
iam_role 'arn:aws:iam::12345678:role/MyRole';
I am thinking the bucket's data is being copied with some additional meta data. I think the meta data is causing my COPY command to fail.
Can you tell me, is it possible to print the copied data somehow?
Thanks in advance!
If your COPY command fails, you should check stl_load_errors system table. It has raw_line column which which shows raw data that caused the failure. There are also other columns which will provide you with more details about the error.

Exporting query results as JSON via Google BigQuery API

I've got jobs/queries that return a few hundred thousand rows. I'd like to get the results of the query and write them as json in a storage bucket.
Is there any straightforward way of doing this? Right now the only method I can think of is:
set allowLargeResults to true
set a randomly named destination table to hold the query output
create a 2nd job to extract the data in the "temporary" destination table to a file in a storage bucket
delete the random "temporary" table.
This just seems a bit messy and roundabout. I'm going to be wrapping all this in a service hooked up to a UI that would have lots of users hitting it and would rather not be in the business of managing all these temporary tables.
1) As you mention the steps are good. You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntax.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
With this approach you first need to export to GCS, then to transfer to local machine. If you have a message queue system (like Beanstalkd) in place to drive all these it's easy to do a chain of operation: submit jobs, monitor state of the job, when done initiate export to GCS, then delete the temp table.
Please also know that you can update a table via the API and set the expirationTime property, with this aproach you don't need to delete it.
2) If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json