SQL compilation error while copying data from snowflakes to S3 using s3a:// and s3n:// - sql

I am trying to copy results to Amazon s3 from snowflakes using s3n:// and s3a:// url but getting an SQL compilation error
The sql query is in the format
COPY INTO '&s3_path/&curr_dt/pvc'
FROM (
SELECT OBJECT_CONSTRUCT('id',id,'keyword',keyword)
FROM brands_delta)
CREDENTIALS = (AWS_KEY_ID='&aws_key_id' AWS_SECRET_KEY='&aws_secret_key')
FILE_FORMAT = (TYPE=JSON)
SINGLE = false
OVERWRITE = true
MAX_FILE_SIZE = 1073741824;
The error in the log files is as follows :
001011 (42601): SQL compilation error:
invalid URL prefix found in: 's3a://abc/prod-runs/input/2021-01-19/pvc'

The URI protocol determines the code/software used by the client to access the resource given in the URI.
In this case, Snowflake is the client software and it, obviously, doesn't use the s3a/s3n protocols. I'm not sure why you are trying to use them?

Related

Clickhouse protobuf output format

I use clickhouse server in docker with just 1 table and several rows in it. I can request all the data in default format with clickhouse client (over TCP) or with some GUI tool like DBeaver (over HTTP).
SELECT * FROM some_table;
Also I can change format to something special:
SELECT * FROM some_table FORMAT Pretty;
I want to request data from clickhouse in protobuf format. Query looks like this:
SELECT * FROM some_table FORMAT Protobuf SETTINGS format_schema = 'proto_file:ProtoStructure';
I have the proto_file.proto in the same directory with clickhouse-client, so I can made TCP request throw it (successful).
But I don't know data structure of TCP request to reproduce it in my program by myself. So I tried to execute the same request in HTTP (through DBeaver) to intercept request and reproduce it. Unfortunately I can't execute script in DBeaver properly, because it complains on proto_file.proto (File not found, I don't know where to place it to make it work). The only thing I known, that format is specified by X-Clickhouse-Format HTTP header, but I don't know and can't find any info about where in HTTP request I should place content of proto file.
So, the main question: Is there any examples of pure HTTP request to clickhouse for protobuf data output format?
SETTINGS format_schema = 'proto_file:ProtoStructure' -- is the feature of clickhouse-client application. It's only possible with clickhouse-client.
clickhouse-client is the reach client. It queries data from clickhouse-server using TPC/native protocol and forms Protobuf by itself using the schema file.
clickhouse-server is also able to form Protobuf using .proto files (using HTTP and GRPC protocols). But in this case .proto files should be placed at the clickhouse-server node into /var/lib/clickhouse/format_schemas/ folder.
https://clickhouse.com/docs/en/operations/server-configuration-parameters/settings/#server_configuration_parameters-format_schema_path
for example I created .proto file
cat /var/lib/clickhouse/format_schemas/test.proto
syntax = "proto3";
message TestMessage {
int64 id = 1;
uint32 blockNo = 2;
string val1 = 3;
float val2 = 4;
uint32 val3 = 5;
};
made it available chown clickhouse.clickhouse test.proto
Now I can do this
curl -o out.protobuf 'localhost:8123/?format_schema=test:TestMessage&query=select+1+id+from+numbers(10)+format+Protobuf'

Using Jinja template variables with BigQueryOperator in Airflow

I'm attempting to use the BigQueryOperator in Airflow by using a variable to populate the sql= attribute. The problem I'm running into is that the file extension is dropped when using Jinja variables. I've setup my code as follows:
dag = DAG(
dag_id='data_ingest_dag',
template_searchpath=['/home/airflow/gcs/dags/sql/'],
default_args=DEFAULT_DAG_ARGS
)
bigquery_transform = BigQueryOperator(
task_id='bq-transform',
write_disposition='WRITE_TRUNCATE',
sql="{{dag_run.conf['sql_script']}}",
destination_dataset_table='{{dag_run.conf["destination_dataset_table"]}}',
dag=dag
)
The passed variable contains the name of the SQL file stored in the separate SQL directory. If I pass the value as a static string, sql="example_file.sql", everything works fine. However, when I pass the example_file.sql using Jinja template variable it automatically drops the file extension and I receive this error:
BigQuery job failed.
Final error was: {u'reason': u'invalidQuery', u'message': u'Syntax error: Unexpected identifier "example_file" at [1:1]', u'location': u'query'}
Additionally, I've tried hardcoding ".sql" to the end of the variable anticipating that the extension would be dropped. However, this causes the entire variable reference to be interpreted as as string.
How do you use variables to populate BigQueryOperator attributes?
Reading the BigQuery operator docstring it seems that you can provide the sql statement in 2 ways:
1. As a string that can contain templating macros
2. A reference to a file that can contain templating macros (the file, not the file name).
You cannot template the file name but only the SQL statement. In fact, your error message shows that BigQuery did not recognize the identifier "example_file". If you inspect the BigQuery history for the project which ran that query, you will see that the query string was "example_file.sql" which is not a valid SQL statement, thus the error.

Processing Event Hub Capture AVRO files with Azure Data Lake Analytics

I'm attempting to extract data from AVRO files produced by Event Hub Capture. In most cases this works flawlessly. But certain files are causing me problems. When I run the following U-SQL job, I get the error:
USE DATABASE Metrics;
USE SCHEMA dbo;
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
REFERENCE ASSEMBLY [Avro];
REFERENCE ASSEMBLY [log4net];
USING Microsoft.Analytics.Samples.Formats.ApacheAvro;
USING Microsoft.Analytics.Samples.Formats.Json;
USING System.Text;
//DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}/{filename}";
DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/2018/01/16/19/rcpt-metrics-us-es-eh-metrics-v3-us-0-35-36.avro";
#eventHubArchiveRecords =
EXTRACT Body byte[],
date DateTime,
filename System.String
FROM #input
USING new AvroExtractor(#"
{
""type"":""record"",
""name"":""EventData"",
""namespace"":""Microsoft.ServiceBus.Messaging"",
""fields"":[
{""name"":""SequenceNumber"",""type"":""long""},
{""name"":""Offset"",""type"":""string""},
{""name"":""EnqueuedTimeUtc"",""type"":""string""},
{""name"":""SystemProperties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Properties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Body"",""type"":[""null"",""bytes""]}
]
}
");
#json =
SELECT Encoding.UTF8.GetString(Body) AS json
FROM #eventHubArchiveRecords;
OUTPUT #json
TO "/outputs/Avro/testjson.csv"
USING Outputters.Csv(outputHeader : true, quoting : true);
I get the following error:
Unhandled exception from user code: "The given key was not present in the dictionary."
An unhandled exception from user code has been reported when invoking the method 'Extract' on the user type 'Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor'
Am I correct in assuming the problem is within the AVRO file produced by Event Hub Capture, or is there something wrong with my code?
The Key Not Present error is referring to the fields in your extract statement. It's not finding the data and filename fields. I removed those fields and your script runs correctly in my ADLA instance.
The current implementation only supports primitive types, not complex types of the Avro specification at the moment.
You have to build and use an extractor based on apache avro and not use the sample extractor provided by MS.
We went the same path

How to use insert_job

I want to run a Bigquery SQL query using insert method.
I ran the following code just like so:
JobConfigurationQuery = Google::Apis::BigqueryV2::JobConfigurationQuery
bq = Google::Apis::BigqueryV2::BigqueryService.new
scopes = [Google::Apis::BigqueryV2::AUTH_BIGQUERY]
bq.authorization = Google::Auth.get_application_default(scopes)
bq.authorization.fetch_access_token!
query_config = {query: "select colA from [dataset.table]"}
qr = JobConfigurationQuery.new(configuration:{query: query_config})
bq.insert_job(projectId, qr)
and I got an error as below:
Caught error invalid: Job configuration must contain exactly one job-specific configuration object (e.g., query, load, extract, spreadsheetExtract), but there were 0:
Please let me know how to use the insert_job method.
I'm not sure what client library you're using, but insert_job probably takes a JobConfiguration. You should create one of those and set the query parameter to equal your JobConfigurationQuery you've created.
This is necessary because you can insert various jobs (load, copy, extract) with different types of configurations to this one API method, and they all take a single configuration type with a subfield that specifies which type and details about the job to insert.
More info from BigQuery's documentation:
jobs.insert documentation
job resource: note the "configuration" field and its "query" subfield

How to give input Mapreduce jobs which use s3 data from java code

I know that we can normally give the parameters while running the jar file in EC2 instance
But how do we give inputs through code?
I am trying this because I am trying to call my java code from a jsp, so in the java code ,I want to directly pick up data from s3 and proceed , I tries like this but in vain:
DataExtractor.getRelevantData("s3n://syamk/revanthinput/", "999999", "94645", "20120606",
"s3n://revanthufl/gen/testoutput" + "interm");
here s3n://syamk/revanthinput/ I was using input and instead of s3n://revanthufl/gen/testoutput.
I was using output and in the parameters I am using the same strings(s3n://syamk/revanthinput/ and s3n://revanthufl/gen/testoutput) to run the jar.But doing like this from code is throwing and exception,
[java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).] with root cause
Based on my usage of flume, it would appear that you need to format your URL like s3n://AWS_ACCESS_KEY:AWS_SECRET_KEY#syamk/revanthinput/ when calling s3 from within code.