I'm trying to execute a simple pipeline in azure data lake analytics, but I'm having some trouble with U-SQL. I was wondering if someone can give a helping hand.
My Query:
DECLARE #log_file string = "/datalake/valores.tsv";
DECLARE #summary_file string = "/datalake/output.tsv";
#log = EXTRACT valor string from #log_file USING Extractors.Tsv();
#summary = select sum(int.valor) as somavalor from #log;OUTPUT #summary
TO #summary_file USING Outputters.Tsv();
Error:
Erro
Other general doubts:
1. When I deploy a new pipeline to ADF sometimes it doesn't appear in the activity window and sometime it does. I didn't get the logic. (I'm using the OneTime pipeline mode)
2. There is a better way to create new pipeline (other than manipulate raw Json files?)
3.There is any U-SQL parser? What is the easiest way to teste my query?
Thanks a lot.
U-SQL is case-sensitive so your U-SQL should look more like this:
DECLARE #log_file string = "/datalake/valores.tsv";
DECLARE #summary_file string = "/datalake/output.tsv";
#log =
EXTRACT valor int
FROM #log_file
USING Extractors.Tsv();
#summary =
SELECT SUM(valor) AS somavalor
FROM #log;
OUTPUT #summary
TO #summary_file USING Outputters.Tsv();
I have assumed your input file has only a single column of type int.
Use Visual Studio U-SQL projects, VS Code U-SQL add-in to ensure you write valid U-SQL. You can also submit U-SQL jobs via the portal.
Related
I am trying to retrieve a custom sql query from an oracle database source using the existing connector in the data catalog in an AWS Glue job script. I found this in the doc of AWS :
DataSource = glueContext.create_dynamic_frame.from_options(connection_type =
"custom.jdbc", connection_options = {"query":"SELECT id, name, department FROM department
WHERE id < 200","connectionName":"test-connection-jdbc"}, transformation_ctx =
"DataSource0")
But it's not working and I don't know why.
ps : the connector is well configured and tested.
the error raised is : getDynamicFrame. empty.reduceLeft
that the query is executed and loaded to the target.
I'm trying out Hudi, Delta Lake, and Iceberg in AWS Glue v3 engine (Spark 3.1) and have both Delta Lake and Iceberg running just fine end to end using a test pipeline I built with test data. Note I am not using any of the Glue Custom Connectors. I'm using pyspark and standard Spark code (not the Glue classes that wrap the standard Spark classes)
For Hudi, the install of the Hudi jar is working fine as I'm able to write the table in the Hudi format and can create the table DDL in the Glue Catalog just fine and read it via Athena. However, when I try to run a crud statement on the newly created table, I get errors. For example, trying to run a simple DELETE SparkSQL statement, I get the error: 'DELETE is only supported with v2 tables.'
I've added the following jars when building the SparkSession:
org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.0
com.amazonaws:aws-java-sdk:1.10.34
org.apache.hadoop:hadoop-aws:2.7.3
And I set the following config for the SparkSession:
self.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer')
I've tried many different versions of writing the data/creating the table including:
hudi_options = {
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.table.version': 2,
'hoodie.table.name': 'db.table_name',
'hoodie.datasource.write.recordkey.field': 'id', key is required in table.
'hoodie.datasource.write.partitionpath.field': '',
'hoodie.datasource.write.table.name': 'db.table_name',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'date_modified',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
df.write \
.format('hudi') \
.options(**hudi_options) \
.mode('overwrite') \
.save('s3://...')
sql = f"""CREATE TABLE {FULL_TABLE_NAME}
USING {DATA_FORMAT}
options (
type = 'cow',
primaryKey = 'id',
preCombineField = 'date_modified',
partitionPathField = '',
hoodie.table.name = 'db.table_name',
hoodie.datasource.write.recordkey.field = 'id',
hoodie.datasource.write.precombine.field = 'date_modified',
hoodie.datasource.write.partitionpath.field = '',
hoodie.table.version = 2
)
LOCATION '{WRITE_LOC}'
AS SELECT * FROM {SOURCE_VIEW};"""
spark.sql(sql)
The above works fine. It's when I try to run a CRUD operation on the table created above that I get errors. For instance, I try deleting records via the SparkSQL DELETE statement and get the error 'DELETE is only supported with v2 tables.'. I can't figure out why it's complaining about not being a v2 table. Any clues would be hugely appreciated.
I've seen some queries to upload the images from a file, but I get this error message:
Cannot bulk load because the file could not be opened
I went to the properties>security option of the file to give access to SQL, but I couldn't find the option to give the permission. Considering this is Azure from Microsoft, how do I give the access to my files so I can execute the query? I'm using OPENROWSET and this is my code.
INSERT INTO FOTOS_EMPLEADOS
values (1,'HOLA', (SELECT * FROM OPENROWSET(BULK 'C:\Users.jpg', SINGLE_BLOB) as T1))
If there is a mistake with the code or other way to do it, please let me know.
TIA
Azure SQL Database doesn't support load file from on-premise computer.
Please reference OPENROWSET (Transact-SQL):
If you want to do this, you need upload the images to Blob Storage:
Please see Importing into a table from a file stored on Azure Blob storage:
--> Optional - a MASTER KEY is not required if a DATABASE SCOPED CREDENTIAL is not required because the blob is configured for public (anonymous) access!
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'YourStrongPassword1';
GO
--> Optional - a DATABASE SCOPED CREDENTIAL is not required because the blob is configured for public (anonymous) access!
CREATE DATABASE SCOPED CREDENTIAL MyAzureBlobStorageCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = '******srt=sco&sp=rwac&se=2017-02-01T00:55:34Z&st=2016-12-29T16:55:34Z***************';
-- NOTE: Make sure that you don't have a leading ? in SAS token, and
-- that you have at least read permission on the object that should be loaded srt=o&sp=r, and
-- that expiration period is valid (all dates are in UTC time)
CREATE EXTERNAL DATA SOURCE MyAzureBlobStorage
WITH ( TYPE = BLOB_STORAGE,
LOCATION = 'https://****************.blob.core.windows.net/curriculum'
, CREDENTIAL= MyAzureBlobStorageCredential --> CREDENTIAL is not required if a blob is configured for public (anonymous) access!
);
INSERT INTO achievements with (TABLOCK) (id, description)
SELECT * FROM OPENROWSET(
BULK 'csv/achievements.csv',
DATA_SOURCE = 'MyAzureBlobStorage',
FORMAT ='CSV',
FORMATFILE='csv/achievements-c.xml',
FORMATFILE_DATA_SOURCE = 'MyAzureBlobStorage'
) AS DataFile;
Hope this helps.
I'm attempting to extract data from AVRO files produced by Event Hub Capture. In most cases this works flawlessly. But certain files are causing me problems. When I run the following U-SQL job, I get the error:
USE DATABASE Metrics;
USE SCHEMA dbo;
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
REFERENCE ASSEMBLY [Avro];
REFERENCE ASSEMBLY [log4net];
USING Microsoft.Analytics.Samples.Formats.ApacheAvro;
USING Microsoft.Analytics.Samples.Formats.Json;
USING System.Text;
//DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}/{filename}";
DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/2018/01/16/19/rcpt-metrics-us-es-eh-metrics-v3-us-0-35-36.avro";
#eventHubArchiveRecords =
EXTRACT Body byte[],
date DateTime,
filename System.String
FROM #input
USING new AvroExtractor(#"
{
""type"":""record"",
""name"":""EventData"",
""namespace"":""Microsoft.ServiceBus.Messaging"",
""fields"":[
{""name"":""SequenceNumber"",""type"":""long""},
{""name"":""Offset"",""type"":""string""},
{""name"":""EnqueuedTimeUtc"",""type"":""string""},
{""name"":""SystemProperties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Properties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Body"",""type"":[""null"",""bytes""]}
]
}
");
#json =
SELECT Encoding.UTF8.GetString(Body) AS json
FROM #eventHubArchiveRecords;
OUTPUT #json
TO "/outputs/Avro/testjson.csv"
USING Outputters.Csv(outputHeader : true, quoting : true);
I get the following error:
Unhandled exception from user code: "The given key was not present in the dictionary."
An unhandled exception from user code has been reported when invoking the method 'Extract' on the user type 'Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor'
Am I correct in assuming the problem is within the AVRO file produced by Event Hub Capture, or is there something wrong with my code?
The Key Not Present error is referring to the fields in your extract statement. It's not finding the data and filename fields. I removed those fields and your script runs correctly in my ADLA instance.
The current implementation only supports primitive types, not complex types of the Avro specification at the moment.
You have to build and use an extractor based on apache avro and not use the sample extractor provided by MS.
We went the same path
According to the Dataproc docos, it has "native and automatic integrations with BigQuery".
I have a table in BigQuery. I want to read that table and perform some analysis on it using the Dataproc cluster that I've created (using a PySpark job). Then write the results of this analysis back to BigQuery. You may be asking "why not just do the analysis in BigQuery directly!?" - the reason is because we are creating complex statistical models, and SQL is too high level for developing them. We need something like Python or R, ergo Dataproc.
Are they any Dataproc + BigQuery examples available? I can't find any.
To begin, as noted in this question the BigQuery connector is preinstalled on Cloud Dataproc clusters.
Here is an example on how to read data from BigQuery into Spark. In this example, we will read data from BigQuery to perform a word count.
You read data from BigQuery in Spark using SparkContext.newAPIHadoopRDD. The Spark documentation has more information about using SparkContext.newAPIHadoopRDD. '
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
val projectId = "<your-project-id>"
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema =
"[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"
val conf = sc.hadoopConfiguration
// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
fullyQualifiedOutputTableId, outputTableSchema)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf,
classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)
You will need to customize this example with your settings, including your Cloud Platform project ID in <your-project-id> and your output table ID in <your-fully-qualified-table-id>.
Finally, if you end up using the BigQuery connector with MapReduce, this page has examples for how to write MapReduce jobs with the BigQuery connector.
The above example doesn't show how to write data to an output table. You need to do this:
.saveAsNewAPIHadoopFile(
hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY),
classOf[String],
classOf[JsonObject],
classOf[BigQueryOutputFormat[String, JsonObject]], hadoopConf)
where the key: String is actually ignored