Processing Event Hub Capture AVRO files with Azure Data Lake Analytics - azure-data-lake

I'm attempting to extract data from AVRO files produced by Event Hub Capture. In most cases this works flawlessly. But certain files are causing me problems. When I run the following U-SQL job, I get the error:
USE DATABASE Metrics;
USE SCHEMA dbo;
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
REFERENCE ASSEMBLY [Avro];
REFERENCE ASSEMBLY [log4net];
USING Microsoft.Analytics.Samples.Formats.ApacheAvro;
USING Microsoft.Analytics.Samples.Formats.Json;
USING System.Text;
//DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}/{filename}";
DECLARE #input string = "adl://mydatalakestore.azuredatalakestore.net/event-hub-capture/v3/2018/01/16/19/rcpt-metrics-us-es-eh-metrics-v3-us-0-35-36.avro";
#eventHubArchiveRecords =
EXTRACT Body byte[],
date DateTime,
filename System.String
FROM #input
USING new AvroExtractor(#"
{
""type"":""record"",
""name"":""EventData"",
""namespace"":""Microsoft.ServiceBus.Messaging"",
""fields"":[
{""name"":""SequenceNumber"",""type"":""long""},
{""name"":""Offset"",""type"":""string""},
{""name"":""EnqueuedTimeUtc"",""type"":""string""},
{""name"":""SystemProperties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Properties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},
{""name"":""Body"",""type"":[""null"",""bytes""]}
]
}
");
#json =
SELECT Encoding.UTF8.GetString(Body) AS json
FROM #eventHubArchiveRecords;
OUTPUT #json
TO "/outputs/Avro/testjson.csv"
USING Outputters.Csv(outputHeader : true, quoting : true);
I get the following error:
Unhandled exception from user code: "The given key was not present in the dictionary."
An unhandled exception from user code has been reported when invoking the method 'Extract' on the user type 'Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor'
Am I correct in assuming the problem is within the AVRO file produced by Event Hub Capture, or is there something wrong with my code?

The Key Not Present error is referring to the fields in your extract statement. It's not finding the data and filename fields. I removed those fields and your script runs correctly in my ADLA instance.

The current implementation only supports primitive types, not complex types of the Avro specification at the moment.

You have to build and use an extractor based on apache avro and not use the sample extractor provided by MS.
We went the same path

Related

How can I use Format Arguments in BusinessException?

I have noticed the abp localization provide a Format Arguments mechanism to help generate realtime local string by this way, and I want to know how can I do the same thing in calling a BusinessException while all its overloads are not suitable for this purpose.
Please see the documentation: https://docs.abp.io/en/abp/latest/Exception-Handling#exception-localization
It is possible to set an exception code and data related to the exception. Then ABP automatically localizes the exception message by also using the data arguments you've provided.
Example exception:
throw new BusinessException("App:010046")
.WithData("UserName", "john");
And the related localization entry in the json file:
"App:010046": "Username should be unique. '{UserName}' is already taken!"
It is not using {0}, {1}... but using parameter names instead.

Type name 'serializeObject' does not exist in the type 'JsonConvert'

I am trying to convert some data to Json to send to the CloudFlare API. I am attempting to use Newtonsoft.Json.JsonConvert.SeriablizeObject() to accomplish this but I am getting the Intellisense error that the type name 'serializeObject' does not exist in the type 'JsonConvert'. I have the NuGet package Newtonsoft.Json installed but it does not recognize the SerializeObject() method. I am not sure what I am missing.
Its because you're calling the method wrong, remove the new operator from the line
var json = new JsonConvert.SerializeObject(updateDnsRecord);
to
var json = JsonConvert.SerializeObject(updateDnsRecord);

how to get more info than generic "Failed to parse JSON: No active field found.; ParsedString returned false; Could not parse value" on BigQuery load?

We're trying BigQuery for the first time, with data extracted from mongo in json format. I kept getting this generic parse error upon loading the file. But then I tried a smaller subset of the file, 20 records, and it loaded fine. This tells me it's not the general structure of the file, which I had originally thought was the problem. Is there any way to get more info on the parse error, such as the string of the record that it's trying to parse when it has this error?
I also tried using the max errors field, but that didn't work either.
This was via the website. I also tried it via the Google Cloud SDK command line 'bq load...' and got the same error.
This error is most likely caused by some of the JSON records not compying with table schema. It is not clear whether you used schema autodetect feature, or you are supplying schema for the load. But here is one example where such error could happen:
{ "a" : "1" }
{ "a" : { "b" : "2" } }
If you only have a few of these and they are for invalid records - you can automatically ignore them by using max_bad_records option for load job. More details at: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json

Not able to query a Hive table where data is stored as sequencefile format, where data is coming from HBASE export utility

I exported datasets from HBASE to HIVE via export utility. Data is coming as sequence file format.
Now, I am creating a hive table with "stored as sequencefile".
When I am doing simple, select query with limit on the hive table, i am getting the below exception
java.io.IOException: Could not find a deserializer for the Value class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the configuration 'io.serializations' is properly configured,
Detailed error trace below:
Bad status for request TFetchResultsReq(fetchType=0,
operationHandle=TOperationHandle(hasResultSet=True,
modifiedRowCount=None, operationType=0,
operationId=THandleIdentifier(secret='\xb2\x9a^\xc5\xc5\xf3BV\xb1\xb8\xe7\xb6\xab\xe7f\xc0',
guid='\xca\x0c\xc9\\xc0%#\xb4\xa1{\xe4\xc7\xaa(;\xcb')),
orientation=4, maxRows=100):
TFetchResultsResp(status=TStatus(errorCode=0,
errorMessage="java.io.IOException: java.io.IOException: Could not find
a deserializer for the Value class:
'org.apache.hadoop.hbase.client.Result'. Please ensure that the
configuration 'io.serializations' is properly configured, if you're
using custom serialization.", sqlState=None,
infoMessages=["*org.apache.hive.service.cli.HiveSQLException:java.io.IOException:
java.io.IOException: Could not find a deserializer for the Value
class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the
configuration 'io.serializations' is properly configured, if you're
using custom serialization.:14:13",
'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:366',
'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:277',
'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:753',
'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:438',
'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:686',
'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1553',
'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1538',
'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39',
'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39',
'org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor:process:HadoopThriftAuthBridge.java:746',
'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286',
'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142',
'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617',
'java.lang.Thread:run:Thread.java:745',
"*java.io.IOException:java.io.IOException: Could not find a
deserializer for the Value class:
'org.apache.hadoop.hbase.client.Result'. Please ensure that the
configuration 'io.serializations' is properly configured, if you're
using custom serialization.:18:4",
'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:508',
'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:415',
'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:138',
'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:1798',
'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:361',
"*java.io.IOException:Could not find a deserializer for the Value
class: 'org.apache.hadoop.hbase.client.Result'. Please ensure that the
configuration 'io.serializations' is properly configured, if you're
using custom serialization.:26:8",
'org.apache.hadoop.io.SequenceFile$Reader:init:SequenceFile.java:2040',
'org.apache.hadoop.io.SequenceFile$Reader:initialize:SequenceFile.java:1878',
'org.apache.hadoop.io.SequenceFile$Reader::SequenceFile.java:1827',
'org.apache.hadoop.io.SequenceFile$Reader::SequenceFile.java:1841',
'org.apache.hadoop.mapred.SequenceFileRecordReader::SequenceFileRecordReader.java:49',
'org.apache.hadoop.mapred.SequenceFileInputFormat:getRecordReader:SequenceFileInputFormat.java:64',
'org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit:getRecordReader:FetchOperator.java:674',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getRecordReader:FetchOperator.java:324',
'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:446'],
statusCode=3), results=None, hasMoreRows=None)

bq CLI says my JSON schema is invalid while the browser GUI says it's fine. Where am I going wrong?

I have a JSON schema:
[{"name":"timestamp","type":"integer"},{"name":"xml_id","type":"string"},{"name":"prod","type":"string"},{"name":"version","type":"string"},{"name":"distcode","type":"string"},{"name":"origcode","type":"string"},{"name":"overcode","type":"string"},{"name":"prevcode","type":"string"},{"name":"ie","type":"string"},{"name":"os","type":"string"},{"name":"payload","type":"string"},{"name":"language","type":"string"},{"name":"userid","type":"string"},{"name":"sysid","type":"string"},{"name":"loc","type":"string"},{"name":"impetus","type":"string"},{"name":"numprompts","type":"record","mode":"repeated","fields":[{"name":"type","type":"string"},{"name":"count","type":"integer"}]},{"name":"rcode","type":"record","mode":"repeated","fields":[{"name":"offer","type":"string"},{"name":"code","type":"integer"}]},{"name":"bw","type":"string"},{"name":"pkg_id","type":"string"},{"name":"cpath","type":"string"},{"name":"rsrc","type":"string"},{"name":"pcode","type":"string"},{"name":"opage","type":"string"},{"name":"action","type":"string"},{"name":"value","type":"string"},{"name":"other","type":"record","mode":"repeated","fields":[{"name":"param","type":"string"},{"name":"value","type":"string"}]}]
(http://jsoneditoronline.org/ for pretty print)
When loading through the browser GUI the schema is accepted as valid. The cli throws the following error:
BigQuery error in load operation: Invalid schema entry: "fields":[{"name":"type"
Is there something wrong with my schema as specified?
If you are passing the schema as json, you should write it to a file and pass the file name as the schema parameter. Passing the schema inline on the command line is only allowed for simple flat schemas.