Multistorage with avro? - apache-pig

I have a single file containing multiple avro records. Each record contains a unique "name". How do I load and store files such that each file represents a record that corresponds with a given name?
Here is my avro schema:
{
"type": "records",
"name": "XXItem",
"namespace": "com.xxx.xxx",
"fields": [
{
"name": "data",
"type": {"type": "map", "values" : ["string", "long", "int"]}
}
]
}

A quick check seems to indicate that avro, is simply using JSON for data storage.
By looking for solutions for handling JSON in general, you should be able to come up with something that works for you.
This could be a starting point: Hadoop for JSON files

Related

Error loading multiple files to bigquery too many positional args

~edited: I'm running the bq command line using my VM instance in Google Compute Engine
Ive been trying to load multiple csv files to bigquery using bq command line, and i keep getting this error
Too many positional args, still have ['/home/username/csvschema.json']
All my files contain the same schema since I copied and paste it only and rename for testing purposes. So not sure why I keep getting this error. [testFiles_1.csv, testFiles_2.csv, testFiles_3.csv]
These are the steps I took:
1. Created my bigquery table and manually insert 1 file there so I dont need to manually add schema, but rather auto detect.
2. Then, I type this command:
bq load --skip_leading_rows=1 gcstransfer.testFile /home/username/testfile_*.csv /home/username/csvschema.json
My schema contains by running the bq show --format=prettyjson dataset.table
[
{
"mode": "NULLABLE",
"name": "Channel",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Date",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "ID",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Referral",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "Browser",
"type": "STRING"
}
]
I tried omitting the JSON part, but I get this error instead:
BigQuery error in load operation: Error decoding JSON schema from file /home/username/testfile_2.csv: No JSON object could be decoded
To specify a one-column schema, use "name:string".
Looks like you cannot use wildcards when loading from local data source. For this you can upload the files to a GCS Bucket and load them from there. See the Limitations paragraph in the docs: https://cloud.google.com/bigquery/docs/loading-data-local
Wildcards and comma separated lists are not supported when you load
files from a local data source. Files must be loaded individually.

GCP Bigquery: Can't query stackdriver access logs exported in cloudstorage because invalid json field "#type"

I store the access log of a pixel image in a cloudstorage bucket dev-access-log-bucket using the standard "sink"
so the files looks like this requests/2019/05/08/15:00:00_15:59:59_S1.json
and one line looks like this (I formatted the json, but it's on one line normmaly) :
{
"httpRequest": {
"cacheLookup": true,
"remoteIp": "93.24.25.190",
"requestMethod": "GET",
"requestSize": "224",
"requestUrl": "https://dev-snowplow.legalstart.fr/one_pixel_image.png?user_id=0&action=purchase&product_id=0&money=10",
"responseSize": "779",
"status": 200,
"userAgent": "python-requests/2.21.0"
},
"insertId": "w6wyz1g2jckjn6",
"jsonPayload": {
"#type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
"statusDetails": "response_sent_by_backend"
},
"logName": "projects/tracking-pixel-239909/logs/requests",
"receiveTimestamp": "2019-05-08T15:34:24.126095758Z",
"resource": {
"labels": {
"backend_service_name": "",
"forwarding_rule_name": "dev-yolaw-pixel-forwarding-rule",
"project_id": "tracking-pixel-239909",
"target_proxy_name": "dev-yolaw-pixel-proxy",
"url_map_name": "dev-urlmap",
"zone": "global"
},
"type": "http_load_balancer"
},
"severity": "INFO",
"spanId": "7d8823509c2dc94f",
"timestamp": "2019-05-08T15:34:23.140747307Z",
"trace": "projects/tracking-pixel-239909/traces/bb55577eedd5797db2867931f8de9162"
}
all of these once again are standard GCP things, I did not customize anything here.
So now I want to do some requests on it from Bigquery, I create a dataset and an external table configured like this :
External Data Configuration
Source URI(s) gs://dev-access-log-bucket/requests/*
Auto-detect schema true (note: I don't know why it puts true though i've manually defined it)
Ignore unknown values true
Source format NEWLINE_DELIMITED_JSON
Max bad records 0
and the following manual schema:
timestamp DATETIME REQUIRED
httpRequest RECORD REQUIRED
httpRequest. requestUrl STRING REQUIRED
and when I run a request
SELECT
timestamp
FROM
`path.to.my.table`
LIMIT
1000
I got
Invalid field name "#type". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
How can I work around this without needing to pre-process the log to not have the "#type" field in it ?

JSON Schema Array Validation Woes Using oneOf

Hope I might find some help with this validation issue: I have a JSON array that can have multiple object types (video, image). Within that array, the objects have a rel field value. The case I'm working on is that there can only be one object with "rel": "primaryMedia" allowed — either a video or an image.
Here are the object representations for the image and video case, both showing the "rel": "primaryMedia".
{
"media": [
{
"caption": "Caption goes here",
"id": "ncim87659842",
"rel": "primaryMedia",
"type": "image"
},
{
"description": "Shaima Swileh arrived in San Francisco after fighting for 17 months to get a waiver from the U.S. government to be allowed into the country to visit her son.",
"headline": "Yemeni mother arrives in U.S. to be with dying 2-year-old son",
"id": "mmvo1402810947621",
"rel": "primaryMedia",
"type": "video"
}
]
}
Here's a stripped-down version of the schema I created to validate this using oneOf (assuming this will handle my case). It doesn't however, work as intended.
{
"$id": "http://example.com/schema/rockcms/article.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {},
"properties": {
"media": {
"items": {
"oneOf": [
{
"additionalProperties": false,
"properties": {
"caption": {
"type": "string"
},
"id": {
"type": "string"
},
"rel": {
"enum": [
"primaryMedia"
],
"type": "string"
},
"type": {
"enum": [
"image"
],
"type": "string"
}
},
"required": [
"caption",
"id",
"rel",
"type"
],
"type": "object"
},
{
"additionalProperties": false,
"properties": {
"description": {
"type": "string"
},
"headline": {
"type": "string"
},
"id": {
"type": "string"
},
"rel": {
"enum": [
"primaryMedia"
],
"type": "string"
},
"type": {
"enum": [
"video"
],
"type": "string"
}
},
"required": [
"description",
"headline",
"id",
"rel",
"type"
],
"type": "object"
}
]
},
"type": "array"
}
}
}
Using the JSON Schema validator at https://www.jsonschemavalidator.net, the schema validates when data is correctly presented, but the doesn't work when trying to catch errors.
In the case below, headline is added for a video, and it's missing id. This should fail because headline isn't allowed on video, and id is required.
{
"media": [
{
"headline": "Yemeni mother arrives in U.S. to be with dying 2-year-old son",
"rel": "primaryMedia",
"type": "video"
}
]
}
The results I get from the validator, however, aren't entirely expected. Seems it's conflating the two object schemas in its response.
In addition to this, I've separately found that the schema will allow the population of BOTH a video and image object in media, which isn't expected.
Been trying to figure out what I've done wrong, but am stumped. Would very much appreciate some feedback, if anyone has to offer. Thanks in advance!
Validator Response:
Message: JSON is valid against no schemas from 'oneOf'.
Schema path: #/properties/media/items/oneOf
Message: Property 'headline' has not been defined and the schema does not allow additional properties.
Schema path: #/properties/media/items/oneOf/0/additionalProperties
Message: Value "video" is not defined in enum.
Schema path: #/properties/media/items/oneOf/0/properties/type/enum
Message: Required properties are missing from object: description, id.
Schema path: #/properties/media/items/oneOf/1/required
Message: Required properties are missing from object: caption, id.
Schema path: #/properties/media/items/oneOf/0/required
Your question is formed of three parts, but the first two are linked, so I'll address those, although they don't have a "solution" as such.
How can I validate that only one object in an array has a specific key, and the others do not.
You cannot do this with JSON Schema.
The set of JSON Schema keywords which are applicable to arrays do not have a means for expressing "one of the values must match a schema", but rather are applicable to either ALL of the items in a the array, or A SPECIFIC item in the array (if items is an array as opposed to an object).
https://datatracker.ietf.org/doc/html/draft-handrews-json-schema-validation-01#section-6.4
The validation output is not what I expect. I expect to only see the failing branch of oneOf which relates to my object type, which
is defined by the type key.
The JSON Schema specification (as of draft-7) does not specify any format for returning errors, however the error structure you get is pretty "complete" in terms of what it's telling you (and is similar to how we are specifying errors should be returned for draft-8).
Consider, the validator knows nothing about your schema or your JSON instance in terms of your business logic.
When validating a JSON instance, a validator may step through all values, and test validation rules against all applicable subschemas. Looking at your errors, both of the schemas in oneOf are applicable to all items in your array, and so all are tested for validation. If one does not satisfy the condition, the others will be tested also. The validator cannot know, when using a oneOf, what your intent was.
You MAY be able to get around this issue, by using an if / then combination. If your if schema is simply a const of the type, and your then schema is the full object type, you may get a cleaner error response, but I haven't tested this theory.
I want ALL items in the array to be one of the types. What's going on here?
From the spec...
items:
If "items" is a schema, validation succeeds if all elements in the
array successfully validate against that schema.
oneOf:
An instance validates successfully against this keyword if it
validates successfully against exactly one schema defined by this
keyword's value.
What you have done is say: Each item in the array should be valid according to [schemaA]. SchemA: The object should be valid according to on of these schemas: [the schemas inside the oneOf].
I can see how this is confusing when you think "items must be one of the following", but items applies the value schema to each of the items in the array independantly.
To make it do what you mean, move the oneOf above items, and then refactor one of the media types to another items schema.
Here's a sudo schema.
oneOf: [items: properties: type: const: image], [items: properties: type: image]
Let me know if any of this isn't clear or you have any follow up questions.

Apache Druid segment merge task submition failure

I am using Druid 0.9.1.1 and trying to merge all the segment of a datasource per day to a single segment. Whereas the merge task initiation fails with error :
{"error":"Instantiation of [simple type, class io.druid.timeline.DataSegment] value failed: null (through reference chain: java.util.ArrayList[0])"}
I have got the segment details from segment metadata query. There is no help from driud documents as only specify raw structure of the overall query, but not the required segment detail structure(Below is how druid document suggests).
{
"type": "merge",
"id": <task_id>,
"dataSource": <task_datasource>,
"aggregations": <list of aggregators>,
"segments": <JSON list of DataSegment objects to merge>
}
example queries :
{
"type": "merge",
"id": "envoy_merge_task",
"dataSource": "dcap.envoy.diskmounts.kafka",
"segments": [{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5460959,"numRows":41577,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_1","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5448881,"numRows":41577,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_2","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5454452,"numRows":41571,"aggregators":null,"queryGranularity":null},{"id":"dcap.sermon.threshold.kafka_2017-05-22T00:00:00.000Z_2017-05-23T00:00:00.000Z_2017-05-22T07:00:02.951Z_3","intervals":["2017-05-22T00:00:00.000Z/2017-05-23T00:00:00.000Z"],"columns":{},"size":5456267,"numRows":41569,"aggregators":null,"queryGranularity":null}] }
I have tried different forms of structure for "segments" key, results in same error.
example :
"segments": [{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_1"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_2"},{"id":"dcap.envoy.diskmounts.kafka_2017-05-21T06:00:00.000Z_2017-05-21T07:00:00.000Z_2017-05-21T06:02:43.482Z_3"}]
What is right structure for segment-merge tasks.
the format i used for segments is
"segments":[
{
"dataSource": "wikiticker88",
"interval": "2015-09-12T02:00:00.000Z/2015-09-12T03:00:00.000Z",
"version": "2018-01-16T07:23:16.425Z",
"loadSpec": {
"type": "local",
"path": "/home/linux/druid-0.11.0/var/druid/segments/wikiticker88/2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z/2018-01-16T07:23:16.425Z/0/index.zip"
},
"dimensions": "channel,cityName,comment,countryIsoCode,countryName,isAnonymous,isMinor,isNew,isRobot,isUnpatrolled,metroCode,namespace,page,regionIsoCode,regionName,user",
"metrics": "count,added,deleted,delta,user_unique",
"shardSpec": {
"type": "none"
},
"binaryVersion": 9,
"size": 198267,
"identifier": "wikiticker88_2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z_2018-01-16T07:23:16.425Z"
},
]
use this to get your metadata of segments
/druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments?full

AWS Data pipeline CSV data from S3 to DynamoDB

I am trying to transfer CSV data from S3 bucket to DynamoDB using AWS pipeline, following is my pipe line script, it is not working properly,
CSV file structure
Name, Designation,Company
A,TL,C1
B,Prog, C2
DynamoDb : N_Table, with Name as hash value
{
"objects": [
{
"id": "Default",
"scheduleType": "cron",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DynamoDBDataNodeId635",
"schedule": {
"ref": "ScheduleId639"
},
"tableName": "N_Table",
"name": "MyDynamoDBData",
"type": "DynamoDBDataNode"
},
{
"emrLogUri": "s3://onlycsv/error",
"id": "EmrClusterId636",
"schedule": {
"ref": "ScheduleId639"
},
"masterInstanceType": "m1.small",
"coreInstanceType": "m1.xlarge",
"enableDebugging": "true",
"installHive": "latest",
"name": "ImportCluster",
"coreInstanceCount": "1",
"logUri": "s3://onlycsv/error1",
"type": "EmrCluster"
},
{
"id": "S3DataNodeId643",
"schedule": {
"ref": "ScheduleId639"
},
"directoryPath": "s3://onlycsv/data.csv",
"name": "MyS3Data",
"dataFormat": {
"ref": "DataFormatId1"
},
"type": "S3DataNode"
},
{
"id": "ScheduleId639",
"startDateTime": "2013-08-03T00:00:00",
"name": "ImportSchedule",
"period": "1 Hours",
"type": "Schedule",
"endDateTime": "2013-08-04T00:00:00"
},
{
"id": "EmrActivityId637",
"input": {
"ref": "S3DataNodeId643"
},
"schedule": {
"ref": "ScheduleId639"
},
"name": "MyImportJob",
"runsOn": {
"ref": "EmrClusterId636"
},
"maximumRetries": "0",
"myDynamoDBWriteThroughputRatio": "0.25",
"attemptTimeout": "24 hours",
"type": "EmrActivity",
"output": {
"ref": "DynamoDBDataNodeId635"
},
"step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoDBTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#{output.tableName},-d,S3_INPUT_BUCKET=#{input.directoryPath},-d,DYNAMODB_WRITE_PERCENT=#{myDynamoDBWriteThroughputRatio},-d,DYNAMODB_ENDPOINT=dynamodb.us-east-1.amazonaws.com"
},
{
"id": "DataFormatId1",
"name": "DefaultDataFormat1",
"column": [
"Name",
"Designation",
"Company"
],
"columnSeparator": ",",
"recordSeparator": "\n",
"type": "Custom"
}
]
}
Out of four steps while executing the pipeline, two are getting finished, but it is not executing completely
Currently (2015-04) default import pipeline template does not support importing CSV files.
If your CSV file is not too big (under 1GB or so) you can create a ShellCommandActivity to convert CSV to DynamoDB JSON format first and the feed that to EmrActivity that imports the resulting JSON file into your table.
As a first step you can create sample DynamoDB table including all the field types you need, populate with dummy values and then export the records using pipeline (Export/Import button in DynamoDB console). This will give you the idea about the format that is expected by Import pipeline. The type names are not obvious, and the Import activity is very sensitive about the correct case (e.g. you should have bOOL for boolean field).
Afterwards it should be easy to create an awk script (or any other text converter, at least with awk you can use the default AMI image for your shell activity), which you can feed to your shellCommandActivity. Don't forget to enable "staging" flag, so your output is uploaded back to S3 for the Import activity to pick it up.
If you are using the template data pipeline for Importing data from S3 to DynamoDB, these dataformats won't work. Instead, use the format in the link below to store the input S3 data file http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-pipelinejson-verifydata2.html
This format of the output file generated by the template data pipeline that exports data from DynamoDB to S3.
Hope that helps.
I would recommend using the CSV data format provided by datapipeline instead of custom.
For debugging the errors on cluster, you can lookup the jobflow in EMR console and look at the log files for the tasks that failed.
See below link for a solution that works (in the question section), albeit EMR 3.x. Just change the delimiter to "columnSeparator": ",". Personally, I wouldn't do CSV unless you are certain the data is sanitized correctly.
How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?