I'm a newbye with Apache Drill.
The scenario is this:
I've an S3 bucket, where I place my csv file called test.csv.
I've install Apache Drill with instructions from official website.
I followed this tutorial: https://drill.apache.org/blog/2014/12/09/running-sql-queries-on-amazon-s3/ for create an S3 plugin.
I start Drill, use the correct "workspace" (with: use my-s3;), but when I try to select records from test.cav file an error occured:
Table 's3./test.csv' not found.
Can anyone help me?
Thanks!
Use the name of your workspace (if you use one) and back ticks in the USE command as follows:
USE `my-s3`.`<workspace-name>`;
SHOW files; //should list test.csv file
SELECT * FROM `test.csv`;
Query the CSV in the local file system using the dfs storage plugin configuration to rule out things like a header causing a problem. This page might help if you haven't seen it.
Storage plugin mentioned in comment above:
{
"type": "file",
"enabled": true,
"connection": "s3n://<accesskey>:<secret>#catpaws",
"workspaces": {},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json"
}
}
}
Probably, this is not relevant. It's an excerpt from the Amazon S3 help, which contains lots more info:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>SECRET</value>
</property>
Related
I'm using the command standard-version each time I want to publish new version, but the yielded changes in the CHANGELOG.md look like this:
### [10.1.9](https://github.com/my-project-name/compare/v10.1.8...v10.1.9) (2021-03-29)
### [10.1.8](https://github.com/my-project-name/compare/v10.1.7...v10.1.8) (2021-03-29)
### [10.1.7](https://github.com/my-project-name/compare/v10.1.6...v10.1.7) (2021-03-29)
first the links do not work - the github url is not correct and i want to configure it to the right url, and second, I'd like to configure the link that's shown in the changeslog file (there are some types)
I tried to use this documentation but didn't find anything that can help me
https://github.com/conventional-changelog/conventional-changelog
so how do I configure the way standard-version works on the CHANGELOG.md ? can someone provide example?
yes.
according to doc:
You can configure standard-version either by:
Placing a standard-version stanza in your package.json (assuming your project is JavaScript).
Creating a .versionrc, .versionrc.json or .versionrc.js.
If you are using a .versionrc.js your default export must be a configuration object, or a function returning a configuration object.
Any of the command line parameters accepted by standard-version can instead be provided via configuration.
Please refer to the conventional-changelog-config-spec for details on available configuration options.
example:
.versionrc
{
"types": [
{
"type": "feat",
"section": "Features"
},
{
"type": "fix",
"section": "Bug Fixes"
},
{
"type": "chore",
"hidden": true
},
{
"type": "docs",
"hidden": true
},
{
"type": "style",
"hidden": true
},
{
"type": "refactor",
"section": "Refactor"
},
{
"type": "perf",
"section": "Performance"
},
{
"type": "test",
"hidden": true
}
]
}
I'm trying to make storage plugin for Hadoop (hdfs) and Apache Drill.
Actually I'm confused and I don't know what to set as port for hdfs:// connection, and what to set for location.
This is my plugin:
{
"type": "file",
"enabled": true,
"connection": "hdfs://localhost:54310",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json"
},
"avro": {
"type": "avro"
}
}
}
So, is ti correct to set localhost:54310 because I got that with command:
hdfs -getconf -nnRpcAddresses
or it is :8020 ?
Second question, what do I need to set for location? My hadoop folder is in:
/usr/local/hadoop
, and there you can find /etc /bin /lib /log ... So, do I need to set location on my datanode, or?
Third question. When I'm connecting to Drill, I'm going through sqlline and than connecting on my zookeeper like:
!connect jdbc:drill:zk=localhost:2181
My question here is, after I make storage plugin and when I connect to Drill with zk, can I query hdfs file?
I'm very sorry if this is a noob question but I haven't find anything useful on internet or at least it haven't helped me.
If you are able to explain me some stuff, I'll be very grateful.
As per Drill docs,
{
"type" : "file",
"enabled" : true,
"connection" : "hdfs://10.10.30.156:8020/",
"workspaces" : {
"root" : {
"location" : "/user/root/drill",
"writable" : true,
"defaultInputFormat" : null
}
},
"formats" : {
"json" : {
"type" : "json"
}
}
}
In "connection",
put namenode server address.
If you are not sure about this address.
Check fs.default.name or fs.defaultFS properties in core-site.xml.
Coming to "workspaces",
you can save workspaces in this. In the above example, there is a workspace with name root and location /user/root/drill.
This is your HDFS location.
If you have files under /user/root/drill hdfs directory, you can query them using this workspace name.
Example: abc is under this directory.
select * from dfs.root.`abc.csv`
After successfully creating the plugin, you can start drill and start querying .
You can query any directory irrespective to workspaces.
Say you want to query employee.json in /tmp/data hdfs directory.
Query is :
select * from dfs.`/tmp/data/employee.json`
I have similar problem, Drill cannot read dfs server. Finally, the problem is cause by namenode port.
The default address of namenode web UI is http://localhost:50070/.
The default address of namenode server is hdfs://localhost:8020/.
I'm creating an Azure Resource Manager template that instantiates multiple resources, including an Azure storage account and an Azure App Service with a Web App.
I'd like to be able to capture the primary access key (or the full connection string, either way is fine) from the newly-created storage account, and use that as a value for one of the AppSettings for the Web App.
Is that possible?
Use the listkeys helper function.
"appSettings": [
{
"name": "STORAGE_KEY",
"value": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value]"
}
]
This quickstart does something similar:
https://azure.microsoft.com/en-us/documentation/articles/cache-web-app-arm-with-redis-cache-provision/
The syntax has changed since the other answer was accepted. The error you will now hit is 'Template language expression property 'key1' doesn't exist, available properties are 'keys'
Keys are now represented as an array of keys, and the syntax is now:
"StorageAccount": "[Concat('DefaultEndpointsProtocol=https;AccountName=',variables('StorageAccountName'),';AccountKey=',listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('StorageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value)]",
See: http://samcogan.com/retrieve-azure-storage-key-in-arm-script/
I faced with this issue two times. First in the 2015 and last today in May of 2017.
I need to add connection strings to the WebApp - I want to add strings automatically from generated resources during deployment from the ARM template. It can help later to not add manually this values.
First time I used old version of the function listKeys (it looks like old version returns result not as object but as value):
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), '2015-05-01-preview').key1)]"
},
Today last version of the working template is:
"resources": [
{
"apiVersion": "2015-08-01",
"type": "config",
"name": "connectionstrings",
"dependsOn": [
"[resourceId('Microsoft.Web/Sites/', parameters('webSiteName'))]"
],
"properties": {
"DefaultConnection": {
"value": "[concat('Data Source=tcp:', reference(resourceId('Microsoft.Sql/servers/', parameters('sqlserverName'))).fullyQualifiedDomainName, ',1433;Initial Catalog=', parameters('databaseName'), ';User Id=', parameters('administratorLogin'), '#', parameters('sqlserverName'), ';Password=', parameters('administratorLoginPassword'), ';')]",
"type": "SQLServer"
},
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
},
"AzureWebJobsDashboard": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
}
}
},
Thanks.
below is example for adding storage account to ADLA
"storageAccounts": [
{
"name": "[parameters('DataLakeAnalyticsStorageAccountname')]",
"properties": {
"accessKey": "[listKeys(variables('storageAccountid'),'2015-05-01-preview').key1]"
}
}
],
in variable you can keep
"variables": {
"apiVersion": "[providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]]",
"storageAccountid": "[concat(resourceGroup().id,'/providers/','Microsoft.Storage/storageAccounts/', parameters('DataLakeAnalyticsStorageAccountname'))]"
},
I have a single file containing multiple avro records. Each record contains a unique "name". How do I load and store files such that each file represents a record that corresponds with a given name?
Here is my avro schema:
{
"type": "records",
"name": "XXItem",
"namespace": "com.xxx.xxx",
"fields": [
{
"name": "data",
"type": {"type": "map", "values" : ["string", "long", "int"]}
}
]
}
A quick check seems to indicate that avro, is simply using JSON for data storage.
By looking for solutions for handling JSON in general, you should be able to come up with something that works for you.
This could be a starting point: Hadoop for JSON files
I am trying to transfer CSV data from S3 bucket to DynamoDB using AWS pipeline, following is my pipe line script, it is not working properly,
CSV file structure
Name, Designation,Company
A,TL,C1
B,Prog, C2
DynamoDb : N_Table, with Name as hash value
{
"objects": [
{
"id": "Default",
"scheduleType": "cron",
"name": "Default",
"role": "DataPipelineDefaultRole",
"resourceRole": "DataPipelineDefaultResourceRole"
},
{
"id": "DynamoDBDataNodeId635",
"schedule": {
"ref": "ScheduleId639"
},
"tableName": "N_Table",
"name": "MyDynamoDBData",
"type": "DynamoDBDataNode"
},
{
"emrLogUri": "s3://onlycsv/error",
"id": "EmrClusterId636",
"schedule": {
"ref": "ScheduleId639"
},
"masterInstanceType": "m1.small",
"coreInstanceType": "m1.xlarge",
"enableDebugging": "true",
"installHive": "latest",
"name": "ImportCluster",
"coreInstanceCount": "1",
"logUri": "s3://onlycsv/error1",
"type": "EmrCluster"
},
{
"id": "S3DataNodeId643",
"schedule": {
"ref": "ScheduleId639"
},
"directoryPath": "s3://onlycsv/data.csv",
"name": "MyS3Data",
"dataFormat": {
"ref": "DataFormatId1"
},
"type": "S3DataNode"
},
{
"id": "ScheduleId639",
"startDateTime": "2013-08-03T00:00:00",
"name": "ImportSchedule",
"period": "1 Hours",
"type": "Schedule",
"endDateTime": "2013-08-04T00:00:00"
},
{
"id": "EmrActivityId637",
"input": {
"ref": "S3DataNodeId643"
},
"schedule": {
"ref": "ScheduleId639"
},
"name": "MyImportJob",
"runsOn": {
"ref": "EmrClusterId636"
},
"maximumRetries": "0",
"myDynamoDBWriteThroughputRatio": "0.25",
"attemptTimeout": "24 hours",
"type": "EmrActivity",
"output": {
"ref": "DynamoDBDataNodeId635"
},
"step": "s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://elasticmapreduce/libs/hive/dynamodb/importDynamoDBTableFromS3,-d,DYNAMODB_OUTPUT_TABLE=#{output.tableName},-d,S3_INPUT_BUCKET=#{input.directoryPath},-d,DYNAMODB_WRITE_PERCENT=#{myDynamoDBWriteThroughputRatio},-d,DYNAMODB_ENDPOINT=dynamodb.us-east-1.amazonaws.com"
},
{
"id": "DataFormatId1",
"name": "DefaultDataFormat1",
"column": [
"Name",
"Designation",
"Company"
],
"columnSeparator": ",",
"recordSeparator": "\n",
"type": "Custom"
}
]
}
Out of four steps while executing the pipeline, two are getting finished, but it is not executing completely
Currently (2015-04) default import pipeline template does not support importing CSV files.
If your CSV file is not too big (under 1GB or so) you can create a ShellCommandActivity to convert CSV to DynamoDB JSON format first and the feed that to EmrActivity that imports the resulting JSON file into your table.
As a first step you can create sample DynamoDB table including all the field types you need, populate with dummy values and then export the records using pipeline (Export/Import button in DynamoDB console). This will give you the idea about the format that is expected by Import pipeline. The type names are not obvious, and the Import activity is very sensitive about the correct case (e.g. you should have bOOL for boolean field).
Afterwards it should be easy to create an awk script (or any other text converter, at least with awk you can use the default AMI image for your shell activity), which you can feed to your shellCommandActivity. Don't forget to enable "staging" flag, so your output is uploaded back to S3 for the Import activity to pick it up.
If you are using the template data pipeline for Importing data from S3 to DynamoDB, these dataformats won't work. Instead, use the format in the link below to store the input S3 data file http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-pipelinejson-verifydata2.html
This format of the output file generated by the template data pipeline that exports data from DynamoDB to S3.
Hope that helps.
I would recommend using the CSV data format provided by datapipeline instead of custom.
For debugging the errors on cluster, you can lookup the jobflow in EMR console and look at the log files for the tasks that failed.
See below link for a solution that works (in the question section), albeit EMR 3.x. Just change the delimiter to "columnSeparator": ",". Personally, I wouldn't do CSV unless you are certain the data is sanitized correctly.
How to upgrade Data Pipeline definition from EMR 3.x to 4.x/5.x?