Making storage plugin on Apache Drill to HDFS - apache

I'm trying to make storage plugin for Hadoop (hdfs) and Apache Drill.
Actually I'm confused and I don't know what to set as port for hdfs:// connection, and what to set for location.
This is my plugin:
{
"type": "file",
"enabled": true,
"connection": "hdfs://localhost:54310",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json"
},
"avro": {
"type": "avro"
}
}
}
So, is ti correct to set localhost:54310 because I got that with command:
hdfs -getconf -nnRpcAddresses
or it is :8020 ?
Second question, what do I need to set for location? My hadoop folder is in:
/usr/local/hadoop
, and there you can find /etc /bin /lib /log ... So, do I need to set location on my datanode, or?
Third question. When I'm connecting to Drill, I'm going through sqlline and than connecting on my zookeeper like:
!connect jdbc:drill:zk=localhost:2181
My question here is, after I make storage plugin and when I connect to Drill with zk, can I query hdfs file?
I'm very sorry if this is a noob question but I haven't find anything useful on internet or at least it haven't helped me.
If you are able to explain me some stuff, I'll be very grateful.

As per Drill docs,
{
"type" : "file",
"enabled" : true,
"connection" : "hdfs://10.10.30.156:8020/",
"workspaces" : {
"root" : {
"location" : "/user/root/drill",
"writable" : true,
"defaultInputFormat" : null
}
},
"formats" : {
"json" : {
"type" : "json"
}
}
}
In "connection",
put namenode server address.
If you are not sure about this address.
Check fs.default.name or fs.defaultFS properties in core-site.xml.
Coming to "workspaces",
you can save workspaces in this. In the above example, there is a workspace with name root and location /user/root/drill.
This is your HDFS location.
If you have files under /user/root/drill hdfs directory, you can query them using this workspace name.
Example: abc is under this directory.
select * from dfs.root.`abc.csv`
After successfully creating the plugin, you can start drill and start querying .
You can query any directory irrespective to workspaces.
Say you want to query employee.json in /tmp/data hdfs directory.
Query is :
select * from dfs.`/tmp/data/employee.json`

I have similar problem, Drill cannot read dfs server. Finally, the problem is cause by namenode port.
The default address of namenode web UI is http://localhost:50070/.
The default address of namenode server is hdfs://localhost:8020/.

Related

How can i custom config CHANGELOG.md using standard-version npm package?

I'm using the command standard-version each time I want to publish new version, but the yielded changes in the CHANGELOG.md look like this:
### [10.1.9](https://github.com/my-project-name/compare/v10.1.8...v10.1.9) (2021-03-29)
### [10.1.8](https://github.com/my-project-name/compare/v10.1.7...v10.1.8) (2021-03-29)
### [10.1.7](https://github.com/my-project-name/compare/v10.1.6...v10.1.7) (2021-03-29)
first the links do not work - the github url is not correct and i want to configure it to the right url, and second, I'd like to configure the link that's shown in the changeslog file (there are some types)
I tried to use this documentation but didn't find anything that can help me
https://github.com/conventional-changelog/conventional-changelog
so how do I configure the way standard-version works on the CHANGELOG.md ? can someone provide example?
yes.
according to doc:
You can configure standard-version either by:
Placing a standard-version stanza in your package.json (assuming your project is JavaScript).
Creating a .versionrc, .versionrc.json or .versionrc.js.
If you are using a .versionrc.js your default export must be a configuration object, or a function returning a configuration object.
Any of the command line parameters accepted by standard-version can instead be provided via configuration.
Please refer to the conventional-changelog-config-spec for details on available configuration options.
example:
.versionrc
{
"types": [
{
"type": "feat",
"section": "Features"
},
{
"type": "fix",
"section": "Bug Fixes"
},
{
"type": "chore",
"hidden": true
},
{
"type": "docs",
"hidden": true
},
{
"type": "style",
"hidden": true
},
{
"type": "refactor",
"section": "Refactor"
},
{
"type": "perf",
"section": "Performance"
},
{
"type": "test",
"hidden": true
}
]
}

Unable to create the storage plugin for hive in Apache drill

I am new to Apache drill.While creating the storage plugin for Apache hive.I am getting the error.I have tried two ways.Below is the configuration.
1.First approach:
{
"type": "hive",
"enabled": false,
"configProps": {
"hive.metastore.uris": "thrift2:localhost:10000",
"fs.default.name": "hdfs://localhost:9000/",
"hive.metastore.sasl.enabled": "false"
}
}
2.Second approach:
{
"type": "hive",
"enabled": false,
"configProps": {
"hive.metastore.uris": "",
"javax.jdo.option.ConnectionURL": "jdbc:derby://localhost:1527/metastore_db;create=true",
"hive.metastore.warehouse.dir": "/user/tmp/warehouse/hive",
"fs.default.name": "hdfs://localhost:9000",
"hive.metastore.sasl.enabled": "false"
}
}
I am using plain Apache components and both drill and hive2 are installed in the same machine.
For both the cases I am getting the error in the GUI as
Please retry: error (unable to create/ update storage)
Kindly help me in resolving the same.Thanks in Advance!!
I am able to connection through the first approach i.e Hive Remote Metastore Connection.
Here is the Configuration:
{
"type": "hive",
"enabled": false,
"configProps": {
"hive.metastore.uris": "thrift:localhost:9083",
"fs.default.name": "hdfs://localhost:9000/",
"hive.metastore.sasl.enabled": "false"
}
}
Also make sure that Hive metastore is up and running.It can be started using the below command
hive -- service metastore &.
Also the parameter hive.metastore.uris in the hive-site.xml should be updated with thrift://localhost:9083.
Thanks

ARM - How can I get the access key from a storage account to use in AppSettings later in the template?

I'm creating an Azure Resource Manager template that instantiates multiple resources, including an Azure storage account and an Azure App Service with a Web App.
I'd like to be able to capture the primary access key (or the full connection string, either way is fine) from the newly-created storage account, and use that as a value for one of the AppSettings for the Web App.
Is that possible?
Use the listkeys helper function.
"appSettings": [
{
"name": "STORAGE_KEY",
"value": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value]"
}
]
This quickstart does something similar:
https://azure.microsoft.com/en-us/documentation/articles/cache-web-app-arm-with-redis-cache-provision/
The syntax has changed since the other answer was accepted. The error you will now hit is 'Template language expression property 'key1' doesn't exist, available properties are 'keys'
Keys are now represented as an array of keys, and the syntax is now:
"StorageAccount": "[Concat('DefaultEndpointsProtocol=https;AccountName=',variables('StorageAccountName'),';AccountKey=',listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('StorageAccountName')), providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]).keys[0].value)]",
See: http://samcogan.com/retrieve-azure-storage-key-in-arm-script/
I faced with this issue two times. First in the 2015 and last today in May of 2017.
I need to add connection strings to the WebApp - I want to add strings automatically from generated resources during deployment from the ARM template. It can help later to not add manually this values.
First time I used old version of the function listKeys (it looks like old version returns result not as object but as value):
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName')), '2015-05-01-preview').key1)]"
},
Today last version of the working template is:
"resources": [
{
"apiVersion": "2015-08-01",
"type": "config",
"name": "connectionstrings",
"dependsOn": [
"[resourceId('Microsoft.Web/Sites/', parameters('webSiteName'))]"
],
"properties": {
"DefaultConnection": {
"value": "[concat('Data Source=tcp:', reference(resourceId('Microsoft.Sql/servers/', parameters('sqlserverName'))).fullyQualifiedDomainName, ',1433;Initial Catalog=', parameters('databaseName'), ';User Id=', parameters('administratorLogin'), '#', parameters('sqlserverName'), ';Password=', parameters('administratorLoginPassword'), ';')]",
"type": "SQLServer"
},
"AzureWebJobsStorage": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
},
"AzureWebJobsDashboard": {
"type": "Custom",
"value": "[concat(variables('storageConnectionString'), listKeys(resourceId('Microsoft.Storage/storageAccounts', parameters('storageName')), '2016-01-01').keys[0].value)]"
}
}
},
Thanks.
below is example for adding storage account to ADLA
"storageAccounts": [
{
"name": "[parameters('DataLakeAnalyticsStorageAccountname')]",
"properties": {
"accessKey": "[listKeys(variables('storageAccountid'),'2015-05-01-preview').key1]"
}
}
],
in variable you can keep
"variables": {
"apiVersion": "[providers('Microsoft.Storage', 'storageAccounts').apiVersions[0]]",
"storageAccountid": "[concat(resourceGroup().id,'/providers/','Microsoft.Storage/storageAccounts/', parameters('DataLakeAnalyticsStorageAccountname'))]"
},

Apache Drill: table not found on s3 bucket

I'm a newbye with Apache Drill.
The scenario is this:
I've an S3 bucket, where I place my csv file called test.csv.
I've install Apache Drill with instructions from official website.
I followed this tutorial: https://drill.apache.org/blog/2014/12/09/running-sql-queries-on-amazon-s3/ for create an S3 plugin.
I start Drill, use the correct "workspace" (with: use my-s3;), but when I try to select records from test.cav file an error occured:
Table 's3./test.csv' not found.
Can anyone help me?
Thanks!
Use the name of your workspace (if you use one) and back ticks in the USE command as follows:
USE `my-s3`.`<workspace-name>`;
SHOW files; //should list test.csv file
SELECT * FROM `test.csv`;
Query the CSV in the local file system using the dfs storage plugin configuration to rule out things like a header causing a problem. This page might help if you haven't seen it.
Storage plugin mentioned in comment above:
{
"type": "file",
"enabled": true,
"connection": "s3n://<accesskey>:<secret>#catpaws",
"workspaces": {},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json"
}
}
}
Probably, this is not relevant. It's an excerpt from the Amazon S3 help, which contains lots more info:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>SECRET</value>
</property>

Query Extensionless File using Apache Drill

I imported data in Hadoop using Sqoop 1.4.6. Sqoop imports and saves the data in HDFS in an extensionless file but in csv format. I used Apache Drill to query the data from this file but got Table not found error. In Storage Plugin configuration, I even put null, blank (""), space (" ") in extensions but was not able to query the file. Even I was able to query the file when I changed the filename with an extension. Putting any extension in the configuration file works other than null extension. I could query the file saved in csv format but with extension 'mat' or anything.
Is there any way to query the extensionless files?
You can use a default input format in the storage plugin configuration to solve this problem. For example:
select * from dfs.`/Users/khahn/Downloads/csv_line_delimit.csv`;
+-------------------------+
| columns |
+-------------------------+
| ["hello","1","2","3!"] |
. . .
Change the file name to remove the extension and modify the plugin config "location" and "defaultInputFormat":
{
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
"root": {
"location": "/Users/khahn/Downloads",
"writable": false,
"defaultInputFormat": "csv"
},
Query the file that has no extension.
0: jdbc:drill:zk=local> select * from dfs.root.`csv_line_delimit`;
+-------------------------+
| columns |
+-------------------------+
| ["hello","1","2","3!"] |
. . .
I have the same experience. First, I imported 1 table from oracle to hadoop 2.7.1 then query via drill. This is my plugin config set through web UI:
{
"type": "file",
"enabled": true,
"connection": "hdfs://192.168.19.128:8020",
"workspaces": {
"hdf": {
"location": "/user/hdf/my_data/",
"writable": false,
"defaultInputFormat": "csv"
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
}
}
}
then, in drill cli, query like this:
USE hdfs.hdf
SELECT * FROM part-m-00000
Also, in hadoop file system, when I cat the content of 'part-m-00000', the below format printed on the console:
2015-11-07 17:45:40.0,6,8
2014-10-02 12:25:20.0,10,1