I have configured my local S3 server with Minio.
I can access to the files stored in it from Spark following these steps.
But, if I try to configure Hive to access to a external parquet file stored in this server, I get following error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties (respectively).)
My hive version is: 1.1.
I'm using cdh5.16.1 with Hadoop 2.6.
My spark version is 1.6.
I have tried to modify the files (hive-site.xml and core-site.xml) with the properties specified here but I get the same error.
I have also tried to add these properties in execution time, typing following commands in a Hive shell:
SET fs.s3a.endpoint=http://127.0.0.1:9003;
SET fs.s3a.access.key=ACCESSKEY;
SET fs.s3a.awsAccessKeyId=ACCESSKEY;
SET fs.s3a.secret.key=SECRETKEY;
SET fs.s3a.awsSecretAccessKey=SECRETKEY;
SET fs.s3a.path.style.access=true;
SET fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem;
Notice that I only have fs.s3a.access.key and fs.s3a.secret.key because I'm not using an AWS S3 (I'm using a local S3), but I have added AWS KEY properties to my config files because of the exception message that I'm getting. I have also tried to use s3n instead of s3a (To check if s3a is not compatible with my Hive version), but I get the same exception message.
The Create Table command that throws the exception:
CREATE EXTERNAL TABLE aml.bgp_pers_juridi3(
internal_id string,
society_type string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3n://oclrh65c034.isbcloud.isban.corp:9003/minio/entities/bgp_pers_juridi2'
Thanks in advance.
Finally I manage to get access to Cloudera Manager (the server was down and I didn't have permissions) and I have restarted all the services from it. You can also modify the files using Cloudera Manager, but if not (like me case) it will warn you that your configuration isn't updated in all the files that it should be, and it gives you the possibility to modify, automatically, all these files.
I strongly recommend using Cloudera Manager to modify configuration properties in the different services because it modify these properties in all the related files and then it helps you to restart these services.
Related
When I try to query our Serverless SQL pool in Azure Synapse Analytics I get the following error:
"Content of directory on path 'https://xxxxxx.dfs.core.windows.net/dataverse-xxxxxx-org5a2bcccf/account/Snapshot/2018-08_1656570292/*.csv' cannot be listed.".
I have checked out the following link for clues as to what could be cause:
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand?tabs=x80070002
It is suggested that the error is due permissions:
However, I believe I have the correct permissons,
I get this error whether I try to execute the query in SSMS or Synapse Workspace.
The error in SSMS is as follows:
Warning: Unable to resolve path https://xxxxx.dfs.core.windows.net/dataverse-xxxxx-org5a2bcccf/account/Snapshot/2018-10_1657304551/*.csv. Error number 13807, Level 16, State 1, Message "Content of directory on path 'https://xxxxxx.dfs.core.windows.net/dataverse-xxxxx-org5a2bcccf/account/Snapshot/2018-10_1657304551/*.csv' cannot be listed.".
Can someone let me know how to resolve this?
The query that I'm attempting to execute can be located here:
https://github.com/slavatrofimov/Synapse-Link-for-Dataverse-data-enrichment-in-Serverless-SQL-Pools/blob/main/SQL/Enrich%20Synapse%20Link%20for%20Dataverse%20Entities%20with%20Human-Readable%20Labels.sql
Is there a definitive way to determine if the problem is due to lack of permissions?
Update Question:
I have just realised that the issue is access the Lake on https://xxxxxx.dfs.core.windows.net/dataverse-xxxxxx-org5a2bcccf/
Therefore please take a look at my permissons on the lake and let me know if it is sufficient?
This issue occurs when the user trying to query the external table does not have the relevant permissions or if there is a firewall enabled on your storage network.
When looked at the permissions you have provided, I see Storage Blob Data reader and Storage Blob Data contributor have been given.
Ref doc: Control storage account access for serverless SQL pool in Azure Synapse Analytics
In case if your storage account is firewall protect then you will have to follow the steps described in this document to overcome the issue: Access storage that is protected with the firewall
Here are couple of relevant articles which might help you configure your storage firewall to overcome this issue:
Storage configuration for external table is not accessible while query on Serverless
Synapse Studio error while trying to read data from Storage Account using SQL On Demand
I am using PDI 8.3 with repo database in another server.
In my expectation, if I do not define any log connections in the job properties, the job will not send any logs to the repo database.
However, when I run a job with kitchen.sh, it defines new database connection "live_logging_info" that points to "localhost:5432". Because PDI repo database is in another server, the job fails.
May I know how to define the default DB log connection? Thank you.
Under PDI 8.3 there should be a folder called simple-jndi. Within that folder there should be a file called jdbc.properties. In that file near the bottom there are settings for live_logging_info. By default it points to localhost:5432 but you can set it to any location. Or it can be another type of database (MySQL,MSSQL, etc).
The settings that are available by default are:
live_logging_info/type=javax.sql.DataSource
live_logging_info/driver=org.postgresql.Driver
live_logging_info/url=jdbc:postgresql://localhost:5432/hibernate?searchpath=pentaho_dilogs
live_logging_info/user=hibuser
live_logging_info/password=password
Hello I am trying to create a new plan on SQL server to backup all my database.
My goal is to backup them to a network drive thus if I do have some trouble with my server, I will be able to restore databases to other server thanks to backup present in the network drive.
When my plan is executed, I do have some error so I try to execute manually the relative query.
After some investigation, it seems even net use command doesn't work (whereas it is working and I do it from cmd)
EXEC XP_CMDSHELL 'net use Z: \\ServerName\loggin/user:loggin password'
error is
System error 1450 has occurred. Insufficient system resources exist to complete the requested service.
Beside, I do have another server where it is working so I suppose some configuration missing but can't find them
as my network drive is also accessible via FTP, I chose this way to make the job : create a batch file that run winscp and use this batch file in a SQL agent job . I need to add right to batch file to SQL Server agent account. I also need to define a credential and a proxy to be used in the job.
I have the following task in airflow, which works like a charm:
t = SparkSubmitOperator(
task_id = 'some_id',
application = '/path/to/app.py',
name = 'airflow-spark',
conf = {
'spark.hadoop.fs.s3a.endpoint': 'https://some.url.com/',
'spark.hadoop.fs.s3a.access.key': 'myuser',
'spark.hadoop.fs.s3a.secret.key': 'my_super_secret_password',
},
dag = dag,
)
As you can guess, my spark job needs to authenticate on an S3 server instance to retrieve data. While this works, I don't want to put my password as cleartext in the dag. How can I authenticate with the S3 server, without using my password in cleartext? I tried setting up connections in airflow, which seems to be exactly for this use case, but when I use conn_id = 'my_connection' inside the task, it tries to run the spark job on the server instead.
If you are running airflow in AWS infra, you can use the IAM-granted permissions of the container/VM as the accessor.
If you can update the config regularly, you could issue session tokens on your desktop and update the spec. You'll need a hadoop version which supports session credentials (2.8+).
You can also use JCEKs files to store the credentials -you'd then get that file onto all VMs for work and set the hadoop/spark config to load it. See https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Storing_secrets_with_Hadoop_Credential_Providers. This does need to be on a shared store, e.g, HDFS, mounted EBS, etc, as each spark worker will resolve the jceks path locally to load the secrets.
Simplest is use IAM permissions. If you are running on a shared cluster, JCEKS files is better
Finally, Hadoop 3.3+ allows for the S3A to dynamically generate session/role tokens from a user with full credentials, and pass these with the spark job. If you play with that you can have credentials on your desktop/JCEKs file only airflow can read, and have session/role credentials generated from those. Useful, but trickier to set up
Looking for some help to resolve the errors I'm facing. Let me explain the scenario. I'm trying to sync one of the ADLS Gen2 container to Azure BLOB Storage. I have AzCopy 10.4.3, I'm using Azcopy Sync to do this. I'm using the command below
azcopy sync 'https://ADLSGen2.blob.core.windows.net/testsamplefiles/SAMPLE' 'https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE' --recursive
When I run this command I'm getting below error
REQUEST/RESPONSE (Try=1/71.0063ms, OpTime=110.9373ms) -- RESPONSE SUCCESSFULLY RECEIVED
PUT https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet?blockid=ZDQ0ODlkYzItN2N2QzOWJm&comp=block&timeout=901
X-Ms-Request-Id: [378ca837-d01e-0031-4f48-34cfc2000000]
ERR: [P#0-T#0] COPYFAILED: https://ADLSGen2.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet: 404 : 404 The specified resource does not exist.. When Staging block from URL. X-Ms-Request-Id: [378ca837-d01e-0031-4f48-34cfc2000000]
Dst: https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet
REQUEST/RESPONSE (Try=1/22.9854ms, OpTime=22.9854ms) -- RESPONSE SUCCESSFULLY RECEIVED
GET https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet?blocklisttype=all&comp=blocklist&timeout=31
X-Ms-Request-Id: [378ca84e-d01e-0031-6148-34cfc2000000]
So far I checked and ensured below things
I logged into correct tenant while logging into AzCopy
Storage Blob Data Contributor role was granted to my AD credentials
Not sure what else I'm missing as the file exists in the source and I'm getting the same error. I tried with SAS but I received different error though. I cannot proceed with SAS due to the vendor policy so I need to ensure this is working with oAuth. Any inputs is really appreciated.
For the 404 error, you may check if there is any typo in the command and the path /testsamplefiles/SAMPLE exists on both source and destination account. Also, please note that from the tips.
Use single quotes in all command shells except for the Windows Command
Shell (cmd.exe). If you're using a Windows Command Shell (cmd.exe),
enclose path arguments with double quotes ("") instead of single
quotes ('').
From azcopy sync supported scenario:
Azure Blob <-> Azure Blob (Source must include a SAS or is publicly
accessible; either SAS or OAuth authentication can be used for
destination)
We must provide include a SAS token in the source, but I tried the below code with AD authentication.
azcopy sync "https://[account].blob.core.windows.net/[container]/[path/to/blob]?[SAS]" "https://[account].blob.core.windows.net/[container]/[path/to/blob]"
but got the same 400 error as the Github issue.
Thus, in this case, after my validation, you could use this command to sync one of the ADLS Gen2 container to Azure BLOB Storage without executing azcopy login. If you have login in, you can run azcopy logout.
azcopy sync "https://nancydl.blob.core.windows.net/container1/sample?sv=xxx" "https://nancytestdiag244.blob.core.windows.net/container1/sample?sv=xxx" --recursive --s2s-preserve-access-tier=false