pyspark read form s3 and write to elasticsearch - amazon-s3

I'm trying to read from s3 and write to Elasticsearch,
using jupyter install on spark master machine
I have this configuration:
import pyspark
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
import findspark
findspark.init()
from pyspark.sql import SparkSession
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
aws_profile='DEFAULT'
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
from pyspark import SparkContext, SparkConf
sc_conf = SparkConf()
sc_conf.setAppName("app-3-logstash")
sc_conf.setMaster('spark://172.31.25.152:7077')
sc_conf.set('spark.executor.memory', '24g')
sc_conf.set('spark.executor.cores', '8')
sc_conf.set('spark.cores.max', '32')
sc_conf.set('spark.logConf', True)
sc_conf.set('spark.packages', 'org.apache.hadoop:hadoop-aws:2.7.3')
sc_conf.set('spark.jars', '/usr/local/spark/jars/elasticsearch-hadoop-7.6.0/dist/elasticsearch-spark-20_2.11-7.6.0.jar')
sc = SparkContext(conf=sc_conf)
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)
using this configuration, I get access to ES and not S3
when try to read from s3 using this conf I get this error:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
When disabling sc_conf.set('spark.packages'.. and sc_conf.set('spark.jars', .. and enable #os.environ['PYSPARK_SUBMIT_ARGS'] , it do get access to s3 but not to ES
What do I miss?
Thanks
Yaniv

Related

Didn't work driver.find_elements_by_class_name() and driver.find_elements() functions

I have tried to retrieve some data from url page via Selenium, but after running function driver.find_element_by_class_name() I have obtained error message:
'WebDriver' object has no attribute 'find_element_by_class_name'
Maybe this function has been deprecated in new module updates, please give me a hint where to find documentation for new function instead of deprecated.
This is my code:
import pandas as pd
import time
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import seaborn as sns
import os
import logging
######
PATH = 'C:/Program Files/chromedriver/chromedriver.exe'
options = Options()
driver = webdriver.Chrome(service =
Service(PATH))
page_url = "https://witcher.fandom.com/wiki/Category:Characters_in_the_stories"
driver.get(page_url)
book_categories = driver.find_elements_by_class_name('category-page__member-link')
AttributeError: 'WebDriver' object has no attribute 'find_elements_by_class_name'
------------
book_categories = driver.find_elements(By = 'class_name','category-page__member-link')
SyntaxError: positional argument follows keyword argument
I use chromedriver version 109.0.5414.74. Chrome version 109.0.5414.75.
Also I tried to use this code:
driver.find_elements(By.NAME, 'category-page__member-link),
but it also led to error:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[name="category-page__member-link"]"}.
Please, help me to find out reason of this error and how to resolve it.
Your locator is correct, but your locator type is wrong. 'category-page__member-link' is the value of class attribute not name, so you have to mention like:
driver.find_element(By.CLASS_NAME, "category-page__member-link")
or
driver.find_element(By.CSS_SELECTOR, ".category-page__member-link")

Why can't I connect to my aws s3 bucket from pyspark running locally?

I have a script that tries to load a pyspark data frame into aws s3. I have verified that public access is enabled in s3, and a colleague has managed to upload a file to the bucket from a script in data bricks. However, when I try to save a data frame to the bucket from a process runnign on my local machine, i get the following error:
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden;
I'm using pyspark version 3.2.1, and python 3.10. Here is my configuration:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
conf = SparkConf().set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0')
conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider')
sc = SparkContext(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
# Setting up S3 bucket credentials
ACCESS_KEY_ID = "xxxxxxx"
SECRET_ACCESS_KEY = "xxxxxxx"
AWS_BUCKET_NAME = "bucket_name"
DIRECTORY = "/directory /"
# hadoopConf = sc._jsc.hadoopConfiguration()
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', ACCESS_KEY_ID)
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', ENCODED_SECRET_KEY)
sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 's3-eu-central-1.amazonaws.com')
sc._jsc.hadoopConfiguration().set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
spark = SparkSession(sc)
Here is the line of code that attempts to save the data frame:
result_dataframe.write.format("csv").option("header", "true").option("delimiter", "\t").mode("append").save(f"s3a://bucket_name/directory/{filename}.csv")
I can't figure out why my access is denied. Any help would be much appreciated.

Access Denied issue in AWS Glue while performing a simple ETL task

I am facing an error wile trying run AWS Glue , i am trying to copy data from my table which was filled with the help of a crawler.
The Error is given below
An error occurred while calling o91.pyWriteDynamicFrame. Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 6YRCFCNCKW9ZK2PF; S3 Extended Request ID: 7v/5/dEhaxjIMMxfpCEu5vT6fwzmyV0kIphicPvUDYKY23rFYN1ALn2qo/N3CcIUEhSrOGKklW4=; Proxy: null)
my script is given below :
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1647583966899 = glueContext.create_dynamic_frame.from_catalog(
database="s3-databse-mockdata",
table_name="mock_data_csv",
transformation_ctx="AWSGlueDataCatalog_node1647583966899",
)
# Script generated for node Amazon S3
AmazonS3_node1647583976365 = glueContext.getSink(
path="s3://destination-001/",
connection_type="s3",
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=[],
enableUpdateCatalog=True,
transformation_ctx="AmazonS3_node1647583976365",
)
AmazonS3_node1647583976365.setCatalogInfo(
catalogDatabase="s3-databse-mockdata", catalogTableName="dest-table"
)
AmazonS3_node1647583976365.setFormat("csv")
AmazonS3_node1647583976365.writeFrame(AWSGlueDataCatalog_node1647583966899)
job.commit()
I am unable to find what is the problem

How to connect Pymongo with ssh as Robo3T?

I am trying to connect a remote mongo server with SSH and Pymongo. I use the python package sshtunnel. It works with Robo3T, but fails with pythons
It works with Robo3T:
This is my code:
from sshtunnel import SSHTunnelForwarder
from pymongo import MongoClient
from pprint import pprint
MONGO_HOST = "localhost:27017"
MONGO_DB = "dbasename"
MONGO_USER = "username"
MONGO_PASS = "password"
server = SSHTunnelForwarder(
MONGO_HOST,
ssh_username=MONGO_USER,
ssh_password=MONGO_PASS,
remote_bind_address=('10.0.0.244', 22)
)
server.start()
client = pymongo.MongoClient('127.0.0.1', server.local_bind_port)
db = client[MONGO_DB]
The code stops at server.start() . This is the error:
'Could not establish session to SSH gateway'
This is the code that works:
from sshtunnel import SSHTunnelForwarder
from pymongo import MongoClient
from pprint import pprint
MONGO_HOST = "localhost:27017"
MONGO_DB = "dbasename"
MONGO_USER = "username"
MONGO_PASS = "password"
server = SSHTunnelForwarder(
MONGO_HOST,
ssh_username=MONGO_USER,
ssh_password=MONGO_PASS,
remote_bind_address=('localhost', 27017)
)
server.start()
client = MongoClient(host= 'localhost', port=server.local_bind_port)
db = client[MONGO_DB]

cannot find symbol sshCloneUrl for cloning repository in bamboo-specs

I am working on writing Bamboo Specs for a build plan where i am trying to clone the repository from bitbucket. Here is my Spec code for Java
package tutorial;
import com.atlassian.bamboo.specs.api.BambooSpec;
import com.atlassian.bamboo.specs.api.builders.AtlassianModule;
import com.atlassian.bamboo.specs.api.builders.BambooKey;
import com.atlassian.bamboo.specs.api.builders.BambooOid;
import com.atlassian.bamboo.specs.api.builders.Variable;
import com.atlassian.bamboo.specs.api.builders.applink.ApplicationLink;
import com.atlassian.bamboo.specs.api.builders.permission.PermissionType;
import com.atlassian.bamboo.specs.api.builders.permission.Permissions;
import com.atlassian.bamboo.specs.api.builders.permission.PlanPermissions;
import com.atlassian.bamboo.specs.api.builders.plan.Job;
import com.atlassian.bamboo.specs.api.builders.plan.Plan;
import com.atlassian.bamboo.specs.api.builders.plan.PlanIdentifier;
import com.atlassian.bamboo.specs.api.builders.plan.Stage;
import com.atlassian.bamboo.specs.api.builders.plan.artifact.Artifact;
import com.atlassian.bamboo.specs.api.builders.plan.artifact.ArtifactSubscription;
import com.atlassian.bamboo.specs.api.builders.plan.branches.BranchCleanup;
import com.atlassian.bamboo.specs.api.builders.plan.branches.PlanBranchManagement;
import com.atlassian.bamboo.specs.api.builders.plan.configuration.AllOtherPluginsConfiguration;
import com.atlassian.bamboo.specs.api.builders.plan.configuration.ConcurrentBuilds;
import com.atlassian.bamboo.specs.api.builders.project.Project;
import com.atlassian.bamboo.specs.api.builders.repository.VcsChangeDetection;
import com.atlassian.bamboo.specs.api.builders.task.AnyTask;
import com.atlassian.bamboo.specs.builders.repository.bitbucket.server.BitbucketServerRepository;
import com.atlassian.bamboo.specs.builders.repository.viewer.BitbucketServerRepositoryViewer;
import com.atlassian.bamboo.specs.builders.task.CheckoutItem;
import com.atlassian.bamboo.specs.builders.task.CommandTask;
import com.atlassian.bamboo.specs.builders.task.MsBuildTask;
import com.atlassian.bamboo.specs.builders.task.ScriptTask;
import com.atlassian.bamboo.specs.builders.task.VcsCheckoutTask;
import com.atlassian.bamboo.specs.builders.trigger.BitbucketServerTrigger;
import com.atlassian.bamboo.specs.model.task.ScriptTaskProperties;
import com.atlassian.bamboo.specs.util.BambooServer;
import com.atlassian.bamboo.specs.util.MapBuilder;
import com.atlassian.bamboo.specs.api.builders.deployment.*;
/**
* Plan configuration for Bamboo.
* Learn more on: https://confluence.atlassian.com/display/BAMBOO/Bamboo+Specs
*/
#BambooSpec
public class PlanSpec {
/**
* Run main to publish plan on Bamboo
*/
public static void main(final String[] args) throws Exception {
//By default credentials are read from the '.credentials' file.
BambooServer bambooServer = new BambooServer("http://localhost:8085");
Plan plan = new PlanSpec().createPlan();
Deployment deploy = new PlanSpec().createDeployment();
bambooServer.publish(plan);
bambooServer.publish(deploy);
PlanPermissions planPermission = new PlanSpec().createPlanPermission(plan.getIdentifier());
bambooServer.publish(planPermission);
}
PlanPermissions createPlanPermission(PlanIdentifier planIdentifier) {
Permissions permission = new Permissions()
.groupPermissions("bamboo-admin", PermissionType.ADMIN)
.anonymousUserPermissionView();
return new PlanPermissions(planIdentifier.getProjectKey(), planIdentifier.getPlanKey()).permissions(permission);
}
Project project = new Project().name("Bamboo Specs").key("DRAGON");
Plan createPlan() {
return new Plan(
project,
"Java Specs Plan 2", "JSPTT2")
.description("Plan created from (enter repository url of your plan)")
.planRepositories(new BitbucketServerRepository()
.name("New Pattern Playbook")
.repositoryViewer(new BitbucketServerRepositoryViewer())
.server(new ApplicationLink()
.name("Bitbucket")
.id("bca01bef-a3d8-3da3-9187-91b73d0f1f77"))
.projectKey("ALM")
.repositorySlug("pattern.dotnet")
.sshPublicKey("ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC2ETaXXCvEeZqcJWxybcMA91jiskcjrh5PB2UqLNenhiGRdh2kZBUgBMur/JsBYzFF2tNTlWdOXW1oRhSQ38PeUGXVRg3pMR5mxvOO/K/wO1DB6ZqjzRgwLFBJJMqCk58I213Y2pvd7Q9ot/xLludzh3rKmJFHwqOOBJYO/BLIqwL/hfM2Kvr4Op/284s5vBhJ+4l8sCrorSGsDE/r7mpAWjvrMGZGosLqgQtvrnzrL9XchxMT8UstzVeIAdHtWcwGOtv+1pjAqW6+035A/5W3tsWJ+EyBWFQ4rkZP/HFEdAAUgpM//oNVfB03+zZVn0BIKWX6evcwQPEjVzqn3+Ir http://bamboo-lab.cdk.com")
.sshPrivateKey("-----BEGIN RSA PRIVATE KEY-----\nMIIEowIBAAKCAQEAthE2l1wrxHmanCVscm3DAPdY4rJHI64eTwdlKizXp4YhkXYd\npGQVIATLq/ybAWMxRdrTU5VnTl1taEYUkN/D3lBl1UYN6TEeZsbzjvyv8DtQwema\no80YMCxQSSTKgpOfCNtd2Nqb3e0PaLf8S5bnc4d6ypiRR8KjjgSWDvwSyKsC/4Xz\nNir6+Dqf9vOLObwYSfuJfLAq6K0hrAxP6+5qQFo76zBmRqLC6oELb6586y/V3IcT\nE/FLLc1XiAHR7VnMBjrb/taYwKluvtN+QP+Vt7bFifhMgVhUOK5GT/xxRHQAFIKT\nP/6DVXwdN/s2VZ9ASCll+nr3MEDxI1c6p9/iKwIDAQABAoIBAB1sNLVLOOt8d2bq\niVcIs+3RCzU/eE2k0tMUr92b95HkFEKsouexINTW0Y9OuEIGJK1USriEOXipkoe6\nY5JyBvZDaeGIe7EGthIH7s5ZuZkKDOf5d3snJtSKJMNdRbjKYHYO9WCZG31G1Smo\nKgaRMYAzEb3x3/CH3OSTiyiKxgJVktPWHgLxPkQF3ZAyxnt4S0Z5a3Q/WF6zdVcI\nUh+ygcmixHXiQBaMitMSCZuXx6ayCBVmeIkZnmyfSfDU5yjBS8bmYPELVe6X/GfL\nvWsUcbCv2qtxjNxefuOQguBq8svq2ykNAbhzNY5GSVC+1uF++6+EsmqSNoPHIZrh\nYQMaESECgYEA2RkMCM+wOCf87qBHlCVtI9ZukBYOBXde3w2VQNKSK8/3bFilo0NP\nlMguaS5DWvWmPsidESWRWR9eHiZcR8/KA6RhcHjwjydKe+cD0M/asoA/l1AcaquO\ngpllhs00+YqAmIUZT17xTlP77DCiMfFP71mwOAUUb68zih8bSrf/dBsCgYEA1rE0\nIuM46bLq9ru5deatl6N3RR0uX8qZaIg8S1ur7O3rTgWUvsmvVkSgHHOEPawcuioQ\n4HbLeMIVcA3roxGV9TD+uG0qGtv3G6ZUJ/izdNi5czp1N/XCMtagjWe2G33FscTT\neugQIbSRGOehimPtJsOuPMscAbDroChYwJLQizECgYBjDaiOBKT0mlovTnYaRBFN\n/rKnj0iKefKRdxMYZntG/jZ3+uJoYXfX/JYga3lT8S0PDF2Ny0RME6HPw9Tq9wXH\nL6M9vBCWYGj9q2P0TEIOm7FoCqdMjEYTlIXcQZjgGq+d52yq6DjVckBJfc8jVmUQ\nYi2jAb5XTusHJDZBmz407QKBgQCoCyHU4OeuPKYfJAbhWwKrO37isRmYTvtOz7vp\n/EIQ/JT+h+3KfBDqxGJSgrSSlUITEVQObc2Lotam066J//zRY10tO/0F8wBzOviK\nJOdKYUye/bW8bHdp1Ybrx67Jy+NO5tHlVPkzeKNNzBgsO1Tnz6h020H7rOBxhsMZ\nUJE9MQKBgDqRwfKmnBUcOElwvhQ1iH7aJL8zD5Ugbu0Xd72XUXcli1BcixvGdRyI\nWjnACbAVfPA/mcT8Ztto9uM/ZvH1rsAaqVnjKdxtlPYefuTff5QYUNWeR5C7FlJh\ntEJaTQdg/yXvnKpKCAtp+KJVfLyuRtuPwppR3yIGfpJNMYxUxHRb\n-----END RSA PRIVATE KEY-----\n")
.sshCloneUrl("ssh://git#stash-lab.cdk.com:7999/alm/pattern.dotnet.git")
.remoteAgentCacheEnabled(false)
.changeDetection(new VcsChangeDetection()))
.stages(new Stage("Stage 1")
.jobs(new Job("First Job","JOB1")
.tasks(new ScriptTask()
.inlineBody("echo Hello World")
)));
}
Deployment createDeployment() {
return new Deployment(new PlanIdentifier("DRAGON", "JSPTT"),
"Java Specs Plan Deployment")
.releaseNaming(new ReleaseNaming("release-1")
.autoIncrement(true))
.environments(new Environment("Java Specs environment")
.tasks(new ScriptTask().inlineBody("echo Hello world!")));
}
}
Error:
[ERROR] /Users/kamblea/Documents/REPOS/bamboo-spec/java/bamboospecjava/bamboo-specs/src/main/java/tutorial/PlanSpec.java:[90,25] cannot find symbol
[ERROR] symbol: method sshCloneUrl(java.lang.String)
[ERROR] location: class com.atlassian.bamboo.specs.builders.repository.bitbucket.server.BitbucketServerRepository
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
I resolved it on my own.
Solution: I am using Bamboo version 6.3.1 and trying to use Bamboo Specs 6.4 version which is only compatible with Bamboo 6.4 and higher version.
So, I refer the older API reference to solve this issue and it worked.