Path Error in Hortonwork Cluster While Reading CSV File - apache-spark-sql

I am reading a csv filer from spark session. The path of file is obtained by getClass.getResource("/config/abc.csv").getPath. When I am using the sparkSession to read the file and deploy it in cluster it fails with the error. Path does not exist. Complete Code is as mentioned below:
var filePath :String = ""
filePath = getClass.getResource("/config/:abc.csv").getPath
var dataFrame = spark.read.option("header"."true").csv(filePath)
Can anyone help me in resolving the error.

Related

I am trying to perform a simple linear regression on a csv data set, but R won't read the dataset

I am running the code below to use a CSV file so that I can perform a linear regression. A few fixes I found here and on other sites included the "setwd" command and closing the CSV file before running the command. I am still generating the error.
setwd("C:/Users/Tommy/Desktop/")
dataset = file.choose("Project_subset.csv")
dataset = read.csv("dataset")`
> dataset = read.csv("dataset")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") : cannot open file 'dataset': No such file or directory
I appreciate the help on a simple problem.
I entered several different codes to read the csv file and none have been successful. I keep getting the error above that the file does not exist. I also used the file.exist() code and it returned FALSE. I am very confused as this seems to be a simple command to use.

how to read data from multiple folder from adls to databricks dataframe

file path format is data/year/weeknumber/no of day/data_hour.parquet
data/2022/05/01/00/data_00.parquet
data/2022/05/01/01/data_01.parquet
data/2022/05/01/02/data_02.parquet
data/2022/05/01/03/data_03.parquet
data/2022/05/01/04/data_04.parquet
data/2022/05/01/05/data_05.parquet
data/2022/05/01/06/data_06.parquet
data/2022/05/01/07/data_07.parquet
how to read all this file one by one in data bricks notebook and store into the data frame
import pandas as pd
#Get all the files under the folder
data = dbutils.fs.la(file)
df = pd.DataFrame(data)
#Create the list of file
list = df.path.tolist()
enter code here
for i in list:
df = spark.read.load(path=f'{f}*',format='parquet')
i can able to read only the last file skipping the other file
The last line of your code cannot load data incrementally. In contrast, it refreshes df variable with the data from each path for each time it ran.
Removing the for loop and trying the code below would give you an idea how file masking with asterisks works. Note that the path should be a full path. (I'm not sure if the data folder is your root folder or not)
df = spark.read.load(path='/data/2022/05/*/*/*.parquet',format='parquet')
This is what I have applied from the same answer I shared with you in the comment.

how to read a mounted dbc file in databricks?

I try to read a dbc file in databricks (mounted from an s3 bucket)
the file path is:
file_location="dbfs:/mnt/airbnb-dataset-ml/dataset/airbnb.dbc"
how to read this file using spark?
I tried the code below:
df=spark.read.parquet(file_location)
But it generates and error:
AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
thanks for help !
I tried the code below: df=spark.read.parquet(file_location) But it
generates and error:
You are using spark.read.parquet but want to read dbc file. It won't work this way.
Don't use parquet but use load. Give file path with file name (without .dbc extension) in path parameter and dbc in format paramter.
Try below code:
df=spark.read.load(path='<file_path_with_filename>', format='dbc')
Eg: df=spark.read.load(path='/mnt/airbnb-dataset-ml/dataset/airbnb', format='dbc')

how to read all the files under a directory in s3 using spark df?

I have 490 JSON files in this pattern 657438009821376.json all different numbers for different files.
can I use
'''val input = spark.read.option("header", true).json("/path/to/data/[0-9]*.json")'''
I need to read all the 490 files into a single DF
You provide as a source either file path or directory path. https://spark.apache.org/docs/latest/sql-data-sources-json.html
There are no options to filter out specific files from before actually loading them. Loading them you can map their source filename & filter out unnecessary files, but I would no suggest doing that
I suggest you can do it in few steps:
List all matching files from s3 bucket with boto3 client, smth like that
import boto
import boto.s3
import re
pattern = re.compile("your_regexp_pattern")
bucket = boto.s3.connect_to_region('eu-central-1').get_bucket("BUCKET_NAME")
files = bucket.list("","/path/to-dir")
filteredFiles = filter(lambda filename: pattern.match(filename), files)
Provide list of files to the spark read as json from s3
your_schema_class = StructType() \
.add("first_name", StringType()) \
.add("last_name", StringType())
# From step 1:
# files = ["s3://1.json", "s3://2.json"]
df = spark.read.json(files, your_schema_class)
df.show()

Write in memory object to S3 via boto3

I am attempting to write files directly to S3 without creating a local file which is then uploaded.
I am using cStringIO to generate a file in memory, but I am having trouble figuring out the proper way to upload it in boto3.
def writetos3(sourcedata, filename, folderpath):
s3 = boto3.resource('s3')
data = open(sourcedata, 'rb')
s3.Bucket('bucketname').put_object(Key= folderpath + "/" + filename, Body=data)
Above is the standard boto3 method that I was using previously with the local file, it does not work without a local file, I get the following error: coercing to Unicode: need string or buffer, cStringIO.StringO found
.
Because the in memory file (I believe) is already considered open, I tried changing it to the code below, but it still does not work, no error is given the script simply hangs on the last line of the method.
def writetos3(sourcedata, filename, folderpath):
s3 = boto3.resource('s3')
s3.Bucket('bucketname').put_object(Key= folderpath + "/" + filename, Body=sourcedata)
Just for more info, the value I am attempting to write looks like this
(cStringIO.StringO object at 0x045DC540)
Does anyone have an idea of what I am doing wrong here?
It looks like you want this:
data = open(sourcedata, 'rb').decode()
It defaults to utf8. Also, I encourage you to run your code under python3, and to use appropriate language tags for your question.