Databricks - FileNotFoundException - dataframe

I'm sorry if this is basic and I missed something simple. I'm trying to run the code below to iterate through files in a folder and merge all files that start with a specific string, into a dataframe. All files sit in a lake.
file_list=[]
path = "/dbfs/rawdata/2019/01/01/parent/"
files = dbutils.fs.ls(path)
for file in files:
if(file.name.startswith("CW")):
file_list.append(file.name)
df = spark.read.load(path=file_list)
# check point
print("Shape: ", df.count(),"," , len(df.columns))
db.printSchema()
This looks fine to me, but apparently something is wrong here. I'm getting an error on this line:
files = dbutils.fs.ls(path)
Error message reads:
java.io.FileNotFoundException: File/6199764716474501/dbfs/rawdata/2019/01/01/parent does not exist.
The path, the files, and everything else definitely exist. I tried with and without the 'dbfs' part. Could it be a permission issue? Something else? I Googled for a solution. Still can't get traction with this.

Make sure you have a folder named "dbfs" if your parent folder starts from "rawdata" the path should be "/rawdata/2019/01/01/parent" or "rawdata/2019/01/01/parent".
The error is thrown in case of incorrect path.

This is an old thread, but if someone is still looking for a solution:
It does require path to be listed as:
"dbfs:/rawdata/2019/01/01/parent/"

Related

AzureSynapse Lookup UserErrorFileNotFound with Wildcard path

I am facing an odd issue where my lookup is returning a filenotfound error when I use a wildcard path. If I specify and exact file path, the lookup runs without error. However, if I replace the filename with a *, I get a filenotfound error.
The file is Data_643.json, located in my Azure Data Lake Storage Gen2, under the labournavigatorfile system. The exact file path is:
labournavigatorfile/raw_data/Scraped/HeadHunter/Saudi_Arabia/Data_643.json.
If I put this exact path into the Integration dataset configuration, the pipeline runs without issue. However, as soon as I replace the 'Data_643.json' with a '*', the pipeline crashes with a filenotfound error.
What am I doing wrong? Many Thanks for any support. This must be something very simple that I am missing.
Exact path works:
Wildcrad path throws error:
I have 3 files in my container as file1.json, file2.json, file3.json as shown below:
The following is how I configured my dataset to read using wildcard with configuration same as in the image provided in the question.
When I used this in lookup I got the same error:
To overcome this, go to your lookup activity. When you want to use wildcards to read a file/files, check the wildcard file path option. Then specify the folder structure and use wildcard where required. The following is an image for reference.
The following is the debug output when I run the pipeline (Each of my files had 10 rows):

Input data from file from directory where filename updates each month

I'm having trouble finding the answer to this, but if it already exists, please share with me. I'm trying input a file that where the file ending is ambiguous within a directory. For example the directory would be 'ThisPC\Desktop\Folder' and I'm looking for file 'SomeData_April2021'. I need it to search for the term 'SomeData' within the folder, as the ending is updated for each month. There are also other files in the folder, so I can't just specify the directory and have it upload everything.
Hopefully this is clear! If not, I'm happy to clarify.
Figured it out. Had to type (?i).+SomeData.. into the wildcard expression field

How to extract .sql file that seems to be a .zip

I have received a file from a customer. The file is said to be
SQL code (application/sql)
However, this has turned out to be wrong: nothing could open it. It turns out it was secretely a .zip file. By renaming it to '.zip' and manually extracting it I was able to get the files contained in it. I would like to do a similar process in python.
So far I've renamed the file:
file_name_zip = file_name.replace('.sql', '.zip')
os.rename(file_name, file_name_zip)
And I've tried extracting it:
zip_ref = zipfile.ZipFile(file_name_zip, 'r')
zip_ref.extractall(extracted_file)
However, this failed because
zipfile.BadZipFile: File is not a zip file
I've googled, and apparently this can sometimes be fixed using:
zip_file_name_2 = zip_file_name.replace('.zip', '2.zip')
os.system(f'zip -FF {zip_file_name} --out {zip_file_name_2}')
This required me to put in a bunch of settings, which I wasn't able to figure out. There must be a better way to go about this.
Does anybody know how to parse such an .sql file?

How to fix 'File name too long' errors when using Snakemake

When using Snakemake, I store the values for my variables as part of the filenames (ex. "processed/count_{project}.tsv"). Recently, I've started using R formulas with many covariates as a variable. Now I get an error because the the filename is too long for the operating system. Has anyone else run into this issue and have any suggestions? Is there a canonical Snakemake approach for this problem?
Personally, I don't think it is a good idea to store information into the filename.
Rather, I would create a temp file in tabular or yaml format linking the file in question to covariates or other data. Then read this file in R or else to extract the relevant information.
One idea is to use paths instead since paths allowed to be longer.

How to read a Bunch of files in a directory in lua

I have a path (as a string) to a directory. In that directory, are a bunch of text files. I want to go to that directory open it, and go to each text file and read the data.
I've tried
f = io.open(path)
f:read("*a")
I get the error "nil Is a directory"
I've tried:
f = io.popen(path)
I get the error: "Permission denied"
Is it just me, but it seems to be a lot harder than it should be to do basic file io in lua?
A directory isn't a file. You can't just open it.
And yes, lua itself has (intentionally) limited functionality.
You can use luafilesystem or luaposix and similar modules to get more features in this area.
You can also use the following script to list the names of the files in a given directory (assuming Unix/Posix):
dirname = '.'
f = io.popen('ls ' .. dirname)
for name in f:lines() do print(name) end