I am facing an odd issue where my lookup is returning a filenotfound error when I use a wildcard path. If I specify and exact file path, the lookup runs without error. However, if I replace the filename with a *, I get a filenotfound error.
The file is Data_643.json, located in my Azure Data Lake Storage Gen2, under the labournavigatorfile system. The exact file path is:
labournavigatorfile/raw_data/Scraped/HeadHunter/Saudi_Arabia/Data_643.json.
If I put this exact path into the Integration dataset configuration, the pipeline runs without issue. However, as soon as I replace the 'Data_643.json' with a '*', the pipeline crashes with a filenotfound error.
What am I doing wrong? Many Thanks for any support. This must be something very simple that I am missing.
Exact path works:
Wildcrad path throws error:
I have 3 files in my container as file1.json, file2.json, file3.json as shown below:
The following is how I configured my dataset to read using wildcard with configuration same as in the image provided in the question.
When I used this in lookup I got the same error:
To overcome this, go to your lookup activity. When you want to use wildcards to read a file/files, check the wildcard file path option. Then specify the folder structure and use wildcard where required. The following is an image for reference.
The following is the debug output when I run the pipeline (Each of my files had 10 rows):
Related
I'm sorry if this is basic and I missed something simple. I'm trying to run the code below to iterate through files in a folder and merge all files that start with a specific string, into a dataframe. All files sit in a lake.
file_list=[]
path = "/dbfs/rawdata/2019/01/01/parent/"
files = dbutils.fs.ls(path)
for file in files:
if(file.name.startswith("CW")):
file_list.append(file.name)
df = spark.read.load(path=file_list)
# check point
print("Shape: ", df.count(),"," , len(df.columns))
db.printSchema()
This looks fine to me, but apparently something is wrong here. I'm getting an error on this line:
files = dbutils.fs.ls(path)
Error message reads:
java.io.FileNotFoundException: File/6199764716474501/dbfs/rawdata/2019/01/01/parent does not exist.
The path, the files, and everything else definitely exist. I tried with and without the 'dbfs' part. Could it be a permission issue? Something else? I Googled for a solution. Still can't get traction with this.
Make sure you have a folder named "dbfs" if your parent folder starts from "rawdata" the path should be "/rawdata/2019/01/01/parent" or "rawdata/2019/01/01/parent".
The error is thrown in case of incorrect path.
This is an old thread, but if someone is still looking for a solution:
It does require path to be listed as:
"dbfs:/rawdata/2019/01/01/parent/"
I am running my jobs locally using the Local SDK. However, I get the following error message:
Error : 'System.IO.PathTooLongException: The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters.
One of my colleagues was able to track down the error to the .ss file in the catalog folder inside DataRoot by running the project in a new directory in C:\. The path for the .ss file is
C:\HelloWorld\Main\Source\Data\Insights\NewProject\NewProject\USQLJobsForTesting.Tests\bin\Debug\DataRoot\_catalog_\database\d92bfaa5-dc7f-4131-abdc-22c50eb0d8c0\schema\f6cf4417-e2d8-4769-b633-4fb5dddcb066\table\aa136daf-9e86-4650-9cc3-119d607fb3b0\31a18033-099e-4c2a-aae3-75cf099b0fb1.ss
which exceeds the allowed limit of 260 characters. I cannot reduce the length of my project path because my organization follows a certain working directory format.
Is there any possible solution for this problem?
Try using subst in CMD to workaround this problem by mapping a drive letter to the data root you want to use.
subst X: C:\PathToYourDataRoot
And then in ADL Tools for Visual Studio set the DataRoot to X:
I'm trying to load group of folders files in one time with when
i set
sourceURI = 'gs://ybbi/bi_landing_zone/files_to_load/app/reports/app_network_analytics_report/201409011*'
all the folders that i'm want to load start with 20140911
but i get the error:
ERROR: Invalid path: gs://ybbi/bi_landing_zone/files_to_load/apn/reports/appnexus_network_analytics_report/20140901191111_3bab8ec0_092a_43de_a157_db35d1555ea0/
20140901191111_3bab8ec0_092a_43de_a157_db35d1555ea0 is one of these folders(don't know why it's print the all folder name of this specific folder)
in some other folder tree cases it's works, but in this specific folder tree it's return the same error .
i know that cloud storage don't have real folders and it's part of the name of the object, but you understand what i mean.
is it bug?
Without more information, what it looks like is that you have a object file called gs://ybbi/bi_landing_zone/files_to_load/apn/reports/appnexus_network_analytics_report/20140901191111_3bab8ec0_092a_43de_a157_db35d1555ea0/ that is not a csv/json file. Some tools may create these dummy files in order to simulate directories. BigQuery requires all objects that match the input glob path to be importable files.
One solution would be to change the glob path to include a narrower set of files. You can pass multiple paths if that makes things easier. For example, you could pass
gs://ybbi/bi_landing_zone/files_to_load/apn/reports/appnexus_network_analytics_report/20140901191111_3bab8ec0_092a_43de_a157_db35d1555ea0/*
and
gs://ybbi/bi_landing_zone/files_to_load/apn/reports/appnexus_network_analytics_report/20140901191111_some_other_path/*
I’m having issues using wildcard input paths in Pig.
If I run the following commands:
A = load ‘/something/*.csv’ using PigStorage(‘,’)
dump A;
I see the output from all csv files in the something folder printed to my console after the job is run.
If, however, I run a store instead:
A = load ‘/something/*.csv’ using PigStorage(‘,’)
store A into ‘somedestination’;
The job fails with the following error message:
Input(s):
Failed to read data from “/something/*.csv”
It looks like the store is attempting to load from the literal path instead of globbing using the wildcard, but if that’s the case then why does it work during the dump? Is there another way to accomplish this?
You may not have the permission to write to that folder.
The dump essentially writes to the tmp folder (or another folder if the configuration is different) and then prints that to the screen.
Do a dump. Look at the log. It should say something like:
Input(s):
Successfully read 0 records from: "‘/something/*.csv’"
Output(s):
Successfully stored 0 records in: "file:/tmp/temp1865628879/tmp-1573237939"
Then next time try and store to the folder that you saw when you did the dump. If that works fine, then you have a permissions problem.
I'm trying to load simple file:
log = load 'file_1.gz' using TextLoader AS (line:chararray);
dump log
And I get an error:
2014-04-08 11:46:19,471 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:288)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1054)
Is is possible to manage such situation before error appears?
Input Pattern hdfs://hadoop1:8020/pko/file*gz matches 0 files
The error is the input file doesn't exist in the given hdfs path.
log = load 'file_1.gz' using TextLoader AS (line:chararray);
as you haven’t mentioned the absolute path of file_1.gz , it will taken the home hdfs dir of the user with which you are running your pig-script
Unfortunately in the current version of Pig (0.15.0) it is impossible to manage these errors without using UDF's.
I suggest creating a Java or Python script using try and catch to take care of this.
Here's a good website that might be of some use to you: https://wiki.apache.org/pig/PigErrorHandlingInScripts
Good luck learning Pig!
I'm facing this issue as well. My load command is:
DATA = LOAD '${qurwf_folder_input}/data/*/' AS (...);
I want to load all files from the data subfolders, but the data folder is empty and I got the same error as you. What I did, in my particular case, was to create an empty folder in the data directory. So the LOAD returns an empty dataset and the script did not fail.
By the way, I'm using Oozie workflow to run the scripts, and in the prepare, I create the empty folders.