how to read a mounted dbc file in databricks? - amazon-s3

I try to read a dbc file in databricks (mounted from an s3 bucket)
the file path is:
file_location="dbfs:/mnt/airbnb-dataset-ml/dataset/airbnb.dbc"
how to read this file using spark?
I tried the code below:
df=spark.read.parquet(file_location)
But it generates and error:
AnalysisException: Unable to infer schema for Parquet. It must be specified manually.
thanks for help !

I tried the code below: df=spark.read.parquet(file_location) But it
generates and error:
You are using spark.read.parquet but want to read dbc file. It won't work this way.
Don't use parquet but use load. Give file path with file name (without .dbc extension) in path parameter and dbc in format paramter.
Try below code:
df=spark.read.load(path='<file_path_with_filename>', format='dbc')
Eg: df=spark.read.load(path='/mnt/airbnb-dataset-ml/dataset/airbnb', format='dbc')

Related

AzureSynapse Lookup UserErrorFileNotFound with Wildcard path

I am facing an odd issue where my lookup is returning a filenotfound error when I use a wildcard path. If I specify and exact file path, the lookup runs without error. However, if I replace the filename with a *, I get a filenotfound error.
The file is Data_643.json, located in my Azure Data Lake Storage Gen2, under the labournavigatorfile system. The exact file path is:
labournavigatorfile/raw_data/Scraped/HeadHunter/Saudi_Arabia/Data_643.json.
If I put this exact path into the Integration dataset configuration, the pipeline runs without issue. However, as soon as I replace the 'Data_643.json' with a '*', the pipeline crashes with a filenotfound error.
What am I doing wrong? Many Thanks for any support. This must be something very simple that I am missing.
Exact path works:
Wildcrad path throws error:
I have 3 files in my container as file1.json, file2.json, file3.json as shown below:
The following is how I configured my dataset to read using wildcard with configuration same as in the image provided in the question.
When I used this in lookup I got the same error:
To overcome this, go to your lookup activity. When you want to use wildcards to read a file/files, check the wildcard file path option. Then specify the folder structure and use wildcard where required. The following is an image for reference.
The following is the debug output when I run the pipeline (Each of my files had 10 rows):

Is it possible to write directly to final file with distcp?

I'm trying to upload file to s3a using distcp.
distcp writes to temporary file first and than renames it to proper filename.
But user does not allow for update/delete. So I have file with proper size, wrong name.
-rw-rw-rw- 1 3738 2021-05-24 12:04 s3a://testbucket/.distcp.tmp.attempt_1621587961870_0001_m_000000_0
on s3 and receive an error:
Error: java.io.IOException: File copy failed: file:///testfile.json --> s3a://testbucket/testfile.json
Is it possible to omit renaming and write directly to final filename?
I've found it here:
https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
there is a parameter :
-direct
Write directly to destination paths Useful for avoiding potentially very expensive temporary file rename operations when the destination is an object store
example
hadoop distcp -direct hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
Unfortunately my distcp version is too old and hasn't got this feature.

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

I am trying to generate a file from a Dataframe that I have created in AWS-Glue, I am trying to give it a specific name, I see most answers on stack overflow actually uses Filesystem modules, but here this particular csv file is generated in S3, also I want to give the file a name while generating it, and not rename it after it is generated, is there any way to do that?
I have tried using df.save(s3:://PATH/filename.csv) which actually generates a new directory in S3 named filename.csv and then generates part-*.csv inside that directory
df.repartition(1).write.mode('append').format('csv').save('s3://PATH').option("header", "true")

What is the wildcard for the File connector file path field in Anypoint Studio and Mule

I am using Anypoint Studio 7 and Mule 4.1.
A product file in csv format with a filename that will include the current timestamp will be added to a directory on a daily basis and needs to be processed. To do this we are creating a mule workflow using the file connector and want to configure the file path field to only read csv file formats regardless of name.
At the moment, the only way I can get it to work is by specifying the filename in the file path field which looks like this:
C:/Workspace/product-files-v1/src/main/resources/input/products-2018112011001111.csv
when I would like to specify some kind of wildcard in the file path similar to this:
C:/Workspace/product-files-v1/src/main/resources/input/products-*.csv
but the above does not work.
What is the correct wildcard syntax and also is there a way to specify the relative file path instead of the absolute one as when I try to specify a relative file path I get an error too?
Error message in logs:
********************************************************************************
Message : Illegal char <*> at index 108: C:/Workspace/product-files-v1/src/main/resources/input/products-*.csv.
Element : product-files-v1/processors/1 # product-files-v1:product-files-v1.xml:16 (Read File)
Element XML : <file:read doc:name="Read File" doc:id="fdbbf477-e831-4e7c-827c-71efd1d2e538" config-ref="File_Config" path="C:/Workspace/product-files-v1/src/main/resources/input/products-*.csv" outputMimeType="application/csv" outputEncoding="UTF-8"></file:read>
Error type : MULE:UNKNOWN
--------------------------------------------------------------------------------
Root Exception stack trace:
java.nio.file.InvalidPathException: Illegal char <*> at index 108: C:/Workspace/product-files-v1/src/main/resources/input/products-*.csv
Thanks for any help
i am assuming you need to user a <file:matcher> when you want to filter or read certain type of files from a directory.
an example would be
<file:matcher
filename-pattern="a?*.{htm,html,pdf}"
path-pattern="a?*.{htm,html,pdf}"
/>

In kettle use text file input read csv file from a tar.gz file but it didn't worked. Where it might be wrong?

I have a csv file that is tared and zipped. So I have test.tar.gz.
I would like, through text file input, read csv file.
I try this tar:gz:file://C:/test/test.tar.gz!/test.tar! use wildcard like ".*\.csv".
But it sometime can't read success.
It throws Exception
org.apache.commons.vfs.FileNotFolderException:
Could not list the contents of
"tar:gz:file:///C:/test/test.tar.gz!/test.tar!/"
because it is not a folder.
I use windows8.1, pdi 5.2
Where it might be wrong?
For a compressed file csv reading, "Text File Input" step in Pentaho Kettle only supports the first files inside the compressed folder(either in Zip/GZip file). Check the Pentaho Wiki in the compression section.
Now for your issue, try removing the wildcard entry since only the first file inside the zip/gzip file will be read. (as explained above)
I have placed a sample code containing both reading zip and gzip files. Check it here.
Hope it helps :)