spark sql load parqet with special character in path

spark sql load parqet with special character in path - sql

I am using pyspark sql to load files into table following
LOAD DATA LOCAL INPATH '/user/hive/warehouse/students' OVERWRITE INTO TABLE test_load;
https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html
It complains:
pyspark.sql.utils.AnalysisException: load data input path does not exist
when the path string has timestamp in the directory structure like
XX/XX/2021-03-02T20:04:27+00:00/file.parquet
It works with path without timestamp. How to work it around?

I haven't seen any file system that support '2021-03-02T20:04:27+00:00' as folder name or file name. usually ":" and "+" signs are considered as reserved characters and you can't use them in file/folder naming.
read file system manual you are using for "reserved words"
change your datetime format to something that is supported by the operating system file system like 'yyyy-mm-ddThhMMSS' ex: '2021-03-02T200427'

Related

Archiving files using Pentaho PDI

I need to archive the txt file using Pentaho PDI by giving it a dynamic timestamp and append the variable to the output filename. I used get system info which automatically assigns variable as well as value. So my job was Start__ get system info___zip file. In the zip file component, I tried called the variable while giving the output filename along with ${Variable} but the output filename is not coming properly. It should be off filename__timestamp__variable. Can someone please help me with this?

how to read multiple text files into a dataframe in pyspark

i have a few txt files in a directory(i have only the path and not the names of the files) that contain json data,and i need to read all of them into a dataframe.
i tried this:
df=sc.wholeTextFiles("path/*")
but i cant even display the data and my main goal is to preform queries in diffrent ways on the data.

Instead of wholeTextFiles(gives key, value pair having key as filename and data as value),
Try with read.json and give your directory name spark will read all the files in the directory into dataframe.
df=spark.read.json("<directorty_path>/*")
df.show()
From docs:
wholeTextFiles(path, minPartitions=None, use_unicode=True)
Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI.
Each file is read as a single record and returned in a key-value pair,
where the key is the path of each file, the value is the content of
each file.
Note: Small files are preferred, as each file will be loaded fully in
memory.

How to ignore errors but not skip rows in redshift copy command

I have a nested json as my source file in S3 and I am trying to copy this file into redshift.
My issues with this are as follows,
I use MAXERROR - I need to skip certain errors because the source file is missing certain fields in some cases and has them in other
I use a JSONPATH file - to pick the fields that I need to copy to redshift
All the columns in the table are varchar
Obviously, since I am using maxerror the copy command executes successfully but the table has 0 records. Here is my copy command
COPY public.table(col1,col2,col3,col4,col5,col6)
from 's3://bucket/filename'
credentials 'redshift'
format as JSON 'jsonpathfile.json'
timeformat 'YYYY-MM-DDTHH:MI:SS'
EMPTYASNULL ACCEPTANYDATE ACCEPTINVCHARS TRUNCATECOLUMNS maxerror 100 ;
If I check into stl_load_errors it keeps saying
Invalid JSONPath format: Member is not an object.
Does this mean the copy command is not able to find even one object that fits the jsonpath file?
Which is definitely not true. I inferred the schema of the input file to design the jsonpath file.

Here is an example from COPY Examples - Amazon Redshift:
copy category
from 's3://mybucket/category_object_paths.json'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
json 's3://mybucket/category_jsonpath.json';
The path to the jsonpath file is specified fully, whereas your example just refers to the filename.
Try specifying the full path starting with s3:// and see whether that helps.

Get file name from SAP Data service

I'm unable to read file name from data services which contain date_time format, I can read date but time can be variable, I've tried with *.csv on file name(s) property for flat file, but this for static file name.
Example: File_20180520_200003.csv, File_20180519_192503.csv, etc.
My script:
$Filename= 'File_'|| to_char(sysdate()-1, 'YYYYMMDD')|| '_'|| '*.csv';
I want to find a solution to read the 6 digits (any number) *.

Finally, I've found a solution by using
$Csv = word(exec('cmd','dir /b [$Filename]*.csv',8),2) ;
on the flat file (file name property), I've added $Csv
It works fine.

CFSCRIPT - How to check the length of a filename before uploading

I ran into this problem when uploading a file with a super long name - my database field was only set to 50 characters. Since then, I have increased my database field length, but I'd like to have a way to check the length of the filename before uploading. Below is my code. The validation returns '85' as the character length. And it returns the same count for every different file I upload (none of which have a file name length of 85).
<cfscript>
missing_info = "<p>There was a slight problem with your submission. The following are required or invalid:</p><ul>";
// Check the length of the file name for our database field
if ( len(Form["ResumeFile1"]) gt 100 )
{
missing_info = missing_info & "<li>'Resume File 1' is invalid. Character length must be less than 100. Current count is " & len(Form["ResumeFile1"]) & ".</li>";
validation_error = true;
ResumeFileInvalidMarker = true;
}
</cfscript>
Anyone see anything wrong with this?
Thanks!

http://www.cfquickdocs.com/cf9/#cffile.upload
After you upload the file, the variable "clientFileName" will give you the name of the uploaded file, without a file extension.
The only way to read the filename before you upload it would be to use JavaScript to read and parse the value (file path) in the file field.

A quick clarification in the wording of your question. By the time your code executes the file upload has already happened. The file resides in a temporary directory on the ColdFusion server and the form field related to the file upload contains the temporary filename for that file. Aside from checking to see if a file has been specified, do not do anything directly with that file or you'll be circumventing some built in security.
You want to use the cffile tag with the upload action (or equivalent udf) to move the temp file into a folder of your choosing. At that point you get access to a structure containing lots of information. Usually I "upload" into a temporary directory for the application, which should be outside of the webroot for security.
At this point you'll then want to do any validation against the file, such as filename length, file type, file size, etc and delete the file if it fails any checks. If it passes all checks then you move it into it's final destination which may be inside the webroot.
In your case you'll want to check the cffile structure element clientFile which is the original filename including extension (which you'll need to check, since an extension doesn't need to be present and can be any length).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

spark sql load parqet with special character in path - sql

Related

Archiving files using Pentaho PDI

how to read multiple text files into a dataframe in pyspark

How to ignore errors but not skip rows in redshift copy command

Get file name from SAP Data service

CFSCRIPT - How to check the length of a filename before uploading

Categories

Resources