Hive Reading external table from compressed bz2 file - amazon-s3

this is my scenario.
I have bz2 file in Amazon s3. Within the bz2 file, there lies files with .dat,.met,.sta extensions.I am only interested in files with *.dat extensions.You can download this samplefile to take a look at bz2 file.
create external table cdr (
anum string,
bnum string,
numOfTimes int
)
row format delimited
fields terminated by ','
lines terminated by '\n'
location 's3://mybucket/dir'; #the zip file is inside here
The problem lies such that when I execute the above command, some of the records/rows had issues.
1)all the data from files such as *.sta and *.met are also included.
2)the metadata of the filenames are also included.
The only idea I had was to show the INPUT_FILE_NAME. But then, all the records/rows had the same INPUT_FILE_NAME which was the filename.tar.bz2.
Any suggestions are welcome. I am currently completely lost.

Related

Aws s3 batch operation error: Task target couldn't be URL decoded

I need to restore a lot of object from aws s3 glacier deep archive. So i try to use a s3 batch jobs. For that i use a python code to create a manifest as a csv with to columns Bucket,Key.
But my first issue : some Key contain a comma so the job failed.
To solve (partialy) this issue i just cut the csv file to keep only the first two columns hoping that there are not many files involved.
But now i have another issue:
ErrorMessage: Task target couldn't be URL decoded
Any Idea ?
As mentioned on https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html#specify-batchjob-manifest, the manifest CSV file must be URL encoded. The , character in a key name gets converted to %2C with URL encoding so the resulting file will be valid CSV even with commas in the key name

Create a ADF Dataset to load multiple csv files (same format) from the Blob

I try to create a dataset containing multiple csv files from the Blob. In the file path of dataset setting: I create a parameter - #dataset().FolderName and add FolderName in the Parameters.
I leave file (from File Path) empty as I want to grab all files in the folder. However, there is no data when I preview data. Is there anything missing? Thank you
I have tested it on my side and it can work fine.
add FolderName parameter
preview data
If you want to merge all csv files in Data Flow, you can do this:
1.output to single file
2.set Single partition

Mass extract part of a text file using Windows batch

I have thousands of txt files that are actually in JSON format.
Each file has the same string, with different values, namely:
"Name":"xxx","Email":"yyy#zzz.com"
I want to extract the values of these two strings from all the txt files that I put in the same folder.
I've found these lines of code:
Extract part of a text file using Windows batch
but it only applies to one txt file. Whereas what I need is, it can execute all files in one folder.
You can use the FORFILES command, to loop through each file,
Syntax
FORFILES [/p Path] [/m SrchMask] [/s] [/c Command] [/d [+ | -] {date | dd}]
From the following webpage,
https://ss64.com/nt/forfiles.html

how to read multiple text files into a dataframe in pyspark

i have a few txt files in a directory(i have only the path and not the names of the files) that contain json data,and i need to read all of them into a dataframe.
i tried this:
df=sc.wholeTextFiles("path/*")
but i cant even display the data and my main goal is to preform queries in diffrent ways on the data.
Instead of wholeTextFiles(gives key, value pair having key as filename and data as value),
Try with read.json and give your directory name spark will read all the files in the directory into dataframe.
df=spark.read.json("<directorty_path>/*")
df.show()
From docs:
wholeTextFiles(path, minPartitions=None, use_unicode=True)
Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI.
Each file is read as a single record and returned in a key-value pair,
where the key is the path of each file, the value is the content of
each file.
Note: Small files are preferred, as each file will be loaded fully in
memory.

SQLLDR control file: Loading multiple files

Iam trying to load several data files into a single table. Now the files themselves have the following format:
file_uniqueidentifier.dat_date
My control file looks like this
LOAD DATA
INFILE '/home/user/file*.dat_*'
into TABLE NEWFILES
FIELDS TERMINATED BY ','
TRAILING NULLCOLS
(
FIRSTNAME CHAR NULLIF (FIRSTNAME=BLANKS)
,LASTNAME CHAR NULLIF (LASTNAME=BLANKS)
)
My SQLLDR on the other hand looks like this
sqlldr control=loader.ctl, userid=user/pass#oracle, errors=99999,direct=true
The error produced is SQL*Loader-500 unable to open file (/home/user/file*.dat_*) SQL*Loader-553 file not found
Does anyone have an idea as to how I can deal with this issue?
SQLLDR does not recognize the wildcard. The only way to have it use multiple files to to list them explicitly. You could probably do this using a shell script.
Your file naming convention seem like you can combine those files in to one making that one being used by the sqlldr control file. I don't know how you can combine those files into one file in Unix, but in Windows I can issue this command
copy file*.dat* file.dat
This command will read all the contents of the files that have the names that start with file and extension of dat and put in the file.dat file.
I have used this option and this works fine for multiple files uploading into single table.
-- SQL-Loader Basic Control File
options ( skip=1 )
load data
infile 'F:\oracle\dbHome\BIN\sqlloader\multi_file_insert\dept1.csv'
infile 'F:\oracle\dbHome\BIN\sqlloader\multi_file_insert\dept2.csv'
truncate into table scott.dept2
fields terminated by ","
optionally enclosed by '"'
( DEPTNO
, DNAME
, LOC
, entdate
)