Is there a way move data from a Delta Table in S3 to Redshift using Copy Command? - amazon-s3

I've Delta Tables that are created in a S3 Bucket, need to load this data as-is into Redshift tables.
The delta table has Symlink Format Manifest generated and some delta tables might have partitions.
Is there a way to move this data into Redshift?
I've tried to run the Copy Command on the Delta Table giving the path till the Table Name.
COPY <schema>.<table>
FROM 's3://<bucket-name>/<delta_table_name>/'
IAM_ROLE '<iam_role>'
FORMAT AS PARQUET
Tried to run the Copy Command using the path to the Symlink Format Manifest directory in S3. Neither worked.
COPY <schema>.<table>
FROM 's3://<bucket-name>/<delta_table_name>/_symlink_format_manifest/PARTITION1=VALUE1/PARTITION2=VALUE2/manifest'
IAM_ROLE '<iam_role>'
FORMAT AS PARQUET

Related

pyspark script to copy any parquet file data to any oracle table

We have one s3 bucket called Customers/
Inside this we have multiple folders and again sub folders inside them.
And finally we have parquet files of data.
Now I want to read any parquet file (not specific to any file) and load data into oracle.
For now my script is working for one s3 path where it reads one parquet file e.g. customer_info.parquet and it loads data in oracle database table called customer.customer_info
I need help on generating a generic script where we can read any parquet file and load data in any corresponding database table.
for e.g.
S3 location : s3/Customers/new_customrers/new_customer_info.parquet
Oracle Database: Customer
Oracle table : new_customers
S3 location :s3/Customers/old_customrers/old_customer_info.parquet
Oracle Database:Customer
Oracle table:old_customers
S3 location : s3/Customers/current_customrers/current_customer_info.parquet
Oracle Database :Customer
Oracle table:current_customers**
Is there any way to make this copy process generic. Database will be same only oracle tables will be changed according to the parquet file.
My current script is a pyspark script where we are reading one s3 file data into spark dataframe and writting that dataframe to one oracle table.

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

How read data partitons in S3 from Trino

I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command
In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS
Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html

importing data into hive table from external s3 bucket url link

I need to import data from a public s3 bucket which url is shared with me. how to load the data into hive table?
I have tried below command but its not working:
create external table airlines_info (.... ) row format
delimited fields terminated by '|' lines terminated by '\n'
stored as textfile location 'https://ml-cloud-dataset.....*.txt';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:ml-cloud-dataset.s3.amazonaws.com/Airlines_data.txt is not a directory or unable to create one)
I am very new to hive and I am not sure about the code. I also tried below code after creating the table to load the data into hive table but that's also not working
load data inpath 'https://ml-cloud-dataset.....*.txt' into table airlines_info;
Table location should be directory in HDFS or S3, not file and not https link.
Download file manually, put into local filesystem and if you already have the table created then use
load data local inpath 'local_path_to_file' into table airlines_info;
If you do not have the table yet, create it and specify some location inside your S3, or alternatively create MANAGED table (remove EXTERNAL from your DDL), without location specified, it will create location for you, check location using DESCRIBE FORMATTED command, later you can convert table to EXTERNAL if necessary using ALTER TABLE airlines_info SET TBLPROPERTIES('EXTERNAL'='TRUE');
Instead of load data command you can simply copy file into table location using AWS CLI (provide correct local path and table directory S3 URL):
aws s3 cp C:\Users\My_user\Downloads\Airlines_data.txt s3://mybucket/path/airlines_info/

Files lost by overwriting into hive managed tables

I am using hadoop 2.7.3 and hive 2.1.1.
I had some 8-9 file in HDFS. I created one internal hive table. I loaded first of those 8 files in that table. Did some operation on that data.
After that I loaded the second of those files by overwriting into that table.
load data inpath '/path/path1/first.csv' into table ABC;
load data inpath '/path/path1/second.csv' overwrite into table ABC;
Did some operation on second data.
I then loaded third file and so on till the last file by using "overwrite into" .
Now, I see all those files are not there in there original location. Also, at /user/hive/warehouse/ABC only the last of the files is there.
Where did those previous files go? Are they lost because of overwriting into hive table? I did "hdfs dfs -ls -R / | grep "filename" but could not find my files.
LOAD DATA INPATH will move (not copy) the file from the source HDFS path to the table warehouse path.
OVERWRITE will delete the files (if HDFS Trash is enabled, move the files to Trash) that already exist in the table and replace with the files given in the path.
LOAD DATA LOCAL INPATH copies the files.
LOAD DATA INPATH moves the files.
overwrite deletes existing files before moving in new files.