Is there any way to copy files from S3 to redshift through a datapipeline in a predefined order - amazon-s3

I am wondering if there is a way to set an order to the way files are loaded into redshift through data pipeline from S3. I know we can use manifest to specify the files but haven't found anything about the order of files being loaded.
for instance, my s3 folder1 has 10 files. In the data pipeline, I set it to this folder, but how can I set an order of these files loading , if we can.
In short as far I understand there are no means to load files in a predefined order while being consumed by a data pipeline. Anyone correct me if I am wrong.
I am thinking of a case where there can be multiple sources files and they can have duplicate rows but with different values. In such case the order in which the files are consumed is important.
For example , File1 , File2 are part of a data pipeline schedule and if both the files have a common customer entry named xyz . File1 xyz Cost_owed 1000, File2 xyz Cost_owed 500. So in reality the customer xyz owes just 500, but since i use delete and insert mode the order of the files are important here . So my redshift table might end up having an entry for xyz as 1000 OR 500 , in such specific case or any other cases the order of the files matters. Or should this be handled in any other way if so,
can you give me some ideas.
Thanks

The order of files doesn't/couldn't matter for COPY command in Redshift since it's a MPP system.
Redshift relies on the SORTKEY of the target table to enforce ordering.

Related

Architectural design clarrification

I built an API in nodejs+express that allows reactjs clients to upload CSV files(maximum size is atmost 1GB) to the server.
I also wrote another API which when given the filename and row numbers in an array (ie array of row numbers ) as input, it selects the rows corresponding to the row numbers, from the previously stored files and writes it to another result file (writeStream).
Then th resultant file is piped back to the client(all via streaming).
Currently as you see I am using files(basically nodejs' read and write streams) to asynchronously manage this.
But I have faced srious latency (only 2 cores are used) and some memory leak (900mb consumption) when I have 15 requests, each supplying about 600 rows to retrieve from files of size approximately 150mb.
I also have planned an alternate design.
Basically, I will store the entire file as a SQL Table with row numbers as primary indexed key.
I will convert the user inputted array of row numbrs to a another table using sql unnest and then join both these tables to get the rows needed.
Then I will supply back the resultant table as a csv file to the client.
Would this architecture be better than the previous architecture?
Any suggestions from devs is highly appreciated.
Thanks.
Use the client to do all the heavy lifting by using the XLSX package for any manipulation of content. Then have API to save information about the transaction. This will remove upload to server and download from the server and help you provide better experience.

Apache Iceberg to index AWS S3

I have a usecase where there are about 100M files stored on S3. I have a manifest file separately for the location of these files based on my data model.
I want to understand if Apache Iceberg is a good fit to provide indexing of my S3 files.
Reading through Iceberg Documentation, seems like it talks about creating a table with one column being target S3 location. If this is the case how is it different from a normal relational table where I store the S3 path and other column being relevant to my model.
Any pointers or examples where people have used S3 indexing with Iceberg would be helpful.
This is a bit different than what's going on. What Iceberg does is create a secondary level of metadata separate from the actual table data. This metadata is what actually has the field of "path" for the particular row.
If you take a look at this diagram
The Path information is stored in the "manifest file" along with any metrics for that specific file. This means when you perform a read, the query can determine which files to actually read without touching the underlying data files.
For example if you have 3 files with column A, the manifest file will look approximately like
path/to/file1 | Max Value of ColA in File1 | Min Value of ColA in File1
path/to/file2 | Max Value of ColA in File2 | Min Value of ColA in File2
path/to/file3 | Max Value of ColA in File3 | Min Value of ColA in File3
So when I read that file I evaluate any predicates on ColA without touching file1, file2 or file3.
This "metadata" can be read from a "metadata table" in Iceberg but this is just a SQL View of the metadata files for a particular snapshot.

How to get one file in Hive

I tried a Hive process,
which generate words frequency rank from
sentences,
I would like to output not multiple files but
one file.
I searched the similar question this web site,
I found mapred.reduce.tasks=1,
but it didn't generate one file but 50 files.
The process l tried has 50 input files and
they are all gzip file.
How do I get one merged file?
50 input files size is so large that I suppose the
reason may be some kind of limit.
in your job use Order By clause with some field.
So that hive will enforce to run only one reducer as a result you are going to end up with one file has created in the HDFS.
hive> Insert into default.target
Select * from default.source
order by id;
For more details regards to order by clause refer to this and this links.
Thank you for your kind answers,
you are really saving me.
I am trying order by,
but it is taking much time,
i am waiting for it.
All I have to do is get one file
to make output file into input of
the next step,
I am also going to try simply cat all files from reducer outputs according to the advice,
if I will do it, I am worried that files are unique and does not have any the same word between files , and whether it is normal gzip file made by catting multiple gzip files.

How to preserve Google Cloud Storage rows order in compressed files

We've created a query in BigQuery that returns SKUs and correlations between them. Something like:
sku_0,sku_1,0.023
sku_0,sku_2,0.482
sku_0,sku_3,0.328
sku_1,sku_0,0.023
sku_1,sku_2,0.848
sku_1,sku_3,0.736
The result has millions of rows and we export it to Google Cloud Storage which results in several compressed files.
These files are downloaded and we have a Python application that loops through them to make some calculations using the correlations.
We tried then to make use of the fact that our first columns of SKUs is already ordered and not have to apply this ordering inside of our application.
But then we just found that the files we get from GCS changes the order in which the skus appear.
It looks like the files are created by several processes reading the results and saving it in different files, which breaks the ordering we wanted to maintain.
As an example, if we have 2 files created, the first file would look something like that:
sku_0,sku_1,0.023
sku_0,sku_3,0.328
sku_1,sku_2,0.0848
And the second file:
sku_0,sku_2,0.482
sku_1,sku_0,0.328
sku_1,sku_3,0.736
This is an example of what it looks like two processes reading the results and each one saving its current row on a specific file which changes the order of the column.
So we looked for some flag that we could use to force the preservation of the ordering but couldn't find any so far.
Is there some way we could use to force the order in these GCS files to be preserved? Or is there some workaround?
Thanks in advance,
As far I know there is no flag to maintain order.
As a workaround you can rethink your data output to use of NESTED type, and make sure that what you want to group together are converted in NESTED rows, and you can export to JSON.
is there some workaround?
As an option - you can move your processing logic from Python to BigQuery, thus eliminating moving data out of BigQuery to GCS.

How should I partition data in s3 for use with hadoop hive?

I have a s3 bucket containing about 300gb of log files in no particular order.
I want to partition this data for use in hadoop-hive using a date-time stamp so that log-lines related to a particular day are clumped together in the same s3 'folder'. For example log entries for January 1st would be in files matching the following naming:
s3://bucket1/partitions/created_date=2010-01-01/file1
s3://bucket1/partitions/created_date=2010-01-01/file2
s3://bucket1/partitions/created_date=2010-01-01/file3
etc
What would be the best way for me to transform the data? Am I best just running a single script that reads in each file at a time and outputs data to the right s3 location?
I'm sure there's a good way to do this using hadoop, could someone tell me what that is?
What I've tried:
I tried using hadoop-streaming by passing in a mapper that collected all log entries for each date then wrote those directly to S3, returning nothing for the reducer, but that seemed to create duplicates. (using the above example, I ended up with 2.5 million entries for Jan 1st instead of 1.4million)
Does anyone have any ideas how best to approach this?
If Hadoop has free slots in the task tracker, it will run multiple copies of the same task. If your output format doesn't properly ignore the resulting duplicate output keys and values (which is possibly the case for S3; I've never used it), you should turn off speculative execution. If your job is map-only, set mapred.map.tasks.speculative.execution to false. If you have a reducer, set mapred.reduce.tasks.speculative.execution to false. Check out Hadoop: The Definitive Guide for more information.
Why not create an external table over this data, then use hive to create the new table?
create table partitioned (some_field string, timestamp string, created_date date) partition(created_date);
insert overwrite partitioned partition(created_date) as select some_field, timestamp, date(timestamp) from orig_external_table;
In fact, I haven't looked up the syntax, so you may need to correct it with reference to https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries.