How to merge changes from one stream to another in RTC source control - rtc

I have been working on a stream (s1) and I want to merge these changes with another stream(s2)
I do not want to deliver my changes to s2 but create or use a stream that contains the merges of s1 & s2.
I think I have two possilbe solutions :
1.
Create a new stream based on s2, lets call it s3 and change my flow target to s3.
Deliver all changes to s3.
I don't think I will lose change set history with this approach ?
2.
Change my flow target to s2
Accept all changes from s2
Change my flow target to s1
Deliver my changes to s1
What option should I choose, are there alternatives ?

"1." is the safest, isolating the result of the merge in S3
"2." would publish the result of the merge directly in S1
So it depends who needs the result of this merge, and for what.
if you need to test a bit the result of that merge, while you go on developing S1, then having S3 is handy.
but if you need, for developing S1, to have S2 devs merged in it, then scenario "2." is the more direct approach.

Related

How do I use awswrangler to read only the first few N rows of a parquet file stored in S3?

I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N rows due to the file's size (and my poor bandwidth).
I cannot see how to do it, or whether it is even possible without relocating.
Could I use chunked=INTEGER and abort after reading the first chunk, say, and if so how?
I have come across this incomplete solution (last N rows ;) ) using pyarrow - Read last N rows of S3 parquet table - but a time-based filter would not be ideal for me and the accepted solution doesn't even get to the end of the story (helpful as it is).
Or is there another way without first downloading the file (which I could probably have done by now)?
Thanks!
You can do that with awswrangler using S3 Select. For example:
import awswrangler as wr
df = wr.s3.select_query(
sql="SELECT * FROM s3object s limit 5",
path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
input_serialization="Parquet",
input_serialization_params={},
use_threads=True,
)
would return 5 rows only from the S3 object.
This is not possible with other read methods because the entire object must be pulled locally before reading it. With S3 select, the filtering is done on the server side instead

Is there any way to copy files from S3 to redshift through a datapipeline in a predefined order

I am wondering if there is a way to set an order to the way files are loaded into redshift through data pipeline from S3. I know we can use manifest to specify the files but haven't found anything about the order of files being loaded.
for instance, my s3 folder1 has 10 files. In the data pipeline, I set it to this folder, but how can I set an order of these files loading , if we can.
In short as far I understand there are no means to load files in a predefined order while being consumed by a data pipeline. Anyone correct me if I am wrong.
I am thinking of a case where there can be multiple sources files and they can have duplicate rows but with different values. In such case the order in which the files are consumed is important.
For example , File1 , File2 are part of a data pipeline schedule and if both the files have a common customer entry named xyz . File1 xyz Cost_owed 1000, File2 xyz Cost_owed 500. So in reality the customer xyz owes just 500, but since i use delete and insert mode the order of the files are important here . So my redshift table might end up having an entry for xyz as 1000 OR 500 , in such specific case or any other cases the order of the files matters. Or should this be handled in any other way if so,
can you give me some ideas.
Thanks
The order of files doesn't/couldn't matter for COPY command in Redshift since it's a MPP system.
Redshift relies on the SORTKEY of the target table to enforce ordering.

Stream BigQuery table into Google Pub/Sub

I have a Google bigQuery Table and I want to stream the entire table into pub-sub Topic
what should be the easy/fast way to do it?
Thank you in advance,
2019 update:
Now it's really easy with a click-to-bigquery option in Pub/Sub:
Find it on: https://console.cloud.google.com/cloudpubsub/topicList
The easiest way I know of is going through Google Cloud Dataflow, which natively knows how to access BigQuery and Pub/Sub.
In theory it should be as easy as the following Python lines:
p = beam.Pipeline(options=pipeline_options)
tablerows = p | 'read' >> beam.io.Read(
beam.io.BigQuerySource('clouddataflow-readonly:samples.weather_stations'))
tablerows | 'write' >> beam.io.Write(
beam.io.PubSubSink('projects/fh-dataflow/topics/bq2pubsub-topic'))
This combination of Python/Dataflow/BigQuery/PubSub doesn't work today (Python Dataflow is in beta, but keep an eye on the changelog).
We can do the same with Java, and it works well - I just tested it. It runs either locally, and also in the hosted Dataflow runner:
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<TableRow> weatherData = p.apply(
BigQueryIO.Read.named("ReadWeatherStations").from("clouddataflow-readonly:samples.weather_stations"));
weatherData.apply(ParDo.named("tableRow2string").of(new DoFn<TableRow, String>() {
#Override
public void processElement(DoFn<TableRow, String>.ProcessContext c) throws Exception {
c.output(c.element().toString());
}
})).apply(PubsubIO.Write.named("WriteToPubsub").topic("projects/myproject/topics/bq2pubsub-topic"));
p.run();
Test if the messages are there with:
gcloud --project myproject beta pubsub subscriptions pull --auto-ack sub1
Hosted Dataflow screenshot:
That really depends on the size of the table.
If it's a small table (a few thousand records, a couple doze columns) then you could setup a process to query the entire table, convert the response into a JSON array, and push to pub-sub.
If it's a big table (millions/billions of records, hundreds of columns) you'd have to export to file, and then prepare/ship to pub-sub
It also depends on your partitioning policy - if your tables are set up to partition by date you might be able to, again, query instead of export.
Last but not least, it also depends on the frequency - is this a one time deal (then export) or a continuous process (then use table decorators to query only the latest data)?
Need some more information if you want a truly helpful answer.
Edit
Based on your comments for the size of the table, I think the best way would be to have a script that would:
Export the table to GCS as newline delimited JSON
Process the file (read line by line) and send to pub-sub
There are client libraries for most programming languages. I've done similar things with Python, and it's fairly straight forward.

boto3's atomic test and create?

In normal file systems is normal to have the pattern of trying to create a file and fail if it already existed to have the guarantee of being creating a unique filename.
How can the same be achieved with S3 : if I have many parallel tasks creating keys with random names on S3, how can I "test and write" atomically to guarantee that chances don't create a race and I end with messed data ?
Thanks
After a few days of thinking, I believe I have found a very decent solution to my own problem: activate versioning on bucket and save freely the key name you want. From the answer take versionId and encode the object url in a agreed format (e.g. s3://your-bucket/your-key?versionId=XXXXX ) . This url refers always to the object you wanted to save in the first place with no possibility of clashes/races.

Mapreduce Table Diff

I have two versions (old/new) of a database table with about 100,000,000 records. They are in files:
trx-old
trx-new
The structure is:
id date amount memo
1 5/1 100 slacks
2 5/1 50 wine
id is the simple primary key, other fields are non-key. I want to generate three files:
trx-removed (ids of records present in trx-old but not in trx-new)
trx-added (records from trx-new whose ids are not present in trx-old)
trx-changed (records from trx-new whose non-key values have changed since trx-old)
I need to do this operation every day in a short batch window. And actually, I need to do this for multiple tables and across multiple schemas (generating the three files for each) so the actual app is a bit more involved. But I think the example captures the crux of the problem.
This feels like an obvious application for mapreduce. Having never written a mapreduce application my questions are:
is there some EMR application that already does this?
is there an obvious Pig or maybe Cascading solution lying about?
is there some other open source example that is very close to this?
PS I saw the diff between tables question but the solutions over there didn't look scalable.
PPS Here is a little Ruby toy that demonstrates the algorithm: Ruby dbdiff
I think it would be easiest just to write your own job, mostly because you'll want to use MultipleOutputs to write to the three separate files from a single reduce step when the typical reducer only writes to one file. You'd need to use MultipleInputs to specify a mapper for each table.
This seems like the perfect problem to solve in cascading. You have mentioned that you have never written MR application and if the intent is to get started quickly (assuming you are familiar with Java) then Cascading is the way to go IMHO. I'll touch more on this in a second.
It is possible to use Pig or Hive but these aren't as flexible if you want to perform additional analysis on these columns or change schemas since you can build your Schema on the fly in Cascading by reading from the column headers or from a mapping file you create to denote the Schema.
In Cascading you would:
Set up your incoming Taps : Tap trxOld and Tap trxNew (These point to your source files)
Connect your taps to Pipes: Pipe oldPipe and Pipe newPipe
Set up your outgoing Taps : Tap trxRemoved, Tap trxAdded and Tap trxChanged
Build your Pipe analysis (this is where the fun (hurt) happens)
trx-removed :
trx-added
Pipe trxOld = new Pipe ("old-stuff");
Pipe trxNew = new Pipe ("new-stuff");
//smallest size Pipe on the right in CoGroup
Pipe oldNnew = new CoGroup("old-N-new", trxOld, new Fields("id1"),
trxNew, new Fields("id2"),
new OuterJoin() );
The outer join gives us NULLS where ids are missing in the other Pipe (your source data), so we can use FilterNotNull or FilterNull in the logic that follows to get us final pipes that we then connect to Tap trxRemoved and Tap trxAdded accordingly.
trx-changed
Here I would first concatenate the fields that you are looking for changes in using FieldJoiner then use an ExpressionFilter to give us the zombies (cause they changed), something like:
Pipe valueChange = new Pipe("changed");
valueChange = new Pipe(oldNnew, new Fields("oldValues", "newValues"),
new ExpressionFilter("oldValues.equals(newValues)", String.class),
Fields.All);
What this does is it filters out Fields with the same value and keeps the differences. Moreover, if the expression above is true it gets rid of that record. Finally, connect your valueChange pipe to your Tap trxChanged and your will have three outputs with all the data you are looking for with code that allows for some added analysis to creep in.
As #ChrisGerken suggested, you would have to use MultipleOutputs and MultipleInputs in order to generate multiple output files and associate custom mappers to each input file type (old/new).
The mapper would output:
key: primary key (id)
value: record from input file with additional flag (new/old depending on the input)
The reducer would iterate over all records R for each key and output:
to removed file: if only a record with flag old exists.
to added file: if only a record with flag new exists.
to changed file: if records in R differ.
As this algorithm scales with the number of reducers, you'd most likely need a second job, which would merge the results to a single file for a final output.
What come to my mind is that:
Consider your tables are like that:
Table_old
1 other_columns1
2 other_columns2
3 other_columns3
Table_new
2 other_columns2
3 other_columns3
4 other_columns4
Append table_old's elements "a" and table_new's elements "b".
When you merge both files and if an element exist on the first file and not in the second file this is removed
table_merged
1a other_columns1
2a other_columns2
2b other_columns2
3a other_columns3
3b other_columns3
4a other_columns4
From that file you can do your operations easily.
Also, let say your id's are n digits, and you have 10 clusters+1 master. Your key would be 1st digit of id, therefore, you divide the data to clusters evenly. You would do grouping+partitioning so your data would be sorted.
Example,
table_old
1...0 data
1...1 data
2...2 data
table_new
1...0 data
2...2 data
3...2 data
Your key is first digit and you do grouping according to that digit, and your partition is according to rest of id. Then your data is going to come to your clusters as
worker1
1...0b data
1...0a data
1...1a data
worker2
2...2a data
2...2b data and so on.
Note that, a, b doesnt have to be sorted.
EDIT
Merge is going to be like that:
FileInputFormat.addInputPath(job, new Path("trx-old"));
FileInputFormat.addInputPath(job, new Path("trx-new"));
MR will get two input and the two file will be merged,
For the appending part, you should create two more jobs before Main MR, which will have only Map. The first Map will append "a" to every element in first list and the second will append "b" to elements of second list. The third job(the one we are using now/main map) will only have reduce job to collect them. So you will have Map-Map-Reduce.
Appending can be done like that
//you have key:Text
new Text(String.valueOf(key.toString()+"a"))
but I think there may be different ways of appending, some of them may be more efficient in
(text hadoop)
Hope it would be helpful,