Mapreduce Table Diff - sql

I have two versions (old/new) of a database table with about 100,000,000 records. They are in files:
trx-old
trx-new
The structure is:
id date amount memo
1 5/1 100 slacks
2 5/1 50 wine
id is the simple primary key, other fields are non-key. I want to generate three files:
trx-removed (ids of records present in trx-old but not in trx-new)
trx-added (records from trx-new whose ids are not present in trx-old)
trx-changed (records from trx-new whose non-key values have changed since trx-old)
I need to do this operation every day in a short batch window. And actually, I need to do this for multiple tables and across multiple schemas (generating the three files for each) so the actual app is a bit more involved. But I think the example captures the crux of the problem.
This feels like an obvious application for mapreduce. Having never written a mapreduce application my questions are:
is there some EMR application that already does this?
is there an obvious Pig or maybe Cascading solution lying about?
is there some other open source example that is very close to this?
PS I saw the diff between tables question but the solutions over there didn't look scalable.
PPS Here is a little Ruby toy that demonstrates the algorithm: Ruby dbdiff

I think it would be easiest just to write your own job, mostly because you'll want to use MultipleOutputs to write to the three separate files from a single reduce step when the typical reducer only writes to one file. You'd need to use MultipleInputs to specify a mapper for each table.

This seems like the perfect problem to solve in cascading. You have mentioned that you have never written MR application and if the intent is to get started quickly (assuming you are familiar with Java) then Cascading is the way to go IMHO. I'll touch more on this in a second.
It is possible to use Pig or Hive but these aren't as flexible if you want to perform additional analysis on these columns or change schemas since you can build your Schema on the fly in Cascading by reading from the column headers or from a mapping file you create to denote the Schema.
In Cascading you would:
Set up your incoming Taps : Tap trxOld and Tap trxNew (These point to your source files)
Connect your taps to Pipes: Pipe oldPipe and Pipe newPipe
Set up your outgoing Taps : Tap trxRemoved, Tap trxAdded and Tap trxChanged
Build your Pipe analysis (this is where the fun (hurt) happens)
trx-removed :
trx-added
Pipe trxOld = new Pipe ("old-stuff");
Pipe trxNew = new Pipe ("new-stuff");
//smallest size Pipe on the right in CoGroup
Pipe oldNnew = new CoGroup("old-N-new", trxOld, new Fields("id1"),
trxNew, new Fields("id2"),
new OuterJoin() );
The outer join gives us NULLS where ids are missing in the other Pipe (your source data), so we can use FilterNotNull or FilterNull in the logic that follows to get us final pipes that we then connect to Tap trxRemoved and Tap trxAdded accordingly.
trx-changed
Here I would first concatenate the fields that you are looking for changes in using FieldJoiner then use an ExpressionFilter to give us the zombies (cause they changed), something like:
Pipe valueChange = new Pipe("changed");
valueChange = new Pipe(oldNnew, new Fields("oldValues", "newValues"),
new ExpressionFilter("oldValues.equals(newValues)", String.class),
Fields.All);
What this does is it filters out Fields with the same value and keeps the differences. Moreover, if the expression above is true it gets rid of that record. Finally, connect your valueChange pipe to your Tap trxChanged and your will have three outputs with all the data you are looking for with code that allows for some added analysis to creep in.

As #ChrisGerken suggested, you would have to use MultipleOutputs and MultipleInputs in order to generate multiple output files and associate custom mappers to each input file type (old/new).
The mapper would output:
key: primary key (id)
value: record from input file with additional flag (new/old depending on the input)
The reducer would iterate over all records R for each key and output:
to removed file: if only a record with flag old exists.
to added file: if only a record with flag new exists.
to changed file: if records in R differ.
As this algorithm scales with the number of reducers, you'd most likely need a second job, which would merge the results to a single file for a final output.

What come to my mind is that:
Consider your tables are like that:
Table_old
1 other_columns1
2 other_columns2
3 other_columns3
Table_new
2 other_columns2
3 other_columns3
4 other_columns4
Append table_old's elements "a" and table_new's elements "b".
When you merge both files and if an element exist on the first file and not in the second file this is removed
table_merged
1a other_columns1
2a other_columns2
2b other_columns2
3a other_columns3
3b other_columns3
4a other_columns4
From that file you can do your operations easily.
Also, let say your id's are n digits, and you have 10 clusters+1 master. Your key would be 1st digit of id, therefore, you divide the data to clusters evenly. You would do grouping+partitioning so your data would be sorted.
Example,
table_old
1...0 data
1...1 data
2...2 data
table_new
1...0 data
2...2 data
3...2 data
Your key is first digit and you do grouping according to that digit, and your partition is according to rest of id. Then your data is going to come to your clusters as
worker1
1...0b data
1...0a data
1...1a data
worker2
2...2a data
2...2b data and so on.
Note that, a, b doesnt have to be sorted.
EDIT
Merge is going to be like that:
FileInputFormat.addInputPath(job, new Path("trx-old"));
FileInputFormat.addInputPath(job, new Path("trx-new"));
MR will get two input and the two file will be merged,
For the appending part, you should create two more jobs before Main MR, which will have only Map. The first Map will append "a" to every element in first list and the second will append "b" to elements of second list. The third job(the one we are using now/main map) will only have reduce job to collect them. So you will have Map-Map-Reduce.
Appending can be done like that
//you have key:Text
new Text(String.valueOf(key.toString()+"a"))
but I think there may be different ways of appending, some of them may be more efficient in
(text hadoop)
Hope it would be helpful,

Related

How to map the column wise data in flowfile in NiFi?

i have csv file which having following structure.,
Alfreds,Centro,Ernst,Island,Bacchus
Germany,Mexico,Austria,UK,Canada
01,02,03,04,05
Now i have to move that data into database like below.
Name,City,ID
Alfreds,Germay,01
Centro,Mexico,02
Ernst,Austria,03
Island,UK,04
Bacchus,Canda,05
i try to map those colums but i can't able to extract the data in column wise.
Here my input data in column wise but i need to insert those in row wise in SQLServer
Can anyone suggest way to transfer column wise data into row wise in sql server?.
Thanks
There is no existing Apache NiFi processor to perform column transposition. One of the problems is that this is difficult to do in a streaming manner, as most NiFi components are designed, because in a naïve implementation you need to hold the entire contents of the flowfile in active memory at the same time.
I would recommend using an ExecuteScript processor to do this (here's a 6 line Python example). Be careful doing this because you can easily end up overflowing your heap if it is not set properly/you read unexpectedly large files into memory.
You could write a custom processor which performs a streaming transpose operation by iterating over each of n rows and reading up to your delimiter, storing a byte counter per row, combining the n elements as a single output row, and repeating the process starting from the respective byte counter of each row. (Given m columns, this is O(m * n)).
Another solution would be splitting the CSV input into individual rows using the SplitText processor, using an ExecuteScript or custom processor to transpose a single row into a single column, and then using a custom merge operation (either extend the existing MergeContent processor or write a script to do this) which laterally concatenates the incoming columns into a reconstructed matrix. (O(n) + O(n) + O(m) => O(2n + m) but the individual transposition operations can be performed in parallel so with x threads it's O(n + n/x + m)).
Any of these approaches will require some level of custom development. If you are really hesitant to pursue that, you could try using ExecuteStreamCommand and one of the many bash solutions to do the transposition on the command-line.
#Andy,
It could be possible in NiFi also without using ExecuteScript.
I have extract the 3 input rows as input.1,input.2,input.3 in ExtractText. And then count number of columns in "input.1" using AnydelinateValues in expression language and store that in "TotalCount" Attribute.
Initially made "Count=1".
Using Loop Concept to get the first column by using "Count" and then increment "Count" Check "Count" in RouteOnAttribute
"le(totalcount)"
Now form insert Query with "Count" Attribute.
It worked well for me.It could be useful for someone.

Find out the amount of space each field takes in Google Big Query

I want to optimize the space of my Big Query and google storage tables. Is there a way to find out easily the cumulative space that each field in a table gets? This is not straightforward in my case, since I have a complicated hierarchy with many repeated records.
You can do this in Web UI by simply typing (and not running) below query changing to field of your interest
SELECT <column_name>
FROM YourTable
and looking into Validation Message that consists of respective size
Important - you do not need to run it – just check validation message for bytesProcessed and this will be a size of respective column
Validation is free and invokes so called dry-run
If you need to do such “columns profiling” for many tables or for table with many columns - you can code this with your preferred language using Tables.get API to get table schema ; then loop thru all fields and build respective SELECT statement and finally Dry Run it (within the loop for each column) and get totalBytesProcessed which as you already know is the size of respective column
I don't think this is exposed in any of the meta data.
However, you may be able to easily get good approximations based on your needs. The number of rows is provided, so for some of the data types, you can directly calculate the size:
https://cloud.google.com/bigquery/pricing
For types such as string, you could get the average length by querying e.g. the first 1000 fields, and use this for your storage calculations.

looping in a Kettle transformation

I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:
Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.
In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

Represent Ordering in a Relational Database

I have a collection of objects in a database. Images in a photo gallery, products in a catalog, chapters in a book, etc. Each object is represented as a row. I want to be able to arbitrarily order these images, storing that ordering in the database so when I display the objects, they will be in the right order.
For example, let's say I'm writing a book, and each chapter is an object. I write my book, and put the chapters in the following order:
Introduction, Accessibility, Form vs. Function, Errors, Consistency, Conclusion, Index
It goes to the editor, and comes back with the following suggested order:
Introduction, Form, Function, Accessibility, Consistency, Errors, Conclusion, Index
How can I store this ordering in the database in a robust, efficient way?
I've had the following ideas, but I'm not thrilled with any of them:
Array. Each row has an ordering ID, when order is changed (via a removal followed by an insertion), the order IDs are updated. This makes retrieval easy, since it's just ORDER BY, but it seems easy to break.
// REMOVAL
UPDATE ... SET orderingID=NULL WHERE orderingID=removedID
UPDATE ... SET orderingID=orderingID-1 WHERE orderingID > removedID
// INSERTION
UPDATE ... SET orderingID=orderingID+1 WHERE orderingID > insertionID
UPDATE ... SET orderID=insertionID WHERE ID=addedID
Linked list. Each row has a column for the id of the next row in the ordering. Traversal seems costly here, though there may by some way to use ORDER BY that I'm not thinking of.
Spaced array. Set the orderingID (as used in #1) to be large, so the first object is 100, the second is 200, etc. Then when an insertion happens, you just place it at (objectBefore + objectAfter)/2. Of course, this would need to be rebalanced occasionally, so you don't have things too close together (even with floats, you'd eventually run into rounding errors).
None of these seem particularly elegant to me. Does anyone have a better way to do it?
An other alternative would be (if your RDBMS supports it) to use columns of type array. While this breaks the normalization rules, it can be useful in situations like this. One database which I know about that has arrays is PostgreSQL.
The acts_as_list mixin in Rails handles this basically the way you outlined in #1. It looks for an INTEGER column called position (of which you can override to name of course) and using that to do an ORDER BY. When you want to re-order things you update the positions. It has served me just fine every time I've used it.
As a side note, you can remove the need to always do re-positioning on INSERTS/DELETES by using sparse numbering -- kind of like basic back in the day... you can number your positions 10, 20, 30, etc. and if you need to insert something in between 10 and 20 you just insert it with a position of 15. Likewise when deleting you can just delete the row and leave the gap. You only need to do re-numbering when you actually change the order or if you try to do an insert and there is no appropriate gap to insert into.
Of course depending on your particular situation (e.g. whether you have the other rows already loaded into memory or not) it may or may not make sense to use the gap approach.
If the objects aren't heavily keyed by other tables, and the lists are short, deleting everything in the domain and just re-inserting the correct list is the easiest. But that's not practical if the lists are large and you have lots of constraints to slow down the delete. I think your first method is really the cleanest. If you run it in a transaction you can be sure nothing odd happens while you're in the middle of the update to screw up the order.
Just a thought considering option #1 vs #3: doesn't the spaced array option (#3) only postpone the problem of the normal array (#1)? Whatever algorithm you choose, either it's broken, and you'll run into problems with #3 later, or it works, and then #1 should work just as well.
I did this in my last project, but it was for a table that only occasionally needed to be specifically ordered, and wasn't accessed too often. I think the spaced array would be the best option, because it reordering would be cheapest in the average case, just involving a change to one value and a query on two).
Also, I would imagine ORDER BY would be pretty heavily optimized by database vendors, so leveraging that function would be advantageous for performance as opposed to the linked list implementation.
Use a floating point number to represent the position of each item:
Item 1 -> 0.0
Item 2 -> 1.0
Item 3 -> 2.0
Item 4 -> 3.0
You can place any item between any other two items by simple bisection:
Item 1 -> 0.0
Item 4 -> 0.5
Item 2 -> 1.0
Item 3 -> 2.0
(Moved item 4 between items 1 and 2).
The bisection process can continue almost indefinitely due to the way floating point numbers are encoded in a computer system.
Item 4 -> 0.5
Item 1 -> 0.75
Item 2 -> 1.0
Item 3 -> 2.0
(Move item 1 to the position just after Item 4)
Since I've mostly run into this with Django, I've found this solution to be the most workable. It seems that there isn't any "right way" to do this in a relational database.
I'd do a consecutive number, with a trigger on the table that "makes room" for a priority if it already exists.
I had this problem as well. I was under heavy time pressure (aren't we all) and I went with option #1, and only updated rows that changed.
If you swap item 1 with item 10, just do two updates to update the order numbers of item 1 and item 10. I know it is algorithmically simple, and it is O(n) worst case, but that worst case is when you have a total permutation of the list. How often is that going to happen? That's for you to answer.
I had the same issue and have probably spent at least a week concerning myself about the proper data modeling, but I think I've finally got it. Using the array datatype in PostgreSQL, you can store the primary key of each ordered item and update that array accordingly using insertions or deletions when your order changes. Referencing a single row will allow you to map all your objects based on the ordering in the array column.
It's still a bit choppy of a solution but it will likely work better than option #1, since option 1 requires updating the order number of all the other rows when ordering changes.
Scheme #1 and Scheme #3 have the same complexity in every operation except INSERT writes. Scheme #1 has O(n) writes on INSERT and Scheme #3 has O(1) writes on INSERT.
For every other database operation, the complexity is the same.
Scheme #2 should not even be considered because its DELETE requires O(n) reads and writes. Scheme #1 and Scheme #3 have O(1) DELETE for both read and write.
New method
If your elements have a distinct parent element (i.e. they share a foreign key row), then you can try the following ...
Django offers a database-agnostic solution to storing lists of integers within CharField(). One drawback is that the max length of the stored string can't be greater than max_length, which is DB-dependent.
In terms of complexity, this would give Scheme #1 O(1) writes for INSERT, because the ordering information would be stored as a single field in the parent element's row.
Another drawback is that a JOIN to the parent row is now required to update ordering.
https://docs.djangoproject.com/en/dev/ref/validators/#django.core.validators.validate_comma_separated_integer_list