Retrieving Hbase versioned data - hive

I am trying to retrieve different version of Hbase data.
Step 1 - Table abc has 4 column all with version 1 and in single column family.
a b c d
1 1 1 1
Step 2 - Column b and c values get changed and we load updated value of column b and c as version 2.(Column b and c has version 1 and version 2 data)
a b c d
1 1/2 1/2 1
I want to retrieve the below set of versions from Hbase.
a b c d
1 1 2 1
Is there any way to achieve this ??
Thanks in advance.

HBase has decent documentation on this concept:
The maximum number of versions to store for a given column is part of the column schema and is specified at table creation, or via an alter command, via HColumnDescriptor.DEFAULT_VERSIONS. Prior to HBase 0.96, the default number of versions kept was 3, but in 0.96 and newer has been changed to 1.
So if you are designing a schema now, you can set things up to have a particular number of prior versions stored. If the HBase table already exists, you can alter it but won't be able to get prior versions for data that has already been stored.
Here is an example for getting prior versions for a column (it comes from that documentation):
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR = "attr".getBytes();
...
Get get = new Get(Bytes.toBytes("row1"));
get.setMaxVersions(3); // will return last 3 versions of row
Result r = table.get(get);
byte[] b = r.getValue(CF, ATTR); // returns current version of value
List<KeyValue> kv = r.getColumn(CF, ATTR); // returns all versions of this column
It is very important to keep in mind that versions from an HBase point of view are directly associated with the timestamp used when writing. A default put command will use the time of its execution as its timestamp. So normally this provides versioning in sequence with our changes. But, given two put operations with timestamps T1 and T2 where T1 is less than T2, if T1 happens to actually be written after T2, it will still appear as the earlier version. It is the timestamp HBase cares about, not when in absolute time it was actually written. This makes it possible for instance to overwrite earlier versions by setting the same timestamp.

Related

subtract every next column value from previous?

I have a dataset, where somehow the next singular data is added on top of the previous data for one row, and that for every column, which means,
row with ID 1 is the original pure data, but row with e.g ID 10 has added the data from the previous 9 datasets on itself...
what I now want is to get the original pure data for every distinct item, which means for every ID, how can I substract all data from lets say ID, 10? I would have to substract those of the previous one, for ID 9 and so on...
I want to do this either in SQL Server or in Rapidminer, I am working with those tools, any idea?
here is a sample:
ID col1 col2 col3
1 12 2 3
2 15 5 5
3 20 8 8
so the real correct data for Item with ID 3 is not 20, 8, 8 it is (20-15),(8-5),(8-5) so its 5,3,3...
subtract the later from its previous for every item except the first..
1 12 2 3
Try it out with lag series operator, it will work for sure! To get this operator you should install the series extension from the RM marketplace.
What this operator does - he copies the selected attributes and pushes every row of the example set for one point, so row with ID 1 gets a copy with ID 2 etc (you can also specify the value for a lag). Afterwards you can substract one value from another with Generate Attributes.
I think lag() is the answer to your question:
select (case when id = 1 then col
else col - lag(col) over (order by id)
end)
However, sample data would clarify the question.
Within RapidMiner there is the Differentiate operator contained in the Series extension (which is not installed by default and needs to be downloaded from the RapidMiner Marketplace). This can be used to calculate differences between attributes in adjacent examples.

INFORMATICA Using transformation to get desired target from a single flat file (see pictures)

I just started out using Informatica and currently I am figuring out how to get this to a target output (flat file to Microsoft SSIS):
ID Letter Parent_ID
---- ------ ---------
1 A NULL
2 B 1
3 C 1
4 D 2
5 E 2
6 F 3
7 G 3
8 H 4
9 I 4
From (assuming that this is a comma-delimited flat file):
c1,c2,c3,c4
A,B,D,H
A,B,D,I
A,B,E
A,C,F
A,C,G
EDIT: Where c1 c2 c3 and c4 being a header.
EDIT: A more descriptive representation of what I want to acheive:
EDIT: Here is what I have so far (Normalizer for achieving the letter column and Sequence Generator for ID)
Thanks in advance.
I'd go with a two-phased approach. Here's the general idea (not a full, step-by-step solution).
Perform pivot to get all values in separate rows (eg. from "A,B,D,H" do a substring and union the data to get four rows)
Perform sort with distinct and insert into target to get IDs assigned. End of mapping one.
In mapping two add a Sequence to add row numbers
Do the pivot again
Use expression variable to refer previous row and previous RowID (How do I get previous row?)
If current RowID doesn't match previous RowID, this is a top node and has no parent.
If previous row exists and the RowID is matching, previous row is a parent. Perform a lookup to get it's ID from DB and use as Parent_ID. Send update to DB.

What is the best way to reassign ordinal number of a move operation

I have a column in the sql server called "Ordinal" that is used to indicate the display order of the rows. It starts from 0 and skips 10 for the next row. so we have something like this:
Id Ordinal
1 0
2 20
3 10
It skips 10 because we wanted to be able to move item in between items (based on ordinal) without having to reassign ordinal number for the entire table.
As you can imagine eventually, Ordinal number will need to be reassign somehow for a move in between operation either on surrounding rows or for the entire table as the unused ordinal numbers between the target items are all used up.
Is there any algorithm that I can use to effectively reorder the ordinal number for the move operation taken in the consideration like long term maintainability of the table and minimizing update operations of the table?
You can re-number the sequences using a somewhat complicated UPDATE statement:
UPDATE u
SET u.sequence = 10 * (c.num_below-1)
FROM test u
JOIN (
SELECT t.id, count(*) AS num_below
FROM test t
JOIN test tr ON tr.sequence <= t.sequence
GROUP BY t.id
) c ON c.id=u.id
The idea is to obtain a count of items with the sequence lower than that of the current row, multiply the count by ten, and assign it as the new count.
The content of test before the UPDATE:
ID Sequence
__ ________
1 0
2 10
3 20
4 12
The content of test after the UPDATE:
ID Sequence
__ ________
1 0
2 30
3 10
4 20
Now the sequence numbers are evenly spread again, so you can continue inserting in the middle until you run out of new sequence numbers; then you can re-number again.
Demo.
These won't answer your question directly--I just thought I might suggest some other approaches:
One possibility--don't try to do it by hand. Have your software manage the numbers. If they need re-writing, just save them with new numbers.
a second--use a "Linked List" instead. In each record store the index of the next record you want displayed, then have your code load that directly into a linked list.
Yet another simple approach. Let's say you're inserting a new record with an ordinal equal x.
First, check if there's a row having ordinal value equal x. In case there's one, just update all the records having the ordinal value equal or bigger than x increasing them by y. Then, you are safe to insert a new record.
This way you're sure you'll not run update every time and of course, you'll keep the order.

Limit results to x groups

I'm developing a system using Trac, and I want to limit the number of "changelog" entries returned. The issue is that Trac collates these entries from multiple tables using a union, and then later combines them into single 'changesets' based on their timestamp. I wish to limit the results to the latest e.g. 3 changesets, but this requires retrieving as many rows as necessary until I've got 3 unique timestamps. Solution needs to work for SQLite/Postgres.
Trac's current SQL
Current SQL Result
Time User Field oldvalue newvalue permanent
=======================================================================
1371806593507544 a owner b c 1
1371806593507544 a comment 2 lipsum 1
1371806593507544 a description foo bar 1
1371806593324529 b comment hello world 1
1371806593125677 c priority minor major 1
1371806592492812 d comment x y 1
Intended SQL Result (Limited to 1 timestamp e.g.)
Time User Field oldvalue newvalue permanent
=======================================================================
1371806593507544 a owner b c 1
1371806593507544 a comment 2 lipsum 1
1371806593507544 a description foo bar 1
As you already pointed out on your own, this cannot be resolved in SQL due to the undetermined number of results. And I think this is not even required.
You can use a slightly modified trac/ticket/templates/ticket.html Genshi template to get what you want. Change
<div id="changelog">
<py:for each="change in changes">
into
<div id="changelog">
<py:for each="change in changes[-3:]">
and place the file into <env>/templates/ restart your web-server. But watch out for changes to ticket.html, whenever you attempt to upgrade your Trac install. Every time you do that, you might need to re-apply this change on the current template of the respective version. But IMHO its still a lot faster and cleaner than to patch Trac core code.
If you want just three records (as in the "Data Limit 1" result set), you can use limit:
select *
from t
order by time desc
limit 3
If you want all records for the three most recent time stamps, you can use a join:
select t.*
from t join
(select distinct time
from t
order by times desc
limit 3
) tt
on tt.time = t.time

select records that do not exist in a select

I need to track changes created in a directory and also saved the history. I have a function that scans the files in that directory then it inserts them in a table. let's say that the first time this program was run there where files A, and B. As a result the table should look like
FileID File DateModified
1 A 101010
2 B 020202
let's say the user modifies file B therefore the next time the program runs the table should look like:
FileID File DateModified
1 A 101010
2 B 020202
3 A 101010
4 B 030303
From looking at the table above we know that file B has been changed because it has a different modified date and also that file A was not modified. Moreover my program know that the records that where inserted are all the records with a fileID greater than 2. How could I perform a select that will return the last file B because that file was modified. I want to be able to know which files have been modified how could I build that query.
Please read above first in order to understand this part. Here is another example.
First time program runs:
FileID File DateModified
1 X 101010
2 Y 020202
Next time program runs:
FileID File DateModified
1 X 101010
2 Y 020202
3 Y 020202
4 A 010101
so far we know that file X has been deleted because it is not included in the new scan. Moreover we know that file A has been created. And lastly that File Y was not modified it is the same. I would like to perform a select where I can just get the files that where created or modified such as file A in this case.
I am looking for something like:
select * from table1 where fileID > 2 AND File NOT IN (SELECT File FROM table1 WHERE File <=2) AND DateModified NOT IN (SELECT DateModified FROM table1 WHERE File <=2)
I don't know why is it that when I perform such query I get different results. Maybe I will have to group the File and DateModified into one column to make it work.
I would add a column called scan_number so that you can compare the latest scan with the previous scan.
SELECT curr.file, prev.file, curr.DateModified, prev.DateModified
FROM table1 curr
LEFT
JOIN table1 prev
on curr.file = prev.file
and curr.scan_number = 100
and prev.scan_number = 99
WHERE curr.DateModified != prev.DateModified
OR curr.file IS NULL
OR prev.file IS NULL
If you want to catch inserts and deletes, you need full outer join, but it seems sqlite doesn't support that. You might have to run the query twice, once to find inserts and updates, and once to find deletes.
DateModified is being asked to perform too many jobs: it is used to track both the file modification date and proof-of-existence for a given filename on a given date.
You could add another column, ScanId, a foreign key to a new table ScanDates that records a scan id and the date the scan was run.
That way, you could inspect all the results with a given ScanId and compare against any selected previous scan, and also keep track of the real modification dates of your files.