Find first occurrence of record in list of modifications/deletions with connecting keys - pandas

I have an algorithm that converts the writeback of a frontend app into a cleaned dataset.
In the frontend the user can either add a new record or modify/delete an existing one. The modification and deletion are performed by tracking the key of the original row and creating a new one with the new status.
Here is an example of the writeback of the frontend app
key
date
status
source_key
10277_left_1605483676378
1605483676378
created
null
10277_left_1605559717253
1605559717253
modified
10277_left_1605483676378
10277_left_1627550679123
1627550679123
deleted
10277_left_1605559717253
10277_left_1605560105840
1605560105840
modified
10277_left_1605483676378
10277_left_1605560105900
1605560105900
modified
10277_left_1605560105840
and here is the result after applying the algorithm that creates the cleaned dataset
key
date
status
10277_left_1605560105900
1605560105900
modified
As you can see we branched from the first version of the data (1605483676378), created two modified versions and deleted one of those, before making a final modification on the remaining one, so the resulting data only contains one row.
┌──────►1605559717253 ──────► 1627550679123 ─────► no output row
created │ modified deleted
1605483676378│
│ ┌──────────────────┐
└──────►1605560105840─────┼──►1605560105900 ├─────► row visible in
modified │ modified │ cleaned
└──────────────────┘ dataset
This works as every update is treated singularly. However, I would like to be able to inspect the origin of a certain record. That is, I want to know the original date when the record was created, something like this
key
date
status
date_added
10277_left_1605560105900
1605560105900
modified
1605483676378
I'm thinking on how to do this. I would avoid having to loop through the entire history of a record as this would be not efficient.
As the algorithm is currently working in Pyspark I would like to find a solution that works there, but hints in Pandas are also accepted.

IIUC you want to find the root node of a child node. I assume all your keys are unique in the below:
# df is your original df, df2 the one after you apply your algo
d = df.set_index("key")["source_key"].to_dict()
def find_root(node):
cur = d.get(node, np.NaN)
return find_root(cur) if cur is not np.NaN else node
df2["root"] = df2["key"].map(find_root)
print (df2)
key date status root
0 10277_left_1605560105900 1605560105900 modified 10277_left_1605483676378

Related

How to code a simple algorithm to fetch list of data through pagination in a fresh new application?

I'm making a clone of social app. I'm using graphQL as my backend. My problem is that every time I query a list of data it is returning the same result. When I will release that app, the user base will be very small so the amount or data is less in number. So I'm facing the issue described below:
1. My data in data base is like:
I'd=1 title=hello1
I'd=2 title=hello2
I'd=3 title=hello3
2. When I'm querying data through pagination with limit=3, I'm getting list of items is like:
Query 1
I'd=1 title=hello1
I'd=2 title=hello2
I'd=3 title=hello3
3. When I'm adding new items to data base, it is invoked in between the items like below:
I'd=1 title=hello1
I'd=4 title=hello4
I'd=2 title=hello2
I'd=3 title=hello3
I'd=5 title=hello5
4. So next fresh query result(limit=3) Will be like:
Query 2
I'd=1 title=hello1
I'd=4 title=hello4
I'd=2 title=hello2
Look at the data set previously our query result was: I'd=1,2 & 3 now I'd=1,4 & 2 so the user will get same result as id=1,2 is in new list.
If I will save pagination nextToken/cursor(I'd=3) of first query(query 1) then after new data added to data base the new query will start from I'd=5, because it is present after I'd=3. Look at the new dataset it will miss I'd=4 because nextToken is saved for I'd=3 for the query will start from I'd=5. Hope you can understand.
If your suggestion is add a sort key of created at, I want say that if I will add some filter, the data set will become so much selective that might become the reason of limited number of data in feed and we know a feed should query unlimited data.

fnupdate updates wrong row datatables

I am trying to update a row by passing index.
http://live.datatables.net/raculubo/1/
But it most of the time replaces a wrong row.
The code is :-
$(document).ready(function() {
var table = $('#example').DataTable();
var index = table.column(0).data().indexOf("Cedric Kelly");
console.log("index2",index);
table.row().data(["ax","by","dd"], index);
} );
This is happening because of how you are sorting your data, leading to a difference between the "sort order" index and the "internal DataTables" index.
The table.column(0).data() function will return an array of names, as currently displayed in the table, taking into account sorting. In this scenario, the index of "Cedric Kelly" is therefore 1.
However, the internal unique index value stored by DataTables is actually 3 because that is the order provided to DataTables from your HTML code when the data was loaded for the very first time (where Cedric Kelly is the 4th record listed - so the index is 3).
This initial loading happens before data is sorted, and it is during this step that data indexes are assigned. Once assigned, they never change (unless you delete data).
Your data update function uses the value of 1 - thus updating the wrong row.
The fix for this is to tell DataTables to use the original loading order in the table.column(0).data() function:
var index = table.column(0, {order:'index'} ).data().indexOf("Cedric Kelly");
That directive {order:'index'} causes DataTables to use the original loading order. Now, the correct record will be updated because this index will now return 3 instead of 1.
You can see more details about this "selector modifier" syntax here.
Bear in mind that the correct syntax for updating a row is actually this:
table.row( index ).data(["ax","by","dd"]);
Finally, bear in mind that if you filter your data, then you are OK, since the default value used is search: 'none' - which means "do not take searching/filtering into account" when selecting the column data.

How to use Bioproject ID, for example, PRJNA12997, in biopython?

I have an Excel file in which are given more then 2000 organisms, where each one of them has a Bioproject ID associated (like PRJNA12997). The idea is to use these IDs to get the sequence for a later multiple alignment with other five sequences that I have in a text file.
Can anyone help me understand how I can do this using biopython? At least the part with the bioproject ID.
You can first get the info using Bio.Entrez:
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
# This call to efetch fails sometimes with a 400 error.
handle = Entrez.efetch(db="bioproject", id="PRJNA12997")
I've been trying, and Entrez.read(handle) doesn't seems to work. But if you do record_xml = handle.read() you'll get the XML entry for this record. In this XML you can get the ID for the organism, in this case 12997.
handle = Entrez.esearch(db="nuccore", term="12997[BioProject]")
search_results = Entrez.read(handle)
Now you can efecth from your search results. At this point you should use Biopython to parse whatever you will get in the efetch step, playing with the rettype http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
for result in search_results["IdList"]:
entry = Entrez.efetch(db="nuccore", id=result, rettype="fasta")
this_seq_in_fasta = entry.read()

Peoplesoft CreateRowset with related display record

According to the Peoplebook here, CreateRowset function has the parameters {FIELD.fieldname, RECORD.recname} which is used to specify the related display record.
I had tried to use it like the following (just for example):
&rs1 = CreateRowset(Record.User, Field.UserId, Record.UserName);
&rs1.Fill();
For &k = 1 To &rs1.ActiveRowCount
MessageBox(0, "", 999999, 99999, &rs1(&k).UserName.Name.Value);
End-for;
(Record.User contains only UserId(key), Password.
Record.UserName contains UserId(key), Name.)
I cannot get the Value of UserName.Name, do I misunderstand the usage of this parameter?
Fill is the problem. From the doco:
Note: Fill reads only the primary database record. It does not read
any related records, nor any subordinate rowset records.
Having said that, it is the only way I know to bulk-populate a standalone rowset from the database, so I can't easily see a use for the field in the rowset.
Simplest solution is just to create a view, but that gets old very soon if you have to do it a lot. Alternative is to just loop through the rowset yourself loading the related fields. Something like:
For &k = 1 To &rs1.ActiveRowCount
&rs1(&k).UserName.UserId.value = &rs1(&k).User.UserId.value;
&rs1(&k).UserName.SelectByKey();
End-for;

Generate dynamic hstore key calls in Prawn

I have an hstore column that I'm using to build a table in Prawn (pdf builder). The data will consist of records for a given month. Since it is hstore, the keys used will likely change from day to day so this needs to be dynamic.
I need to determine:
1 What unique keys are used that month
I created a helper to find the unique keys that were used in the month. These will be used as column headers.
keys(#users_logs)
# this returns an array like - ["XC", "PIC", "Mountain"]
The table will display a users dutylog data for the month. For testing...If I explicitly call known hstore keys...the data displays correctly. But, since its hstore...I wont know what the table column will be in production.
For testing, I call known hstore keys...this creates the prawn table row data per duty log.
#users_logs.map do |dutylog|
[ dutylog.properties["XC"],
dutylog.properties["PIC"],
dutylog.properties["Mountain"]
]
end
But, since this is hstore...I wont know what keys to call in production. So, I need to make the above iteration dynamic.
I tried, without success, to iterate over each dutylog entry, then iterate over each unique key and output one "dutylog.properties[x]" call for each key value...but, this just outputs the array of key values. I tried using send() in the block, but that didnt help.
#users_logs.map do |dutylog|
[ keys(#users_logs).each { |k| dutylog.properties[k] }.join(",") ]
end
Any ideas on how I could make the "dutylog.properties[k]" dynamic?
Took some head scratching...but turning out to be quit easy
This will build the rows for the Prawn table
def hstore_duty_log_rows
[keys(#users_logs)] +
#users_logs.map do |dutylog|
keys(#users_logs).map { |key| dutylog.properties.keys.include?(key) ? "#{dutylog.properties[key]}" : "0" }
end
end