I'm working on an integration with Workday, and am tracking people by their WID (Workday ID), such as 85bb0669d8ac412582c0a473f7074d79. That WID may be interleaved with WID's from unrelated employers using completely distinct Workday accounts, as well as with IDs (known to be UUID's) from other other (non-Workday) sources.
The only ID source that is not known to be a UUID is the WID.
My standard approach to ensure uniqueness of ID's from various external sources would be to save two fields, external_source (e.g. "workday") and external_id (e.g. "85bb0669d8ac412582c0a473f7074d79"). When combined, these two fields assure uniqueness of person ID's across all sources and employers. But if I can confirm that the WID is in fact a UUID, I can make some desirable optimizations.
I've found no explicit definition of WID in Workday documentation of the WID other than, "The unique identifier type. Each "ID" for an instance of an object contains a type and a value. A single instance of an object can have multiple 'ID' but only a single 'ID' per 'type'. " from https://community.workday.com/sites/default/files/file-hosting/productionapi/Human_Resources/v20/Get_Workers.html
All the samples of WID's I've seen are 32-character hexadecimal strings, matching some non-authoritative articles I've found. I've not found any Workday documentation that spec's that they will always be in that format. They are not formatted with hyphens like a UUID, but could be arranged that way.
So...does anyone have a reference to Workday documentation that specify the contents of a WID? Lacking official docs, does anyone have practical knowledge about it?
Related
How does everyone deal with missing values in dataframe? I created a dataframe by using a Census Web Api to get the data. The 'GTCBSA' variable provides the City information which is required for me to use it for (plotly and dash) and I found that there is a lot of missing values in the data. Do I just leave it blank and continue with my data visualization? The following is my variable
Example data for 2004 = https://api.census.gov/data/2004/cps/basic/jun?get=GTCBSA,PEFNTVTY&for=state:*
Variable description = https://api.census.gov/data/2022/cps/basic/jan/variables/GTCBSA.json
There are different ways of dealing with missing data depending on the use case and the type of data that is missing. For example, for a near-continuous stream of timeseries signals data with some missing values, you can attempt to fill the missing values based on nearby values by performing some type of interpolation (linear interpolation, for example).
However, in your case, the missing values are cities and the rows are all independent (each row is a different respondent). As far as I can tell, you don't have any way to reasonably infer the city for the rows where the city is missing so you'll have to drop these rows from consideration.
I am not an expert in the data collection method(s) used by the US census, but from this source, it seems like there are multiple methods used so I can see how it might be possible that the city of the respondent isn't known (the online tool might not be able to obtain the city of the respondent, or perhaps the respondent declined to state their city). Missing data is a very common issue.
However, before dropping all of rows with missing cities, you might do a brief check to see if there is any pattern (e.g. are the rows with missing cities predominantly from one state, for example?). If you are doing any state-level analysis, you could keep the rows with missing cities.
It seems that the SNOMED codeset is rapidly evolving. New codes are being added, old codes are being deactivated. Do the releases provide upgrade maps from old code to new codes? (In particular, I'm interested in the UK extension).
Yes.
The information you're after will be in two files.
Concept inactivation indicator reference set
This reference set will tell you why a concept was inactivated. E.g. Duplicate, Erroneous, Outdated Etc.
The referencedComponentId is the concept that was inactivated.
The files are all IDs so you'll need to lookup the human readable terms.
Specifications are here 5.2.3 Attribute Value Reference Set
Historical association reference set
Not this refset itself, but it's subtypes, will be the mapping you are looking for. Each reference set provides a specific association as required.
For example."SAME AS association reference set" provides the mappings for duplicate concepts. The referencedComponentId is the inactive concept (X) and the targetComponentId is the association (Y). Read as "X SAME AS Y" 5.2.5 Association Reference Set
As a holistic example:
247702001|Bad trips (finding)| is a retired concept.
Has the following entries in the above reference sets
|8c7497f7-8230-4d01-9967-c85dd6f57e46|...|247702001|900000000000482003(Duplicate)|
|d09da0d7-b8f9-4d07-9b28-5c74ab247cb1|...|247702001|28368009(Psychoactive substance-induced organic hallucinosis)|
Presumably you've got the RF2 snapshot in a database? My advice is to create a table or view, of the three tables joined.
select * from concepts
left join conceptInactivationRefset
on concepts.id = conceptInactivationRefset.referencedComponentId
left join SAMEASrefset
on concepts.id = SAMEASrefset.referencedComponentId
where conceptInactivationRefset.active = 1 and SAMEASrefset.active = 1;
That should give you a start.
I am looking for specific meaning of following fields
valueIdentifier
valueMask
fieldType
FieldInfoMultiFixed
AutoRegFieldInfoSingle
FieldInfoMultiVariable
and in most cases we are getting numerical value for helpText. How do we identify whether helpText is present or not?
A lot of the stuff like FieldInfoMultiFixed/Variable is discussed in the Yodlee SDK Developer guide. Search for either one. They're just basically silly combos where people breakup a single value into multiple fields (like phone number or ssn into 3 textboxes)
As for the helpText, every time i've seen a Yodlee tech respond they say no. The number corresponds to an internal resource identifier that is apparently not exposed through the api. I want to say I saw somebody say that it might be available for things like forum signup/registration (where it would be more useful). The SDK makes mention as if it works as you would expect it to but that is an error.
Currently Yodlee does not have helptext populated for any field. Hence a numerical value is associated to it. In future if any helptext gets added then instead of numerical value you will have text in that field.
Hence if you are receiving numerical values then you should take it as helptext not present.
Shreyans
I am working on a project which shows rss feeds from different sites.
I keep them in the database, every 3 hours my program fetches and inserts them into sql database.
I want unique records for providers not to show duplicate content.
But problem is some providers do not give GUID field, and some others gives GUID field but not pubdate.. And some others does not even give GUID or PubDate just title and link.
So to keep rss feeds uniqe in sql server what would be the best way?
Should I check for first guid, then pubbdate, then link, then title? Will it be to good practice to compare link fields in SQL to check uniqueness?
Thanks.
I would develop a routine that takes certain key parameters like the title, source and body and then combines them to create a CRC hash. Then store the hash as an attribute with the feed and check for a matching hash before adding a new feed.
I'm not sure what your environment contraints are but here is an example for calculating CRC-32 in C#: http://damieng.com/blog/2006/08/08/calculating_crc32_in_c_and_net
Currently, this is what I am doing
# If we have a GUID in the feed item, use it as the feed_item_id else use link
# http://www.詹姆斯.com/blog/2006/08/rss-dup-detection
def build_feed_item_id(entry):
guid = trim(entry.get('id', ''))
if len(guid):
feed_item_id = guid
else:
feed_item_id = trim(entry.get('link', ''))
return hashlib.md5(feed_item_id.encode(encoding)).hexdigest()
It is based on the reasoning mentioned in the blog post linked in the snippet which I ll reference here in case the post gets taken down
RSS 2.0 has a guid element that fits the bill perfectly, but it’s not
a required element and many feeds don’t use it.
I can’t say for sure what algorithms applications are using, but after
running 150 tests on more than 20 different aggregators, I think have
a fair idea how many of them work.
As you would expect, for most the guid is considered the key element
for determining duplicates. This is pretty straightforward. If two
items have the same guid they are considered duplicates; if their
guids differ then they are considered different.
If a feed doesn’t contain guids, though, aggregators will most likely
resort to one of three general strategies – all of which involve the
link element in some way.
Technique 1
Guid must be unique
If a post doesnt have guid, consider link, title, description or any combination of them to get a unique hash
Technique 2
Link must be unique
If both link and guid are missing, check other elements such as title or description
Technique 3
Combination of link + title or link + description must be unique
The most obvious recommendation is that you should always include
guids in your feeds.
In addition, I would recommend you also include a unique link element
for each item in your feed, to allow for aggregators that don’t handle
guids very well. No two items should ever have the same link element,
and ideally a link should never change (if you do update a link, be
aware that it could show up as a new item for some aggregators).
Finally, although this is not essential, it is advisable that you
refrain from updating your article titles if at all possible. There
are at least two aggregators that will consider an entry with an
altered title to be a completely new post – somewhat annoying to
readers when all you’ve done is make a spelling correction in your
title.
In our DB, every Person has an ID, which is the DB generated, auto-incremented integer. Now, we want to generate a more user-friendly alpha-numeric ID which can be publicly exposed. Something like the Passport number. We obviously don't want to expose the DB ID to the users. For the purpose of this question, I will call what we need to generate, the UID.
Note: The UID is not meant to replace the DB ID. You can think of the UID as a prettier version of the DB ID, which we can give out to the users.
I was wondering if this UID can be a function of the DB ID. That is, we should be able to re-generate the same UID for a given DB ID.
Obviously, the function will take a 'salt' or key, in addition to the DB ID.
The UID should not be sequential. That is, two neighboring DB IDs should generate visually different-looking UIDs.
It is not strictly required for the UID to be irreversible. That is, it is okay if somebody studies the UID for a few days and is able to reverse-engineer and find the DB ID. I don't think it will do us any harm.
The UID should contain only A-Z (uppercase only) and 0-9. Nothing else. And it should not contain characters which can be confused with other alphabets or digits, like 0 and O, l and 1 and so on. I guess Crockford's Base32 encoding takes care of this.
The UID should be of a fixed length (10 characters), regardless of the size of the DB ID. We could pad the UID with some constant string, to bring it to the required fixed length. The DB ID could grow to any size. So, the algorithm should not have any such input limitations.
I think the way to go about this is:
Step 1: Hashing.
I have read about the following hash functions:
SHA-1
MD5
Jenkin's
The hash returns a long string. I read here about something called XOR folding to bring the string down to a shorter length. But I couldn't find much info about that.
Step 2: Encoding.
I read about the following encoding methods:
Crockford Base 32 Encoding
Z-Base32
Base36
I am guessing that the output of the encoding will be the UID string that I am looking for.
Step 3: Working around collisions.
To work around collisions, I was wondering if I could generate a random key at the time of UID generation and use this random key in the function.
I can store this random key in a column, so that we know what key was used to generate that particular UID.
Before inserting a newly generated UID into the table, I would check for uniqueness and if the check fails, I can generate a new random key and use it to generate a new UID. This step can be repeated till a unique UID is found for a particular DB ID.
I would love to get some expert advice on whether I am going along the correct lines and how I go about actually implementing this.
I am going to be implementing this in a Ruby On Rails app. So, please take that into consideration in your suggestions.
Thanks.
Update
The comments and answer made me re-think and question one of the requirements I had: the need for us to be able to regenerate the UID for a user after assigning it once. I guess I was just trying to be safe, in the case where we lose a user's UID and we will able to get it back if it is a function of an existing property of the user. But we can get around that problem just by using backups, I guess.
So, if I remove that requirement, the UID then essentially becomes a totally random 10 character alphanumeric string. I am adding an answer containing my proposed plan of implementation. If somebody else comes with a better plan, I'll mark that as the answer.
As I mentioned in the update to the question, I think what we are going to do is:
Pre-generate a sufficiently large number of random and unique ten character alphanumeric strings. No hashing or encoding.
Store them in a table in a random order.
When creating a user, pick the first these strings and assign it to the user.
Delete this picked ID from the pool of IDs after assigning it to a user.
When the pool reduces to a low number, replenish the pool with new strings, with uniqueness checks, obviously. This can be done in a Delayed Job, initiated by an observer.
The reason for pre-generating is that we are offloading all the expensive uniqueness checking to a one-time pre-generation operation.
When picking an ID from this pool for a new user, uniqueness is guaranteed. So, the operation of creating user (which is very frequent) becomes fast.
Would db_id.chr work for you? It would take the integers and generate a character string from them. You could then append their initials or last name or whatever to it. Example:
user = {:id => 123456, :f_name => "Scott", :l_name => "Shea"}
(user.id.to_s.split(//).map {|x| (x.to_i + 64).chr}).join.downcase + user.l_name.downcase
#result = "abcdefshea"