Is there another method of handling Time Travel in Data Science in cases where the underlying system does not have that feature? - data-science

I'm looking for a term or technique that might store schema changes over time. What did we do before Big Data? Is there something like a "Schema Ledger" that can capture things like "this field was renamed to that field?" Or "this SKU was moved to a different department on July 4th, 1976?"

Related

Where in the pdf text did the modification likely take place?

I have a pdf file that was created on a certain date and from the meta-data it was last modified on a date after its creation.
The pdf is nearly all just text and there is a sentence in the text that has likely been extended and a word deleted. Can I find out whether this particular sentence was in fact (likely) modified between the creation date and last modification date? Or rule it out.
I didn't know whether I could convert the pdf to a more elementary type (similar to .tex) or view it in another more elementary application (like CosEdit) to identify whether this sentence was extended and words deleted between the creation date and last modification date?
Don't worry about anyone attempting to conceal the modifications in any way. That's not applicable in this instance.
Link to document: https://drive.google.com/file/d/1OFXRCw2U1mo7BjHUSGs_1fVjDsQLRo0V/view?usp=drivesdk
Realvent line is on page5. Its the first bullet point under the title Criteria for Addressing a Property
There is not much value or certainty when analysing a reasonably well constructed PDF the sample provided is of unknown pedigree. I personally would not trust a PDF history comparison over a conventional Paper Trail. You query the changes made to a newer copy of a Public Document.
We can see the Original was reported as produced by the technician using Word 2013 on 6/12/2017, potentially after drafts had been corrected by management, the source document reports that there were 2 prior changes, which are not of concern here, since the document as it stood at that time, would then (as if printed) have gone forward for final approval, master sign off, and publication.
You provided a secondary amended copy of the same policy document. Initial query shows it appears as if it was subject to A change in time but there are no incremental editions to be pared back, so using a comparison tool we can check for the differences.
First look suggests 5 of 8 pages were changed (updated per annual review)
The first change is Page 3 the admin charge for 2021 is now £86 (was £75 in 2017)
The second change is on Page 5 more on that later
The third change is on Page 6 where premise has been changed to primary
The fourth is Page 7 where the example Numbered ... 1 is changed to Lettered ... A
Finally Page 8 the Technician has been Promoted over the years and the department has been renamed.
ALL these changes would have been made in the source Word Document which in turn may have changed many more times than we shall know without the paper trail showing which day the technician was formally appointed or the department changed name or the annual charges were increased. A PDF is dumbly generated as showing A difference from the original.
Your query is can we tell how many times or when or by who Page 5 was changed. As you may have gathered from the above the short answer is usually no (not from a PDF).
The changes over time of a policy document are driven by many factors such as inflation, spell checking and proof reading changes, or changes in managerial policies.
Page 5 was changed in two places
semantically the unnecessary word "new" was replaced with "a"
and a concession was added to the end of the paragraph
"unless justification can be supplied"
There is no way of knowing who penned those changes, only some certainty we can guess the technician was directed to make those corrections between 2017 and 2021. But was it verbal or by email or paper we do not know those are other documents. What we do know is the final document must have been approved for PDF printing, unless your copy is unofficial.
If you wish to know more see https://www.whatdotheyknow.com/request/street_naming_information

What are the steps to convert a Scenario to BPMN?

I have an exam tomorrow and to be honest till now I don't know what are the steps that I should go through to design a given Scenario.
For example, when you see a scenario like this
Every weekday morning, the database is backed up and then it is checked to see whether the “Account Defaulter” table has new records. If no new records are found, then the process should check the CRM system to see whether new returns have been filed. If new returns exist, then register all defaulting accounts and customers. If the defaulting client codes have not been previously advised, produce another table of defaulting accounts and send to account management. All of this must be completed by 2:30 pm, if it is not, then an alert should be sent to the supervisor. Once the new defaulting account report has been completed, check the CRM system to see whether new returns have been filed. If new returns have been filed, reconcile with the existing account defaulters table. This must be completed by 4:00 pm otherwise a supervisor should be sent a message.
What is your approach to model this? I am not asking for the answer of this particular scenario, I am asking for the method. Do you design sentence by sentence? or do you try to figure out the big picture first then try to find the sub process?
There are no exact steps. Use imagination, Luke!)
You can take these funny instructions like a starting point, but they were made by dummies for dummies.
Commonly you should outline process steps and process participants on a sheet of paper schematically and try to build your model. No other way: only brainstorm.
When BPMN comes to mind, one thinks of people together in a conference room discussing how the business does things (creating what you call scenarios and translating to business processes) and drawing boxes and lines on a whiteboard.
Since 2012, when BPMN 2.0 appeared as an Object Management Group (OMG) specification, we have the very comprehensive 532-page .pdf file with pretty much all the information to create the process diagrams one needs.
Still, in addition to reading the previous file, one can also find many BPMN examples of common modeling problems, patterns, books and research papers which help to understand how certain scenarios come to live.
Generally speaking, we first identify who takes part in the process to understand who are the actors. After, we realize where they get (if they get) their input, what they do with it (if they do anything) and where they forward it after they have completed their work (if they forward). This allows to visualize each actor has specific tasks that follow a specific flow of work and can better draw it.
Then, once the clean and simple diagram is built, one can validate visualizing (IRL or not) the users / actors executing the activities.

How should I deal with copies of data in a database?

What should I do if a user has a few hundred records in the database, and would like to make a draft where they can take all the current data and make some changes and save this as a draft potentially for good, keeping the two copies?
Should I duplicate all the data in the same table and mark it as a draft?
or only duplicate the changes? and then use the "non-draft" data if no changes exist?
The user should be able to make their changes and then still go back to the live and make changes there, not affecting the draft?
Just simply introduce a version field in the tables that would be affected.
Content management systems (CMS) do this already. You can create a blog post for example, and it has version 1. Then a change is made and that gets version 2 and on and on.
You will obviously end up storing quite a bit more data. A nice benefit though is that you can easily write queries to load a version (or a snapshot) of data.
As a convention you could always make the highest version number the "active" version.
You can either use BEGIN TRANS, COMMIT and ROLLBACK statements or you can create a stored procedure / piece of code that means that any amendments the user makes are put into temporary tables until they are ready to be put into production.
If you are making a raft of changes it is best to use temporary tables as using COMMIT etc can result in locks on the live data for other uses.
This article might help if the above means nothing to you: http://www.sqlteam.com/article/temporary-tables
EDIT - You could create new tables (ie NOT temporary, but full fledged sql tables) "on the fly" and name them something meaningful. For instance, the users intials, followed by original table name, followed by a timestamp.
You can then programtically create, amend and delete these tables over long periods of time as well as compare against Live tables. You would need to keep track of how many tables are being created in case your database grows to vast sizes.
The only major headache then is putting the changes back into the live data. For instance, if someone takes a cut of data into a new table and then 3 weeks later decides to send it into live after making changes. In this instance there is a likelihood of the live data having changed anyway and possibly superseding the changes the user will submit.
You can get around this with some creative coding though. There are many ways to tackle this, so if you get stuck at the next step you might want to start a new question. Hopefully this at least gives you some inspiration though.

Full text address matching

I'm looking for duplicate records. I have a Property table with the fields street, number, city, state, county and zip. They get geo-coded based on location, but there are some holes in the data. Problem is if they make a simple typing error or omit certain fields, they won't come up as matches.
As of now a straight = comparison and LIKE aren't really doing a very good job. But Jaro Winkler and similar edit distance algorithms are running with extremely poor performance.
The CASS-Certified Scrubbing Service from SmartyStreets offers deduplication as part of their address verification process. Just upload the data in a delimited text file and the duplicates will be marked on the output file you download. There's always a free preview for each file you process so you don't have to purchase anything before you're satisfied with the results. I'm a software developer for SmartyStreets and helped write the application. I'm pretty pleased with both its functionality and ease of use. We also have an API you could use but the deduplication would be your responsibility (just compare the full, 12-digit Delivery Point Barcode, which serves as a unique identifier for addresses).

sDesigning a database with flexible user profile

I am working on a design where I can have flexible attributes for users and I am confused how to continue the design of the schema.
I made a table where I kept system needed information:
Table name: users
id
username
password
Now, I wish to create a profile table and have one to one relation where all the other attributes in profile table such as email, first name, last name, etc. My question is: is there a way to add a third table in which profiles will be flexible? In other words, if my clients need to create a new attribute he/she won't need any customization to the code.
You're looking for a normalized table. That is a table that has user_id, key, value columns which produce a 1:N relationship between User & this new table. Look into http://en.wikipedia.org/wiki/Database_normalization for a little more information. Performance isn't amazing with normalized tables and it can take some interesting planning for optimization of your code but it's a very standard practice.
Keep the fixed parts of the profile in a standard table to make it easy to query, add constraints, etc.
For the configurable parts it sounds like you are looking for an entity-attribute-value model. The extra configurability comes at a high cost though: everything will have to be stored as strings and you will have to do any data validation in the application, not in the database.
How will these attributes be used? Are they simply a bag of data or would the user expect that the system would do something with these values? Are there ever going to be any reports against them?
If the system must do something with these attributes then you should make them columns since code will have to be written anyway that does something special with the values. However, if the customers just want them to store data then an EAV might be the ticket.
If you are going to implement an EAV, I would suggest adding a DataType column to your attributes table. This enables you to do some rudimentary validation on the entered data and dynamically change the control used for entry.
If you are going to use an EAV, then the one rule you must follow is to never write any code where you specify a particular attribute. If these custom attributes are nothing more than a wad of data, then an EAV for this one portion of your system will work. You could even consider creating an XML column to store these attributes. SQL Server actually has an XML data type but all databases have some form of large text data type that will also work. On reports, the data would only ever be spit out. You would never place specific values in specific places on reports nor would you ever do any kind of numerical operation against the data.
The price of an EAV is vigilence and discipline. You have to have discipline amongst yourself and the other developers and especially report writers to never filter on a specific attribute no matter how much pressure you get from management. The moment a client wants to filter or do operations on a specific attribute, it must become a first class attribute as a column. If you feel that this kind of discipline cannot be maintained, then I would simply create columns for each attribute which would mean an adjustment to code but it will create less of mess down the road.