Different between Data Verification/Validation , Data Exploring/Exploitation, Data Acquisition/Collecting/Gathering? - data-science

What is the difference between these words in datascience?
is there any glossary exist to define these word?
-Data Verification / Validation
-Data Exploring / Exploitation
-Data Acquisition / Collecting / Gathering
thanks,

I found just the first part of your question:
-Data Verification / Validation:
Data verification is a way of ensuring the user types in what he or she intends, in other words, to make sure the user does not make a mistake when inputting data. An example of this includes double-entry of data (such as when creating a password or email) to prevent incorrect data input. Data validation has nothing to do with what the user wants to input. Validation is about checking the input data to ensure it conforms with the data requirements of the system to avoid data errors. An example of this is a range check to avoid an input number that is greater/smaller than the specified range.

Related

What is considered enough validation within excel dataset

When you are providing a report to an end user and you want to validate the report against the system and database (checking that your SQL code is pulling the details accurately), what is considered enough validation on the output Excel file?
For example:
10% in 100 would be 10
10% in 1000 would be 100 seems reasonable
10% in 1,000,000 would be 100,000 and seems completely unreasonable.
Is there a template or scale to validate across large datasets for human validation? Has anyone done or seen something like this before that I could use as a guide?
Now when you say validate against the database, the data you are giving has already been pulled from the database I am assuming. So is the question how to validate data that was input or validate that the queries are accurate?
If it's the first, the only real way to do this is to have controls for type of entry (IE: string, date, integer validation.)
If it's the second, then you should evaluate it on some quantitative criteria. For example, If I am trying to validate that I sold 100 computers last month and I have hard receipts for those 100 then I can query to ensure that this is reflective of the actual truth. Beyond that, you could have controls to make sure reports don't report duplicates and so on and so forth, but that has more to do with just general administration of input.

How to manage additional processed data in MarkLogic

MarkLogic 9.0.8.2
We have around 20M records in MarkLogic.
For one of the business requirement, we need to generate additional data for each xml and then need user will search this data.
As we can't change original document, so need input on what is best way to manage additional data. Following are the few which we have thought of
Create separate collection and store additional data in separate xml with same unique number i.e. same as original xml. So when user search for it, search in this collection and then retrieved original documents and send response back.
Store additional data in original document properties
We also need to create element range index to make sure it works when end user provide data in range operators.
<abc>
<xyz>
<quan>qty1</quan>
<value1>1.01325E+05</value1>
<unit>Pa</unit>
</xyz>
<xyz>
<quan>qty2</quan>
<value1>9.73E+02</value1>
<value2>1.373E+03</value2>
<unit>K</unit>
</xyz>
<xyz>
<quan>qty3</quan>
<value1>1.8E+03</value1>
<unit>s</unit>
</xyz>
<xyz>
<quan>qty4</quan>
<value1>3.6E+03</value1>
<unit>s</unit>
</xyz>
</abc>
We need to process data from value1 element. User will then search for something like
qty1 >= minvalue AND qty1<=maxvalue
qty2 >= minvalue AND qty2<=maxvalue
qty3 >= minvalue AND qty3<=maxvalue
So when user will search for qty1 then it should only get data from element where value is qty1 and so on.
So would like to know
What is best approach to store data like this
What kind of index i should create to implement this
I would recommend wrapping the original data in an envelope, which allows adding extra data in the header. It could also allow creating a canonical view on the relevant pieces of the data, and either store that as instance, and original as 'attachment' (sub-property, not an attached binary), or keep the instance as-is, and put canonical values for indexing in the header.
There is a lengthy blog article about the topic, that discusses pros and cons in high detail: https://www.marklogic.com/blog/envelope-design-pattern/
HTH!
Grtjn's answer would be the recommended solution, as it is more performant to keep all the information inside the document itself, versus having to query across both the document with the properties, but it would require changes to the document.
Option 1 & 2 could both work.
Properties documents already exist, so it doesn't add fragments, but the properties must conform to the schema.
Creating a sidecar document provides more flexibility, because you are creating new documents, it will increase number of fragments.

How to create a unique number in Sanity.io?

I have a field of the type number defined on a document in my schema. When the user inputs a number, I want a validation which verifies that no another document of the same type has the same number assigned to this field. How am I able to do this?
There's no out of the box solution to check for uniqueness. Currently the only input that does this is the slug field. However, you can create your own custom validation that uses the client to check for other documents with the same number for the specific field.
You can read more about custom validation in the docs. To import the client, you can add this to the top of your schema import client from 'part:#sanity/base/client'. Then, write a GROQ query to look for the number and validate accordingly.
Hope that helps!

ABAP Program to notify Users X amount of days before user account will be disabled

I'm currently learning ABAP and trying to make an enhancement but have broken down in confusion on how to go about building on top of existing code. I have a program that runs periodically via a background job that disables user accounts X amount of days (in this case 90 days of inactive usage based on USR02~TRDAT).
I want to add an enhancement to notify the User via their email address (result of usr02~bname to match usr21~bname to pass the usr21~persnumber and usr21~addrnumber to adr6 which will point to the adr6~smtp_addr of the user, providing the usr02~bname -> adr6~smtp_addr relationship) based on their last logon date being 30, 15, 7, 5, 3, and 1 day away from the 90 day inactivity threshold with a link to the SAP system to help them reactivate the account with ease.
I'm beginning to think that an enhancement might not be a good idea but rather create a new program and schedule the background job daily. Any guidance or information would be greatly appreciated...
Extract
CLASS cl_inactive_users_reader DEFINITION.
PUBLIC SECTION.
TYPES:
BEGIN OF ts_inactive_user,
user_name TYPE syst_uname,
days_of_inactivity TYPE int1,
END OF ts_inactive_user.
TYPES tt_inactive_users TYPE STANDARD TABLE OF ts_inactive_user WITH EMPTY KEY.
CLASS-METHODS read_inactive_users
IMPORTING
min_days_of_inactivity TYPE int1
RETURNING
VALUE(result) TYPE tt_inactive_users.
ENDCLASS.
Then refactor
REPORT block_inactive_users.
DATA(inactive_users) = cl_inactive_users_readers=>read_inactive_users( 90 ).
LOOP AT inactive_users INTO DATA(inactive_user).
" block user
ENDLOOP.
And add
REPORT warn_inactive_users.
DATA(inactive_users) = cl_inactive_users_readers=>read_inactive_users( 60 ).
LOOP AT inactive_users INTO DATA(inactive_user).
CASE inactive_user-days_of_inactivity.
" choose urgency
ENDCASE.
" send e-mail
ENDLOOP.
and run both reports daily.
Don't create a big ball of mud by squeezing new features into existing code.
From SAP wiki:
The enhancement concept allows you to add your own functionality to SAP's standard business applications without having to modify the original applications. To modify the standard SAP behavior as per customer requirements, we can use enhancement framework.
As per your description, it doesn't sound like a use case for an enhancement. It isn't an intervention in an existing process. The original process and your new requirement are two different processes with some mutual logical part - selection of days of inactivity of users. The two shouldn't rely on each other.
Structurally I think it is best to have a separate program for computing which e-mails need to be sent and when, and a separate program for actually sending them.
I would copy your original program to a new one, and modify it a little bit so that instead of disabling a user, it records into some table for each user: 1) an e-mail 2) a date when to send 3) how many days left (30, 15, 7, etc) 4) status if the e-mail was sent or not. Initially you can even have multiple such jobs for each period (30, 15, 7 etc) and pass it as a parameter (which you use inside instead of 90).
This program you run daily as a job and it populates that table with e-mail "tasks" of what needs to be sent today. It just adds new lines, so lines from yesterday should stay in there.
The 2nd program should just read that table and send actual e-mails and update the statuses. You run that program daily as well.
This way you have:
overview: just check the table to see what's going on
control: if the e-mailer dies or hangs, you can restart it and it will continue where it left off; with statuses you avoid sending duplicate mails
you can make sure that you don't send outdated e-mails if in your mailer script you ignore all tasks older than say 2 days
I want to clarify your confusion about the use of enhancements:
You would want to use enhancements in terms of 'something' happens or is going to happen in the system and you would want to change this standard way.
That something, let's call it event or process could be for example an order is placed, a certain user is logging onto the system or a material has been or is going to be changed.
The change could be notifying another system of an order or checking the logged on user with additional checks for example his GUI version and warn him/her if not up-to-date.
Ask yourself, what process on the system does the execution of your program or code depend on. Does anything need to happen before the program is executed? No, only elapsing time.
Even if you had found an enhancement, you would want to use. If this process using the enhancement would not be run in 90 days, your mails would not be sent, because the enhancement would never been called.
edit: That being said, supposing you mean by enhancement 'building on your existing program' instead of 'creating a new one' would be absolutely not the right terminology for enhancement in the sap universe.
I would extend the functionality of your existing program, since you already compute how many days are left and you would have only one job to maintain.

What are the security risks if I disclose database field name to web user interface?

I want make the program more simple, so I use table's field name as name in input html,
And then I can save some time for mapping input name to database field name
But, are there security risks if user know my field name?
(Suppose SQL injection have handled in the server program)
Update 1:
I am not going to around the field name validation
I just don't want to do something like this
$uid=$_POST['user_id'];
$ufname=$_POST['user_first_name'];
$ulname=$_POST['user_last_name'];
If I do this
$user_id=$_POST['user_id'];
$user_first_name=$_POST['user_first_name'];
$user_first_name=$_POST['user_last_name'];
I can save coding time, and don't need to think two names for one data, and reduce bug.
and I can also do something like this to save more time as I just type the name once.
$validField=array("user_id","user_first_name","user_last_name");
foreach ($validField as $field) {
$orm[$field]=$field;
}
This can also valid the field name
so I think that hacks are no way to get my unpublished fields
I can save some time for mapping input name to database field name.
If you save time mapping input names to database field names, you would need to spend a roughly equivalent time validating that the field names are, in fact, among the fields that the users can access in your database. There is no way around this validation, because otherwise your DB is exposed to hacks that try and get your unpublished fields, such as IDs and hashes. This is pretty bad, so you would need to build that validation layer.
On the other hand, if you do a mapping from meaningless IDs to meaningful, then you do not need validation, because it is your program that produced the meaningful IDs. Essentially, the validation step is built into the process.