How do I get started in producing mathematical models from data? - data-science

Lets say I have a system where user-defined questionnaires are displayed and a number of respondents submit their answers. Below is a sample JSON object of an answer:
{
"fullname": "Some guy"
"gender": male,
"q1": true,
"q2": false,
"q3": true,
"q4": false,
"q5": true,
"qualifications": ["Diploma","Degree"]
}
Now most developers could query results like this and be able to produce answer questions like
What percentage of respondents answered TRUE to question 1?
How many diploma holders are female?
I want to produce these answers without a developer being involved. Sure, I could just supply the raw data and let the users start making their own pivot tables in Excel, but even an Excel pivot table is describing the relationship of data. That is respondents where "qualifications" includes "Diploma", give a breakdown of gender.
I know I'm dipping into data science and mathematical models, but I'm not sure where's the best place to start.
How do I describe these relationships in software? What standards and tools exist? If I have the schema of data available, can I have the machine figure out the relationships (like making suggestions)?

Related

Methodology to check if a SQL query is correct for a context

I am doing dozen of queries per day, and I admit sometimes I miss the context of the user's demand.
I would like to know if you have any tips to check / double check, if a SQL query is really doing what the context is asking.
For example, I have this context :
retrieve the firstname of 10 male students from the school "great_school" having 21 years old.
For this context I would write a pseudo query like this :
SELECT st.firstname
FROM studient st
JOIN school_studient sc_st ON sc_st.studient_id = st.id
JOIN school sc ON sc.id = sc_st.shool_id
AND sc.name = "great_school"
WHERE st.age = 21
AND st.sexe = "male"
LIMIT 10
How to be sure that this query is really doing what the context asked ?
It isn't about using EXPLAIN to check if the query is valid, it is about checking that the query has all the conditions needful.
Is there any tools who is able to read pseudo query and tell what it does in human language ?
I was thinking about a paper checklist with 2 columns : "fields to select" and "criterias", and then I tick everytime one item is in the query.
But isn't there more advance tools than a piece of paper ?
Your question is asking for a recommendation for a software product, which is off topic on SO, but you might have more luck here.
However, I would focus more on a process than on a tool. I find it really helpful to work with sample data sets, and ask the end user to mark up what should and should not be included. The challenges in interpretation are usually much more around "what's in/out", "how do we aggregate (what's the group-by)" and "how do we sort" (your example grabs 10 random students).
If you can build a sample data set, and ask your users to say "I want these records to be included, those excluded", and "I want you to aggregate this column for every change in that column", you get a much higher quality specification. Once you find problems in the specification, you can adjust the sample data set to avoid that problem in future...

SQL Best way to store "Check All That Apply" in survey

I have a survey that ask the question of availability via check boxes like so:
I am available (please check all that apply:
[] Early Mornings
[] Mid Mornings
[] Early Afternoons
[] Mid Afternoons
[] Evenings
[] Late Evenings
[] Overnight
That I need to translate into a SQL database. My question is: What is the best way to store this data under one column? I was thinking of a 7 digit bit storage like: 0010001 (Indicates the candidate is only available during Early Afternoons and overnight). Is there a better way? Thanks for any opinions!
A separate table for the options and a "join table" of options to the candidate. The other solutions/suggestions will impede data integrity and performance in a relational database. If you've got another DB it might be different but don't do anything other than the relational table if you're using SQL.
Pipe delimited flags.
Make the column a fairly wide text column, then store:
'Early Mornings|Evenings|Overnight'
if those 3 choices were selected.
(Note: I do agree with the other answer that it is likely better to use separate columns, but this is how I'd do it if there were a good reason to want just 1 column)
Is there any particular reason the results need to be stored inside one column? If so, your solution is probably the best way EDIT: If you are going to be querying this data your solution is the best way, otherwise follow the other answer using "|" to separate the strings in one long varchar field, though anyone looking at that data is going to have no clue what it means unless they've taken the time to memorize each question in order.
If it doesn't need to be all in one column I'd recommend just creating a column for each question with a bit value similar to what you already want to do.

A way to store array type data into single database table?

I am having a input JSON which I need to feed into a database. We are exploring on whether to normalize or not our database tables.
Following is the structure for the input data (json):
"attachments": [
{
"filename": "abc.pdf",
"url": "https://www.web.com/abc.pdf",
"type": "done"
},
{
"filename": "pqr.pdf",
"url": "https://www.web.com/pqr.pdf",
"type": "done"
},
],
In the above example, attachments could have multiple values (more than 2, upto 8).
We were thinking of creating a different table called DB_ATTACHMENT and keep all the attachments for a worker down there. But the issue is we have somewhat 30+ different attachment type array (phone, address, previous_emp, visas, etc.)
Is there a way to store everything in ONE table (employee)? One I can think of is using a single column (ATTACHMENT) and add all the data in 'delimited-format' and have the logic at target system to parse and extract everything..
Any other better solution?
Thanks..
Is there a way to store everything in ONE table (employee)? One I can
think of is using a single column (ATTACHMENT) and add all the data in
'delimited-format' and have the logic at target system to parse and
extract everything.. Any other better solution?
You can store the data in a single VARCHAR column as JSON, then recover the information in the client decoding this JSON data.
Also, there are already some SQL implementations offering native JSON datatypes. For example:
mariaDB: https://mariadb.com/kb/en/mariadb/column_json/
mySQL: https://dev.mysql.com/doc/refman/5.7/en/json.html
Database systems store your data and offer you SQL to simplify your search requests in case your data is structured.
It depends on you to decide whether you want to store the data structured to benefit from the SQL or leave the search requester with the burden of parsing it.
It very much depends on how you intend to use the data. I'm not totally sure I understand your question, so I am going to rephrase the business domain I think you're working with - please comment if this is not correct.
The system manages 0..n employees.
One employee may have 0..8 attachments.
An attachment belongs to exactly 1 employee.
An attachment may be one of 30 different types.
Each attachment type may have its own schema.
If attachments aren't important in the business domain - they're basically notes, and you don't need to query or reason about them - you could store them as a column on the "employee" table, and parse them when you show them to the end user.
This solution may seem easier - but don't underestimate the conversion logic - you have to support Create, Read, Update and Delete for each attachment.
If attachments are meaningful in the business domain, this very quickly breaks down. If you need to answer questions like "find all employees who have attached abc.pdf", "find employees who do not have a telephone_number attachment", unpacking each employee_attachment makes your query very difficult.
In this case, you almost certainly need to store attachments in one or more separate tables. If the schema for each attachment is, indeed, different, you need to work out how to deal with inheritance in relational database models.
Finally - some database engines support formats like JSON and XML natively. Yours may offer this as a compromise solution.

Category Implementation in a database

I'm building a system that involves users and teachers. In this particular system however I would like to categorize the teachers, but the tricky part is the categories are dynamic thus they can change anytime.
I have to have some functions, since I'm developing the backend;
The first one is showAllCategories(), that shows all the main categories.
Second is the showSubcategories() which shows the subcategories of a category()
Third is the showContent(), which in this case shows the teacher's information.
Before asking mighty Stack-Overflowers how would this be efficiently implemented, I thought I could use a doubly linked list approach where in categories table CategoryName, Before, After, Content and if the category did not have the after, the content would be pointing to the teacher's table. This is my classic SQL approach however I'm using MongoDB and since I'm a beginner I wonder if I could take the advantage of NoSQL in this particular situation?
MongoDb natively supports the Array type, which behaves actually more like a list. With $push and $pull you can add and remove elements from such an array field. $addToSet even makes sure there are no dublicates.
Now is the question of how the categories are stored. You can make a collection categories with the main categories, and they would be having a field each that has the array of the sub-categories:
{"_id": "science", "sub": ["chemist", "physicist", "biology"]}
{"_id": "languages", "sub": ["english", "german", "spanish"]}
Your teacher collection on the other hand would then have an array of embedded documents, the categories of the teacher. They are duplicates of those found in the categories collection, minus the fields that you won't need in the teacher view. This way you avoid joins, since they don't exist in MongoDB.
{
"_id": ObjectId(...),
"name": {"first": "Foo", "last": "Bar"},
"categories": ["chemist", "biology"]
}
The rest I am sure you can think up.
Addition: In short, use the flexible types that MongoDB offers, and don't worry about data redundancy. Embed documents often and don't forget the indexes.

Retrieving freebase quad dump type names from id

I'm currently working on a project using the freebase dumps, which I insert assertions into a per-mid ordered LevelDB. My goal is to be able for a given name, like Bob Dylan, to retrieve every types linked to it name.
For example, "Bob Dylan" would correspond to "Musician", "Film Producer" and so on, each corresponding themselves to the types "/music/artist", "/film/producer" etc...
Unfortunatly, if it's rather easy to find out the Bob Dylan mid into the quad dump
/m/bobdylanmid /common/topic/notable_types /music/artist
/m/bobdylanmid /common/topic/notable_types /film/producer
I'd like to be able to find those types names in various languages now. But I can't find a logical way to retrieve them in the dump.
Any clue please?
I'm not 100% certain, but I don't think the schema is actually in the quad dump. I know it never used to be.
You'll need to look up the names using a query like this. Unfortunately, the human readable names exist only in English, so you'd need to jump through some more hoops to get other languages. For that you could try something along the lines of this slightly more complicated query
[{
"id": "/music/artist",
"/freebase/type_profile/equivalent_topic": {
"name": {
"lang": null,
"value": null
}
},
"name": null
}]​
It depends on the "equivalent topic" property being filled in, which may not be the case for all types. If you only want a few languages, you could modify the query to return those explicitly ("Musician" has 45 different language variants).
If you are mainly interested in cases like your example (a person is/was a ...) using properties (rather than types) may do the job, in your case (the latter via a cvt):
/people/person/profession
/people/person/employment_history /business/employment_tenure/title
This might be more what you want to have anyways, unless you also want to display that e.g. Alan Turing is a "Literature Subject".
For the corresponding instances (with types /business/job_title, /people/profession) you can get the names in different languages (if existing).