What is the best method to extract a recurring blob data and put in another table ? - SQL - sql

I'm developing a new webpage in (.NET framework, if that helps) for the below scenario. Every single day, we get a cab drivers report.
Date | Blob
-------------------------------------------------------------
15/07 | {"DriverName1":"100kms", "DriverName2":"10kms", "Hash":"Value"...}
16/07 | {"DriverName1":"50kms", "DriverName3":"100kms", "Hash":"Value"}
Notice that the 'Blob' is the actual data received in json format - contains information about the distance covered by a driver at that particular day.
I have written a service which reads the above table & further breaks down this and puts it into a new table like below:
Date | DriverName | KmsDriven
15/07 DriverName1 100
15/07 DriverName2 10
16/07 DriverName3 100
16/07 DriverName1 50
By populating this, I can easily do the following queries:
How many drivers drove on that particular day.
How is 'DriverName1' did for that particular week, etc.,
My questions here are:
Are there anything in .NET / SQL world to specifically address this or let me know if I am reinventing the wheel here.
Is this the right way to use the Blob data ?
Are there any design patterns to adhere here to ?

Are there anything in .NET / SQL world to specifically address this or
let me know if I am reinventing the wheel here.
Well, there are JSON parsers available, for example Newtonsoft's Json.NET. Or you can use SQL Server's own functions. Once you have extracted individual values from JSON, you can write them into corresponding columns (in your new table).
Is this the right way to use the Blob data?
No. It violates the principle of atomicity, and therefore the first normal form.
Are there any design patterns to adhere here to?
I'm not sure about "patterns", but I don't see why would you need a BLOB in this case.
Assuming the data is uniform (i.e. it always has the same fields), you can just declare the columns you need and write directly to them (as you already proposed).
Otherwise, you may consider using SQL Server's XML data type, which will enable you to extract some of the sections within an XML document, or insert a new section without replacing your whole document.

Related

Create master table for status column

I have a table that represent a request sent through frontend
coupon_fetching_request
---------------------------------------------------------------
request_id | request_time | requested_by | request_status
Above I tried to create a table to address the issue.
Here request_status is an integer. It could have some values as follows.
1 : request successful
2 : request failed due to incorrect input data
3 : request failed in otp verification
4 : request failed due to internal server error
That table is very simple and status is used to let frontend know what happened to sent request. I had discussion with my team and other developers were proposing that we should have a status representation table. At database side we are not gonna need this status. But team was saying that in future we may need to show simple output from database to show what is the status of all request. According to YAGNI principle I don't think it is a good idea.
Currently I have coded to convert returned request_status value to descriptive value at frontend. I tried to convince team that I can creat an enumuration at business layer to represent meaning of the status OR I could add documentation at frontend and in java but failed to convince them.
The table proposed is as follows
coupon_fetching_request_status
---------------------------------------------------
status_id | status_code | status_description
My question is, Is it necessary to create table for such a simple status in similar cases.
I tried to create simple example to address the problem. In real time the table is to represent a Discount Coupon Code Request and status representing if the code is successfully fetched
It really depends on your use case.
To start with: in you main table, you are already storing request_status as an integer, which is a good thing (if you were storing the whole description, like 'request successful', that would not be optimized).
The main question is: will you eventually need to display that data in a human-readable format?
If no, then it is probably useless to create a representation table.
If yes, then having a representation table would be a good thing, instead of adding some code in the presentation layer to do the transcodification; let the data live in the database, and the frontend take care of presentation only.
Since this table can be easily created when needed, a pragmatic approach would be to hold on until you have a real need for the representation table.
You should create the reference table in the database. You currently have business logic on the application side, interpreting data stored in the database. This seems dangerous.
What does "dangerous" mean? It means that ad-hoc queries on the database might need to re-implement the logic. That is prone to error.
It means that if you add a reporting front end, then the reports have to re-implement the logic. That is prone to error and a maintenance nightmare.
It means that if you have another developer come along, or another module implemented, then the logic might need to be re-implemented. Red flag.
The simplest solution is to have a reference table to define the official meanings of the codes. The application should use this table (via join) to return the strings. The application should not be defining the meaning of codes stored in the database. YAGNI doesn't apply, because the application is so in need of this information that it implements the logic itself.

How to properly store a JSON object into a Table?

I am working on a scenario where I have invoices available in my Data Lake Store.
Invoice example (extremely simplified):
{
"business_guid":"b4f16300-8e78-4358-b3d2-b29436eaeba8",
"ingress_timestamp": 1523053808,
"client":{
"name":"Jake",
"age":55
},
"transactions":[
{
"name":"peanut",
"amount":100
},
{
"name":"avocado",
"amount":2
}
]
}
All invoices are stored in ADLS, and can be queried. But, It is my desire to provide access to the same data inside an ALD DB.
I am not an expert on unstructed data: I have RDBMS background. Taking that into consideration, I can only think of 2 possible scenarios:
2/3 tables - invoice, client (could be removed) and transaction. In this scenario, I would have to create an invoice ID to be able to build relationships between those tables
1 table - client info could be normalized into invoice data. But, transactions could (maybe) be defined as an SQL.ARRAY<SQL.MAP<string, object>>
I have mainly 3 questions:
What is the correct way of doing so? Solution 1 seems much better structured.
If I go with solution 1, how do I properly create an ID (probably GUID)? Is it acceptable to require ID creation when working with ADL?
Is there another solution I am missing here?
Thanks in advance!
This type of question is a bit like do you prefer your sauce on the pasta or next to the pasta :). The answer is: it depends.
To answer your 3 questions more seriously:
#1 has the benefit of being normalized that works well if you want to operate on the data separately (e.g., just clients, just invoices, just transactions) and want to the benefits of normalization, get the right indexing, and are not limited by the rowsize limits (e.g., your array of map needs to fit into a row). So I would recommend that approach unless your transaction data is always small and you always access the data together and mainly search on the column data.
U-SQL per se has no understanding of the hierarchy of the JSON document. Thus, you would have to write an extractor that turns your JSON into rows in a way that it either gives you the correlation of the parent to the child (normally done by stepwise downwards navigation with cross apply) and use the key value of the parent data item as the foreign key, or have the extractor generate the key (as int or guid).
There are some sample JSON extractors on the U-SQL GitHub site (start at http://usql.io) that can get you started with the JSON to rowset conversion. Note that you will probably want to optimize the extraction at some point to be JSON Reader based so you process larger docs without loading it into memory.

Adding custom data to GapMinder

Does anyone have any experience adding their own data to GapMinder, the really cool software that Hans Rosling uses in his TED talks? I have an array od objects in JSON that would be easy to show in moving bubbles. This would be really cool.
I can see that my Ubuntu box has what looks like data in /opt/Gapminder Desktop/share/assets/graphs/world, but I would need to figure out:
How to add a measure to a graph
How to add a data series
How to set the time range of the data
Identify the measures to follow at each time step
and so on.
Just for the record: if you want to use Gapminder with your own dataset, you have to convert your data in a format suitable to Gapminder. More specifically, looking in the assets/graphs/world, you will have to:
Edit the file overview.xml, which contains the tree structure of all the indicators (just copy/paste an entry and specify your own data);
Convert your data copying the structure of the xml files in that directory (this is the tricky part): you can specify some metadata in the preamble, and then specify your own data series, with something like:
<t1 m="i20,50.0,99.0,1992" d="90.0, ... ,50.0, ..."/> where i20 is the country id, which is followed by the minima and maxima of the series, and the year it refers to.
In my humble opinion, Gapminder is a great app but it definitely needs more work on integration with other datasets. Way better to use Google Motion Chart as you did, or MooGraph (site and doc), which is unfortunately not as great as Gapminder.
#Stefano
the information you provided is very valuable. Is somewhere available a detailed specification of the XML files containing the data?
Anyway, just to enrich your response, I also found that:
overview.xml file
The link between Nations and their IDs is in this file
The structure of the menus for the selection of the indicators is also in the same file (at the bottom) under the section <indicatorCategorization>
The structure of the datafile XML
For each line the year represents the first year of the serie, and then the values follow one per year, comma separated.
Grazie,
Max
I ended up using the google motion chart API. I ended up with this.

Database vs. Front-End for Output Formatting

I've read that (all things equal) PHP is typically faster than MySQL at arithmetic and string manipulation operations. This being the case, where does one draw the line between what one asks the database to do versus what is done by the web server(s)? We use stored procedures exclusively as our data-access layer. My unwritten rule has always been to leave output formatting (including string manipulation and arithmetic) to the web server. So our queries return:
unformatted dates
null values
no calculated values (i.e. return values for columns "foo" and "bar" and let the web server calculate foo*bar if it needs to display value foobar)
no substring-reduced fields (except when shortened field is so significantly shorter that we want to do it at database level to reduce result set size)
two separate columns to let front-end case the output as required
What I'm interested in is feedback about whether this is generally an appropriate approach or whether others know of compelling performance/maintainability considerations that justify pushing these activities to the database.
Note: I'm intentionally tagging this question to be dbms-agnostic, as I believe this is an architectural consideration that comes into play regardless of one's specific dbms.
I would draw the line on how certain layers could rotate out in place for other implementations. It's very likely that you will never use a different RDBMS or have a mobile version of your site, but you never know.
The more orthogonal a data point is, the closer it should be to being released from the database in that form. If on every theoretical version of your site your values A and B are rendered A * B, that should be returned by your database as A * B and never calculated client side.
Let's say you have something that's format heavy like a date. Sometimes you have short dates, long dates, English dates... One pure form should be returned from the database and then that should be formatted in PHP.
So the orthogonality point works in reverse as well. The more dynamic a data point is in its representation/display, the more it should be handled client side. If a string A is always taken as a substring of the first six characters, then have that be returned from the database as pre-substring'ed. If the length of the substring depends on some factor, like six for mobile and ten for your web app, then return the larger string from the database and format it at run time using PHP.
Usually, data formatting is better done on client side, especially culture-specific formatting.
Dynamic pivoting (i. e. variable columns) is also an example of what is better done on client side
When it comes to string manipulation and dynamic arrays, PHP is far more powerful than any RDBMS I'm aware of.
However, data formatting can use additional data which is also kept in the database. Like, the coloring info for each row can be stored in additional table.
You should then correspond the color to each row on database side, but wrap it into the tags on PHP side.
The rule of thumb is: retrieve everything you need for formatting in as few database round-trips as possible, then do the formatting itself on the client side.
I believe in returning the data pretty much as-is from the database and letting it be formatted on the front-end instead. I don't stick to it religously, but in general I think it's better as it provides greater flexibility - e.g. 1 sproc can service n different requirements for data, each of which can format the data as each individually needs. Otherwise, you end up either with multiple queries returning the same data with slightly different formatting from the DB (from a SQL Server point of view, thus reducing execution plan caching benefits - therefore negative impact on performance).
Leave output formatting to the web server

SQL server string manipulation in a view... Or in XSLT

I have been passed a piece of work that I can either do in my application or perhaps in SQL:
I have to get a date out of a string that may look like this:
1234567-DSP-01/01-VER-01/01
or like this:
1234567-VER-01/01-DSP-01/01
but may look like this:
00 12345 DISCH 01/01-VER-01/01 XXX X XXXXX
Yay. if it is a "DSP" then I want that date, if a "DISCH" then that date.
I am pulling the data out in a SQL Server view and would be happy to have the view transform the data for me. My application could do it but would add processor time. I could also see if the data could be manipulated before it is entered into the DB, I suppose.
Thank you for your time.
An option would be to check for the presence of DSP or DISCH then substring out the date as necessary.
For example (I don't have sqlserver today so I can verify syntax, sorry)
select
date = case date_attribute
when charindex('DSP',date_attribute) > 0 then substring(date_attribute,beg,end)
when charindex('DISCH',date_attribute) > 0 then substring(date_attribute,beg,end)
else 'unknown'
end
from myTable
don't store multiple items in the same column!
store the date in its own column when inserting the row!
add a new nullable column for the date
write an update that pulls the date out and sets the new column
alter the column to be not nullable
fix your save routine to pull the date out and insert it in for you
If you do it in the view your adding processing time on SQL which in general a more expensive resource then an app, web or some other type of client.
I'd recommend you try and format the data out when you insert the data, or you handle in the application tier. Scaling horizontally an app tier is so much easier then scalling your SQL.
Edit
I am talking the database server's physical resources are usually more expensive then a properly designed applications server's physical resources. This is because it is very easy to scale an application horizontally, it is in my opinion an order of magnitude more expensive to scale a DB server horizontally. Especially if your dealing with a transactional database and need to manage merging
I am not saying it is not possible just that scaling a database server horizontally is a much more difficult task, hence it's more expensive. The only reason I pointed this out is the OP raised a concern about using CPU cycles on the app server vs the database server. Most applications I have worked with have been data centric applications which processed through GB's of data to get a user an answer. We initially put everything on the database server because it was easier then doing it in classic asp and vb6 at the time. Over time the DB server was more and more loaded until scaling veritcally was no longer an option.
Database Servers are also designed at retrieving and joining data together. You should leave the formating of the data to the application and business rules (in general of course)