Simple database needed to store JSON using C# - sql

I'm new to databases. I've been saving a financials table from a website in JSON format on a daily basis, accumulating new files in my directory every day. I simply parse the contents into a C# collection for use in my program and compare data via Linq.
Obviously I'm looking for a more efficient solution especially as my file collection will grow over time.
An example of a row of the table is:
{"strike":"5500","type":"Call","open":"-","high":"9.19B","low":"8.17A","last":"9.03B","change":"+.33","settle":"8.93","volume":"0","openInterest":"1,231"}
I'd prefer to keep a 'compact file' per stock that I can access individually as opposed to a large database with many stocks.
What would be an 'advisable' solution to use? I know that's a bit of an open ended question but some suggestions would be great.
I don't mind slower writing into the DB but a fast read would be beneficial.
What would be the best way to store the data? Strings or numerical values?
I found this link to help with the conversion How to Save JSON data to SQL server database in C#?
Thank you.

For a faster reading in a DB, I would suggest denormalization of the data.
Read "Normalization vs Denormalization"
Judging from your JSON file, it doesn't seems like you have any table joins. Keeping that structure should be fine.
For the comparison between varchar(string) and int(numeric), int are faster than varchar, for the simple fact that ints take up much less space than varchars. int takes up 2-8 bytes where varchar takes up 4 bytes plus the actual characters.

Related

Best way to structure streamed data with missing table fields to benefit filesize

I currently have a service which provides live data every second in JSON format and I save it to a SQLserver table.
Typically the table is approx 20 fields of varchar, int and decimal and each row/record is a single timestamp for each second. Both the JSON and INSERT query contain data for all fields on every timestamp.
In order to speed up response times and reduce transmitted bytes, the JSON in future will only contain changes to the data (ie the value is different from the previous value), so many fields will not be contained in the JSON.
My question is what is the best way to store this in SQL to also benefit from the reduction in data - Is there a better way to do this? If I used the same table structure with NULL entries then surely this will be the same byte size based on the field type anyway?
Edit: The new streaming format would mean the following
Each timeframe will still have data values but they would not be in the JSON array if there was no data change from previous values.
I'm looking at saving disk space. I'm happy to rebuild the data when required with post processing outside of SQL to get 'full' data for any particular timestamp.
Possibly it might be better to just store the full JSON response string with timestamp?
Not familiar with JSON and the whole idea, but the best way to store possible NULLs is to put fixed size fields (INT, BIT, SMALL INT DECIMAL, FLOAT, Etc.) in the beginning of a table and variable sized fields (VARCHAR, NVARCHAR, XML, JSON etc.) at the end of your table.
Second advice would be to use temporal table update in SQL 2016. That will store the data maybe in the best way (needs a research), but will significantly make easy extraction and handling the data.

Is there any datatype that can store data more than 2 gb in sql server

I am having requirement to store more than 2 gigabytes data in a column. Is there any way that I can do it? I need the data what I store need to be in database not in computer which results when using file stream concept
NO there isn't. NVARCHAR(MAX) is the datatype which can be used to store 2GB of data in a column. But you can not store more than 2GB of data in it so that the upper limit to the datatype.
On a side note what makes you store such a big data in a column as this may cause you a lot of performance overhead and also it might not be a worthy thing to proceed with. I am sure you may find alternatives to that.
Possible alternatives may be to split the data and store it into multiple rows.
Else as commented by Mladen Prajdic you can use Filestream to store more than 2Gb of data.

Which is better: Parsing big data each time from the DB or caching the result?

I have a problem regarding the system performance. In the DB table there will be a big XML data in every record. My concern is that if I should parse the XML data each time from the DB to get the attributes and information in the XML. The other choice could be parsing the XML once and catching them. The XML size averages 100KB and there will be 10^10 records. How to solve this space vs computing performance problem? My guess is to catch the result(important attributes in the XML). Because parsing 10^10 records per query is not a easy task. Plus the parsed attributes can be used as the index.
If you are going to parse it all every query you undoubtly should cache the results, perhaps put the full generated product into a single database field or a file for future use, or at last until something is changed, just like a forum system do.
Repeating an expensive process on a massive amount of data knowing you will always get the same result is a real waste of resources.
If you intend to use some of the attributes from the XML for indexing, better to add them as columns to the table.
Regarding parsing XML, 100kb is hardly a size that would affect performance. Moreover, you can read and store the XML (as is) in a string while fetching the records and parse it only when you want to show / use those additional attributes.
You didn't mention the best choice: parse the XML and store its data in tables were they belong. If you really really need the original XML verbatim, keep it as a blob and otherwise ignore it.
(I think you meant cache, by the way, not catch.)

How to loop through all rows in an Oracle table?

I have a table with ~30,000,000 rows that I need to iterate through, manipulate the data for each row individually, then save the data from the row to file on a local drive.
What is the most efficient way to loop through all the rows in the table using SQL for Oracle? I've been googling but can see no straightforward way of doing this. Please help. Keep in mind I do not know the exact number of rows, only an estimate.
EDIT FOR CLARIFICATION:
We are using Oracle 10g I believe. The row data contains blob data (zipped text files and xml files) that will be read into memory and loaded into a custom object, where it will then be updated/converted using .Net DOM access classes, rezipped, and stored onto a local drive.
I do not have much database experience whatsoever - I planned to use straight SQL statements with ADO.Net + OracleCommands. No performance restrictions really. This is for internal use. I just want to do it the best way possible.
You need to read 30m rows from an Oracle DB and write out 30m files from the BLOBs (one zipped XML/text file in one BLOB column per row?) in each row to the file system on the local computer?
The obvious solution is open a ADO.NET DataReader on SELECT * FROM tbl WHERE <range> so you can do batches. Read the BLOB from the reader into your API, do your stuff and write out the file. I would probably try to write the program so that it can run from many computers, each doing their own ranges - your bottleneck is most likely going to be the unzipping, manipulation and the rezipping, since many consumers can probably stream data from that table from the server without noticeable effect on server performance.
I doubt you'll be able to do this with set-based operations internal to the Oracle database, and I would also be thinking about the file system and how you are going to organize so many files (and whether you have space - remember the size taken up by a file on a the file system is always an even multiple of the file system block size).
My initial solution was to do something like this, as I have access to an id number (pseudocode):
int num_rows = 100;
int base = 0;
int ceiling = num_rows;
select * from MY_TABLE where id >= base and id < ceiling;
iterate through retrieved rows, do work,
base = ceiling;
ceiling += num_rows;
select * from MY_TABLE where id >= base and id < ceiling;
iterate through retrieved rows, do work,
...etc.
But I feel that this might not be the most efficient or best way to do it...
You could try using rownum queries to grab chunks until you grab chunk that doesn't exist.
This is a good article on rownum queries:
http://www.oracle.com/technetwork/issue-archive/2006/06-sep/o56asktom-086197.html
If you don't feel like reading, jump directly to the "Pagination with ROWNUM" section at the end for an example query.
It's always preferable to use set-based operations when working with a large number of rows.
You would then enjoy a performance benefit. After processing the data, you should be able to dump the data from the table into a file in one go.
The viability of this depends on the processing you need to perform on the rows, although it is possible in most cases to avoid using a loop. Is there some specific requirement which prevents you from processing all rows at once?
If iterating through the rows is unavoidable, using bulk binding can be beneficial: FORALL bulk operations or BULK COLLECT for "select into" queries.
It sounds like you need the entire dataset before you can do any data manipulation since it is a BLOB>. I would just use a DataAdapter.Fill and then hand the dataset over to the custom object to iterate through, do it's manipulation and then write to disk the end object, and then zip.

SQL performance & MD5 strings

I've got a DB table where we store a lot of MD5 hashes (and yes I know that they aren't 100% unique...) where we have a lot of comparison queries against those strings.
This table can become quite large with over 5M rows.
My question is this: Is it wise to keep the data as hexadecimal strings or should I convert the hex to binary or decimals for better querying?
Binary is likely to be faster, since with text you're using 8 bits (a full character) to encode 4 bits of data. But I doubt you'll really notice much if any difference.
Where I'm at we have a very similar table. It holds dictation texts from doctors for billing purposes in a text column (still on sql server 2000). We're approaching four million records, and we need to be able to check for duplicates, where the doctor dictated the exact same thing twice for validation and compliance purposes. A dictation can run several pages, so we also have a hash column that's populated on insert via a trigger. The column is a char(32) type.
Binary data is a bummer to work with manually or if you have to dump your data to a text file or whatnot.
Just put an index on the hash column and you should be fine.