Text file Vs SQL? - sql

I am using a simple text file to store filenames and their hashvalues; which is later read to search a particular file. Should I go for SQL for such simple task ?

Depends on your need and operations.
If you need simple operation like read and write ( updates and deletion are difficult than DB) considering with a very low volume data sizes it's ok to go that way (not recommending).
Relational Databases are always better than normal file systems because of their rows and tuples structure, suitable for data manipulation operations.
If your need are simple use a json or XML structures. They are way better than raw text files

Related

Vertica Large Objects

I am migrating a table from Oracle to Vertica that contains an LOB column. The maximum actual size of the LOB column amounts to 800MB. How can this data be accommodated in Vertica? Is it appropriate to use the Flex Table?
In Vertica's documentation, it says that data loaded in a Flex table is stored in column raw which is a LONG VARBINARY data type. By default, it has a max value of 32MB, which, according to the documentation can be changed(i.e. increased) using the parameter FlexTablesRawSize.
I'm thinking this is the approach for storing large objects in Vertica. We just need to update the FlexTablesRawSize parameter to handle 800MB of data. I'd like to consult if this is the optimal way or if there's a better way. Or will this conflict with Vertica's table row constraint limitation that only allows up to 32MB of data per row?
Thank you in advance.
If you use Vertica for what it's built for - running a Big Data database, you would, like in any analytical database, try to avoid large objects in your table. BLOBs and CLOBs are usually used to store unstructured data: large documents, image files, audio files, video files. You can't filter by such a column, you can't run functions on it, or sum it, etc, you can't group by it.
A safe and performant design should lead to storing the file name in a Vertica table column, storing the file maybe even in Hadoop, and letting the front end (usually a BI tool, and all BI tools support that) retrieve the file to bring it to a report screen ...
Good luck ...
Marco

Simple database needed to store JSON using C#

I'm new to databases. I've been saving a financials table from a website in JSON format on a daily basis, accumulating new files in my directory every day. I simply parse the contents into a C# collection for use in my program and compare data via Linq.
Obviously I'm looking for a more efficient solution especially as my file collection will grow over time.
An example of a row of the table is:
{"strike":"5500","type":"Call","open":"-","high":"9.19B","low":"8.17A","last":"9.03B","change":"+.33","settle":"8.93","volume":"0","openInterest":"1,231"}
I'd prefer to keep a 'compact file' per stock that I can access individually as opposed to a large database with many stocks.
What would be an 'advisable' solution to use? I know that's a bit of an open ended question but some suggestions would be great.
I don't mind slower writing into the DB but a fast read would be beneficial.
What would be the best way to store the data? Strings or numerical values?
I found this link to help with the conversion How to Save JSON data to SQL server database in C#?
Thank you.
For a faster reading in a DB, I would suggest denormalization of the data.
Read "Normalization vs Denormalization"
Judging from your JSON file, it doesn't seems like you have any table joins. Keeping that structure should be fine.
For the comparison between varchar(string) and int(numeric), int are faster than varchar, for the simple fact that ints take up much less space than varchars. int takes up 2-8 bytes where varchar takes up 4 bytes plus the actual characters.

Avro schema storage

We are evaluating avro v/s thrift for storage. At this point Avro seems to be our choice, however the documentation states that the schema is stored alongside the data when serialized, is there a way to avoid this, since we are incharge of both producing and consuming the data, we want to see if we can avoid serializing the schema, and also is the difference in size of the serialized data with the schema is much larger than just the data without schema?
A little late to the party, but you don't actually need to store the actual schema with each and every record. You do, however, need a way to get back to the original schema from each record's serialized format.
Thus, you could use a schema store + custom serializer that writes the avro record content and the schema id. On read, you can read back in that schema ID, retrieve it from the schema store and then use that schema to rehydrate the record content. Bonus points for using a local cache if your schema store is remote.
This is the exactly the approach that Oracle's NoSQL DB takes to managing schema in a storage efficient manner (its also available able under the AGPL license).
Full disclosure: currently and never previously employed by Oracle or Sun, or worked on the above store. Just came across it recently :)
I'm pretty sure you will always need the schema to be stored with the data. This is because Avro will use it when reading and writing to the .avro file.
According to http://docs.oracle.com/cd/NOSQL/html/GettingStartedGuide/avroschemas.html:
You apply a schema to the value portion of an Oracle NoSQL Database record using Avro bindings. These bindings are used to serialize values before writing them, and to deserialize values after reading them. The usage of these bindings requires your applications to use the Avro data format, which means that each stored value is associated with a schema.
As far as size difference, you only have to store the schema once, so in the big scheme of things, it doesn't make that much of a difference. My schema takes up 105.5KB (And that is a really large schema, yours shouldn't be that large) and each serialized value takes up 3.3KB. I'm not sure what the difference would be for just the raw json of the data, but according to that link I posted:
Each value is stored without any metadata other than a small internal schema identifier, between 1 and 4 bytes in size.
But I believe that may just be for single, simple values.
This is on HDFS for me btw.
Thanks JGibel, Our data would eventually end up in HDFS eventually, and the object container file format does ensure that the schema is only written as a header on the file.
For uses other than HDFS, I was under the wrong assumption that the schema would be attached to every encoded data, but its not the case, meaning you need the schema to deserialize it, but the serialized data does not have to have the schema string attached to it.
E.g.
DatumWriter<TransactionInfo> eventDatumWriter = new SpecificDatumWriter<TransactionInfo>(TransactionInfo.class);
TransactionInfo t1 = getTransaction();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BinaryEncoder becoder = EncoderFactory.get().binaryEncoder(baos, null);
eventDatumWriter.setSchema(t1.getSchema());
eventDatumWriter.write(t1, becoder);
becoder.flush();

Which is better: Parsing big data each time from the DB or caching the result?

I have a problem regarding the system performance. In the DB table there will be a big XML data in every record. My concern is that if I should parse the XML data each time from the DB to get the attributes and information in the XML. The other choice could be parsing the XML once and catching them. The XML size averages 100KB and there will be 10^10 records. How to solve this space vs computing performance problem? My guess is to catch the result(important attributes in the XML). Because parsing 10^10 records per query is not a easy task. Plus the parsed attributes can be used as the index.
If you are going to parse it all every query you undoubtly should cache the results, perhaps put the full generated product into a single database field or a file for future use, or at last until something is changed, just like a forum system do.
Repeating an expensive process on a massive amount of data knowing you will always get the same result is a real waste of resources.
If you intend to use some of the attributes from the XML for indexing, better to add them as columns to the table.
Regarding parsing XML, 100kb is hardly a size that would affect performance. Moreover, you can read and store the XML (as is) in a string while fetching the records and parse it only when you want to show / use those additional attributes.
You didn't mention the best choice: parse the XML and store its data in tables were they belong. If you really really need the original XML verbatim, keep it as a blob and otherwise ignore it.
(I think you meant cache, by the way, not catch.)

How to loop through all rows in an Oracle table?

I have a table with ~30,000,000 rows that I need to iterate through, manipulate the data for each row individually, then save the data from the row to file on a local drive.
What is the most efficient way to loop through all the rows in the table using SQL for Oracle? I've been googling but can see no straightforward way of doing this. Please help. Keep in mind I do not know the exact number of rows, only an estimate.
EDIT FOR CLARIFICATION:
We are using Oracle 10g I believe. The row data contains blob data (zipped text files and xml files) that will be read into memory and loaded into a custom object, where it will then be updated/converted using .Net DOM access classes, rezipped, and stored onto a local drive.
I do not have much database experience whatsoever - I planned to use straight SQL statements with ADO.Net + OracleCommands. No performance restrictions really. This is for internal use. I just want to do it the best way possible.
You need to read 30m rows from an Oracle DB and write out 30m files from the BLOBs (one zipped XML/text file in one BLOB column per row?) in each row to the file system on the local computer?
The obvious solution is open a ADO.NET DataReader on SELECT * FROM tbl WHERE <range> so you can do batches. Read the BLOB from the reader into your API, do your stuff and write out the file. I would probably try to write the program so that it can run from many computers, each doing their own ranges - your bottleneck is most likely going to be the unzipping, manipulation and the rezipping, since many consumers can probably stream data from that table from the server without noticeable effect on server performance.
I doubt you'll be able to do this with set-based operations internal to the Oracle database, and I would also be thinking about the file system and how you are going to organize so many files (and whether you have space - remember the size taken up by a file on a the file system is always an even multiple of the file system block size).
My initial solution was to do something like this, as I have access to an id number (pseudocode):
int num_rows = 100;
int base = 0;
int ceiling = num_rows;
select * from MY_TABLE where id >= base and id < ceiling;
iterate through retrieved rows, do work,
base = ceiling;
ceiling += num_rows;
select * from MY_TABLE where id >= base and id < ceiling;
iterate through retrieved rows, do work,
...etc.
But I feel that this might not be the most efficient or best way to do it...
You could try using rownum queries to grab chunks until you grab chunk that doesn't exist.
This is a good article on rownum queries:
http://www.oracle.com/technetwork/issue-archive/2006/06-sep/o56asktom-086197.html
If you don't feel like reading, jump directly to the "Pagination with ROWNUM" section at the end for an example query.
It's always preferable to use set-based operations when working with a large number of rows.
You would then enjoy a performance benefit. After processing the data, you should be able to dump the data from the table into a file in one go.
The viability of this depends on the processing you need to perform on the rows, although it is possible in most cases to avoid using a loop. Is there some specific requirement which prevents you from processing all rows at once?
If iterating through the rows is unavoidable, using bulk binding can be beneficial: FORALL bulk operations or BULK COLLECT for "select into" queries.
It sounds like you need the entire dataset before you can do any data manipulation since it is a BLOB>. I would just use a DataAdapter.Fill and then hand the dataset over to the custom object to iterate through, do it's manipulation and then write to disk the end object, and then zip.