Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
It just occurred to me that serialization seems a strange name for the process of writing a data structure out to disk. Wikipedia and Google are not much help for the origin of the name. Why is serialization called serialization? Is it related to writing data out to tapes?
EDIT: To be clear, I'm interested in the origins of the word 'Serialization' not the fact that data needs to be 'serial' in order to be transmitted.
Is it related to writing data out to tapes?
That's pretty close to the truth. Data stored on a tape is obviously stored in a serial way, i.e. it's basically a series of bits from one end of the tape to the other. However, it's not only data on tapes that is serial in this way. Even if not as obvious, data saved or transmitted in any way is serial in much the same way. A file on the disk or a stream sent over the network is a series of bytes with a start and an end.
As a contrast, when an object resides in memory it can be scattered across several parts of memory. A list of string for example is not a single block of memory where one string follows after another, but a block of pointers where each one points to a different section of memory where one of the strings is stored.
When you want to store or transmit an object, you can't just take that scattered data and send it as it is. You have to take the different parts of data and arrange it as a single series of bytes. There is where the serialization comes in.
You need to understand a couple of concepts from way back.
Serial access: the data is written on a medium that can only be read from beginning to end, e.g., primarily, a tape, but text files with line separators are another example, as indeed is a TCP socket.
Sequential access: the data is written on a medium that may support random or indexed access as well, but what we are doing at the moment is sequential only, i.e. from beginning to end. Many operating systems used to support 'relative' and 'indexed' files, which can both be accessed sequentially, but which can also be accessed by position or by key value respectively.
When you consider the nature of a serialized object stream, 'serial' is clearly the term that applies here. You can't access it by position or key, only serially.
Stolen from another question:
"... in order to transmit any information you need to put all parts of that information into a series of bytes.
In order to transmit a record full of information you would have to "serialize" all the bytes that comprise the record, send them over the wire and at the other end would have to deserialize them back into a record.
With the advent of client / server applications, the concept was generalized to serializing objects into some kind of (textual) form that could be transmitted across a network and deserialized back into objects at the other end.
Client / server communication started with several proprietary protocols that handled the deconstruction and reconstruction of object before and after transmission between client and server. With SOAP for client server communication xml became a defacto protocol standard for the textual representation of objects. Javascript and the abundance of web clients using it brought the need for a more concise representation and led to Json."
Related
I'm working on a project in which two processes communicate via a TCP-based message bus. For efficiency, I'm considering prepending each message with the byte length of the message.
Some messages however convey information about COM objects; namely, process A calls CoMarshalInterface() and submits the resulting bytes to Process B for deserialization.
In order to determine the byte length of my messages without actually serializing them yet, I'm trying to figure out whether there is any way of knowing the specific, or at least maximum size of bytes that CoMarshalInterface() would yield, without actually having to call that method yet (at least not at this point in code).
Would anybody know if there's any way?
I haven't noticed any big variations in data length for the objects I have tested this with, but I'm not quite sure how CoMarshalInterface works internally. Does it depend on some mechanism implemented by each COM object individually, hence completely unknown size, or is it safe to assume it would never generate more than XYZ bytes of serialized information?
Thanks!
My question is with respect to a labVIEW VI (2013), I am trying to modify. (I am only just learning to use this language. I have searched the NI site and stackoverflow for help without success, I suspect I am using the incorrect key words).
My VI consists of a flat sequence one pane of which contains a while loop where integer data is collected from a device and displayed on a graph.
I would like to be able to be able to buffer this data and then send it to disk when a preset number of samples have been collected. My attempts so far result in only the last record being saved.
Specifically I need to know how to save the data in a buffer (array) then when the correct number of samples are captured save it all to disk (saving as it is captured slows the process down to much).
Hope the question is clear and thanks very much in advance for any suggestions.
Tom
Below is a simple circular-buffer that holds the most recent 100 readings. Each time the buffer is refilled, its contents are written to a text file. Drag the image onto a VI's block diagram to try it out.
As you learn more about LabVIEW and as your performance and multi-threaded needs increase, consider reading about some of the LabVIEW design patterns mentioned in the other answers:
State machine: http://www.ni.com/tutorial/7595/en/
Producer-consumer: http://www.ni.com/white-paper/3023/en/
I'd suggest to split the data acquisition and the data saving in two different loops using a producer/consumer design pattern..
Moreover if you need a very high throughput consider using TDMS file format.
Have a look here for an overview: http://www.ni.com/white-paper/3727/en/
Screenshot will definitely help. However, some things are clear:
Unless you are dealing with very high volume of data, very slow hard drives or have other unusual requirements, open the file before your while loop, write to it every time you acquire a sample (leaving buffering to the OS), and close it afterwards.
If you decide you need to manage buffering on your own, you can use queues. See this example: https://decibel.ni.com/content/docs/DOC-14804 for reference (they stream data from disk, buffering it in the queue, but it is the same idea)
My VI consists of a flat sequence one pane of which
Substitute flat sequence for finite state machine (e.g. http://forums.ni.com/t5/LabVIEW/Ending-a-Flat-Sequence-Inside-a-case-structure/td-p/3170025)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am new to NoSQL world and thinking of replacing my MS Sql Server database to MongoDB. My application (written in .Net C#) interacts with IP Cameras and records meta data for each image coming from Camera, into MS SQL Database. On average, i am inserting about 86400 records per day for each camera and in current database schema I have created separate table for separate Camera images, e.g. Camera_1_Images, Camera_2_Images ... Camera_N_Images. Single image record consists of simple metadata info. like AutoId, FilePath, CreationDate. To add more details to this, my application initiates separate process (.exe) for each camera and each process inserts 1 record per second in relative table in database.
I need suggestions from (MongoDB) experts on following concerns:
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images? I am thinking on this with respect to splitting reads and writes on separate databases. Because all read requests might be served by second database and writes to first one. Will it benefit or not? If yes then any idea to ensure that both databases are synced always.
Any other suggestions are welcomed please.
I am myself a starter on NoSQL databases. So I am answering this at the expense of potential down votes but it will be a great learning experience for me.
Before trying my best to answer your questions I should say that if MS
SQL Server is working well for you then stick with it. You have not
mentioned any valid reason WHY you want to use MongoDB except the fact
that you learnt about it as a document oriented db. Moreover I see
that you have almost the same set of meta-data you are capturing for
each camera i.e. your schema is dynamic.
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
MongoDB being a document oriented db, is good at querying within an aggregate (you call it document). Since you already are storing each camera's data in its own table, in MongoDB you will have a separate collection created for each camera. Here is how you perform date range queries.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
All NoSQL data bases are built to scale-out on commodity hardware. But by the way you have asked the question, you might be thinking of improving performance by scaling-up. You can start with a reasonable machine and as the load increases, you can keep adding more servers (scaling-out). You no need to plan and buy a high end server.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
MongoDB locks the entire db for a single write (but yields for other operations) and is meant for systems which have more reads than writes. So this depends upon how your system is. There are multiple ways of sharding and should be domain specific. A generic answer is not possible. However some examples can be given like sharding by geography, by branches etc.
Also read A plain english introduction to CAP Theorem
Updated with answer to the comment on sharding
According to their documentation, You should consider deploying a sharded cluster, if:
your data set approaches or exceeds the storage capacity of a single node in your system.
the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other
approaches have not reduced contention.
So based upon the last point yes. The auto-sharding feature is built to scale writes. In that case, you have a write lock per shard, not per database. But mine is a theoretical answer. I suggest you take consultation from 10gen.com group.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)?
This quiestion is too subjective for me to answer. From personal experience with numerous SQL solutions (ironically not MS SQL) I would say they are both equally as good, if done right.
Also:
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Depends on too many variables that only you know, however a small cluster of commodity hardware works quite well. I cannot really give a factual response to this question and it will come down to your testing.
As for a schema I would go for a document of the structure:
{
_id: {},
camera_name: "my awesome camera",
images: [
{
url: "http://I_like_S3_here.amazons3.com/my_image.png" ,
// All your other fields per image
}
]
}
This should be quite easy to mantain and update so long as you are not embedding much deeper since then it could become a bit of pain, however, that depends upon your queries.
Not only that but this should be good for sharding since you have all the data you need in one document, if you were to shard on _id you could probably get the perfect setup here.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Possibly, many people assume they need to shard when in reality they just need to be more intelligent in how they design the database. MongoDB is very free form so there are a lot of ways to do it wrong, but that being said, there are also a lot of ways of dong it right. I personally would keep sharding in mind. Replication can be very useful too.
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images?
Even though MongoDBs write lock is on DB level (currently) I would say: No. The right document structure and the right sharding/replication (if needed) should be able to handle this in a single document based collection(s) under a single DB. Not only that but you can direct writes and reads within a cluster to certain servers so as to create a concurrency situation between certain machines in your cluster. I would promote the correct usage of MongoDBs concurrency features over DB separation.
Edit
After reading the question again I omitted from my solution that you are inserting 80k+ images for each camera a day. As such instead of the embedded option I would actually make a row per image in a collection called images and then a camera collection and query the two like you would in SQL.
Sharding the images collection should be just as easy on camera_id.
Also make sure you take you working set into consideration with your server.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)? Any suggestions about
Document Based schema design for my case?
MongoDB can do this. For better performance, you can set an index on your time field.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
I think RAM and Disk would be important.
If you don't want to do sharding to scale out, you should consider a larger size of disk so you can store all your data in it.
Your hot data should can fit into your RAM. If not, then you should consider a larger RAM because the performance of MongoDB mainly depends on RAM.
Should i consider Sharding/Replication for this scenario (while
considering the performance in writing to synch replica sets)?
I don't know many cameras do you have, even 1000 inserts/second with total 1000 cameras should still be easy to MongoDB. If you are concerning insert performance, I don't think you need to do sharding(Except the data size are too big that you have to separate them into several machines).
Another problem is the read frequency of your application. It it is very high, then you can consider sharding or replication here.
And you can use (timestamp + camera_id) as your sharding key if your query only on one camera in a time range.
Are there any benefits of using multiple databases on same machine, so
that one database will hold images of current day for all cameras, and
the second one will be used to archive previous day images?
You can separate the table into two collections(archive and current). And set index only on archive if you only query date on archive. Without the overhead of index creation, the current collection should benefit with insert.
And you can write a daily program to dump the current data into archive.
What are good compression-oriented application programming interfaces (APIs)?
Do people still use the
1991 "data compression interface" draft standard, and the
1991 "Stream transformation algorithm interface" draft standard.
(Both draft standards by Ross Williams)?
Are there any alternatives to those draft standards?
(I'm particularly looking for C APIs, but links to compression-oriented APIs in C++ and other languages would also be appreciated).
I'm experimenting with some data compression algorithms.
Typically the compressed file I'm producing is composed of a series of blocks,
with a block header indicating which compression algorithm needs to be used to decompress the remaining data in that block -- Huffman, LZW, LZP, "stored uncompressed", etc.
The block header also indicates which filter(s) need to be used to convert the intermediate stream or buffer of data from the decompressor into a lossless copy of the original plaintext -- Burrows–Wheeler transform, delta encoding, XML end-tag restoration, "copy unchanged", etc.
Rather than use a huge switch statement that selects based on the "compression type", which calls the selected decompression algorithm or filter algorithm, each procedure with its own special number and order of parameters,
it simplifies my code if every algorithm has exactly the same API -- the same number and order of parameters, etc.
Rather than waiting for the decompressor to run through the entire input stream before handing its output to the first filter,
It would be nice if the API supported decompressed output data coming out the final filter "relatively quickly" (low-latency) after relatively little compressed data has been fed into the initial decompressor.
It would be nice if the API could be used in systems that have only one thread or process.
Currently I'm kludging together my own internal API,
re-using existing compression algorithm implementations by
writing short wrapper functions to convert between my internal API and the special number and order of parameters used by each implementation.
Is there an already-existing API that I could use rather than designing my own from scratch?
Where can I find such an API?
I fear such an "API" does not exist.
Especially, requirement such as "starting stage-2 while stage-1 is ongoing and unfinished" is completely implementation dependant; and cannot be added later by an API layer.
Btw, Maciej Adamczyk just tried the same as you.
He made an open source benchmark comparing multiple compression algorithms over a block-compression scenario. The code can be consulted here :
http://encode.ru/threads/1371-Filesystem-benchmark?p=26630&viewfull=1#post26630
He has been obliged to "encapsulate" all these different compressor interfaces in order to cope with the difference.
Now for the good thing : most compressors tend to have relatively similar C interface when it comes to compressing a block of data.
AS an example, they can be as simple as this one :
http://code.google.com/p/lz4/source/browse/trunk/lz4.h
So, in the end, the adaptation layer is not so heavy.
Let's imagine really simple game... We have a labirinth and two players trying to find out exit in real time through internet.
On every move game client should send player's coordinates to server and accept current coordinates of another client. How is it possible to make this exchange so fast (as all modern games do).
Ok, we can use memcache or similar technology to reduce data mining operations on server side. We can also use fastest webserver etc., but we still will have problems with timings.
So, the questions are...
What protocol game clients are usually using for exchanging information with server?
What server technologies are coming to solve this problem?
What algorithms are applied for fighting with delays during game etc.
Usually with Network Interpolation and prediction. Gamedev is a good resource: http://www.gamedev.net/reference/list.asp?categoryid=30
Also check out this one: http://developer.valvesoftware.com/wiki/Source_Multiplayer_Networking
use UDP, not TCP
use a custom protocol, usually a single byte defining a "command", and as few subsequent bytes as possible containing the command arguments
prediction is used to make the other players' movements appear smooth without having to get an update for every single frame
hint: prediction is used anyway to smooth the fast screen update (~60fps) since the actual game speed is usually slower (~25fps).
The other answers haven't spelled out a couple of important misconceptions in the original post, which is that these games aren't websites and operate quite differently. In particular:
There is no or little "data-mining" that needs
to be speeded up. The fastest online
games (eg. first person shooters)
typically are not saving anything to
disk during a match. Slower online
games, such as MMOs, may use a
database, primarily for storing
player information, but for the most
part they hold their player and world data in memory,
not on disk.
They don't use
webservers. HTTP is a relatively slow
protocol, and even TCP alone can be
too slow for some games. Instead they
have bespoke servers that are written just for that particular game. Often these servers are tuned for low latency rather than throughput, because they typically don't serve up big documents like a web server would, but many tiny messages (eg. measured in bytes rather than kilobytes).
With those two issues covered, your speed problem largely goes away. You can send a message to a server and get a reply in under 100ms and can do that several times per second.