How are TLB's so fast? - virtual-machine

How are TLB's so fast? - virtual-machine

When handling virtual memory you often use a TLB (I'm asking about software-managed TLB's) to make things faster. Instead of plugging your virtual address into a page table to get the physical address it maps to you can use a TLB in between to store recently used pages.
So, what you do is that you take your virtual address and look if the TLB contains your address, if it does it will return that physical address the virtual address maps to, but if it doesn't get a hit it will go to the page table, load that page into the TLB and return the physical address.
My question is: How can the TLB be so fast if it has to search for your virtual addresses all the time? The entries seemingly come in no certain order, so going through them would be very slow, I'd assume.
For context, this is the image of a TLB-lookup I have in my head (picture is taken from Wikipedia).

The entries seemingly come in no certain order, so going through them would be very slow, I'd assume.
That's not how things usually work in hardware, "going through" entries isn't really a thing. Well you can do it that way, but then it's not fast anymore so it misses the point.
All entries can be compared simultaneously, and the matching entry (if it exists) can be extracted with a log-depth circuit (constant-depth if you allow arbitrary fan-in, but that has its own problems).

The TLB is just a cache. The idea behind a cache is that it's easier to search a small number of recent references rather than a larger, more "distant", number of unprecendened references.
If I understand, your question is more: "why is caching fast?"
Perhaps someone can explain from an architecture standpoint better than I?

Related

Are " process image" and "address space of a process" the same concept?

I just saw "process image" in the description for exec- functions in GNU Libc's manual. Is it the same concept as the address space of a process? Thanks.

I'd say they are related, but not quite the same.
The process image is everything you get from the program executable and load into memory, plus everything that gets added or modified at load time.
The address space is all its virtual addresses, and whatever is in them.
Same stuff - almost - but different viewpoint.
Much of the time, when I'm concerned about address space, I just want to know whether some particular address is valid for that process, and/or whatever addresses are currently available. Maybe I care about details like "this one's currently resident; that one's paged out; this other one will be initialized with zeroes when it's first accessed". But I really don't care about the contents, and I don't care about details like text/data/bss/heap/stack/mmap etc.
When I'm concerned about a process image, I'm pretty much concerned with the stuff that comes from the executable file, and getting it set up right. And if the program's already started running, I care about things like its registers, not just its memory.

A process consists of:
An address space
One or more threads of execution.
That which is sometimes called the process image consists of both.
An exec- function will:
Create a new address space
Wipe out all the threads and create a main thread.
A process image is really the same as a process.

Organzing Data in EEPROM

I have a 64KB EEPROM, organized as 128-byte pages, on my board which talks to an AT Mega 1281. The board also has a SD Card slot and is capable of copying over some configuration files onto the EEPROM (which acts as the internal memory). Due to the nature of the board, only two types of files are needed - one is known as the Circuit Data and the other is Location Data - both are binary files.
Up until now, I had just split the EEPROM into two 32K halves and wrote the Circuit Data in the top half and the Location Data in the bottom half. Both files also have a 25 byte header. I copy the header in the last pages of the files respective half i.e. the page starting at address 0x7F80 has the Circuit Data file's header and the address starting at 0xFF80 has the other header. The data is always going to be of fixed width so that makes random access quite easy.
My question is, is there a better, simpler, way to organize data in an EEPROM? At the moment, I don't even store the length of the data as it's not really needed. But I'm thinking it might add an another step of safety if I do include that in the header.

Better? It depends. Simpler? Really not. It depends how strong is your "always". How much do you believe yourself that the files will be always of fixed length? The fact that you are asking this question probably means some doubts. Keep in mind KISS principle. Microcontroller development is still an area where unecessary features are a direct threat to the solution stability. Having a data length in the header would be useful if you want to make your EEPROM access more generic. But then again, generalization for two files is an overkill.
Second thought: rather than introducing file lengths which you actually don't need, i would like to know why you store the file headers at the opposite side of the respective memory chunk. A "header" is to me something what needs to be read before the file itself. You could save one transfer of the reading address to EEPROM.

I believe, in any embedded project, simplest solution is the best. Your way to organize storage is simple, and looks like it meets all your requirements.
Any attempt to "improve" or "optimize" this solution will lead to more complicated code and will increase probability of making bug in it. So keep all your engineering solutions as simple as possible. If there will pop new requirements, you always can find new simple solution for them. Don't do any premature optimizations.

Is it good practice to count on the file system as a database?

I'm working on an ASP.net web application that uses SQL as a database back-end. One issue that I have is that it sometimes takes a while to get my DBA to create or modify tables in the database which under no circumstance am I allowed to modify on my own.
Here is something that I do is when I expect users to upload files with their data.
Suppose the user uploads a new record for a table called Student_Records. The user uploads a record with fname Bob and lname Smith. The record is assigned primary key 123 The user also uploads two files: attendance_record.pdf and homework_record.pdf. Let's suppose that I have a network share: \\foo\bar where the files are saved.
One way of handling this situtation would be to have a table Student_Records_Files that associates the key 123 with Bob Smith. However, since I have trouble getting tables created, I've gone and done something different: When I save the files on the server, I call them 123_attendance_record.pdf and 123_homework_record.pdf. That way, I can easily identify what table record each file is associated with without having to create a new SQL table. I am, in essence, using the file system itself as a join table (Obviously, the file system is a type of database).
In my code for retrieving the files, I scan the directory \\foo\bar and look for files that begin with each primary key number from Student_Records.
It seems to work very well, but is it good practice?

There is nothing wrong with using the file system to store files. It's what it is used for.
There are a few things to keep in mind though.
I would consider a better method of storing the files - perhaps a directory for each user, rather than simply appending the user id to the filename.
Ensure that the file store is resilient and backed up with the same regularity as your database. If your database is configured to give you a backup every 10 minutes, but your file store only does a backup every day (or worse week) then you might be in for a world of pain.
Also consider what would happen if the user uploads two documents that are the same name.

First of all, I think it's a bad practice, in general, to design your architecture based on how responsive your DBA is. Any given compromise based on this approach may or may not be a big deal, but over time it will result in a poorly designed system.
Second, making the file name this critical seems dangerous to me; there's no protection against a person or application modifying the filename without realizing its importance.
Third, one of the advantages of having a table to maintain the join between the person and the file is that you can add additional data, such as: when was the file uploaded, what is the MIME type, has the file been read by anyone through the system, is this file a newer version of a previous file, etc. etc. Metadata can be very powerful, and the filesystem offers only limited ways to store it.

There are really two questions here. One is, given that for administrative reasons you cannot get changes made to the database schema, is it acceptable to devise some workaround. To that I'd have to say yes. What else can you do? In theory, if it takes two weeks to get the DBA to make a schema change for you, then this two weeks should be added to any deadline that you are given. In practice, this almost never happens. I've often worked places where some paperwork or whatever required two weeks before I could even begin work, and then I'd be given two weeks and one day to do the project. Sometimes you just have to put it together with rubber bands and bandaids.
Two is, is it a good idea to build a naming convention into file names and use this to identify files and their relationship to other data. I've done this at times and it's generally worked for me, though I have a perhaps irrational emotional feeling that it's not a good idea.
On the plus side, (a) By building information into a file name, you make it easy for both the computer and a human being to identify file associations. (Human readable as long as the naming convention is straightforward enough, anyway.) (b) By eliminating the separate storage of a link, you eliminate the possibility of a bad link. A file with the appropriate name may not exist, of course, but a database record with appropriate keys may not exist, or the file reference in such a record may be null or invalid. So it seems to solve one problem there without creating any new problems.
Potential minuses are: (a) You may have characters in the key that are not legal in file names. You may be able to just strip such characters out, or this may cause duplicates. The only safe thing to do is to escape them in some way, which is a pain. (b) You may exceed the legal length of a file name. Not as much of an issue as it was in the bad old 8.3 days. (c) You can't share files. If a database record points to a file, then two db records could point to the same file. If you must make two copies of a file, not only does this waste disk space, but it also means that if the file is updated, you must be sure to update all copies. If in your application it would make no sense to share files, than this isn't an issue.
You have to manage the files in some way, but you had to do that anyway.
I really can't think of any over-riding minuses. As I say, I've done this on occassion and didn't run into any particular problems. I'm interested in seeing others' responses.

I think it is not good practice because you are making your working application very dependent on specific implementation details and it would make it pretty hard to work with in the future to maintain, or if other people later needed access to your code/api.
Now weather you should do this or not is a whole different question. If you are really taking that much of a performance hit and it is significantly easier to work with how you have it, then I would say go ahead and break the rules. Ideally its good to follow best practice methods, but sometimes you have to bend the rules a little to make things work.

First, why is this a table change as opposed to a data change? Once you have the tables set up you should only need to update rows in that table every time that a user adds new files. If you have to put up with this one-time, two-week delay then bite the bullet and just get it done right.
Second, instead of trying to work around the problem why don't you try to fix the problem? Why is the process of implementing table changes so slow? Are you at least able to work on a development database (in which you have control to test and try out these changes)? Even if it's your own laptop you can at least continue on with development. Work with your manager, the DBA, and whoever else you need to, in order to improve the process. Would it help to speed things up if your scripts went through a formal testing process before you handed them off to the DBA so that he doesn't need to test the scripts, etc. himself?
Third, if this is a production database then you should probably be building in this two-week delay into your development cycle. You know that it takes two weeks for the DBA to review and implement changes in production, so make sure that if you have a deadline for releasing functionality that you have enough lead time for it.
Building this kind of "data" into a filename has inherent problems as others have pointed out. You have no relational integrity guarantees and the "data" can be changed without knowledge of the rest of the application/database.

It's best to keep everything in the database.
Network file I/O is spotty at best. In addition, its slower than the DB I/O.
If the DBA is difficult in getting small changes into the database, you
may be dealing with:
A political control issue. Maybe he just knows DB stuff and is threatened
when he perceives others moving in on his turf. Whatever his reasons, you need
to GET WORK DONE. Period. Document all the extra time / communication / work
you need to do for each small change and take that up with the management.
If the first level of management is unwilling to see things your way,
(it does not matter what their reasons are), escalate the issue
to the next level of management. In the past, I've gotten results this way.
It was more of a political territory problem than a technical problem.
The DBA eventually gave up and gave me full access to the TEST system BUT
he also stipulated that I would need to learn his testing process,
naming convention, his DB standards and practices, his way of testing, etc.
I was game.
I would also need to fix any database problems arising from changes I introduced.
This was fair and I got to wear the DBA hat in addition to the developer hat.
I got the freedom I needed and he got one less thing to worry about.
A process issue. Maybe the DBA needs to put every small DB change you submit
through a gauntlet of testing and performance analysis. Maybe he has a highly
normalized DB schema and because he has the big picture, he needs to normalize or
denormalize your requested DB changes to fit into the existing schema.
Ask to work with him. Ask him for a full DB design diagram.
Get a good sense of his DB design philosophy. Implement your DB changes with
his DB design philosophy in mind. Show that you understand that he's trying
to keep the DB in good order (understand normalization, relational constraints,
check constraints) Give him less to worry about. He needs to trust that you
will not muck up his database.
Accumulate all the small changes into a lengthy script and submit them to the DBA.
This way, you won't have to wait for each small change to go through all of his
process / testing. In addition, you're giving him a bigger picture view of your
development planning (that is in step with his DB design philosophy) instead of
just the play by play.

Unique computer ID like finger prints

Is there any unique computer ID that distinguishes a computer from other, like finger prints for human?
If yes, please advise how to get it in vb.net.

It is possible to put together info to uniquely identify a machine. This has already been done by many software vendors, most notably the Microsoft Activation service does it by sampling various bits of hardware on your system. The problem with this approach is that the identifier is not guaranteed to be persistent.
What i mean by this is:
the chances of another computer co-incidentally having the same identifiers is nil
if it becomes public knowledge which identifiers you are using it would be reasonably easy to spoof the identity of a machine
the identifiers can change over time as users change hardware, so the "fingerprint" will also change
For further reference, try these links:
Previous SO question: What's a good way to uniquely identify a computer?
MSDN forums: Uniquely identify computer
Just remember: the more points of reference you use to assemble your identifier, the greater the chance it may change at some point in time.

Try getting MAC + CPU ID + Motherboard serial. If you concatenate these then its you unique finger print of that machine until hardware changes occur.

Bear in mind, many people these days are working with dual operating systems (eg VirtualBox) and the MAC will be different. Even swapping network connectivity (ethernet hardwire vs wifi) will change the MAC. I would say MAC is not a good reference for identification.

Well it can be done in many ways.
You can try to get all kind of data about the computer, and then hash it to a string that will identify it. For example, number of drives, number of processors, the user's name, some keys in the registry.
But you have to make sure all the data you take is data that doesn't usually change.

Most optimized way to store crawler states?

I'm currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are.
Thus, I'm able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.
Right now, I've just been using a mysql table to handle those storage action, mostly for fast prototyping.
Now I'd like to know how I could optimize this, since I believe a database shouldn't be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a great amount of data written in short times
For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...
Thanks in advance for suggestions

The fastest way of persisting things is typically to just append them to a log -- such a totally sequential access pattern minimizes disk seeks, which are typically the largest part of the time costs for storage. Upon restarting, you re-read the log and rebuild the memory structures that you were also building on the fly as you were appending to the log in the first place.
Your specific application could be further optimized since it doesn't necessarily require 100% reliability -- if you miss writing a few entries due to a sudden crash, ah well, you'll just crawl them again. So, your log file can be buffered and doesn't need to be obsessively fsync'ed.
I imagine the search structure would also fit comfortably in memory (if it's only for a few dozen sites you could probably just keep a set with all their URLs, no need for bloom filters or anything fancy) -- if it didn't, you might have to keep in memory only a set of recent entries, and periodically dump that set to disk (e.g., merging all entries into a Berkeley DB file); but I'm not going into excruciating details about these options since it does not appear you will require them.

There was a talk at PyCon 2009 that you may find interesting, Precise state recovery and restart for data-analysis applications by Bill Gribble.
Another quick way to save your application state may be to use pickle to serialize your application state to disk.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas