Halfbakery: Safe Thread Suspension

One of the things that modern programming languages allow is "multi-threaded" applications. I probably need to explain this for any non-computer-nerd who is curious about the title of this Idea.

First, a modern computer Operating System (OS) is usually "multi-tasking", meaning that it can run multiple tasks or applications simultaneously --or at least SEEM to run them simultaneously. The trick is, the OS rapidly switches between each application and runs a little bit of it, before switching to the next task. Computers are so fast, compared to human perception, that this trick worked nicely even on 1980s-era PCs that had a clock speed of 2 Megahertz (clock speeds are 1000 times faster than that, today). The result is that every task running on the computer appears to be smoothly and seamlessly chugging along, at normal speed. (Too many of them, of course, does lead to notice-able slowdowns.)

Well, suppose a single application was divided into sections that the Operating System could be told to run "simultaneously"? In this case each section is, in computer lingo, called a "thread"; the OS switches between threads of a single application in exactly the same way that it switches between different tasks.

It can be tricky to write a multi-threaded application. The different threads often need to communicate with each other, and one way is to save some data in one thread, and to read it by another thread. Well, "deadlocks" can happen when both threads try do do that at the same time. Various ways of dealing with these issues have been devised, having such names as "locks", "semaphores", "critical sections", "signals", and "mutexes".

One other thing that a Master Thread might want to do, occasionally, is to temporarily "suspend" one or more of the other threads of an application. It might want to do this if it is about to do some serious data-crunching that cannot be spread across multiple threads. The Master Thread would get more "processor time" (processing time devoted to each task by the Operating System), if the other threads weren't running. Well, technically, the Master Thread cannot suspend the other threads by itself, it has to send a request to the Operating System, which is responsible for switching between the applications and threads that it is running.

Now we come to the Problem that this Idea exists to solve. Just the other day I encountered a note, in a programming-language function-library, that reads: "About Suspend and Resume. POSIX does not support suspending/resuming a thread. Suspending a thread is considered dangerous since it is not guaranteed where the thread would be suspend. It might be holding a lock, mutex or it might be inside a critical section."

POSIX is an international standard, a set of behaviors that an Operating System is expected to follow, for interfacing with other software. While created for Unix, the POSIX standard is largely followed by most other Operating Systems, including, believe it or not, Microsoft Windows. (And just because Windows allows thread-suspending/resuming in spite of POSIX, that doesn't mean other parts of Windows are not POSIX-compliant.)

One peculiarity about multi-threaded programming I need to explain clearly. A good way to think of each thread of an application is to pretend it has a complete copy of all the code, to itself. This is not the truth (only one copy of the program-code exists in the computer's memory, usually), but it is handy to avoid certain mistakes. For example, suppose Thread A wants to Suspend Thread B. Obviously it shouldn't call the Suspend function in its own copy of the application! It should call the Suspend function in the copy that Thread B is using. (Technically, each Thread has a numerical identifier, so the solely-existing Suspend function merely operates on the identifier that it is told to Suspend, no matter what Thread is doing the telling.) Well, let's pretend we have copies of the code, and that Thread A does call B's copy of the Suspend function. Carefully note that it is actually Thread A that is running that code, and not Thread B. This is why, when B is suspended, POSIX says it is dangerous. We don't know what piece of code B was running, when it got Suspended.

The purpose of this Idea is to solve the Problem. We want to guarantee that a thread, when suspended, is always in a safe place. So:

1. Let each Thread have access to a Suspend function in the already-existing way, as explained above (there is actually only one copy of the code being accessed by multiple Threads).

2. Regarding the Resume function, it is accessed like Suspend, except of course a Suspended Thread cannot Resume itself. Some other Thread has to ask the OS to Resume it.

3. Let the Suspend function have two parts. The first part is simply some message-passing code, and nothing else. Any Thread that calls this first part will actually be sending a message *to*the*Thread* that we want to Suspend.

4. The Thread that called Suspend now goes on to do other things. Meanwhile, the Thread that is to be Suspended eventually receives the message, and as a result it calls the second part of the Suspend function. This is the part that asks the Operating System to Suspend the Thread. But this soon-to-be-Suspended Thread does not make that request just yet.

5. A "Resume location" is now required. This is the address of the first program-code-instruction in the computer's memory, that a specified Thread will start to do upon being Resumed, some time after being Suspended. I'll be a little more specific about this in a moment.

6. Inside the second part of the Suspend function, the Thread that is running this code obtains that Resume-location, and now it passes this on to the Operating System, along with its own Thread-identifying code-number, as it makes the request for Suspension. The OS will eventually save the Resume-location and use the identifier to stop switching to that task, as requested.

7. Here is the sneaky trick. Ordinarily, after a Thread passes information on to another process (like the OS), it continues on to do many more instructions. POSIX recognizes that even a Thread that requests its own Suspension might be in the middle of doing something else when that Suspension finally occurs. But all we really need is an endless loop! That is, inside the second part of the Suspend function, immediately after the System-call to request Suspension, the ONLY code that the Thread will now access is a simple loop, essentially "JUMP TO THIS JUMP INSTRUCTION". So the Thread will do that and nothing else, until Suspension finally occurs. NOTE: There is a particular kind of code, called "re-entrant code" that this loop needs to be. We need it to work, and to NOT deadlock, no matter how many Threads might have been told to Suspend themselves, and are all executing this endless loop until it happens.

8. The Resume-location specified earlier, logically, is the next instruction in the overall computer program, that follows the JUMP loop. This will essentially cause a newly-Resumed Thread to exit the second part of the Suspend function, and allow it to go on to do other things.