aboutsummaryrefslogtreecommitdiff
path: root/RATIONALE.adoc
diff options
context:
space:
mode:
Diffstat (limited to 'RATIONALE.adoc')
-rw-r--r--RATIONALE.adoc260
1 files changed, 260 insertions, 0 deletions
diff --git a/RATIONALE.adoc b/RATIONALE.adoc
new file mode 100644
index 00000000..9c21ce3f
--- /dev/null
+++ b/RATIONALE.adoc
@@ -0,0 +1,260 @@
+Rational: Or why am I bothering to rewrite nanomsg?
+===================================================
+Garrett D'Amore <garrett@damore.org>
+v0.1, October 18, 2017
+
+
+NOTE: You might want to review
+ http://nanomsg.org/documentation-zeromq.html[Martin Sustrik's rationale]
+ for nanomsg vs. ZeroMQ.
+
+
+Background
+----------
+
+I became involved in the nanomsg community back in 2014, when
+I wrote https://github.com/go-mangos/mangos[mangos] as a pure Golang
+implementation of the wire protocols behind nanomsg. I did that work
+because I was dissatisfied with ZeroMQ's licensing and C++ baggage, and
+I needed something that would work with Golang on Solaris, which at the time
+lacked support for cgo (so I could not just use an FFI binding.) This gave
+me a lot of detail about the internals of nanomsg and the SP protocols.
+At the time, it was the only alternate implementation those protocols.
+
+It would not be wrong to say that one of the goals of mangos was to teach
+me about golang. It was my first non-trivial golang project.
+
+While working with mangos, I wound up implementing a number of additional
+features, such as a TLS transport, the ability to bind to wild card ports,
+and the ability to determine more information about the sender of a message.
+This was incredibly useful in a number of projects.
+
+I initially looked at nanomsg itself, as I wanted to add a TLS transport
+to it, and I needed to make some bug fixes (for protocol bugs for example),
+and so forth.
+
+
+State Machine Madness
+---------------------
+
+What I ran into was a challenging mess of state machines.
+nanomsg has dozens of state machines, many of which feed into others,
+such that tracking flow through the state machines is incredibly painful.
+
+Worse, these state machines are designed to be run from a single worker
+thread. This means that a given socket is entirely single theaded; you
+could in theory have dozens, hundreds, or even thousands of connections
+open, but they would be serviced only by a single thread. (Admittedly
+non-blocking I/O is used to let the OS kernel calls run asynchronously
+perhaps on multiple cores, but nanomsg itself runs all socket code on
+a single worker thread.)
+
+There is another problem too -- the inproc code that moves messages
+between one socket and another was incredibly racy. This is because the
+two sockets have different locks, and so dealing with the different
+contexts was tricky (and consequently buggy). (I've since, I think, fixed
+the worst of the bugs here, but not after many hours of pulling out hair.)
+
+The state machines also make fairly linear flow really difficult to follow.
+For example, there is a state machine to read the header information. This
+may come a byte a time, and the state machine has to add the bytes, check
+for completion, and possibly change state, even if it is just reading a
+single 32-bit word. This is a lot more complex than most programmers are
+used to, such as `read(fd, &val, 4)`.
+
+Now to be fair, Martin Sustrik had the best intentions when he created the
+state machine model around which nanomsg is built. I do think that from
+experience this is one of the most dense and unapproachable parts of nanomsg,
+in spite of the fact that Martin's goal was precisely the opposite. I
+consider this a "failed experiment" -- but hey failed experiments are the
+basis of all great science.
+
+
+Thread Challenges
+-----------------
+
+While nanomsg is mostly internally single threaded, I decided to try to
+emulate the simple architecture of mangos using system threads. (mangos
+has the benefit of goroutines.) Having been well and truly spoiled by
+Solaris/illumos threading (and especially Solaris/illumos kernel threads),
+I thought this would be a reasonable architecture.
+
+Sadly, this initial effort, while it worked, scaled incredibly poorly --
+even so-called "modern" operating systems like macOS 10.12 and Windows 8.1
+simply melted or failed entirely when creating any non-trivial number of
+threads. (To me, creating 100 threads should be a no-brainer, especially if
+one limits the stack size appropriately. I'm used to be able to create
+thousands of threads without concern. As I said, I've been spoiled.
+If your system falls over at a mere 200 threads I consider it a toy
+implementation of threading. Unfortunately most of the mainstream operating
+systems are therefore toy implementations.)
+
+Chalk up another failed experiment.
+
+I did find another approach which is discussed further.
+
+
+File Descriptor Driven
+----------------------
+
+Most of the underlying I/O in nanomsg is built around file descriptors,
+and it's internal usock structure, which is also state machine driven.
+This means that implementing new transports which might need something
+other than a file descriptor, is really non-trivial. This stymied my
+first attempt to add OpenSSL support to get TLS added -- OpenSSL has
+it's own struct BIO for this stuff, and I could not see an easy way to
+convert nanomsg's usock stuff to accomodate the struct BIO.
+
+
+Poll
+----
+
+In order to support use in event driven programming, asynchronous
+situations, etc. nanomsg offers non-blocking I/O. In order to make
+this work for end-users, a notification mechanism is required, and
+nanomsg, in the spirit of following POSIX, offers a notification method
+based on poll(2) or select(2).
+
+In order for this to work, it offers up a selectable file descriptor
+for send and another one for receive. When events occur, these are
+written to, and the user application "clears" these by reading from
+them. (This is done on behalf of the application by nanomsg's API calls.)
+
+This means that in addition to the context switch code, there are not
+fewer than 2 extra system calls executed per message sent or received, and
+on a mostly idle system as many as 3. This means that to send a message
+from one process to another you may have to execute up to 6 extra system
+calls, beyond the 2 required to actually send and receive the message.
+
+There are cases where this file descriptor logic is easier for existing
+applications to integrate into event loops (e.g. they already have a thread
+blocked in poll().)
+
+But for many cases this is not necessary. A simple callback mechanism
+would be far better, with the FDs available only as an option for code
+that needs them. This is the approach that we have taken with nng.
+
+As another consequence of our approach, we do not require file descriptors
+for sockets at all, so it is possible to create applications containing
+many thousands of inproc sockets with no files open at all. (Obviously
+if you're going to perform real I/O to other processes or other systems,
+you're going to need to have the underlying transport file descriptors
+open.)
+
+
+POSIX APIs
+----------
+
+Another of Martin's goals, which seems worthwhile at first, was the
+attempt to provide a familiar BSD POSIX API. As a C programmer, this
+really attracted me.
+
+The problem is that the POSIX APIs are actually really horrible. In
+particular the semantics around cmsg are about as arcane and painful as
+one can imagine. Largely, this has meant that extensions to the cmsg
+API simply have not occurred in nanomsg.
+
+The cmsg API specified by POSIX is as bad as it is because POSIX had
+requirements not to break APIs that already existed, and they needed to
+shim something that would work with existing implementations, including
+getting across a system call boundary. nanomsg has never had such
+constraints.
+
+Attempting to retain low numbered "socket descriptors" had its own
+problems -- a huge source of use-after-close bugs, which made the
+use of nn_close() incredibly dangerous for multithreaded sockets.
+(If one thread closes and opens a new socket, other threads still using
+the old socket might wind up accessing the "new" socket without realizing
+it.)
+
+The other thing is that BSD socket APIs are super familiar to UNIX C
+programmers -- but experience with nanomsg has taught us already that these
+are actually in the minority of nanomsg's users. Most of our users are
+coming to us from C++ (object oriented), Java, and Python backgrounds.
+For them the BSD sockets API is frankly somewhat bizarre and alien.
+
+With nng, we realized that constraining ourselves to the mistakes of the
+POSIX API was hurting rather than helping. So nng provides a much friendlier
+interface for getting properties associated with messages.
+
+In nng we also generally try hard to avoid reusing
+an identifier until no other option exists. This generally means most
+applications won't see socket reuse until billions of other sockets
+have been opened.
+
+
+Compatibility
+-------------
+
+Of course, there are a number of existing nanomsg consumers "in the wild"
+already. It is important to continue to support them. So I decided from
+the get go to implement a "compatibility" layer, that provides the same
+API, and as much as possible the same ABI, as legacy nanomsg. However,
+new features and capabilities would not necessarily be exposed to the
+the legacy API.
+
+Today nng offers this. You can relink an existing nanomsg binary against
+nng instead of nanomsg, and it usually Just Works. Source compatibility
+is almost as easy, although the application code needs to be modified
+to use different header files. (Note: I am considering changing this
+in the future.)
+
+
+Asynchronous IO
+---------------
+
+As a consequence of our experience with threads being so unscalable,
+we decided to create a new underlying abstraction modeled largely on
+Windows IO completion ports. (As bad as so many of the Windows APIs
+are, the IO completion port stuff is actually pretty nice.) Under the
+hood in nng all I/O is asynchronous, and we have `struct nni_aio` objects
+for each pending I/O. These have an associated completion routine.
+
+The completion routines are _usually_ run on a separate worker thread
+(we have many such workers; in theory the number should be tuned to the
+available number of CPU cores to ensure that we never wait while a CPU
+core is available for work), but they can be run "synchronously" if
+the I/O provider knows it is safe to do so (for example the completion
+is occuring in a context where no locks are held.)
+
+There is still a lot of performance tuning work to do, and there is a
+real need to expose the AIO structures to user code, which is something
+that I will be doing in the coming days. This should enable a much
+faster and lighter weight I/O model, especially for event driven programs.
+
+
+New Transports (ZeroTier)
+-------------------------
+
+The other, most critical, motivation behind nng was to enable an easier
+creation of new transports. In particular, one client (Capitar IT Group BV)
+contracted the creation of a ZeroTier transport for nanomsg.
+
+After beating my head against the state machines some more, I finally asked
+myself if it would not be easier just to rewrite nanomsg using the model
+I had created for mangos.
+
+In retrospect, I'm not sure that the answer was a clear and definite yes
+in favor of nng, but for the other things I want to do, it has enabled a
+lot of new work. The ZeroTier transport was created with a relatively
+modest amount of effort, in spite of being based upon a connectionless
+transport. I do not believe I could have done this easily in the existing
+nanomsg.
+
+In the coming weeks I expect to create transports for TLS, websocket,
+and websocket over TLS. I expect the websocket transport to support
+multiple sockets with different path components bound to the same TCP port.
+This might be possible with nanomsg, but it wouldn't be easy. With nng,
+it looks fairly straight-forward.
+
+
+Towards nanomsg 2.0
+-------------------
+
+It is my intention that nng ultimately replace nanomsg. I do think of it
+as "nanomsg 2.0". In fact "nng" stands for "nanomsg next generation" in
+my mind. Some day before too long I'm hoping that the various website
+references to nanomsg my simply be updated to point at nng. It is not
+clear to me whether at that time I will simply rename the existing
+code to nanomsg, nanomsg2, or leave it as nng. (One possible issue is
+that Martin Sustrik retains the trademark for "nanomsg".)