RATIONALE.adoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260

Rational: Or why am I bothering to rewrite nanomsg?
===================================================
Garrett D'Amore <garrett@damore.org>
v0.1, October 18, 2017


NOTE: You might want to review
      http://nanomsg.org/documentation-zeromq.html[Martin Sustrik's rationale]
      for nanomsg vs. ZeroMQ.


Background
----------

I became involved in the nanomsg community back in 2014, when
I wrote https://github.com/go-mangos/mangos[mangos] as a pure Golang
implementation of the wire protocols behind nanomsg.  I did that work
because I was dissatisfied with ZeroMQ's licensing and C++ baggage, and
I needed something that would work with Golang on Solaris, which at the time
lacked support for cgo (so I could not just use an FFI binding.)  This gave
me a lot of detail about the internals of nanomsg and the SP protocols.
At the time, it was the only alternate implementation those protocols.

It would not be wrong to say that one of the goals of mangos was to teach
me about golang.  It was my first non-trivial golang project.

While working with mangos, I wound up implementing a number of additional
features, such as a TLS transport, the ability to bind to wild card ports,
and the ability to determine more information about the sender of a message.
This was incredibly useful in a number of projects.

I initially looked at nanomsg itself, as I wanted to add a TLS transport
to it, and I needed to make some bug fixes (for protocol bugs for example),
and so forth.


State Machine Madness
---------------------

What I ran into was a challenging mess of state machines.
nanomsg has dozens of state machines, many of which feed into others,
such that tracking flow through the state machines is incredibly painful.

Worse, these state machines are designed to be run from a single worker
thread.  This means that a given socket is entirely single theaded; you
could in theory have dozens, hundreds, or even thousands of connections
open, but they would be serviced only by a single thread.  (Admittedly
non-blocking I/O is used to let the OS kernel calls run asynchronously
perhaps on multiple cores, but nanomsg itself runs all socket code on
a single worker thread.)

There is another problem too -- the inproc code that moves messages
between one socket and another was incredibly racy.  This is because the
two sockets have different locks, and so dealing with the different
contexts was tricky (and consequently buggy).  (I've since, I think, fixed
the worst of the bugs here, but not after many hours of pulling out hair.)

The state machines also make fairly linear flow really difficult to follow.
For example, there is a state machine to read the header information.  This
may come a byte a time, and the state machine has to add the bytes, check
for completion, and possibly change state, even if it is just reading a
single 32-bit word.  This is a lot more complex than most programmers are
used to, such as `read(fd, &val, 4)`.

Now to be fair, Martin Sustrik had the best intentions when he created the
state machine model around which nanomsg is built.  I do think that from
experience this is one of the most dense and unapproachable parts of nanomsg,
in spite of the fact that Martin's goal was precisely the opposite.  I
consider this a "failed experiment" -- but hey failed experiments are the
basis of all great science.


Thread Challenges
-----------------

While nanomsg is mostly internally single threaded, I decided to try to
emulate the simple architecture of mangos using system threads.  (mangos
has the benefit of goroutines.)  Having been well and truly spoiled by
Solaris/illumos threading (and especially Solaris/illumos kernel threads),
I thought this would be a reasonable architecture.

Sadly, this initial effort, while it worked, scaled incredibly poorly --
even so-called "modern" operating systems like macOS 10.12 and Windows 8.1
simply melted or failed entirely when creating any non-trivial number of
threads.  (To me, creating 100 threads should be a no-brainer, especially if
one limits the stack size appropriately.  I'm used to be able to create
thousands of threads without concern.  As I said, I've been spoiled.
If your system falls over at a mere 200 threads I consider it a toy
implementation of threading. Unfortunately most of the mainstream operating
systems are therefore toy implementations.)

Chalk up another failed experiment.

I did find another approach which is discussed further.


File Descriptor Driven
----------------------

Most of the underlying I/O in nanomsg is built around file descriptors,
and it's internal usock structure, which is also state machine driven.
This means that implementing new transports which might need something
other than a file descriptor, is really non-trivial.  This stymied my
first attempt to add OpenSSL support to get TLS added -- OpenSSL has
it's own struct BIO for this stuff, and I could not see an easy way to
convert nanomsg's usock stuff to accomodate the struct BIO.


Poll
----

In order to support use in event driven programming, asynchronous
situations, etc. nanomsg offers non-blocking I/O.  In order to make
this work for end-users, a notification mechanism is required, and
nanomsg, in the spirit of following POSIX, offers a notification method
based on poll(2) or select(2).

In order for this to work, it offers up a selectable file descriptor
for send and another one for receive.  When events occur, these are
written to, and the user application "clears" these by reading from
them.  (This is done on behalf of the application by nanomsg's API calls.)

This means that in addition to the context switch code, there are not
fewer than 2 extra system calls executed per message sent or received, and
on a mostly idle system as many as 3.  This means that to send a message
from one process to another you may have to execute up to 6 extra system
calls, beyond the 2 required to actually send and receive the message.

There are cases where this file descriptor logic is easier for existing
applications to integrate into event loops (e.g. they already have a thread
blocked in poll().)

But for many cases this is not necessary.  A simple callback mechanism
would be far better, with the FDs available only as an option for code
that needs them.  This is the approach that we have taken with nng.

As another consequence of our approach, we do not require file descriptors
for sockets at all, so it is possible to create applications containing
many thousands of inproc sockets with no files open at all.  (Obviously
if you're going to perform real I/O to other processes or other systems,
you're going to need to have the underlying transport file descriptors
open.)


POSIX APIs
----------

Another of Martin's goals, which seems worthwhile at first, was the
attempt to provide a familiar BSD POSIX API.  As a C programmer, this
really attracted me.

The problem is that the POSIX APIs are actually really horrible.  In
particular the semantics around cmsg are about as arcane and painful as
one can imagine.  Largely, this has meant that extensions to the cmsg
API simply have not occurred in nanomsg.

The cmsg API specified by POSIX is as bad as it is because POSIX had
requirements not to break APIs that already existed, and they needed to
shim something that would work with existing implementations, including
getting across a system call boundary.  nanomsg  has never had such
constraints.

Attempting to retain low numbered "socket descriptors" had its own
problems -- a huge source of use-after-close bugs, which made the
use of nn_close() incredibly dangerous for multithreaded sockets.
(If one thread closes and opens a new socket, other threads still using
the old socket might wind up accessing the "new" socket without realizing
it.)

The other thing is that BSD socket APIs are super familiar to UNIX C
programmers -- but experience with nanomsg has taught us already that these
are actually in the minority of nanomsg's users.  Most of our users are
coming to us from C++ (object oriented), Java, and Python backgrounds.
For them the BSD sockets API is frankly somewhat bizarre and alien.

With nng, we realized that constraining ourselves to the mistakes of the
POSIX API was hurting rather than helping. So nng provides a much friendlier
interface for getting properties associated with messages.

In nng we also generally try hard to avoid reusing
an identifier until no other option exists.  This generally means most
applications won't see socket reuse until billions of other sockets
have been opened.


Compatibility
-------------

Of course, there are a number of existing nanomsg consumers "in the wild"
already.  It is important to continue to support them.  So I decided from
the get go to implement a "compatibility" layer, that provides the same
API, and as much as possible the same ABI, as legacy nanomsg.  However,
new features and capabilities would not necessarily be exposed to the
the legacy API.

Today nng offers this.  You can relink an existing nanomsg binary against
nng instead of nanomsg, and it usually Just Works.  Source compatibility
is almost as easy, although the application code needs to be modified
to use different header files.  (Note: I am considering changing this
in the future.)


Asynchronous IO
---------------

As a consequence of our experience with threads being so unscalable,
we decided to create a new underlying abstraction modeled largely on
Windows IO completion ports.  (As bad as so many of the Windows APIs
are, the IO completion port stuff is actually pretty nice.)  Under the
hood in nng all I/O is asynchronous, and we have `struct nni_aio` objects
for each pending I/O.  These have an associated completion routine.

The completion routines are _usually_ run on a separate worker thread
(we have many such workers; in theory the number should be tuned to the
available number of CPU cores to ensure that we never wait while a CPU
core is available for work), but they can be run "synchronously" if
the I/O provider knows it is safe to do so (for example the completion
is occuring in a context where no locks are held.)

There is still a lot of performance tuning work to do, and there is a
real need to expose the AIO structures to user code, which is something
that I will be doing in the coming days.  This should enable a much 
faster and lighter weight I/O model, especially for event driven programs.


New Transports (ZeroTier)
-------------------------

The other, most critical, motivation behind nng was to enable an easier
creation of new transports.  In particular, one client (Capitar IT Group BV)
contracted the creation of a ZeroTier transport for nanomsg.

After beating my head against the state machines some more, I finally asked
myself if it would not be easier just to rewrite nanomsg using the model
I had created for mangos.

In retrospect, I'm not sure that the answer was a clear and definite yes
in favor of nng, but for the other things I want to do, it has enabled a
lot of new work.  The ZeroTier transport was created with a relatively
modest amount of effort, in spite of being based upon a connectionless
transport.  I do not believe I could have done this easily in the existing
nanomsg.

In the coming weeks I expect to create transports for TLS, websocket,
and websocket over TLS.  I expect the websocket transport to support
multiple sockets with different path components bound to the same TCP port.
This might be possible with nanomsg, but it wouldn't be easy.  With nng,
it looks fairly straight-forward.


Towards nanomsg 2.0
-------------------

It is my intention that nng ultimately replace nanomsg.  I do think of it
as "nanomsg 2.0".  In fact "nng" stands for "nanomsg next generation" in
my mind.  Some day before too long I'm hoping that the various website
references to nanomsg my simply be updated to point at nng.  It is not
clear to me whether at that time I will simply rename the existing
code to nanomsg, nanomsg2, or leave it as nng.  (One possible issue is
that Martin Sustrik retains the trademark for "nanomsg".)