TCP, the workhorse of the internet

2025-11-156:37350154cefboud.com

An exploration of TCP, the workhorse of the internet. This deep dive includes detailed examples and a step-by-step walkthrough.

Show article

The internet is incredible. It’s nearly impossible to keep people away from. But it can also be unreliable: packets drop, links congest, bits mangle, and data corrupts. Oh, it’s dangerous out there! (I’m writing this in Kramer’s tone)

So how is it possible that our apps just work? If you’ve networked your app before, you know the drill: socket()/bind() here, accept() there, maybe a connect() over there, and it just works. Reliable, orderly, uncorrupted data flows to and fro.

Websites (HTTP), email (SMTP) or remote access (SSH) are all built on top of TCP and just work.

Why TCP

Why do we need TCP? Why can’t we just use the layer below, IP?

Remember, the network stack goes: Physical –> Data Link (Ethernet/Wi-Fi, etc) –> Network (IP) –> Transport (TCP/UDP).

IP (Layer 3) operates at the host level, while the transport layer (TCP/UDP) works at the application level using ports. IP can deliver packets to the correct host via its IP address, but once the data reaches the machine, it still needs to be handed off to the correct process. Each process “binds” to a port: its address within the machine. A common analogy is: the IP address is the building, and the port is the apartment. Processes or apps live in those apartments.

Another reason we need TCP is that if a router (a piece of infra your average user does not control) drops packets or becomes overloaded, TCP at the edges (on the users’ machines) can recover without requiring routers to participate. The routers stay simple, the reliability happens at the endpoints.

Packets get lost, corrupted, duplicated, and reordered. That’s just how the internet works. TCP shields developers from these issues. It handles retransmission, checksums, and a gazillion other reliability mechanisms. If every developer had to implement those themselves, they’d never have time to properly align their flexboxes, a truly horrendous alternate universe.

Jokes aside, the guarantee that data sent and received over a socket isn’t corrupted, duplicated, or out of order, despite the underlying network being unreliable, is exactly why TCP is awesome.

Flow and Congestion Control

When you step back and think about network communication, here’s what we’re really trying to do: machine A sends data to machine B. Machine B has a finite amount of space and must store the incoming data somewhere before passing it to the application, which might be asleep or busy. This temporary storage takes the name of a receive buffer and is managed by the kernel:

sysctl net.ipv4.tcp_rmem => net.ipv4.tcp_rmem = 4096 131072 6291456, a min of 4k, default of 128k and max of 8M.

The problem is that space is finite. If you’re transferring a large file (hundreds of MBs or even GBs), you could easily overwhelm the destination. The receiver therefore needs a way to tell the sender how much more data it can handle. This mechanism is called flow control, and TCP segments include a field called the window, which specifies how much data the receiver is currently willing to accept.

Another issue is overwhelming the network itself, even if the receiving machine has plenty of buffer space. You’re only as strong as your weakest link: some links carry gigabits, others only megabits. If you don’t tune for the slowest link, congestion is inevitable.

Fun fact: in 1986, the Internet’s bandwidth dropped from a few dozen KB/s to as low as 40 bps (yes, bits per second! yes, those numbers are wild!), in what became known as congestion collapse. When packets were lost and systems retried sending them, they made congestion even worse: a doom loop. To fix this, TCP incorporated ‘play nice’ and ‘back off’ behaviors known as congestion control, which help prevent the Internet from clogging itself to death.

Some Code: A Plain TCP Server

With all low-level things like TCP, C examples are the way to go. Just show it like it is.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <signal.h>

int sockfd = -1, clientfd = -1;
void handle_sigint(int sig) { printf("\nCtrl+C caught, shutting down...\n"); if (clientfd != -1) close(clientfd); if (sockfd != -1) close(sockfd); exit(0);
} int main() { signal(SIGINT, handle_sigint); sockfd = socket(AF_INET, SOCK_STREAM, 0); int opt = 1; // SO_REUSEADDR to force bind to the port even if an older socket is still terminating (TIME_WAIT) setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)); struct sockaddr_in addr = { .sin_family = AF_INET, .sin_port = htons(8080), .sin_addr.s_addr = INADDR_ANY }; bind(sockfd, (struct sockaddr*)&addr, sizeof(addr)); listen(sockfd, 5); printf("Listening on 8080...\n"); clientfd = accept(sockfd, NULL, NULL); char buf[1024], out[2048]; int n; while ((n = recv(clientfd, buf, sizeof(buf) - 1, 0)) > 0) { buf[n] = '\0'; int m = snprintf(out, sizeof(out), "you sent: %s", buf); printf("response %s %d\n", out, m); send(clientfd, out, m, 0); } close(clientfd); close(sockfd);
}

This create a TCP server that echoes what the client sends prefixed with ‘You sent:’.

# compile and run server
gcc -o server server.c && ./server
# connect client
telnet 127.0.0.1 8080
# hi
# you sent: hi

127.0.0.1 (localhost) could be replace with a remote IP and it should work all the same.

We used the following primitives/functions follow the Berkley Socket way of doing things (released with BDS 4.2):

SOCKET: create an endpoint (structure in the kernel).
BIND: associate to a port.
LISTEN: get ready to accept connection and a specify queue size of pending connection (beyond that size, drop!)
ACCEPT: accept an incoming connection (TCP Server)
CONNECT: attempt connection (TCP client)
SEND: send data
RECEIVE: receive data
CLOSE: release the connection

In the example above, we’re using client/server dynamics in a request/response pattern. But I can add the following after send:

send(clientfd, out, m, 0);
sleep(5);
const char *msg = "not a response, just doing my thing\n";
send(clientfd, msg, strlen(msg), 0);

Compile, run, and telnet:

client here
you sent: client here
client again
not a response, just doing my thing
you sent: client again

I typed in the telnet terminal: client here, then client again. I only got you sent: client here, then the server was sleeping. My second line, client again, was patiently waiting in the receive buffer. The server sent not a response, just doing my thing, then picked up my second TCP packet and replied with you sent: client again.

This is very much a duplex bidirectional link. Each side sends what it wishes, it just happens that at the beginning, one listens and the other connects. The dynamics afterwards don’t have to follow a request/response pattern.

Catfishing Curl: A Dead Simple HTTP Server

Let’s create a very simple HTTP/1.1 server (later versions are trickier).

 // same as before printf("Listening on 8080...\n"); int i = 1; while (1) { clientfd = accept(sockfd, NULL, NULL); char buf[1024], out[2048]; int n; while ((n = recv(clientfd, buf, sizeof(buf) - 1, 0)) > 0) { buf[n] = '\0'; int body_len = snprintf(out, sizeof(out), "[%d] Yo, I am a legit web server\n", i++); char header[256]; int header_len = snprintf( header, sizeof(header), "HTTP/1.1 200 OK\r\n" "Content-Type: text/plain\r\n" "Content-Length: %d\r\n" "Connection: close\r\n" "\r\n", body_len ); printf("header: %s\n", header); printf("out: %s\n", out); send(clientfd, header, header_len, 0); send(clientfd, out, body_len, 0); break; // one request per connection } close(clientfd); }

~ curl localhost:8080 [1] Yo, I am a legit web server
~ curl localhost:8080
[2] Yo, I am a legit web server

We’re using i to keep count of requests. We’re establishing a TCP connection and returning the HTTP headers expected by the HTTP client (the TCP peer, really). A real HTTP server would return proper HTML, CSS, and JS, and handle a whole lot of other options and headers. But underneath, it’s simply a process making use of our reliable, dependable TCP.

The Actual Bytes

  0                   <----- 32 bits ------>                     
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |        Source Port              |     Destination Port        |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                        Sequence Number                        |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                    Acknowledgment Number                      |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | Header|Rese-|   Flags   |       Window Size                   |
 | Len   |rved |           |                                     |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |       Checksum                  |     Urgent Pointer          |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                    Options (if any)                           |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                    Data (Payload)                             |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Each TCP segment has the header above. And each TCP segment is contained within a IP packet. We have a source and destination ports. Each 16 bits, and that’s where the 64k port limit comes from!

Each transport-layer connection is 5-tuple (TCP/UDP, src IP, src port, dst IP, dst port).

Sequence and Acknowledgment Numbers

TCP reliability depends on two key fields: the Sequence number, indicating which bytes a segment carries, and the Acknowledgment number, indicating which bytes have been received. Sequence numbers let the receiver interpret data order, detect and reorder out-of-order segments, and identify losses. TCP uses cumulative acknowledgments—an ACK of 100 means bytes 0-99 were received. If bytes 100-120 are lost but later bytes arrive, the ACK remains 100 until the missing data is received.

1. A --> B: Send [Seq=0-99]
2. B --> A: Send [Seq=0-49]

3. B --> A: Receives A's [0-99] --> sends ACK=100
4. A --> B: Receives B's [0-49] --> sends ACK=50

5. A --> B: Send [Seq=100-199]   --- lost ---
6. B --> A: Send [Seq=50-99]     --- lost ---

7. A --> B: Send [Seq=200-299]
   B receives --> notices gap (100-199 missing) --> sends ACK=100

8. B --> A: Send [Seq=100-149]
   A receives --> notices gap (50-99 missing) --> sends ACK=50

9. A --> B: Send [Seq=300-399]
   B still missing 100-199 --> sends ACK=100

10. B --> A: Send [Seq=150-199]
    A still missing 50-99 --> sends ACK=50

11. A --> B: Retransmit [Seq=100-199]
    B receives --> now has 0-399 --> sends ACK=400

12. B --> A: Retransmit [Seq=50-99]
    A receives --> now has 0-199 --> sends ACK=200

Header Length shows how many 4-byte words are in the header, needed because the Options field is variable length, and thus so is the header.

TCP Flags

Next are 8 flags (1 bit each). A few important ones:

SYN: used to establish a connection. ACK: indicates the Acknowledgment number is valid.

These two flags are central to connection setup. Why establish a connection? To detect out-of-order or duplicate segments you must track what has been sent and received i.e., maintain a state or a connection.

SYN and ACK participate in the famous 3-way handshake:

A –> B: SYN (I want to connect)
B –> A: SYN + ACK (I got your SYN, I want to connect too!)
A –> B: ACK (got it, connection established!)

The FIN flag signals teardown and also uses a handshake:

X –> Y: FIN (I want to disconnect)
Y –> X: ACK (got your FIN, whatever!)
Y –> X: FIN (I want to disconnect too - sometimes sent with the previous ACK)
X –> Y: ACK (got it!)

This is normally a 4-way (sometimes 3-way) goodbye handshake.

RST is the reset flag. It indicates an error or forced shutdown — drop the connection immediately. An OS sends RST if no process is listening or if the listening process crashed. There’s also a known TCP reset attack where intermediaries inject RST to terminate connections (used by some firewalls).

Window

We talked about this field in flow control. As mentionned above, this indicates how many bytes the receiver is willing to receive after the acknowledged number.

With the example above, running ss (Socket Statistics) provides info about the TCP connection.

ss -tlpmi
// State Recv-Q Send-Q Local Address:Port Peer Address:Port Process // LISTEN 0 5 0.0.0.0:http-alt 0.0.0.0:* users:(("server",pid=1113,fd=3))
// skmem:(r0,rb131072,t0,tb16384,f0,w0,o0,bl0,d0) cubic cwnd:10

rb131072 (128KB) is the receive buffer size, while tb16384 (16KB) is the transmit buffer size, where data waits before being sent over the network. Send-Q indicates bytes not yet acknowledged by the remote host, and Recv-Q shows bytes received but not yet read by the application (e.g., data waiting in from the second line in telnet session above, while the server was sleeping).

Checksum

Checksum is used for reliability. All 16-bit words in the TCP segment are added together, and the result is compared to the checksum. If they don’t match, it means some bits were likely corrupted, and retransmission is needed.

Conclusion

It always amazes me how all this works. The network, the internet. Reliably and continuously. Just a few decades ago, sending a few KB was quite the feat. And today, streaming 4k is banal. God bless all those hardworking people that made and make it all possible!

Read the original article

Comments

By gsliepen 2025-11-158:5010 reply

If you start with the problem of how to create a reliable stream of data on top of an unreliable datagram layer, then the solution that comes out will look virtually identical to TCP. It just is the right solution for the job.

The three drawbacks of the original TCP algorithm were the window size (the maximum value is just too small for today's speeds), poor handling of missing packets (addressed by extensions such as selective-ACK), and the fact that it only manages one stream at a time, and some applications want multiple streams that don't block each other. You could use multiple TCP connections, but that adds its own overhead, so SCTP and QUIC were designed to address those issues.

The congestion control algorithm is not part of the on-the-wire protocol, it's just some code on each side of the connection that decides when to (re)send packets to make the best use of the available bandwidth. Anything that implements a reliable stream on top of datagrams needs to implement such an algorithm. The original ones (Reno, Vegas, etc) were very simple but already did a good job, although back then network equipment didn't have large buffers. A lot of research is going into making better algorithms that handle large buffers, large roundtrip times, varying bandwidth needs and also being fair when multiple connections share the same bandwidth.

By rkagerer 2025-11-1513:003 reply

it only manages one stream at a time

I'll take flak for saying it, but I feel web developers are partially at fault for laziness on this one. I've often seen them trigger a swath of connections (e.g. for uncoordinated async events), when carefully managed multiplexing over one or a handful will do just fine.

Eg. In prehistoric times I wrote a JavaScript library that let you queue up several downloads over one stream, with control over prioritization and cancelability.

It was used in a GreaseMonkey script on a popular dating website, to fetch thumbnails and other details of all your matches in the background. Hovering over a match would bring up all their photos, and if some hadn't been retrieved yet they'd immediately move to the top of the queue. I intentionally wanted to limit the number of connections, to avoid oversaturating the server or the user's bandwidth. Idle time was used to prefetch all matches on the page (IIRC in a sensible order responsive to your scroll location). If you picked a large enough pagination, then stepped away to top up your coffee, by the time you got back you could browse through all of your recent matches instantly, without waiting for any server roundtrip lag.

It was pretty slick. I realize these days modern stacks give you multiplexing for free, but to put in context this was created in the era before even JQuery was well-known.

Funny story, I shared it with one of my matches and she found it super useful but was a bit surprised that, in a way, I was helping my competition. Turned out OK... we're still together nearly two decades later and now she generously jokes I invented Tinder before it was a thing.

By bobmcnamara 2025-11-159:042 reply

> If you start with the problem of how to create a reliable stream of data on top of an unreliable datagram layer, then the solution that comes out will look virtually identical to TCP.

I'll add that at the time of TCP's writing, the telephone people far outnumbered everyone else in the packet switching vs circuit switching debate. TCP gives you a virtual circuit over a packet switched network as a pair of reliable-enough independent byte streams over IP. This idea, that the endpoints could implement reliability through retransmission came from an earlier French network, Cylades, and ends up being a core principle of IP networks.

By o11c 2025-11-1520:581 reply

TCP has another unfixable flaw - it cannot be properly secured. Writing a security layer on top of TCP can at most detect, not avoid, attacks.

It is very easy for a malicious actor anywhere in the network to inject data into a connection. By contrast, it is much harder for a malicious actor to break the legitimate traffic flow ... except for the fact that TCP RST grants any rando the power to upgrade "inject" to "break". This is quite common in the wild for any traffic that does not look like HTTP, even when both endpoints are perfectly healthy.

Blocking TCP RST packets using your firewall will significantly improve reliability, but this still does not project you from more advanced attackers which cause a desynchronization due to forged sequence numbers with nonempty payload.

As a result, it is mandatory for every application to support a full-blown "resume on a separate connection" operation, which is complicated and hairy and also immediately runs into the additional flaw that TCP is very slow to start.

---

While not an outright flaw, I also think it has become clear by now that it is highly suboptimal for "address" and "port" to be separate notions.

By 1vuio0pswjnm7 2025-11-1516:533 reply

"... some applications want multiple streams that don't block each other. You could use multiple TCP connections, but that adds its own overhead, so SCTP and QUIC were designed to address those issues."

Other applications work just fine with a single TCP connection

If I am using TCP for DNS, for example, and I am retrieving data from a single host such as a DNS cache, I can send multiple queries over a single TCP connection and receive multiple responses over the same single TCP single connection, out of order. No blocking.^1 If the cache (application) supports it, this is much faster than receiving answers sequentially and it's more efficient and polite than opening multiple TCP connections

1. I do this every day outside the browser with DNS over TLS (DoT) using something like streamtcp from NLNet Labs. I'm not sure that QUIC is faster, server support for QUIC is much more limited, but QUIC may have other advantages

I also do it with DNS over HTTPS (DoH), outside the browser, using HTTP/1.1 pipelining, but there I receive answers sequentially. I'm still not convinced that HTTP/2 is faster for this particular use case, i.e., downloading data from a single host using multiple HTTP requests (compared to something like integrating online advertising into websites, for example)

By kccqzy 2025-11-1518:22

Yeah the fact that the congestion control algorithm isn’t part of the wire protocol is very ahead of its time and gave the protocol flexibility that’s much needed in retrospective. OTOH a lot of college courses about TCP don’t really emphasize this fact and still many people I interacted with thought that TCP had a single defined congestion control algorithm.

By musicale 2025-11-1518:211 reply

> how to create a reliable stream of data on top of an unreliable datagram layer, then the solution that comes out will look virtually identical to TCP. It just is the right solution for the job

A stream of bytes made sense in the 1970s for remote terminal emulation. It still sort of makes sense for email, where a partial message is useful (though downloading headers in bulk followed by full message on demand probably makes more sense.)

But in 2025 much of communication involves messages that aren't useful if you only get part of them. It's also a pain to have to serialize messages into a byte stream and then deserialize the byte stream into messages (see: gRPC etc.) and the byte stream ordering is costly, doesn't work well with multipathing, and doesn't provide much benefit if you are only delivering complete messages.

TCP without congestion control isn't particularly useful. As you note traditional TCP congestion control doesn't respond well to reordering. Also TCP's congestion control traditionally doesn't distinguish between intentional packet drops (e.g. due to buffer overflow) and packet loss (e.g. due to corruption.) This means, for example that it can't be used directly over networks with wireless links (which is why wi-fi has its own link layer retransmission).

TCP's traditional congestion control is designed to fill buffers up until packets are dropped, leading to undesirable buffer bloat issues.

TCP's traditional congestion control algorithms (additive increase/multiplicative decrease on drop) also have the poor property that your data rate tends to drop as RTT increases.

TCP wasn't designed for hardware offload, which can lead to software bottlenecks and/or increased complexity when you do try to offload it to hardware.

TCP's three-way handshake is costly for one-shot RPCs, and slow start means that short flows may never make it out of slow start, neutralizing benefits from high-speed networks.

TCP is also poor for mobility. A connection breaks when your IP address changes, and there is no easy way to migrate it. Most TCP APIs expose IP addresses at the application layer, which causes additional brittleness.

Additionally, TCP is poorly suited for optical/WDM networks, which support dedicated bandwidth (signal/channel bandwidth as well as data rate), and are becoming more important in datacenters and as interconnects for GPU clusters.

etc.

By NooneAtAll3 2025-11-159:432 reply

> If you start with the problem of how to create a reliable stream of data on top of an unreliable datagram layer

> poor handling of missing packets

so it was poor at exact thing it was designed for?

By rini17 2025-11-1511:10

Might be obvious in hindsight, but it was not clear at all back then, that the congestion is manageable this way. There were legitimate concerns that it will all just melt down.

By 29athrowaway 2025-11-1515:57

I was excited about SCTP over 10 years ago but getting it to work was hard.

The Linux kernel supports it but at least when I had tried this those modules were disabled on most distros.

By kragen 2025-11-1516:544 reply

There are a lot of design alternatives possible to TCP within the "create a reliable stream of data on top of an unreliable datagram layer" space:

• Full-duplex connections are probably a good idea, but certainly are not the only way, or the most obvious way, to create a reliable stream of data on top of an unreliable datagram layer. TCP's predecessor NCP was half-duplex.

• TCP itself also supports a half-duplex mode—even if one end sends FIN, the other end can keep transmitting as long as it wants. This was probably also a good idea, but it's certainly not the only obvious choice.

• Sequence numbers on messages or on bytes?

• Wouldn't it be useful to expose message boundaries to applications, the way 9P, SCTP, and some SNA protocols do?

• If you expose message boundaries to applications, maybe you'd also want to include a message type field? Protocol-level message-type fields have been found to be very useful in Ethernet and IP, and in a sense the port-number field in UDP is also a message-type field.

• Do you really need urgent data?

• Do servers need different port numbers? TCPMUX is a straightforward way of giving your servers port names, like in CHAOSNET, instead of port numbers. It only creates extra overhead at connection-opening time, assuming you have the moral equivalent of file descriptor passing on your OS. The only limitation is that you have to use different client ports for multiple simultaneous connections to the same server host. But in TCP everyone uses different client ports for different connections anyway. TCPMUX itself incurs an extra round-trip time delay for connection establishment, because the requested server name can't be transmitted until the client's ACK packet, but if you incorporated it into TCP, you'd put the server name in the SYN packet. If you eliminate the server port number in every TCP header, you can expand the client port number to 24 or even 32 bits.

• Alternatively, maybe network addresses should be assigned to server processes, as in Appletalk (or IP-based virtual hosting before HTTP/1.1's Host: header, or, for TLS, before SNI became widespread), rather than assigning network addresses to hosts and requiring port numbers or TCPMUX to distinguish multiple servers on the same host?

• Probably SACK was actually a good idea and should have always been the default? SACK gets a lot easier if you ack message numbers instead of byte numbers.

• Why is acknowledgement reneging allowed in TCP? That was a terrible idea.

• It turns out that measuring round-trip time is really important for retransmission, and TCP has no way of measuring RTT on retransmitted packets, which can pose real problems for correcting a ridiculously low RTT estimate, which results in excessive retransmission.

• Do you really need a PUSH bit? C'mon.

• A modest amount of overhead in the form of erasure-coding bits would permit recovery from modest amounts of packet loss without incurring retransmission timeouts, which is especially useful if your TCP-layer protocol requires a modest amount of packet loss for congestion control, as TCP does.

• Also you could use a "congestion experienced" bit instead of packet loss to detect congestion in the usual case. (TCP did eventually acquire CWR and ECE, but not for many years.)

• The fact that you can't resume a TCP connection from a different IP address, the way you can with a Mosh connection, is a serious flaw that seriously impedes nodes from moving around the network.

• TCP's hardcoded timeout of 5 minutes is also a major flaw. Wouldn't it be better if the application could set that to 1 hour, 90 minutes, 12 hours, or a week, to handle intermittent connectivity, such as with communication satellites? Similarly for very-long-latency datagrams, such as those relayed by single LEO satellites. Together this and the previous flaw have resulted in TCP largely being replaced for its original session-management purpose with new ad-hoc protocols such as HTTP magic cookies, protocols which use TCP, if at all, merely as a reliable datagram protocol.

• Initial sequence numbers turn out not to be a very good defense against IP spoofing, because that wasn't their original purpose. Their original purpose was preventing the erroneous reception of leftover TCP segments from a previous incarnation of the connection that have been bouncing around routers ever since; this purpose would be better served by using a different client port number for each new connection. The ISN namespace is far too small for current LFNs anyway, so we had to patch over the hole in TCP with timestamps and PAWS.

By throw0101a 2025-11-1513:232 reply

Any love for SCTP?

> The Stream Control Transmission Protocol (SCTP) is a computer networking communications protocol in the transport layer of the Internet protocol suite. Originally intended for Signaling System 7 (SS7) message transport in telecommunication, the protocol provides the message-oriented feature of the User Datagram Protocol (UDP) while ensuring reliable, in-sequence transport of messages with congestion control like the Transmission Control Protocol (TCP). Unlike UDP and TCP, the protocol supports multihoming and redundant paths to increase resilience and reliability.

[…]

> SCTP may be characterized as message-oriented, meaning it transports a sequence of messages (each being a group of bytes), rather than transporting an unbroken stream of bytes as in TCP. As in UDP, in SCTP a sender sends a message in one operation, and that exact message is passed to the receiving application process in one operation. In contrast, TCP is a stream-oriented protocol, transporting streams of bytes reliably and in order. However TCP does not allow the receiver to know how many times the sender application called on the TCP transport passing it groups of bytes to be sent out. At the sender, TCP simply appends more bytes to a queue of bytes waiting to go out over the network, rather than having to keep a queue of individual separate outbound messages which must be preserved as such.

> The term multi-streaming refers to the capability of SCTP to transmit several independent streams of chunks in parallel, for example transmitting web page images simultaneously with the web page text. In essence, it involves bundling several connections into a single SCTP association, operating on messages (or chunks) rather than bytes.

* https://en.wikipedia.org/wiki/Stream_Control_Transmission_Pr...

By o11c 2025-11-1521:01

No, SCTP only fixes half of a problem, but also gratuitously introduces several additional flaws, even ignoring the "router support" problem.

The only good answer is "a reliability layer on top of UDP"; fortunately everybody is now rallying around QUIC as the choice for that.

By nesarkvechnep 2025-11-1514:26

As a BSD enjoyer and paid to write Erlang, I have nothing but love for SCTP.

By stavros 2025-11-158:2412 reply

Wait, can you actually just use IP? Can I just make up a packet and send it to a host across the Internet? I'd think that all the intermediate routers would want to have an opinion about my packet, caring, at the very least, that it's either TCP or UDP.

By ilkkao 2025-11-158:485 reply

You can definitely craft an IP packet by hand and send it. If it's IPv4, you need to put a number between 0 and 255 to the protocol field from this list: https://www.iana.org/assignments/protocol-numbers/protocol-n...

Core routers don't inspect that field, NAT/ISP boxes can. I believe that with two suitable dedicated linux servers it is very possible to send and receive single custom IP packet between them even using 253 or 254 (= Use for experimentation and testing [RFC3692]) as the protocol number

By xorcist 2025-11-1511:053 reply

> caring, at the very least, that it's either TCP or UDP.

You left out ICMP, my favourite! (And a lot more important in IPv6 than in v4.)

Another pretty well known protocol that is neither TCP nor UDP is IPsec. (Which is really two new IP protocols.) People really did design proper IP protocols still in the 90s.

> Can I just make up a packet and send it to a host across the Internet?

You should be able to. But if you are on a corporate network with a really strict firewalling router that only forwards traffic it likes, then likely not. There are also really crappy home routers which gives similar problems from the other end of enterpriseness.

NAT also destroyed much of the end-to-end principle. If you don't have a real IP address and relies on a NAT router to forward your data, it needs to be in a protocol the router recognizes.

Anyway, for the past two decades people have grown tired of that and just piles hacks on top of TCP or UDP instead. That's sad. Or who am I kidding? Really it's on top of HTTP. HTTP will likely live on long past anything IP.

By eqvinox 2025-11-1511:451 reply

> I'd think that all the intermediate routers would want to have an opinion about my packet, caring, at the very least, that it's either TCP or UDP.

They absolutely don't. Routers are layer 3 devices; TCP & UDP are layer 4. The only impact is that the ECMP flow hashes will have less entropy, but that's purely an optimization thing.

Note TCP, UDP and ICMP are nowhere near all the protocols you'll commonly see on the internet — at minimum, SCTP, GRE, L2TP and ESP are reasonably widespread (even a tiny fraction of traffic is still a giant number considering internet scales).

You can send whatever protocol number with whatever contents your heart desires. Whether the other end will do anything useful with it is another question.

By Karrot_Kream 2025-11-159:35

If there's no form of NAT or transport later processing along your path between endpoints you shouldn't have an issue. But NAT and transport and application layer load balancing are very common on the net these days so YMMV.

You might have more luck with an IPv6 packet.

By Twisol 2025-11-158:432 reply

As far as I'm aware, sure you can. TCP packets and UDP datagrams are wrapped in IP datagrams, and it's the job of an IP network to ship your data from point A (sender) to point B (receiver). Nodes along the way might do so-called "deep packet inspection" to snoop on the payload of your IP datagrams (for various reasons, not all nefarious), but they don't need to do that to do the basic job of routing. From a semantic standpoint, the information in the TCP and UDP headers (as part of the IP payload) is only there to govern interactions between the two endpoint parties. (For instance, the "port" of a TCP or UDP packet is a node-local identifier for one of many services that might exist at the IP address the packet was routed to, allowing many services to coexist at the same node.)

By gruturo 2025-11-1511:441 reply

Yep it's full of IP protocols other than the well-known TCP, UDP and ICMP (and, if you ever had the displeasure of learning IPSEC, its AH and ESP).

A bunch of multicast stuff (IGMP, PIM)

A few routing protocols (OSPF, but notably not BGP which just uses TCP, and (usually) not MPLS which just goes over the wire - it sits at the same layer as IP and not above it)

A few VPN/encapsulation solutions like GRE, IP-in-IP, L2TP and probably others I can't remember

As usual, Wikipedia has got you covered, much better than my own recollection: https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers

By nly 2025-11-1510:55

The reason you wouldn't do that is IP doesn't give you a mechanism to share an IP address with multiple processes on a host, it just gets your packets to a particular host.

As soon as you start thinking about having multiple services on a host you end up with the idea of having a service id or "port"

UDP or UDP Lite gives you exactly that at the cost of 8 bytes, so there's no real value in not just putting everything on top of UDP

By LeoPanthera 2025-11-158:50

You know I've always wondered if you could run Kermit*-over-IP, without having TCP inbetween.

*The protocol.

By gsliepen 2025-11-158:392 reply

They shouldn't; the whole point is that the IP header is enough to route packets between endpoints, and only the endpoints should care about any higher layer protocols. But unfortunately some routers do, and if you have NAT then the NAT device needs to examine the TCP or UDP header to know how to forward those packets.

By immibis 2025-11-1510:32

Yes but not if you or they are behind NAT. It's a shame port numbers aren't in IP.

By GardenLetter27 2025-11-1511:421 reply

Probably not, loads of routers are even blocking parts of ICMP.

By NooneAtAll3 2025-11-159:452 reply

something like this?

https://en.wikipedia.org/wiki/IP_over_Avian_Carriers