Skip to content

pyln-testing: replace ephemeral-port-reserve with a filesystem lock approach#9211

Open
daywalker90 wants to merge 4 commits into
ElementsProject:masterfrom
daywalker90:address-in-use-fix-67
Open

pyln-testing: replace ephemeral-port-reserve with a filesystem lock approach#9211
daywalker90 wants to merge 4 commits into
ElementsProject:masterfrom
daywalker90:address-in-use-fix-67

Conversation

@daywalker90

Copy link
Copy Markdown
Collaborator

using pytest-xdist the ephemeral-port-reserve approach would only ensure no port conflicts per pytest worker. There could still be race conditions where we get "address already in use".

This new filesystem approach should work across pytest workers

Changelog-None

@daywalker90 daywalker90 requested a review from cdecker as a code owner June 12, 2026 11:27
@daywalker90 daywalker90 marked this pull request as draft June 12, 2026 12:30
@cdecker

cdecker commented Jun 12, 2026

Copy link
Copy Markdown
Member

I don't quite get why it is necessary to keep a list of reserved ports at all. The idea is to have the OS itself track the reservations, and have them time out after the usual 30s - 60s cleanup timeout, after which the TIME_WAIT state on the socket expires, and the port goes back into the pool of available ports (i.e., bind with that specific port no longer needs SO_REUSEADDR set, and the random port selection may return it).

So unless we have a test that somehow assumes we have control over a port, while not actively bound to it, for prolonged periods (30s - 60s+) we should never need to materialize the list of reservations, even worse, the list of reservations may think it has control over a port, when the OS has already given that port to someone else. So the materialized port list may actually be causing collisions that are harder to debug. To me, a test that runs for longer than 30s and still expects to have exclusive control over a port is the bigger problem. Due to the port returning to the free-pool, and the OS being allowed to reassign it, the list reservation approach will never work I think.

e7fe59b changed the way lightningd refuses
rpc calls during shutdown and the now resulting ConnectionRefusedError was never caught
so the test plugin would not exit and the teardown of those tests would hang for the full
timeout

Changelog-None
@daywalker90 daywalker90 force-pushed the address-in-use-fix-67 branch from afaa3ca to bcedc58 Compare June 15, 2026 21:56
@daywalker90 daywalker90 linked an issue Jun 16, 2026 that may be closed by this pull request
@daywalker90

Copy link
Copy Markdown
Collaborator Author

I don't quite get why it is necessary to keep a list of reserved ports at all. The idea is to have the OS itself track the reservations, and have them time out after the usual 30s - 60s cleanup timeout, after which the TIME_WAIT state on the socket expires, and the port goes back into the pool of available ports (i.e., bind with that specific port no longer needs SO_REUSEADDR set, and the random port selection may return it).

So unless we have a test that somehow assumes we have control over a port, while not actively bound to it, for prolonged periods (30s - 60s+) we should never need to materialize the list of reservations, even worse, the list of reservations may think it has control over a port, when the OS has already given that port to someone else. So the materialized port list may actually be causing collisions that are harder to debug. To me, a test that runs for longer than 30s and still expects to have exclusive control over a port is the bigger problem. Due to the port returning to the free-pool, and the OS being allowed to reassign it, the list reservation approach will never work I think.

For some reason ports go back into the available pool sooner than what we would expect. I'm not sure why, but it clearly happens.
ephemeral_port_reserve puts them in TIME_WAIT, we can bind to them immediately because of SO_REUSEADDR set, but then, probably because of node restarts, during a test the port is somehow returned to a state where ephemeral_port_reserve will return it again immediately for another test.
So we need an additional mechanism for the ports to stay consistent during a test with node restarts. Our current mechanism was flawed in combination with pytest-xdist since it was tracking per worker, not per machine. I did five runs with this on my fork repo and one run here with no "address already in use" errors, where before it would happen in atleast 50% of runs.

@daywalker90 daywalker90 marked this pull request as ready for review June 16, 2026 07:35
@daywalker90 daywalker90 added the Status::Ready for Review The work has been completed and is now awaiting evaluation or approval. label Jun 16, 2026
@daywalker90 daywalker90 force-pushed the address-in-use-fix-67 branch from bcedc58 to 2492c7a Compare June 16, 2026 07:39
@daywalker90 daywalker90 force-pushed the address-in-use-fix-67 branch from 2492c7a to 6858b33 Compare June 16, 2026 09:20
…ady first

`test_grpc_custommsg_notification` was often failing with a timeout here:

```
def test_grpc_custommsg_notification(node_factory):
        l1, l2 = node_factory.get_nodes(2)

        # Test the connect notification
        custommsg_stream = l1.grpc.SubscribeCustomMsg(clnpb.StreamCustomMsgRequest())
        l2.connect(l1)

        # Send the custom-msg to node l1
        # The payload doesn't matter.
        # It just needs to be valid hex which encodes to an odd BOLT-8 msg id
        l2.rpc.sendcustommsg(l1.info["id"], "3131313174657374")

>       for custommsg in custommsg_stream:
```

Changelog-None
… exit

this would cause pytest teardowns to take the full timeout of killing a plugin
that should exit themselves

Changelog-None
…pproach

For some reason ports go back into the available pool sooner than what we would expect. I'm not sure why, but it clearly happens.
ephemeral_port_reserve puts them in TIME_WAIT, we can bind to them immediately because of SO_REUSEADDR set,
but then, probably because of node restarts, during a test the port is somehow returned
to a state where ephemeral_port_reserve will return it again immediately for another test.
So we need an additional mechanism for the ports to stay consistent during a test with node restarts.
Our current mechanism was flawed in combination with pytest-xdist since it was tracking per worker, not per machine.

This new filesystem approach should work across pytest workers

Changelog-None
@daywalker90 daywalker90 force-pushed the address-in-use-fix-67 branch from 6858b33 to 09d9fb0 Compare June 16, 2026 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status::Ready for Review The work has been completed and is now awaiting evaluation or approval.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI flake test_websocket

2 participants