Skip to content

Add reply-tree backfill strategy#807

Open
sij411 wants to merge 16 commits into
fedify-dev:feat/backfillfrom
sij411:feat/backfill
Open

Add reply-tree backfill strategy#807
sij411 wants to merge 16 commits into
fedify-dev:feat/backfillfrom
sij411:feat/backfill

Conversation

@sij411

@sij411 sij411 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds the reply-tree strategy to @fedify/backfill, last feature of the #275

The strategy walks reply relationships from a seed object by following inReplyTo ancestors and replies descendants. It is opt-in, so the default remains the cheaper FEP-f228 context collection path via context-auto.

This also wires ordered multi-strategy execution for hybrid usage such as ["context-auto", "reply-tree"]. Strategies share item/request budgets, abort state, deduplication, and a traversal-local document cache.

Related #275

Details

  • Add reply-tree strategy
  • Walk inReplyTo ancestors
  • Walk replies descendants
  • Respect maxDepth for reply-tree traversal
  • Preserve ordered strategy semantics
  • Share maxItems, maxRequests, interval, and signal across strategies
  • Deduplicate yielded objects globally by object ID
  • Avoid repeated dereferences with traversal-local caching
  • Skip already-seen context collection URL items before dereferencing
  • Document reply-tree usage and ordered strategy behavior

Notes

reply-tree remains opt-in because it can be more network-heavy than context collection traversal.

This PR does not impose default maxDepth or maxRequests; callers are expected to choose limits appropriate for their use case.

The cache is traversal-local only. Persistent/shared caching can be layered outside @fedify/backfill through the supplied documentLoader.

Verification

  • deno task -f @fedify/backfill check
  • deno task -f @fedify/backfill test
  • mise exec -- pnpm --filter @fedify/backfill test
  • mise exec -- pnpm --filter @fedify/backfill test:bun

AI usage

Assisted-by: Codex:gpt-5.5

sij411 added 15 commits June 15, 2026 14:28
Extend the backfill public strategy and origin types for the upcoming
reply-tree traversal.  Clarify that context-auto absorbs only other context
collection strategies so it can later compose with reply-tree traversal.

Assisted-by: Codex:gpt-5.5
Run normalized strategies through a shared orchestration path so context
collection traversal can compose with the upcoming reply-tree strategy.
Keep context-auto absorbing only overlapping context collection strategies,
while preserving global deduplication, budgets, and yield metadata handling.

Assisted-by: Codex:gpt-5.5
Implement the reply-tree ancestor path through inReplyTo targets.  Ancestors
share the existing document loader, request budget, abort signal, and global
output deduplication while keeping a separate visited set to prevent traversal
cycles.

Add coverage for embedded and dereferenced ancestors, maxDepth, maxRequests,
cycle prevention, and deduplication against context collection results.

Assisted-by: Codex:gpt-5.5
Extend reply-tree traversal to follow replies collections after ancestor
lookup.  Descendants reuse the existing collection loader, request budget,
abort signal, and visited-state handling while reporting replies origin and
reply-tree depth metadata.

Add coverage for embedded and dereferenced replies collections, maxDepth,
maxRequests, and descendant cycle prevention.

Assisted-by: Codex:gpt-5.5
Update the backfill public API comments now that reply-tree traversal is
implemented.  Document how documentLoader, maxRequests, maxDepth, and item
depth apply to reply targets and replies collections.

Assisted-by: Codex:gpt-5.5
Preserve the PR 2 context collection behavior by loading the context
collection once for adjacent context strategies.  This keeps ordered strategy
execution while avoiding duplicate context dereferences and request-budget
regressions for explicit context strategy combinations.

Add a regression test that combined context object and activity strategies
share the same context collection load.

Assisted-by: Codex:gpt-5.5
Describe how reply-tree composes with context collection backfill and clarify
that reply-tree yields discovered post-like objects without unwrapping
Activity objects.

Assisted-by: Codex:gpt-5.5
Separate reply-tree object and collection visited state so traversal can avoid
reloading or revisiting replies collections.  Mark collection IRIs before
loading them and keep embedded collection references in visited state to avoid
reply-tree collection cycles.

Add coverage for repeated replies collection IRIs sharing a single loader
request.

Assisted-by: Codex:gpt-5.5
Add tests that exercise shared maxItems, maxRequests, and abort behavior across context collection and reply-tree strategies.  This protects the ordered hybrid execution path where context-auto runs before reply-tree.

Assisted-by: Codex:gpt-5.5
Add a regression test for hybrid backfill ordering where reply-tree runs before context-auto.  The test protects the contract that earlier strategies keep their BackfillItem metadata when later strategies discover the same object.

Assisted-by: Codex:gpt-5.5
Clarify that backfill strategies run in order while sharing item, request, abort, and deduplication state.  Also update the README to describe the opt-in reply-tree strategy alongside the FEP-f228 context collection path.

Assisted-by: Codex:gpt-5.5
Limit context-auto absorption to the current adjacent context strategy group so reply-tree keeps acting as an ordering boundary.  Add a regression test for context-objects before reply-tree followed by context-auto.

Assisted-by: Codex:gpt-5.5
Avoid dereferencing context collection URL items whose object IDs are already known.  This prevents fetching the seed object again when a context collection includes it and preserves request budget for unseen items.

Assisted-by: Codex:gpt-5.5
Keep a traversal-local document cache in the shared request budget so overlapping strategies do not dereference the same IRI more than once.  Cache hits bypass request budgeting and interval delays while thrown loader failures remain retryable.

Assisted-by: Codex:gpt-5.5
Add a regression test proving failed documentLoader calls are removed from the traversal-local cache.  A later strategy can retry the same IRI and yield the recovered object.

Assisted-by: Codex:gpt-5.5
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e0b40416-e45d-459e-9202-7ea74257fbec

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new reply-tree backfill strategy to walk inReplyTo ancestors and replies descendants when context collections are unavailable or incomplete. It also implements document caching to avoid duplicate dereferences across strategies and updates options to share request, item, abort, and deduplication state. Feedback on the changes suggests enforcing a default maximum depth limit of 100 on recursive traversal functions to prevent potential stack overflow or DoS attacks, and simplifying the tracking logic for visited replies collections to improve code maintainability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

readonly origin: "in-reply-to";
readonly depth: number;
}> {
if (options.maxDepth != null && traversal.depth > options.maxDepth) return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent stack overflow or denial-of-service (DoS) attacks from maliciously deep reply trees, it is highly recommended to enforce a default maximum depth limit (e.g., 100) when options.maxDepth is not explicitly provided.

  const maxDepth = options.maxDepth ?? 100;
  if (traversal.depth > maxDepth) return;
References
  1. To prevent stack overflow from maliciously crafted deep JSON, implement a depth limit in recursive traversal functions. The limit should be high enough for legitimate data but low enough to prevent DoS attacks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dahlia I think we decided not to set max values intentionally, but it is valid concern. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. We should have a default maxDepth value, and 100 is even too great–how about 10?

readonly origin: "replies";
readonly depth: number;
}> {
if (options.maxDepth != null && traversal.depth > options.maxDepth) return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent stack overflow or denial-of-service (DoS) attacks from maliciously deep reply trees, it is highly recommended to enforce a default maximum depth limit (e.g., 100) when options.maxDepth is not explicitly provided.

Suggested change
if (options.maxDepth != null && traversal.depth > options.maxDepth) return;
const maxDepth = options.maxDepth ?? 100;
if (traversal.depth > maxDepth) return;
References
  1. To prevent stack overflow from maliciously crafted deep JSON, implement a depth limit in recursive traversal functions. The limit should be high enough for legitimate data but low enough to prevent DoS attacks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too

Comment on lines +340 to +352
const repliesId = object.repliesId;
let repliesIdVisited = false;
if (repliesId != null && !visitReplyTreeCollectionId(repliesId, traversal)) {
return;
}
repliesIdVisited = repliesId != null;
const replies = await getRepliesCollection(context, object, options, budget);
if (replies == null) return;
if (repliesIdVisited) {
traversal.visitedCollections.add(replies);
} else if (!visitReplyTreeCollection(replies, traversal)) {
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current logic for tracking visited replies collections is quite complex and bypasses visitReplyTreeCollection when repliesIdVisited is true. We can simplify this significantly by checking the ID first, loading the collection, checking the loaded collection via visitReplyTreeCollection, and then marking the ID as visited. This avoids the need for the repliesIdVisited state variable and makes the code much cleaner and easier to maintain.

  const repliesId = object.repliesId;
  if (repliesId != null && traversal.visitedCollectionIds.has(repliesId.href)) {
    return;
  }
  const replies = await getRepliesCollection(context, object, options, budget);
  if (replies == null) return;
  if (!visitReplyTreeCollection(replies, traversal)) {
    return;
  }
  if (repliesId != null) {
    traversal.visitedCollectionIds.add(repliesId.href);
  }

@sij411 sij411 Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic marks repliesId as visited before dereferencing, which avoids repeated loads of the same replies collection URL and protects request budget.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new reply-tree backfill strategy to the @fedify/backfill package, enabling traversal of inReplyTo ancestors and replies descendants when context collections are unavailable or incomplete. The implementation shares request budgets, item limits, abort signals, and deduplication state across strategies, and introduces a document cache to prevent duplicate dereferences. Additionally, comprehensive unit tests have been added to verify the new strategy's behavior, including depth limits, request budgets, and cycle avoidance. All review comments were filtered out as they suggested implementing depth limit checks that are already present in the code changes, so there is no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.59420% with 29 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/backfill/src/backfill.ts 91.59% 22 Missing and 7 partials ⚠️
Files with missing lines Coverage Δ
packages/backfill/src/backfill.ts 88.92% <91.59%> (+4.03%) ⬆️
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sij411

sij411 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7d117ee9d4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/backfill/src/backfill.ts
Comment thread packages/backfill/src/backfill.ts
Start descendant traversal from discovered ancestors as well as the
seed so reply-tree backfill does not miss sibling branches when the
seed is itself a reply.

Also skip already visited reply IRIs before dereferencing collection
items so bounded crawls do not waste request budget on known objects.

Add regressions covering both behaviors.

Assisted-by: Codex:gpt-5.5
@sij411

sij411 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

Reviewed commit: 8f29cbbe4b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@sij411 sij411 marked this pull request as ready for review June 16, 2026 13:56

@dahlia dahlia left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two reply-tree edge cases worth fixing before merging.

ancestors.push(item.object);
yield item;
}
for (const object of ancestors.toReversed()) {

@dahlia dahlia Jun 18, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descendant crawl starts every discovered ancestor at depth: 1, so depth is no longer measured from the original seed. For example, if the seed is a reply and its parent has both the seed and a sibling in replies, the sibling is discovered through seed → parent → sibling but is reported as depth 1 and still passes maxDepth: 1. That makes the public depth metadata and maxDepth budget surprising. Please keep the ancestor depth alongside each ancestor and start descendant traversal from ancestor.depth + 1, or otherwise make the API/docs explicit that reply-tree depth is local to each traversal root.

if (options.maxDepth != null && traversal.depth > options.maxDepth) return;
const repliesId = object.repliesId;
let repliesIdVisited = false;
if (repliesId != null && !visitReplyTreeCollectionId(repliesId, traversal)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This marks the replies collection IRI as visited before the collection is successfully loaded. If getRepliesCollection() later returns null because the loader hit a transient non-budget error, the traversal treats that collection as visited and will not retry it if the same IRI is reachable through another branch. That conflicts with the document-cache behavior below, where failed loads are deliberately removed so later strategies can retry. Please either mark repliesId as visited only after a successful load, or remove it from visitedCollectionIds when the load fails.

@dahlia dahlia self-assigned this Jun 18, 2026
@dahlia dahlia added component/federation Federation object related component/collections Collections related activitypub/interop Interoperability issues labels Jun 18, 2026
@dahlia dahlia added activitypub/compliance Specification compliance component/backfill Backfiller-related (@fedify/backfill) labels Jun 18, 2026
@dahlia

dahlia commented Jun 18, 2026

Copy link
Copy Markdown
Member

Ah, also we need to add FEP-f228 to our FEDERATION.md document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

activitypub/compliance Specification compliance activitypub/interop Interoperability issues component/backfill Backfiller-related (@fedify/backfill) component/collections Collections related component/federation Federation object related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants