supchain-bench-benchmarking-real-world-supply
This skill teaches Claude to build reliable, long-horizon supply chain agent systems using the SupChain-ReAct framework from the SupChain-Bench paper. The core technique replaces brittle, hand-authored Standard Operating Procedures (SOPs) with autonomous multi-path ReAct reasoning and majority-vote aggregation, enabling agents to synthesize their own executable procedures for tool orchestration across complex supply chain workflows spanning order management, fulfillment tracking, warehouse operations, cancellation analysis, and error diagnosis.
When to Use
- When the user asks to build an agent that orchestrates 10-30+ sequential tool calls to resolve supply chain or e-commerce order issues
- When designing a diagnostic pipeline that traces orders through trade, fulfillment, and warehouse layers
- When the user needs an agent framework that works without hand-authored SOPs or rigid procedural scripts
- When implementing multi-step tool-calling workflows where early termination and execution drift are failure risks
- When building order investigation systems that must handle branching logic (cancelled vs. error vs. in-transit statuses)
- When the user wants to improve tool-calling reliability through parallel reasoning paths and consensus voting
- When creating agents for any domain requiring long-horizon, multi-entity traversal across linked database records
Key Technique
The Problem: LLMs performing multi-step tool orchestration in supply chain settings suffer from three failure modes: (1) premature termination, where the model stops calling tools before exhausting all entities; (2) schema mismatches, where field names drift between tool calls; and (3) faithfulness errors, where the model's final response contradicts what tools actually returned. Providing hand-written SOPs helps but requires expensive domain expertise and still fails for models that prioritize conversational brevity over exhaustive coverage.
SupChain-ReAct: Instead of authoring SOPs, run N independent ReAct trajectories (the paper uses N=5) in parallel against the same task prompt and tool schema. Each trajectory alternates between a reasoning step ("I need to check the fulfillment status for each ID") and a tool invocation, continuing until it produces a final answer or hits a step limit. The final output is selected by majority vote over the textual answers from successful trajectories. This approach works because: (a) different trajectories explore different orderings and branching paths, reducing the chance that all paths prematurely terminate at the same point; (b) majority voting filters out hallucinated or unfaithful answers since they are unlikely to appear in a majority of independent runs; and (c) the model leverages its existing domain knowledge and tool-schema understanding to self-organize procedural steps without external instruction.
Results: SupChain-ReAct consistently outperformed both SOP-free and SOP-guided baselines across models. For example, Gemini-2.5-Pro jumped from 11.22% (no SOP) to 72.44% (SupChain-ReAct), and Claude-4-Sonnet went from 31.63% to 75.51%. The technique is model-agnostic and requires no training or fine-tuning.
More from ndpvt-web/arxiv-claude-skills
sparseeval-evaluation-sparse-optimization
Efficiently evaluate LLMs on benchmarks by selecting a small subset of anchor items via sparse optimization, reproducing full-benchmark rankings at a fraction of the cost. Use when: 'reduce evaluation cost for my LLM benchmark', 'select representative test items from a large dataset', 'rank models without running all benchmark samples', 'sparse subset selection for evaluation', 'find anchor items that represent my test suite', 'efficient model comparison on benchmarks'.
1predictive-coding-information-bottleneck
>
1pcbschemagen-constraint-guided-schematic-design
Generate PCB schematics from natural language using constraint-guided LLM code generation with knowledge-graph verification. Use when the user says 'generate a PCB schematic', 'design a circuit board', 'create a KiCad schematic from description', 'convert circuit requirements to netlist', 'automate schematic design', or 'generate SKiDL code for a circuit'.
1predicting-improving-test-time-scaling
Implement Scaling-Law Guided (SLG) Search for test-time compute optimization. Uses reward tail distribution estimation (GPD fitting) to predict scaling laws and dynamically allocate compute budget across candidate solutions. Trigger phrases: 'optimize test-time compute', 'best-of-N scaling', 'SLG search', 'tail-guided search', 'reward-guided budget allocation', 'test-time scaling law'
1