MLSanity

Do AI Agents Know When a Task Is Simple? Toward Complexity-Aware Reasoning and Execution

Junjie Yin, Xinyu Feng (cs.AI, cs.CL, cs.SE, eess.SY)

Large language model (LLM) agents increasingly automate multi-step engineering and informatics workflows, yet they rarely ask how much effort a task actually requires. They often follow a maximum-context-first strategy--re-reading files and dependencies they have already seen--turning a one-line edit into a small code-base audit. We argue the missing capability is task-aware execution-scope estimation: judging a task's difficulty, the information it truly needs, and the shortest reliable path before committing budget. We formalize minimum-sufficient execution and the Agent Cognitive Redundancy Ratio (ACRR), and propose E3 (Estimate, Execute, Expand): the agent estimates an initial operating point, executes a minimum viable path, and expands scope only when verification fails. On MSE-Bench--a deterministic benchmark of 121 edits in a capability-controlled simulator--E3 matches the strongest baseline's 100% success while cutting cost by 85%, tokens by 91%, and inspected files by 92%, and further beats a strong adaptive retrieval baseline by 16%; the gains survive held-out instruction wording and essentially every cost weighting. A companion real-model harness (LLM-Case) corroborates the effect on a live gpt-4o agent editing a real open-source library, with every candidate patch graded by actually running the project's real pytest suite against a measured oracle: the over-reading is milder but real, and E3 is the leanest and fastest policy at comparable task success--its one shortfall a provider rate-limit, not a wrong edit. We frame this as a controlled probe of execution redundancy, not a measurement of any deployed agent, and position task-aware execution as a step toward engineering-grounded AI (EGAI)--agents whose effort is anchored in the engineering reality of the task. We release the framework and benchmark.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

DenseReward: Dense Reward Learning via Failure Synthesis for Robotic Manipulation

Yu Fang, Wanxi Dong, Jiaqi Liu, Yue Yang, Mingxiao Huo, Yao Mu, Huaxiu Yao, Li Erran Li, Daniel Szafir, Mingyu Ding (cs.RO)

Reinforcement learning holds great promise for improving robot policies beyond the limits of imitation learning. However, its practical adoption remains bottlenecked by the lack of reliable vision-language reward models that provide dense and informative feedback. Two key challenges remain: acquiring diverse failure data at scale and obtaining fine-grained reward signals beyond sparse trajectory-level success labels. Collecting failure trajectories typically requires laborious human effort, while pseudo-failures constructed by relabeling successful demonstrations fail to capture the diverse physical failure modes that arise during robot execution. Meanwhile, existing reward models often predict sparse binary or trajectory-level rewards, which provide limited guidance for efficient policy optimization. We introduce DenseReward, a dense robotic reward model that addresses both challenges. To train DenseReward, we develop an automated failure data generation pipeline that synthesizes physically realistic failure trajectories in simulation without human labeling, covering diverse failure modes such as collisions, missed grasps, object drops, and recovery behaviors. DenseReward predicts dense frame-level reward scores from visual observations and language instructions, enabling fine-grained estimation of task progress throughout an episode. Experiments show that DenseReward outperforms general-purpose VLMs and existing robotic reward models in dense reward prediction across both simulated and real-world manipulation. We further demonstrate that DenseReward provides effective reward guidance for downstream model predictive control and reinforcement learning. We release the dataset, trained reward models, and evaluation suite to support the development of failure-aware dense reward modeling for robot learning.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

The Seriality Gap in Video Diffusion Models

Jorge Diaz Chao, Konpat Preechakul, Yuxi Liu, Yutong Bai (cs.LG, cs.CV)

When one ball strikes another, then another, video models should predict the consequences of each bounce. In controlled experiments on multi-ball hard-sphere dynamics, we find that the performance of standard bidirectional video diffusion degrades as the causal chain lengthens, even when provided more denoising steps. In a length-matched single-ball control, where ball-ball interactions are absent, the degradation largely disappears, isolating dependent-event structure rather than video length as the cause. Across intervention studies, methods that increase effective serial computation improve performance disproportionately, including autoregressive/blockwise generation and architectural depth. We identify this pattern as the seriality gap: a mismatch between tasks requiring growing serial computation and video diffusion models whose denoising loop does not provide scalable serial compute. We then prove that, for deterministic video prediction, denoising steps do not add serial computation beyond the backbone, indicating a structural obstacle for video diffusion on serial reasoning and simulation tasks.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

TerraZero: Procedural Driving Simulation for Zero-Demonstration Self-Play at Scale

Zhouchonghao Wu, Akshay Rangesh, Weixin Li, Wei-Jer Chang, Zachary Lee, Tim Wang, Wei Zhan (cs.LG, cs.AI, cs.RO)

Training robust autonomous driving agents requires a simulator that is fast enough for reinforcement learning at scale, realistic enough to ground behavior in real-world map structure, and diverse enough to cover the safety-critical long tail that logged data rarely contains. We present TerraZero, a procedural driving simulator and self-play training stack. A configurable C engine runs simulation on the CPU and policy inference on the GPU over a zero-copy path, sustaining 1.3M agent-steps per second on a single server-grade GPU, far faster than existing object-level simulators, while keeping fidelity lighter single-agent systems omit: heterogeneous agents, multiple dynamics models, and full traffic-rule enforcement. TerraZero treats logged data only as a source of real-world map geometry, populating each map with randomized rule-based road users and signal controllers and randomizing agent dynamics, rewards, and sizes per episode, so a map yields an unbounded set of scenarios. Every reported policy trains from scratch by reinforcement learning alone on a compute-efficient self-play recipe across GPUs, with zero human demonstrations and no fallback planner at inference. Policies generalize zero-shot across cities and datasets, including emergent left-hand-traffic driving without explicit supervision. As an ego policy, TerraZero is the first fully learned policy to top the InterPlan long-tail benchmark, ahead of larger learned planners; on routine-driving val14 it ranks among the best approaches and is the safest, posting the best collision and time-to-collision scores. On Waymo Open Sim Agents realism the same recipe outperforms other demonstration-free methods and is competitive with the strongest reference-anchored self-play method. One stack serves both roles: driving policies across dynamics for cars and trucks, and sim agents that jointly control vehicles, pedestrians, and cyclists.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

PalmClaw: A Native On-Device Agent Framework for Mobile Phones

Hongru Cai, Yongqi Li, Ran Wei, Wenjie Li (cs.CL, cs.AI)

Large Language Model (LLM) agents have moved beyond generating responses to executing multi-step tasks by calling tools, observing the results, and iteratively deciding the next action. Most agent systems run on desktops or servers, which support tool use and task automation. Mobile devices are also important agent environments because they are widely accessible and contain users' data, sensors, and daily-use applications. Existing mobile agents mainly operate smartphones through graphical user interface (GUI) actions such as tapping, swiping, and typing, which often form long, interface-dependent sequences, cannot directly access device capabilities, and make execution boundaries difficult to define. We present PalmClaw, an open-source agent framework that runs natively on mobile phones and manages the sessions, memory, skills, tools, and agent loop directly on the device. PalmClaw exposes device capabilities as device tools with explicit arguments, structured results, and clearly defined execution boundaries. This design enables agents to use mobile capabilities directly while keeping each action explicit and controlled. Experiments show an 11.5% relative improvement in task success and a 94.9% reduction in completion time over the strongest baseline, with lower setup burden and traces illustrating how execution boundaries are applied. Code is available at https://github.com/ModalityDance/PalmClaw.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

The Balanced Four-Color Theorem

Ken-ichi Kawarabayashi, Hirotaka Yoneda, Masataka Yoneda (cs.DS, math.CO)

We show that every planar graph with n ≥ 3 vertices admits a 4-coloring in which each color is used on fewer than n/2 vertices. This bound is the best possible. Moreover, such a coloring can be found in O(n log n) time. We also extend these results to five or more colors and to graphs on general surfaces.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

A Shortcut to Statistically Steady-State Turbulence with Flow Matching

Gianluca Galletti, Gerald Gutenbrunner, William Hornsby, Lorenzo Zanisi, Naomi Carey, Stanislas Pamela, Johannes Brandstetter, Fabian Paischer (physics.plasm-ph, cs.LG)

Many nonlinear physical systems exhibit an initial transient phase in which perturbations grow before nonlinear interactions lead to a statistically steady state. While this saturated regime is of primary interest, direct numerical simulations must resolve the full transient dynamics before reaching it, incurring significant computational cost. In Computational Fluid Dynamics, reduced-order approaches such as Large Eddy Simulation mitigate computational cost by modeling small-scale dynamics, enabling tractable approximations of turbulent flows. In contrast, for systems such as gyrokinetics, comparably effective closures for the full dynamics are not generally available, and high-fidelity simulations remain necessary. Existing surrogate modeling approaches for these systems are autoregressive, hence they suffer from accumulating error. We instead propose to bypass explicit time evolution by directly modeling the distribution of saturated states under an ergodicity assumption, stating that ensemble averages over samples are equivalent to time averages of a single long simulation. We introduce GyroFlow, a latent generative model that directly estimates steady-state statistics of gyrokinetic turbulence in 5D phase space, without resolving the transient phase. GyroFlow generates saturated snapshots from noise, conditioned on dimensionless operating parameters and outperforms autoregressive, reduced-order, and other generative approaches, while providing substantial speedup. To evaluate generation quality we propose FGyD, a distributional metric computed in the latent space of a pretrained gyrokinetic model, and show that it correlates with downstream flux accuracy and solver convergence. Finally, GyroFlow can be used to warm-start the numerical code used to produce the data.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

FlowWAM: Optical Flow as a Unified Action Representation for World Action Models

Yixiang Chen, Peiyan Li, Yuan Xu, Qisen Ma, Jiabing Yang, Kai Wang, Jianhua Yang, Dong An, He Guan, Gaoteng Liu, Jianlou Si, Jun Huang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang (cs.RO, cs.CV)

World Action Models (WAMs) are able to leverage pretrained video generators for both world modeling and action prediction. However, directly leveraging such video generators for control raises a new challenge: how to represent actions in a suitable form that aligns with pretrained video generators while carrying enough motion cues for accurate control. Existing numerical actions fail to satisfy the former, and prior visual action representations overlook the temporal motion structure across frames. We address this issue with FlowWAM, a dual-stream diffusion framework that adopts optical flow as a unified, video-native action representation. Flow videos share the same format as RGB videos and encode rich per-pixel displacement. By jointly modeling them within a shared pretrained video generator, FlowWAM can naturally implement two modes of WAMs. In policy mode, FlowWAM generates flow for action prediction, while in world-model mode, it uses target flow sequences to guide future video generation. Moreover, since flow can be easily extracted from raw videos without action labels, FlowWAM can leverage large-scale action-unlabeled video datasets for pretraining. We empirically find that our flow-based action representation delivers gains across both modes. On RoboTwin manipulation, FlowWAM raises the success rate to 92.94% on the Clean setting and 92.14% on Random, outperforming both VLA and WAM baselines. On WorldArena world modeling, it achieves the best overall EWMScore (63.71) with an 18.4% relative improvement in trajectory accuracy. More results can be found on our project website: https://flow-wam.github.io .

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Privacy Attacks on Stable Marriage

Stephan A. Fahrenkrog-Petersen, Aleksander Figiel, Darya Melnyk, Tijana Milentijević, Stefan Schmid (cs.DS, cs.DC, cs.MA)

The stable marriage problem appears in many privacy-sensitive domains, for example in the National Resident Matching Program in the US. In such applications, preserving the privacy of users' preference lists is essential to prevent strategic manipulation, discourage misreporting, and comply with data protection regulations. In this work, we investigate privacy attacks on stable marriage algorithms. Assuming that the attacker (e.g., the hospitals) can repeatedly interact with the stable marriage algorithm, we demonstrate how such interactions can reveal private preferences of the non-malicious side (e.g., the residents). We show that the widely applied Gale-Shapley Matching Algorithm, where the proposers' side is malicious, is vulnerable to privacy attacks and all honest agents' preferences can be revealed. We further investigate which preference distributions of the honest, non-malicious side are susceptible to privacy attacks and show that the Gale-Shapley Matching Algorithm where the honest side proposes can preserve privacy in non-susceptible preference distributions. We extend our results to the decentralized setting and show that the attacker's side can infer all preference orderings. In an experimental evaluation, we test privacy attacks on synthetic and real-world data and show that real-world data is indeed susceptible to privacy attacks. This work underlines a need for new privacy-preserving stable marriage algorithms.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Audio-Native Speech Recognition with a Frozen Discrete-Diffusion Language Model

Harsha Vardhan Khurdula, Abhinav Kumar Singh, Yoeven D Khemlani, Vineet Agarwal (cs.AI, cs.SD)

Automatic speech recognition is dominated by autoregressive decoders that emit one token at a time. We ask whether a discrete diffusion language model can transcribe speech instead, refining a whole transcript in parallel over a small number of denoising steps. We train an audio-native interface for DiffusionGemma, a 26B mixture-of-experts model that generates text by uniform, random-token discrete diffusion rather than the absorbing-mask scheme common to recent diffusion language models. A frozen Whisper encoder supplies acoustic features, a lightweight projector maps them into the model embedding space, and low-rank adapters let the frozen backbone attend to the new modality. About 42M parameters are trained, which is 0.16 percent of the backbone. We find that the natural training objectives fail to ground the audio because their gradient reaches the projector only through attention that has already dismissed it. A connectionist temporal classification loss applied through the frozen output head breaks this deadlock. The resulting model reaches 6.6 percent word error rate on LibriSpeech test-clean, transcribes in roughly eight parallel steps regardless of utterance length, and uses a single adapter trained on six languages, which we evaluate here on English, Hindi, and Mandarin.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Testing the Independent Set Property in Hypergraphs

Elena Grigorescu, Shreya Nasa, Cameron Seth (cs.DS)

The optimal sample complexity of testing if an n-vertex graph has an independent set of size ρn, or is ε-far from having an independent set of size ρn, was established to be O(ρ^3/ε^2), in a notable result by Blais and Seth (SICOMP 2025). In contrast, for q-uniform hypergraphs, there is a significant gap between the best known upper and lower bounds, and there has been no progress on the problem for the last two decades. In this work, we prove a new upper bound of O(qρ^2q-3/ε^2 (q-2)!^2) on the sample complexity of testing the ρ-independent set property. The previous best known upper bound was O(2^q q! ρ^2q/ε^3), due to Langberg (RANDOM 2004). This establishes the optimal dependence on ε and gives an exponential improvement in the dependence on q. We prove our result via a new application of the hypergraph container method.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Resilient Decentralized Ergodic Coverage for Scalable Multi-Robot Systems in Unknown Time-Varying Environments

Maria G. Mendoza, Victoria Marie Tuck, Chinmay Maheshwari, Shankar Sastry (cs.MA, eess.SY)

Maintaining situational awareness in high-stakes multi-robot applications requires balancing exploration of unobserved regions with sustained monitoring of changing Regions of Interest (ROIs), often under unknown and time-varying distributions, partial observability, and limited communication. We propose a decentralized multi-agent coverage framework that serves as a high-level planning strategy, in which each agent computes an adaptive ergodic policy, implemented via a Markov-chain, that tracks an updated belief over the underlying importance map. Beliefs are maintained online via Gaussian Process (GP) regression from local noisy observations exchanged with neighbors. The resulting policy drives agents to spend time in ROIs in proportion to their estimated importance, while preserving sufficient exploration to detect and adapt to time-varying environmental changes. Unlike existing approaches that assume known importance maps, centralized coordination, or a static environment, our framework addresses the combined challenges of unknown, time-varying distributions under a decentralized, partially observable setting. We further show that our framework is robust to communication and memory degradation, robot loss, and can scale up to hundreds of robots.

Review

PDF

Published: April 05, 2026

Last updated: July 14, 2026

Inclusive Federated Learning Through Compliance-Weighted Noise Allocation in Healthcare AI

Santhosh Parampottupadam, Melih Coşğun, Sarthak Pati, Maximilian Zenk, Saikat Roy, Dimitrios Bounias, Benjamin Hamm, Sinem Sav, Ralf Floca, Klaus Maier-Hein (cs.LG, cs.AI, cs.CR, cs.DC)

Background: Federated learning (FL) enables collaborative training of clinical AI models without centralizing patient data, but adoption is limited by privacy concerns, heterogeneous institutional compliance, and resource disparities; standard differential privacy (DP) applies uniform noise to all clients, penalizing well-compliant or under-resourced institutions. Objective: We introduce a compliance-aware FL framework that adapts DP to institutional compliance, letting lower-compliance sites participate without uniformly penalizing others. Methods: A compliance scoring tool aligned with HIPAA, GDPR, NIST, ISO, and HL7/FHIR maps each client score to a per-step Gaussian noise scale for server-side DP-SGD on a small aggregator dataset. The formal (ε,δ) bound applies to the aggregator dataset under a semi-honest aggregator; client-level DP needs secure aggregation (future work). We evaluate five FL strategies on PneumoniaMNIST and BreastMNIST (16 clients, 50 rounds, five seeds); the cumulative aggregator-dataset ε is about 1434 (Breast) and 513 (Pneumonia) at δ=10^-5. Results: Including 12 lower-compliance clients (Experiment 1) versus a compliant-only baseline (Experiment 4) changed BreastMNIST accuracy by +4.5 (FedAvg), +6.8 (FedMedian), +5.2 (FedProx), +1.6 (FedYogi), and -4.1 (FedAdam) percentage points (pooled +2.8 pp; not significant at n=5; up to +17 pp per configuration); compliance-weighted allocation matched uniform server-side DP at equal mean noise (+0.1 pp), carrying no utility penalty, and first-round noise cost 1.3 pp (Breast) and 2.5 pp (Pneumonia, FedAvg). Conclusions: Compliance-weighted server-side DP lets lower-compliance institutions join FL without degrading performance, giving auditable per-site noise control at no utility cost; formal guarantees apply to the aggregator dataset, with client-level DP requiring secure aggregation.

Review

PDF

Published: May 28, 2025

Last updated: July 14, 2026

DermDepth: Toward Monocular Metric Scale 3D Reconstruction Models for Dermatology

Héctor Carrión, Narges Norouzi (cs.CV)

Dermatological practice routinely involves measuring and tracking lesion size, morphology and texture, as critical components of wound or skin cancer screening, monitoring and diagnosis. To accomplish this task, practitioners often image the skin surface with commonly available off-the-shelf camera sensors. This has led to an overwhelming research focus on 2D methods while these objectives naturally benefit from 3D information. In this paper, we demonstrate that dense monocular 3D reconstructions, metric scale measurements and rich surface normal texture estimates are achievable for both dermoscopic and macroscopic cases without the need for additional hardware or multiple captures. We present DermDepth, the first single-view metric scale 3D model for the dermatological domain and D-Synth, the first synthetic dermoscopic dataset with pixel-perfect 3D information. Our experiments show training DermDepth on D-Synth corrects metric scale error from over 16x to under 1.1x for real dermoscopic data, while preserving geometric quality and increasing texture richness. Fine-tuning on a small amount of real clinical samples generalizes our method across three real-world benchmarks spanning the few mm to hundred cm range, diverse skin-tones, chronic wound cases and produces measurements broadly consistent with disease size reported in medical literature. All code, data and models are available at https://github.com/hectorcarrion/dermdepth.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Dynamic Resource Allocation for Ensemble Determinization MCTS

Jakub Kowalski, Adam Ciężkowski, Artur Krzyżyński, Mark H. M. Winands (cs.AI)

Simulation-based algorithms are especially suited for high-uncertainty environments such as adversarial board games with significant elements of randomness and hidden information. In particular, several Monte Carlo Tree Search (MCTS) variants are commonly used in such domains. In this paper, we propose a series of enhancements for Ensemble Determinization MCTS, introducing two axes for dynamic resource allocation. First, Dynamic Number of Determinizations, increases or decreases the number of currently used determinization trees depending on the behavior of so-far search. Second, Dynamic Simulation Allocation, splits the simulation budget nonuniformly across the determinization trees, using simulation-to-simulation decisions to choose the tree with potentially the best knowledge gain. As benchmark domains, we used three popular tabletop games: Jaipur, Lost Cities, and Splendor. Testing our proposed enhancements in iteration- and time-based settings showed that particular configurations yield a statistically significant increase in the algorithm's strength.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

The Spectrum Is Not Enough: When Context Helps Time-Series Forecasting

Mert Onur Cakiroglu, Mehmet Dalkilic, Hasan Kurban (cs.LG)

A growing family of indices scores how predictable a series is from its spectrum. Practitioners increasingly read these scores as answering a different question: whether adding context, a longer lookback, a retrieval plug-in, or a pretrained model, will help. These are not the same question. The value of context is a property of the operating point, not of the series. Any index built from the power spectrum is invariant under phase randomization, whereas the beyond-second-order value that retrieval and foundation models supply is not, because a phase-randomized series is asymptotically Gaussian. We state this as an impossibility result and isolate it with surrogate pairs that fix the spectrum and the marginal by construction. We then give a label-free, configuration-level diagnostic, the coverage deficit, whose principal term measures beyond-spectrum structure as the gain of analog over linear prediction. On seven benchmarks the prediction holds: window-keyed retrieval's value collapses across surrogate pairs (ECL median +33%→-35%, p<10^-40) while every spectral index stays frozen; a foundation model's value splits into a surviving second-order part and a small beyond-linear margin that collapses; a longer linear window's value survives. Leave-one-dataset-out, the structure term predicts the sign of beyond-spectrum value where the spectral indices trail it, and the reverse holds for the second-order mechanism. We introduce no new forecaster; the contribution is the distinction, a controlled comparison, and a diagnostic for the deployment decision. Code: https://anonymous.4open.science/r/SINE.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Propheticus: Machine Learning Framework for the Development of Predictive Models for Reliable and Secure Software

João R. Campos, Marco Vieira, Ernesto Costa (cs.LG, cs.AI, stat.ML)

The growing complexity of software calls for innovative solutions that support the deployment of reliable and secure software. Machine Learning (ML) has shown its applicability to various complex problems and is frequently used in the dependability domain, both for supporting systems design and verification activities. However, using ML is complex and highly dependent on the problem in hand, increasing the probability of mistakes that compromise the results. In this paper, we introduce Propheticus, a ML framework that can be used to create predictive models for reliable and secure software systems. Propheticus attempts to abstract the complexity of ML whilst being easy to use and accommodating the needs of the users. To demonstrate its use, we present two case studies (vulnerability prediction and online failure prediction) that show how it can considerably ease and expedite a thorough ML workflow.

Review

PDF

Published: September 06, 2018

Last updated: July 14, 2026

Watermark Forensics for Generative Models: An Information-Theoretic Perspective

Xiaoyu Li, Zheng Gao, Xiaoyan Feng, Jiaojiao Jiang, Yulei Sui, Jiankun Hu (cs.CR, cs.IT, cs.LG)

A watermark in a generative model's output is usually asked only whether a text is machine-made. The same mark can do more: attribute it to the user who produced it, extract a hidden payload, or localize the part that survives editing. These form a forensic ladder, and we ask what each rung costs in the sample length n. One object organizes the answers. Let S be the secret the mark carries (a user's identity or payload), and let the information profile ν(t)=I(S;X_t| X_<t) record how much the t-th token reveals about S given the earlier ones. Its total mass pays for attribution and extraction; how that mass is spread pays for localization; and detection alone is paid for not by information but by presence, the distance from the marked to the unmarked distribution. The literature's two quality models, a mark subtle on every token and one that stamps a few tokens loudly, are two incomparable ways of capping this profile. Our main theorem settles the ladder's entropy column. For statistically distortion-free schemes, attributing a text to one of N users costs Θ(log N/h) tokens over every stationary-ergodic source of entropy rate h, sharp to a (1+o(1)) factor: to our knowledge the first tight entropy-rate law for multi-user attribution (via exact alignment). The natural collision-counting analysis overcharges without bound; only a decoder thresholding each candidate by its own realized surprisal attains the rate while almost never implicating an innocent user. A matching converse makes the law two-sided, and extraction of an ℓ-bit payload costs Θ(ℓ/h). Two gaps are real, not modeling artifacts: a Θ(log N)-token window in which a text is provably machine-made yet unattributable, and a footprint-resolution uncertainty principle. Experiments on GPT-2, Pythia-410M, and Qwen2.5 recover the predicted constants.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

The TIME Machine: On The Power of Motion for Efficient Perception

Mantas Skackauskas, Xinyue Hao, Laura Sevilla-Lara (cs.CV, cs.AI, cs.LG)

Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.

Review

PDF

Published: May 21, 2026

Last updated: July 14, 2026

LapSurgie: Humanoid Robots Performing Surgery via Teleoperated Handheld Laparoscopy

Zekai Liang, Xiao Liang, Soofiyan Atar, Sreyan Das, Zoe Chiu, Peihan Zhang, Calvin Joyce, Florian Richter, Shanglei Liu, Michael C. Yip (cs.RO)

Robotic laparoscopic surgery has gained increasing attention in recent years for its potential to deliver more efficient and precise minimally invasive procedures. However, adoption of surgical robotic platforms remains largely confined to high-resource medical centers, exacerbating healthcare disparities in rural and low-resource regions. To close this gap, a range of solutions has been explored, from remote mentorship to fully remote telesurgery. Yet, the practical deployment of surgical robotic systems to underserved communities remains an unsolved challenge. Humanoid systems offer a promising path toward deployability, as they can directly operate in environments designed for humans without extensive infrastructure modifications -- including operating rooms. In this work, we introduce LapSurgie, the first humanoid-robot-based laparoscopic teleoperation framework. The system leverages an inverse-mapping strategy for manual-wristed laparoscopic instruments that abides to remote center-of-motion constraints, enabling precise hand-to-tool control of off-the-shelf surgical laparoscopic tools without additional setup requirements. A control console equipped with a stereo vision system provides real-time visual feedback. Finally, a comprehensive user study across platforms demonstrates the effectiveness of the proposed framework and provides initial evidence for the feasibility of deploying humanoid robots in laparoscopic procedures.

Review

PDF

Published: October 03, 2025

Last updated: July 14, 2026

Bringing Back Rule Induction to Fluid Intelligence Research? An Initial Validation of the ARC-AGI Benchmark in Humans

Jasmin Thelen, Oliver Wilhelm (cs.AI, cs.LG)

Two competing perspectives on fluid intelligence (gf) measures propose that performance is primarily constrained either by working memory capacity or by the ability to induce novel relations. The first perspective is currently dominant in measurement, as evident from the use of a limited set of recurring rules, whereas the second perspective is reflected in many definitions but rarely present in measurement. The ARC-AGI benchmark predominantly requires rule induction and was proposed as a measure of gf for both humans and artificial systems. However, its psychometric properties have not yet been examined in human samples. We therefore investigated the psychometric characteristics and nomological network of ARC-AGI in a first study with 100 participants. A compilation of ARC-AGI items showed good psychometric properties and correlated substantially with figural fluid intelligence as measured by a figural reasoning test (ρ= .63). Associations with figural originality were weak. These findings provide initial support for the validity of ARC-AGI as a measure of human fluid intelligence. Future research should include more rule induction tasks as well as additional multivariate covariates. This study is unusual by studying a task in humans that was initially designed for machines. We suggest systematically embedding AI benchmarks into the nomological network of human cognitive abilities to enable more systematic evaluation and interdisciplinary cooperation.

Review

PDF

Published: July 13, 2026

Last updated: July 14, 2026

Polycepta: Object-Centric Appearance Estimation for Multi-Object Tracking

Mohamed Nagy, Naoufel Werghi, Jorge Dias, Majid Khonji (cs.CV, cs.AI)

The tracking-by-detection paradigm in multi-object tracking (MOT) typically relies on static appearance descriptors to complement motion estimation. However, these descriptors are frame-independent, limiting their robustness as visual cues. Since such descriptors are often obtained from computationally intensive pretrained backbones, real-time MOT systems frequently abandon appearance cues altogether and rely solely on motion prediction and geometric association. In this work, we introduce Polycepta, an object-centric appearance state estimation framework that reformulates appearance modeling as a recursive estimation problem rather than a frame-wise matching task. Polycepta constructs and continuously updates an independent appearance state for each tracked object, enabling future appearance representations to be estimated from accumulated observations. Polycepta is encouraged to learn the appearance-state construction of object-specific representations rather than memorize them through a proposed learning strategy, enabling appearance estimation for unseen classes. A key property of Polycepta is that the quality of appearance estimation improves as object states evolve during inference. While conventional appearance descriptors remain static or degrade over time, Polycepta progressively refines appearance estimates as additional observations are accumulated. Extensive experiments on KITTI, the Waymo Open Dataset, and MOT17 demonstrate consistent reductions in identity switches and improvements in tracking performance when integrated into the tracking-by-detection pipelines. Polycepta operates at 90.57 Hz and delivers state-of-the-art performance on the KITTI benchmark, achieving a MOTA of 92.27\%.

Review

PDF

Published: June 22, 2026

Last updated: July 14, 2026

VL-Nav: Neuro-Symbolic Reasoning-based Vision-Language Navigation

Yi Du, Taimeng Fu, Zhipeng Zhao, Shaoshu Su, Zitong Zhan, Qiwei Du, Zhuoqun Chen, Bowen Li, Chen Wang (cs.RO, cs.CV)

Navigating unseen, large-scale environments based on complex and abstract human instructions remains a formidable challenge for autonomous mobile robots. Addressing this requires robots to infer implicit semantics and efficiently explore large-scale task spaces. However, existing methods, ranging from end-to-end learning to foundation model-based modular architectures, often lack the capability to decompose complex tasks or employ efficient exploration strategies, leading to robot aimless wandering or target recognition failures. To address these limitations, we propose VL-Nav, a neuro-symbolic (NeSy) vision-language navigation system. The proposed system intertwines neural reasoning with symbolic guidance through two core components: (1) a NeSy task planner that leverages a symbolic 3D scene graph and image memory system to enhance the vision language models' (VLMs) neural reasoning capabilities for task decomposition and replanning; and (2) a NeSy exploration system that couples neural semantic cues with the symbolic heuristic function to efficiently gather the task-related information while minimizing unnecessary repeat travel during exploration. Validated on the DARPA TIAMAT Challenge navigation tasks, our system achieved an 83.4% success rate (SR) in indoor environments and 75% in outdoor scenarios. VL-Nav achieved an 86.3% SR in real-world experiments, including a challenging 483-meter run. Finally, we validate the system with complex instructions in a 3D multi-floor scenario.

Review

PDF

Published: February 02, 2025

Last updated: July 14, 2026

X-Lens: Real-Time Metric Depth Estimation with Heterogeneous Cameras

Heng Zhou, Shuhong Liu, Yonghao He, Bohao Zhang, Fa Fu, Chenhui Hou, Xianbao Hou, Lijun Han, Wei Sui (cs.CV)

We present X-lens, a compact feed-forward model for metric depth estimation from a variable number of calibrated fisheye and pinhole views. To support real-time downstream perception, X-lens is built around a geometry-aware heterogeneous camera formulation with two key components. Learnable calibration tokens provide a coarse alignment between fisheye and pinhole projective spaces, while a Jacobian-parameterized distortion bias injected into cross-attention models local projection changes and promotes cross-camera consistency, enabling robust generalization with only 0.04B parameters and up to 41 FPS. The model predicts dense depth together with a global metric scale, avoiding auxiliary reconstruction targets that increase computation and optimization complexity. To learn such cross-camera generalization at scale and depth, X-lens is trained on multiple public datasets and OmniScene, our newly released large-scale synthetic dataset containing approximately 266K synchronized six-view frames, 1.7M individual images, and 103 indoor and outdoor scenes. Extensive experiments on both real-world and synthetic indoor and outdoor datasets demonstrate superior heterogeneous-camera metric depth accuracy, reducing AbsRel by 25.4\% on OmniScene-Full over the strongest baseline while using 88.9\% fewer parameters, with competitive performance on conventional fisheye-only and pinhole-only settings.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

SPECTRA: Context-Conditioned Spectral Movement Primitives for Robot Skill Generalization

Boxuan Zhang, Sheng Liu, Chenlin Ming, Ahmed Abdelrahman (cs.RO)

Robot imitation learning for manipulation should preserve demonstrated task geometry while producing dynamically admissible robot motions. Existing pipelines often learn task-dependent trajectories and impose execution limits afterward through filtering, smoothing, clipping, or time scaling, which may distort task-critical end-effector paths. We propose the Spectral Movement Primitive (SMP), a frequency-domain imitation learning framework that couples task-space skill generation with joint-space execution regulation. Demonstrations are represented by truncated finite-horizon Fourier coefficients. An empirically selected low-frequency task band captures the dominant motion geometry, while higher harmonics contribute disproportionately to derivative growth. A frame-aware context-conditioned GMM/GMR prior predicts the task-band coefficients in a canonical task frame, and the resulting Cartesian trajectory is mapped to joint space through sequential inverse kinematics. A phase-coupled regulator then limits the requested phase progression without modifying the spectral coefficients, thereby enforcing joint velocity and acceleration limits while preserving the represented path. Experiments evaluate task-band reconstruction, robustness to composite demonstration corruption, out-of-distribution cross-board generalization, joint-space dynamic admissibility, end-effector path preservation, and deployment on a Franka Panda robot. Results show compact geometric reconstruction, consistent transfer across unseen task frames, substantial reductions in dynamic violations and jerk, and preservation of the intended end-effector path during phase regulation.

Review

PDF

Published: July 08, 2026

Last updated: July 14, 2026

ChunkFlow: Towards Continuity-Consistent Chunked Policy Learning

Zhao Yang, Yinan Shi, Mingyuan Yao, Wenyao Xue, Yawei Jueluo, Longjun Liu (cs.RO)

Vision-language action (VLA) models increasingly adopt chunked action heads to satisfy real-time constraints; however, this introduces boundary jitter: overlapping regions between consecutive chunks often yield inconsistent predictions, degrading temporal coherence and the task success rate. Existing methods, such as inference-time blending, merely reweight mismatched proposals without correcting underlying errors, leading to residual accumulation under biased or noisy histories. We propose ChunkFlow, a seam-aware training-and-execution framework for chunked policies that aligns chunk structure with boundary execution. It partitions each chunk into frozen, editable, and future zones, applies deterministic overlap blending at execution, and trains raw predictions with seam and first- and second-order continuity losses. History corruption and scheduled sampling improve robustness to executed-history errors, while an AWAC fine-tuning stage adapts the policy without removing these structural regularizers. Under mild smoothness assumptions, pre-blending seam discrepancies provably decay with increasing overlap. Experiments on CALVIN, LIBERO, and real robots show an improved success-stability trade-off with low-latency inference. Project page: https://cytoderm-ai.github.io/chunkflow.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Growing a Tail: Increasing Output Diversity in Large Language Models

Michal Shur-Ofry, Bar Horowitz-Amsalem, Adir Rahamim, Yonatan Belinkov (cs.CL, cs.CY)

How diverse are the outputs of large language models when diversity is desired? We examine the diversity of responses of several language models to questions with multiple possible answers, comparing them with human responses. Our findings suggest that models' responses are highly concentrated, reflecting narrow, mainstream outputs, in comparison to humans, whose responses exhibit a much longer-tail. We examine three simple and practical ways to increase output diversity: 1) increasing generation randomness via temperature sampling; 2) prompting models to answer from diverse perspectives using a single prompt; 3) aggregating outputs from several models. We find that these interventions, especially when combined, can substantially increase output diversity, although single-model outputs generally remain less diverse than the human baseline. We discuss potential implications of these findings for future work in AI policy and governance that wishes to preserve cultural diversity, an essential building block of a democratic social fabric.

Review

PDF

Published: November 05, 2024

Last updated: July 14, 2026

Beyond Points: Spherical Distributional Part Prototypes for Interpretable Classification

Duarte Leão, Diogo Pereira Araújo, Catarina Barata, Carlos Santiago (cs.CV)

Prototype-based neural networks aim to provide intrinsic interpretability by grounding predictions in a small set of part prototypes. However, modern vision backbones typically operate in normalized, directional embedding spaces where each semantic part exhibits substantial intra-class variability. As a result, point prototypes often become redundant or unstable, hurting both explanation quality and robustness. We propose vMFProto, a distributional part-prototype framework that models each class as a mixture of von Mises-Fisher components on the hypersphere. Each prototype learns its own concentration, capturing part-specific variability, and we use entropic optimal transport (OT) to obtain structured patch-to-prototype assignments. A two-stage training schedule performs OT-driven prototype discovery followed by end-to-end refinement with patch-level distillation and distribution-aware diversity regularization. Experiments on CUB-200-2011, Stanford Dogs, and Stanford Cars with frozen DINO backbones show that vMFProto achieves state-of-the-art explanation quality (consistency, stability, and distinctiveness) with competitive accuracy. Qualitative results confirm that vMFProto yields localized, non-redundant part evidence.

Review

PDF

Published: June 25, 2026

Last updated: July 14, 2026

Controllable Generation of Diverse Dermatological Imagery for Fair and Efficient Malignancy Classification

Héctor Carrión, Narges Norouzi (cs.CV)

Accurate dermatological diagnosis naturally necessitates equitable performance across diverse populations, yet a systematic lack of expertly annotated images, especially for underrepresented skin tones and rare diseases, impedes progress toward measurably fair methods. We introduce cgDDI (Controllable Generation of Diverse Dermatological Imagery), a hybrid framework that (1) synthesizes realistic healthy skin samples without disturbing other input properties, (2) maps single-sample rare lesions onto novel skin-tones and locations non-parametrically, and (3) allows for efficient parametric generation with as few as 10 training samples. The framework supports both human and automated segmentation masking, enabling scalability to datasets without pre-made lesion masks. We grow a 656-image dataset by more than 400x and validate across two datasets: biopsy-confirmed Diverse Dermatology Images (DDI) and expert-verified Fitzpatrick17k (F17k). On the DDI benchmark, we achieve malignancy classification accuracy of 86.4% under synthetic-only training and 90.9% state-of-the-art performance with real data fine-tuning, alongside leading fairness metrics. Cross-dataset experiments show +13.9% accuracy improvements on unseen F17k data despite minimal disease overlap. We openly release 266k+ synthetic images, code, and generative models to further support fairness research at https://github.com/hectorcarrion/ControllableGenDDI.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Win by Silence: Deletion Non-Monotonicity, Autonomous Exploitation, and Typed-State Gating in LLM Plan Evaluation

Aleh Manchuliantsau (cs.AI, cs.SE)

Plan evaluators can reward a strategic plan for becoming less explicit. This paper studies that failure in a staged expected-value scorer for LLM-generated venture routes. Proposition 1 gives the score change from deleting an interior transition while retargeting its predecessor and retaining downstream value: Delta_k = (prod_{i<k} p_i)[c_k + (1 - p_k)R_{k+1}]. On a frozen 26-route cohort, all 57 admissible deletions matched the analytic identity and threshold sign, and every route had at least one score-improving deletion. A score-seeking optimizer, allowed to restructure routes but not told the exploit mechanism, found baseline-beating uncovered structures in 21/26 routes. GATE refused score release for 26/26 silenced routes with 0/26 honest suspensions; after refusal, 47/54 next revisions repaired to a covered structure, and strict covered improvement rose from 1/26 to 13/26. An adaptive compiler-aware co-author exposed the registry-provenance boundary: obligation-channel evasions remained 6/6 across all four v1/v1.5 conditions, while delta-indexed cost floors reduced beat-honest routes from 6/6 to 3/6 and fundability-by-silence from 5/6 to 0/6 without establishing semantic completeness. If a plan scores better only because it omits necessary work, the plan did not improve; the evaluation created an omission incentive. PCSC detects and neutralizes post-hoc omission splices over model-mediated typed-state records. In the cooperative setting tested, GATE acts as a deterministic search-shaping constraint, not merely a post-hoc filter. It does not verify the semantic completeness or real-world quality of arbitrary LLM-generated strategies.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Resist and Update: Counterfactual Report Coordinates for Incentive-Compatible LLMs

Sen Yang, Yuen-Hei Yeung (cs.AI)

Aligned language models routinely misreport under non-evidential incentive pressure: they agree with a confident user or overstate certainty even when their internal belief is unchanged. We cast this as a failure of internal incentive-compatibility (IC) and present a method for learning and certifying counterfactual report mediators that hold a model's reports to a causal contract: invariant to forbidden influences (pressure, prestige, restyling) and responsive to licensed ones (genuine evidence). These two demands, resist and update, pull in opposite directions. We study them on a Bayesian-witness benchmark with known posteriors, in which the same user disagreement is licensed evidence or forbidden pressure purely by stated source reliability. We (i) causally identify, by interchange interventions rather than probe accuracy, low-rank report coordinates for answer, confidence, and caveat that are near-orthogonal and independently controllable, and (ii) introduce a training-free counterfactual report-coordinate (CRC) clamp that references the model's own report under a counterfactually incentive-neutralized context. On the witness benchmark the two-pass clamp attains resist and update of 1.00 jointly (Wilson 95% CI [0.99,1.00]), a causal certificate under a constructible reference, not a deployed solution. Global decoding and steering show a single-parameter tradeoff; output-level fine-tuning matches both objectives only when both are enumerated; resist-only training loses evidence-responsiveness. The deployable single-pass compilation is lossy (0.73/0.97). The mechanism and clamp reproduce across three model families and transfer to a natural sycophancy benchmark (SycophancyEval). Our contribution is the interface and certification method: activation-level counterfactual incentive-invariance as a structural primitive for internal IC.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

The log log jam in Gaussian state tomography

Sitan Chen, Weiyuan Gong, Qi Ye, Zhihan Zhang (quant-ph, cs.DS, math-ph)

Unlike in finite dimensions, quantum information in continuous-variable systems has the peculiar feature that without imposing physical constraints, the sample complexity of state tomography can be unbounded. Remarkably, this is even the case for state-of-the-art protocols for learning Gaussian states, which have finite-dimensional descriptions: the best known rates scale with loglog E, where E is the energy of the system. We prove this is not an artifact of existing analyses, but a fundamental limitation of the measurements used. We show: (1) Any protocol that uses Gaussian measurements, even entangled or adaptively chosen ones, must incur a loglog E dependence. This answers an open question posed by a number of previous works. (2) There is a smooth tradeoff between the number of rounds of adaptivity and the energy dependence, and we give a matching protocol achieving this interpolated rate. (3) With highly entangled, non-Gaussian measurements, one can learn n-mode pure Gaussian states with O(n^2 / ε^2) samples, independent of E. This answers an open question posed by Chen et al. (4) A simple protocol based on the single-copy canonical phase POVM of Holevo and Helstrom learns single-mode pure Gaussian states with O(1/ε^2) samples, again independent of E. Our results clarify the role of energy in bosonic state tomography and shed new light on the intriguing interplay between adaptivity, entanglement, and magic in quantum learning.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

FormalAnalyticGeo: A Neural-Symbolic Based Framework for Multimodal Analytic Geometry Problem Generation

Ruoran Xu, Wending Gao, Qiufeng Wang (cs.AI, cs.MA, cs.SC)

Math reasoning has achieved significant progress with the rapid advancement of Multimodal Large Language Models (MLLMs), however analytic geometry remains largely underexplored, primarily due to the scarcity of annotated samples. Existing diagram generation approaches struggle with analytic geometry: template methods cannot handle constraint-driven layouts, and generative models lack the geometric precision to render annotated conic curves correctly. We present FormalAnalyticGeo, a scalable framework for fully automatic generation of multimodal analytic geometry problems. Leveraging the rigor of formal languages, we design the framework around CDL (Condition Description Language), a formal intermediate representation that bridges free-form problem text with precise diagram rendering via a Signed Distance Field (SDF) engine. The framework employs four specialized LLM components in sequence: a Generator that produces diverse analytic geometry problems, a Formalizer that converts each problem into CDL for SDF-based rendering, a Measurer that extracts ground-truth answers through vision-based measurement on the rendered diagrams, and a Quality Verifier that checks outputs at three stages. Structured feedback from the Quality Verifier drives automatic retry, forming a closed loop that eliminates any need for human annotation. Applying FormalAnalyticGeo at scale yields AnalyticGeo7K, a dataset of over 7K verified multimodal problems, each with aligned text, diagram, formal annotation, and ground truth.Experiments show that the generated problems achieve a median ground-truth relative error of 0.70\%, with 82.3\% of answers falling within 5\% of the exact symbolic solution. Our framework and dataset will be publicly released.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Ensemble Controlled-Flow Filtering for Implicit Data Assimilation

Zhuoyuan Li, Yue Zhao, Ming Li (stat.ML, cs.LG, math.NA, math.OC)

Data assimilation estimates the state of a dynamical system from model forecasts and incoming observations. Many observation mechanisms, however, are many-to-one, implicit, non-smooth, or accessible only through simulation, and need not provide the residual structures or likelihood guidance required by existing ensemble filters. We introduce implicit data assimilation, in which the analysis law is defined as an energy tilt of the forecast distribution. We then propose the Ensemble Controlled-flow Filter (EnCF), which realizes this update through a stochastic controlled flow and learns the observation-dependent control by adjoint matching from terminal energy gradients. For simulator-defined observations, EnCF-LF learns a surrogate conditional energy from samples and applies the same controlled-flow solver. We prove ideal exactness, derive a one-step error decomposition, and establish non-accumulation of local errors under filter stability. Numerical results show that Kalman-type filters remain preferable for smooth additive-Gaussian observations, while the proposed methods are better suited to non-Gaussian, many-to-one, multimodal, and implicit observation models.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

When and Why Does Multi-Agent Debate Fail and Does It Really Underperform?

Yongqiang Chen, Gang Niu, James Cheng, Bo Han, Masashi Sugiyama (cs.LG)

Multi-agent debate (MAD) was proposed as a promising approach for ensembling the wisdom of multiple large language models (LLMs) to improve reasoning and provide effective supervision to superhuman LLMs. However, increasing empirical evidence suggests that MAD may not outperform or even significantly underperform single-agent approaches (SA), raising doubts about the benefits of MAD. In this work, we investigate this issue by analyzing the incentive structures of popular MAD paradigms: (i) competitive MAD (CopMAD) where agents compete by holding opposing positions; (ii) consensus-seeking MAD (CosMAD) where agents are driven to seek consensus. We show that both paradigms suffer from debate hacking: CopMAD reduces to a cheap-talk game, where agents produce misleading messages to win the game, while CosMAD filters out informative disagreements for premature consensus. Consequently, agents in both CopMAD and CosMAD fail to jointly resolve the ambiguity and seek the truth. To this end, we introduce ColMAD, a collaborative protocol that reframes MAD as a non-zero-sum game to encourage agents to provide informative while truthful messages. Through extensive benchmarking on challenging tasks such as error detection, we show that ColMAD significantly outperforms previous MAD protocols up to 10 percentage points. Under the same budgets, ColMAD effectively brings non-trivial improvements over SA methods, implying that the protocol design is critical to realizing the potential of MAD.

Review

PDF

Published: October 23, 2025

Last updated: July 14, 2026

Invariant Learning Dynamics of Transformers in Inductive Reasoning Tasks

Tiberiu Musat, Tiago Pimentel, Nicolas Zucchet, Thomas Hofmann (cs.LG, cs.AI)

We present a theoretical framework to explain the emergence of inductive reasoning abilities in Transformer language models. While previous works on Transformer learning dynamics have so far been mostly tied to specific tasks, we study a generalized class of inductive tasks that unifies several synthetic tasks known in the literature, including in-context n-grams and multi-hop reasoning. In this class, we theoretically prove that the training dynamics of attention models can be confined to a highly interpretable, low-dimensional invariant manifold. On this manifold, the learning dynamics are captured by a handful of interpretable coordinates rather than millions of parameters, making both theoretical and empirical analysis more tractable. Using this framework, we characterize how data statistics govern the competition between in-context and in-weights learning, we study how random initializations determine the `winning' circuit when multiple solutions are possible, and we demonstrate that the coordinate frame associated with the manifold can be used to automatically detect which circuits have been learned in trained models. By casting circuit formation as a low-dimensional dynamical phenomenon, we take a step toward a predictive theory of how Transformers learn.

Review

PDF

Published: July 13, 2026

Last updated: July 14, 2026

MAMMOTH: A Multi-Modal End-to-End Policy for Off-Road Mobility Robust to Missing Modality

Ahaan Kotian, Shivani Subramanyan, Suresh Sundaram (cs.RO)

Reliable autonomous navigation in unstructured off-road environments remains a critical unsolved challenge due to extreme terrain diversity, drastic illumination variations and acute sensor degradation. Recent developments have approached the problem as a traversability costmap estimation or visual navigation task. However, many exhibit heavy reliance on RGB modality, leading to poor performance in varied illumination such as glares, shadows or low ambient light. Achieving robust generalization in such conditions requires integrating modalities that provide supplementary scene information. Such multi-modal methods suffer from a rigid dependency on the presence of near-perfect sensor inputs, leaving them unable to robustly handle sensor degradation or individual modality failure. To address these limitations, we introduce MAMMOTH (MAsking Multi-Modal inputs for Off-road Traversability Heuristic-informed navigation), a unified end-to-end navigation policy for robust off-road visual-goal-conditioned navigation and undirected exploration. Specifically, MAMMOTH efficiently fuses multi-modal observations (RGB, Thermal, 3D Pointcloud and Ego Velocity) and is trained with a modality dropout scheme, enabling it to generalize to missing modalities at inference time. Furthermore, we employ a diffusion policy to learn the joint conditional probability distribution of physically-grounded trajectories and a intrinsic traversability heuristic. MAMMOTH utilizes this heuristic to prefer safer, smoother trajectories. We validate MAMMOTH through extensive real-world robot experiments in distinct off-road environments, including night-time operation. Our results demonstrate superior performance, with significant improvements in collision avoidance, terrain-aware planning and generalization to missing modalities. The code and dataset used for this work will be made publicly available.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

The Illusion of Robustness: Aggregate Accuracy Hides Prediction Flips under Task-Irrelevant Context

Yanzhe Zhang, Sanmi Koyejo, Diyi Yang (cs.CL)

As large language models (LLMs) grow more capable, they are increasingly deployed in context-rich settings where task inputs are often accompanied by long, partially irrelevant context. In a controlled setting, we find that state-of-the-art models often appear robust to task-irrelevant context at the aggregate level: prepending it to benchmark questions causes little change in overall accuracy. This aggregate stability, however, masks significant per-example instability. Even semantically meaningless pseudo-words, formed by randomly combining characters, can markedly shift model predictions on a small fraction of examples, degrading performance on some while improving it on others. This two-sided effect holds consistently across a wide range of models and datasets, yet the affected examples are largely model-specific. We further show that this instability is modulated by context type, context length, test-time compute, and model development stage. Together, our findings reveal context-induced tail risks concealed by aggregate accuracy, motivating per-example reliability evaluation of language models.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Form, Not Content? A Preregistered, Placebo-Controlled Evaluation of Learned Error-Conditioned Self-Repair Through Prompts and Weights in Frozen Small Code Models

Mehmet Iscan (cs.SE, cs.AI, cs.LG)

Frozen small code LLMs are deployed locally, yet the information guiding a retry after a failed attempt is still measured without placebo controls in the self-repair literature. We treat a failed program as a conjecture and an execution counterexample as an oracle-relative refutation, and introduce PoPE (Popperian Placebo-controlled Evaluation): a methodology for measuring whether evidence that falsifies LLM-generated code can be used operationally by that same model. In PoPE, error content is paired with channel-specific placebos that keep the predeclared scaffold while ablating task-relevant content or deranging the task-error assignment. Frozen small code models (0.5-1.5B) are evaluated under preregistered rules through a prompt channel and a weight channel (small-data adapter training), with four generations per arm-unit pair. In the prompt channel, public-tier screening unlocked 12 units under the content-ablated form placebo versus 10 under the live error-pattern arm on a 40-unit resistant band; the result was recorded as mechanism-null. In the weight channel, an 8-8 tie was observed between the error-content adapter and the intervention-free baseline (p=1.0), while the SHA-deranged placebo adapter stayed ahead with 10 unlocks; content-attributable superiority was not confirmed. These results do not constitute evidence of equivalence or non-inferiority. Equivalence was not tested separately. Findings are restricted to the public-tier screening endpoint; hidden-tier confirmation was deferred by design. We read this not as compiled criticism disappearing as information, but as the loss of its external role in testing a new conjecture: when a representation learned from the oracle is written back into the generation state, testing is replaced by conditioning. No working JEPA-RL controller is claimed. PoPE is presented as a placebo-controlled, retestable measurement standard.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Teaching signal synchronization in deep neural networks with prospective neurons

Nicolas Zucchet, Qianqian Feng, Axel Laborieux, Friedemann Zenke, Walter Senn, João Sacramento (q-bio.NC, cs.NE)

Working memory requires the brain to maintain information from the recent past to guide ongoing behavior. Neurons can contribute to this capacity by slowly integrating their inputs over time, creating persistent activity that outlasts the original stimulus. However, when these slowly integrating neurons are organized hierarchically, they introduce cumulative delays that create a fundamental challenge for learning: teaching signals that indicate whether behavior was correct or incorrect arrive out-of-sync with the neural activity they are meant to instruct. Here, we demonstrate that neurons enhanced with an adaptive current can compensate for these delays by responding to external stimuli prospectively -- effectively predicting future inputs to synchronize with them. First, we show that such prospective neurons enable teaching signal synchronization across a range of learning algorithms that propagate error signals through hierarchical networks. Second, we demonstrate that this successfully guides learning in slowly integrating neurons, enabling the formation and retrieval of memories over extended timescales. We support our findings with a mathematical analysis of the prospective coding mechanism and learning experiments on motor control tasks. Together, our results reveal how neural adaptation could solve a critical timing problem and enable efficient learning in dynamic environments.

Review

PDF

Published: November 18, 2025

Last updated: July 14, 2026

ViCo3D: Empowering LiDAR-based Collaborative 3D Object Detection with Vision Foundation Models

Haojie Ren, Songrui Luo, Lingfeng Wang, Yan Xia, Yao Li, Jing Li, Lu Zhang, Jiajun Deng, Yanyong Zhang (cs.CV)

LiDAR-based collaborative 3D perception in Vehicle-to-Everything (V2X) systems typically relies on fusing bird's-eye-view (BEV) features across agents. However, current BEV representations, typically extracted by LiDAR backbones trained from scratch, are geometry-dominated and lack general semantic priors, inherently limiting the efficacy of feature-level collaboration. Meanwhile, vision foundation models (VFMs) pretrained on large-scale image data have demonstrated strong capability in learning general-purpose and informative visual representations for 2D tasks, and have the potential to enhance agent-wise LiDAR BEV representations for collaboration. Despite this potential, adapting VFMs to LiDAR-based 3D detection remains challenging due to the substantial image-point cloud modality gap. To bridge this gap, we propose ViCo3D, a collaborative 3D object detection framework powered by VFMs. Specifically, ViCo3D adapts VFMs to LiDAR-based collaborative perception from three aspects: First, ViCo3D projects point clouds onto the BEV plane as three-channel images, enabling DINOv2 to extract BEV-space visual features from LiDAR inputs. Besides, to effectively integrate these DINOv2-derived features with LiDAR geometric features, ViCo3D introduces a multi-scale BEV fusion module within the single-agent encoder. In addition, ViCo3D adopts an ego-centric cross-agent fusion strategy to aggregate complementary information from multiple agents. Experiments on DAIR-V2X and V2XSet demonstrate that ViCo3D achieves state-of-the-art 3D detection performance. Remarkably, it delivers up to 1.8x greater collaborative gains than prior methods on DAIR-V2X. The code will be made public available for future investigation.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

FairCoder: Probing LLM Bias in High-Stakes Decision Making via Coding Tasks

Yongkang Du, Jen-tse Huang, Jieyu Zhao, Lu Lin (cs.CL, cs.SE)

Large language models (LLMs) are increasingly used in high-stakes decisions such as hiring and college admissions, making their social bias a critical concern. While LLMs are trained to refuse explicitly biased requests, bias can be leaked implicitly during LLM planning and reasoning process. As code becomes the primary medium for LLM internal logic-writing, we introduce FairCoder, a benchmark that frames decision-making as coding tasks to systematically probe LLM bias across employment, education, and healthcare domains, covering multiple fairness definitions. Considering that existing metrics may fail when LLMs frequently refuse the request, we propose FairScore, a metric that jointly captures refusal behavior and group-level outcome diversity. Experiments with a 1k-sample dataset on powerful LLMs reveal consistent and previously underexplored bias patterns, such as prioritizing applicants from high-income families in college admissions. Our findings highlight the risks of deploying LLMs as decision-making agents and provide a comprehensive evaluation framework for future research.

Review

PDF

Published: January 09, 2025

Last updated: July 14, 2026

Robustness of Deep Learning Models for PV Power Forecasting under NWP Forecast Errors: A Spatiotemporal and Physically Interpretable Analysis

Dandan Chen, Yan Zhao, Xuepeng Chen (physics.ao-ph, cs.LG)

Engineering use of AI forecasting models requires not only high nominal accuracy but also predictable behavior under uncertain inputs. In photovoltaic (PV) forecasting, this requirement is especially challenging because numerical weather prediction (NWP) errors are temporally correlated, state dependent, and physically coupled across variables. Existing evaluations, however, often rely on perfect forecast assumptions or simplistic perturbations that do not reflect these characteristics. This study presents a physically constrained robustness evaluation framework based on simulation, using virtual PV power as a controlled response variable to isolate the propagation of input uncertainty from confounders at the plant level. Six representative machine learning and deep sequence models, including PatchTST, GRU, N-HITS, and LightGBM, are evaluated under dynamic NWP perturbations with heteroscedasticity modulated by clear-sky conditions and Erbs reconstruction that preserves radiation consistency. The results show that sequence models provide stronger noise filtering and temporal resilience than a strong tabular baseline under medium to high disturbance regimes. SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG) further support a feature reallocation tendency at the case level, in which predictive reliance shifts from corrupted future forecasts toward more stable historical observations and deterministic physical priors. A Pareto analysis of accuracy under clean conditions, robustness, and computational latency then translates these findings into engineering implications for robustness assessment and model selection under forecast uncertainty.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

RecRec: Recursive Refinement for Sequential Recommendation

Pervez Shaik, Prosenjit Biswas, Abhinav Thorat, Ravi Kolla, Niranjan Pedanekar (cs.IR, cs.LG)

Sequential recommender systems typically infer user preferences through single-pass encoding of interaction histories without iterative refinement, relying on increasingly deep architectures to capture complex patterns. In this work, we revisit sequential recommendation from a recursive inference perspective: can user preferences be modeled as a persistent latent state that is recursively refined? We propose RecRec (Recursive Recommendation), a lightweight model that maintains a compact latent state and updates it through a shared recursive module conditioned on interaction evidence. Unlike prior recursive models, RecRec introduces an evidence-anchored correction mechanism that stabilizes refinement by grounding each update in the original interaction context, preventing semantic drift during deep recursive reasoning. Experiments on three benchmark datasets under standard evaluation protocols show that RecRec matches or outperforms state-of-the-art sequential, graph-based, and reasoning-enhanced recommenders while using only 3.9M to 14M parameters. Ablation studies demonstrate that both recursive refinement and the evidence-anchored correction gate contribute significantly to performance, highlighting the effectiveness of recursive latent inference as a scalable alternative to deeper or language-based architectures. Code is available at https://anonymous.4open.science/r/RecRec-6B67/README.md.

Review

PDF

Published: July 12, 2026

Last updated: July 14, 2026

ViHoRec: A Quality-Controlled Vietnamese Hotel Recommendation Dataset and Cold-Start Benchmark

Minh Hoang Nguyen (cs.IR, cs.AI)

Recommender-system research for Vietnamese remains limited by the absence of a public, well-documented hotel interaction resource. Building such a resource is challenging for three reasons: cross-platform hotel names must be reconciled before interactions are comparable; quality must be audited with reproducible metrics rather than ad hoc cleaning; and public release must preserve privacy while remaining benchmarkable under realistic cold-start conditions. We introduce ViHoRec, a quality-controlled Vietnamese hotel recommendation dataset of 18{,}267 interactions between 6{,}832 users and 560 hotels, crawled from Booking.com, Traveloka, and Ivivu. Our contributions are: (i) a reproducible construction pipeline with cross-platform entity resolution and quantitative quality control; (ii) a privacy-preserving release with HMAC pseudonyms; and (iii) a public cold-start benchmark with temporal leave-last-one-out split, data-centric ablations, and dependency-free baselines. On the public split, learned models degrade sharply for users with short histories (BPR-MF Recall@10: 0.065 vs. 0.120), while UserKNN remains strongest overall, establishing ViHoRec as a sparse, cold-start-dominated testbed for low-resource recommendation. All data are publicly available at https://github.com/MinhNguyenDS/ViHoRec.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Hardness of Obligatory-Test Scheduling on Multiple Machines

Kao-Chuan Liang, Ya-Chun Liang (cs.DS)

We study online scheduling with obligatory testing on m identical parallel machines, with the objective of minimizing the sum of completion times. Each job comprises a test of known length and a processing operation of initially unknown length. The processing time is revealed only when the test completes. Unlike in optional testing models, the scheduler does not choose whether to acquire information. Instead, it must decide how to allocate machine capacity between testing unrevealed jobs and processing jobs whose sizes are already known. Previous single-machine lower-bound constructions suggest a natural √(2) benchmark [ESA 2024: 48:1-14]. However, these constructions cannot be directly transferred to identical parallel machines by a simple replication argument. An online algorithm may interleave jobs from different copies, and the test and processing operation of a job need not be scheduled on the same machine. We address this difficulty by introducing a completion-threshold framework that reasons directly about global progress under total machine capacity. For each X, let T_X be the earliest time at which the algorithm has completed at least X jobs. The identity ∑_X=1^NT_X then converts pointwise progress bounds into lower bounds on the total completion time. Using this framework, we prove a three-type lower bound of 1.4811 and a dyadic multi-type lower bound tending to 3/2. The latter also improves the deterministic single-machine lower bound from √(2) to 3/2. On the algorithmic side, we give a parallel version of single-machine 1-SORT and prove that, if single-machine 1-SORT is ρ-competitive, then its parallel version is 2(m+ρ-1)/m+1-competitive on m identical machines.

Review

PDF

Published: June 01, 2026

Last updated: July 14, 2026

Point Tracking in Surgery--The 2025 Surgical Tattoos in Infrared Challenge (STIRC2025)

Adam Schmidt, Mert Asim Karaoglu, Zijian Wu, Jiaming Zhang, Yuxin Chen, Tim Salcudean, Ho-Gun Ha, Minkang Jang, Kyungmin Jung, Ihsan Ullah, Hyunki Lee, Suresh Guttikonda, Sarah Latus, Alexander Schlaefer, Xinkai Zhao, Yuichiro Hayashi, Masahiro Oda, Takayuki Kitasaka, Kensaku Mori, Peng Liu, Chenyang Li, Stefanie Speidel, Aoife Gardiner, Agostino Stilli, Danail Stoyanov, Francisco Vasconcelos, Anwesa Choudhuri, Meng Zheng, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Terrence Chen, Ziyan Wu, Alexander Ladikos, Omid Mohareri (cs.CV)

Point tracking in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. This paper introduces the 2025 iteration of a point tracking challenge to address this, wherein participants submit their algorithms for quantification. Their algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge named the STIR Challenge 2025 (STIRC2025). The STIR Challenge 2025 comprises two quantitative components: accuracy and efficiency. The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. The efficiency component tests algorithm inference latency. The challenge was conducted as a part of MICCAI EndoVis 2025, and seven teams participated in this challenge. In this paper we summarize the challenge results and participant methods. The challenge dataset is available at: https://zenodo.org/records/20191078, and the code for baseline models and metrics calculation is available here: https://github.com/athaddius/STIRMetrics

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Exact and Calibrated Diffusion Reconstruction for Digital Breast Tomosynthesis

Imade Bouftini (eess.IV, cs.CV)

Limited-angle digital breast tomosynthesis (DBT) reconstructs a volume from a few low-dose projections over a narrow arc. At a representative nine-view, 25^∘ protocol more than 98

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Achieving Almost Exact Recovery in Almost Quadratic Time: Rank-Based Graph Matching via Local Tree Correlation Tests

Jiale Cheng, Ziao Wang, Lei Ying (cs.DS, math.ST, stat.ML)

This paper studies graph matching under the correlated Erdős-Rényi (ER) graph pair model. This model first samples an ER(n,ns) base graph, whose edges are then independently subsampled twice with probability s to produce two correlated ER(n,n) graphs. We propose a graph matching algorithm that has n^2+o(1) time complexity and achieves almost exact recovery with high probability under the assumptions λ=(log n)^α+o(1) for some α∈(0,1) and s∈(√(C_Otter),1], where C_Otter≈ 0.338 is Otter's tree-counting constant. This is the first algorithm with almost quadratic time complexity in this regime of λ, while the best known result in this regime is the chandelier-counting algorithm with time complexity O(n^c(s)), where c(s)→∞ as s approaches √(C_Otter) from above. The proposed algorithm is based on local tree correlation tests. It uses a rank-based algorithm to match the vertex pairs instead of threshold-based rules in the literature. This avoids the need of computing an explicit threshold, which is computationally difficult to obtain. To prove the almost exact recovery result, we establish a new analysis of tree correlation tests in the diverging-degree regime, where both the mean degree and the tree depth grow with n. Based on this new result, we establish the existence of a threshold for a threshold-based graph matching algorithm via local tree correlation tests. Finally, we couple the performance of the rank-based algorithm with the threshold-based algorithm to show almost exact recovery.

Review

PDF

Published: July 10, 2026

Last updated: July 14, 2026

Domain-Incremental Remote Sensing Change Detection via Difference-Guided Adaptation and Frequency-Decoupled Distillation

Daifeng Peng, Yaning Li, Haiyan Guan (cs.CV)

Remote sensing change detection (RSCD) models are prone to catastrophic forgetting when incrementally adapted to new domains. Existing domain-incremental learning (DIL) methods mainly preserve image-level representations but often overlook bitemporal discrepancy cues, which are critical for robust change detection under domain shifts. To address this limitation, we propose DG-FDD, a domain-incremental change detection framework that integrates Difference-Guided Adaptation and Frequency-Decoupled Distillation. Specifically, the Difference-Guided Dynamic Adapter (DGDA) models bitemporal feature discrepancies to promote change-aware feature adaptation and reduce domain-specific interference. Meanwhile, the Frequency-Decoupled Knowledge Distillation strategy with Cross-domain Synthesis (FDKD-CS) separates structural information from domain style in the frequency domain, enabling stable knowledge transfer without historical data. Extensive experiments on three public high-resolution RSCD datasets under two- and three-domain incremental protocols demonstrate that DG-FDD effectively mitigates catastrophic forgetting. Compared with independently trained single-task models, DG-FDD records mean relative changes in F1 and IoU of only -0.23% and -0.45%, respectively, across six two-domain sequences, and -0.69% and -1.31%, respectively, across the three evaluated three-domain sequences. These results indicate a favorable stability-plasticity balance between historical knowledge retention and new-domain adaptation in continual cross-domain change detection.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

ExToken: Structured Exploration for Efficient Vision-Language-Action Reinforcement Fine-tuning

Yilun Kong, Yunpeng Qing, Guozheng Ma, Haoyu Wang, Li Shen, Zhi Hou, Dacheng Tao (cs.RO)

Reinforcement Learning (RL) has demonstrated significant potential for improving Vision-Language-Action (VLA) models on complex manipulation tasks. However, its practical scalability remains severely limited by the substantial cost of environmental interactions. In this work, we first investigate the exploration stagnation bottleneck in current VLA-RL frameworks and reveal that trajectory diversity is fundamentally more important to sample efficiency than the sheer quantity of collected rollouts. Motivated by these insights, we introduce RL Exploration Token (ExToken), a simple yet general framework that condition VLA policies on discrete behavioral priors derived from offline demonstrations for structured exploration. By conditioning the policy on different tokens during rollout collection, ExToken encourages the agent to explore diverse behavioral modes, substantially improving state-action coverage and exploration efficiency. To bridge exploration during training with deterministic inference at deployment, ExToken further incorporates a state-conditioned token selector that adaptively predicts effective behavioral modes for unseen scenarios. Extensive experiments across simulated and real-world robotic manipulation tasks demonstrate that ExToken consistently accelerates convergence, improves task performance, and exhibits strong robustness under highly constrained interaction budgets.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Efficient Sequential Calibration with O(T^2/3-ε) Error Bound

Zihan Zhang (cs.LG)

We study the online binary sequential calibration problem. A recent breakthrough by <cit.> overcomes the classical T^2/3 barrier for calibration error. Building on this result, we present an efficient randomized forecaster that achieves an expected calibration error O(T^2/3-ε) for some constant ε>0. Our forecaster combines the SPR-Calibration procedure <cit.> with an outer Blackwell-style correction layer. The SPR-Calibration procedure controls calibration with respect to a surrogate sequence of conditional-mean estimates, while the correction layer controls the additional error incurred when these surrogates are used to approximate the true outcomes. The analysis decomposes the total calibration error into the surrogate calibration error and the residual discrepancy between the surrogate sequence and the true outcomes. The former is bounded by the SPR-Calibration guarantee in <cit.>, and the latter is controlled using a quadratic potential argument together with the sparsity of the SPR-Calibration forecaster.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Knowledge- and Gradient-Guided Reinforcement Learning for Parametrized Action Markov Decision Processes

Jonas Ehrhardt, René Heesch, Oliver Niggemann (cs.AI)

In this paper, we study Reinforcement Learning in Parametrized Action Markov Decision Processes (PAMDP), where each decision consists of a symbolic action and numerical parameters. In such settings Reinforcement Learning algorithms typically determine parameters with one-shot estimators, which makes their training sample inefficient. Though in most PAMDP environments explicit but incomplete knowledge (e.g., rules, safety constraints, or expert heuristics) is available, it is rarely directly used to increase the sample-efficiency of training Reinforcement Learning agents. We step into this gap and propose our novel Neuro-Symbolic Knowledge- and Gradient-Guided Reinforcement Learning (KGRL) algorithm. KGRL uses domain knowledge in a Datalog knowledge base to derive the set of applicable actions and feasible parameters for a given state. This allows it to prune non-applicable actions from the decision-space and constrain the parameter spaces of the remaining actions. We then use a gradient-based parameter refinement loop to estimate the optimal parameters during training and deployment of the agent. By recording activated rules along the trajectory, KGRL additionally provides local procedural explanations on the pruning of actions and constraining of parameters. Overall, KGRL guides the agent's exploration and deployment toward feasible and constraint-aware decisions, while increasing sample efficiency during training. KGRL outperforms state-of-the-art RL baselines for PAMDPs in both, sample efficiency and episodic return.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

LatentFlow: A General Framework for Conditioning Stochastic Processes

Louis Sharrock, Lachlan Astfalck, Henry Moss (stat.ML, cs.LG, stat.ME)

Stochastic-process models are, as a rule, far easier to simulate than to condition. Non-linear observations, non-Gaussian likelihoods, black-box information, and global constraints all induce intractable conditional laws, requiring bespoke, model-specific constructions. We introduce LatentFlow, a single framework for conditioning stochastic processes, with no learned neural approximations and no training. Our starting point is to write the stochastic process as the deterministic image of a tractable latent innovation, f_0 = T_ϑ(ξ_0), with ξ_0 sampled from a simple reference distribution. This reduces process-level conditioning to latent-space inference: pull the likelihood back through T_ϑ, sample the resulting latent law with a tractable guided probability flow, and push the samples forward. This construction is provably exact at the level of the target law; in practice, approximation enters only through finite terminal noising, Monte Carlo guidance, and time discretisation of the continuous-time dynamics, each of which is explicit and systematically reducible. As LatentFlow is training-free, conditioning reduces to solving a single reverse-time SDE. This enables conditional sampling in seconds on a single desktop CPU across model classes that have never shared a scalable method: classical spatial priors, nonlinear stochastic dynamics, mechanistic models from the physical and life sciences, stochastic PDEs, heavy-tails and extremes, point and discrete-state processes, and neural or simulator-defined processes.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

The TopCoW Challenge -- Topology-Aware Circle of Willis Segmentation for CT and MR Angiography

Kaiyuan Yang, Fabio Musio, Yihui Ma, Norman Juchler, Johannes C. Paetzold, Rami Al-Maskari, Luciano Höher, Hongwei Bran Li, Ibrahim Ethem Hamamci, Anjany Sekuboyina, Suprosanna Shit, Houjing Huang, Chinmay Prabhakar, Ezequiel de la Rosa, Bastian Wittmann, Diana Waldmannstetter, Florian Kofler, Fernando Navarro, Martin J. Menten, Ivan Ezhov, Daniel Rueckert, Iris N. Vos, Ynte M. Ruigrok, Birgitta K. Velthuis, Hugo J. Kuijf, Pengcheng Shi, Wei Liu, Ting Ma, Maximilian R. Rokuss, Yannick Kirchhoff, Fabian Isensee, Klaus Maier-Hein, Chengcheng Zhu, Huilin Zhao, Philippe Bijlenga, Julien Hämmerli, Catherine Wurster, Laura Westphal, Jeroen Bisschop, Elisa Colombo, Hakim Baazaoui, Hannah-Lea Handelsmann, Andrew Makmur, James Hallinan, Amrish Soundararajan, Benedikt Wiestler, Jan S. Kirschke, Evamaria O. Riedel, Roland Wiest, Emmanuel Montagnon, Laurent Letourneau-Guillon, Kwanseok Oh, Dahye Lee, Orhun Utku Aydin, Adam Hilbert, Jana Rieger, Dimitrios Rallios, Satoru Tanioka, Alexander Koch, Dietmar Frey, Abdul Qayyum, Moona Mazher, Steven Niederer, Nico Disch, Julius C. Holzschuh, Dominic LaBella, Francesco Galati, Daniele Falcetta, Maria A. Zuluaga, Chaolong Lin, Haoran Zhao, Zehan Zhang, Minghui Zhang, Xin You, Hanxiao Zhang, Guang-Zhong Yang, Yun Gu, Sinyoung Ra, Jongyun Hwang, Hyunjin Park, Junqiang Chen, Marek Wodzinski, Henning Müller, Nesrin Mansouri, Florent Autrusseau, Cansu Yalcin, Rachika E. Hamadache, Clara Lisazo, Joaquim Salvi, Adrià Casamitjana, Xavier Lladó, Uma Maria Lal-Trehan Estrada, Valeriia Abramova, Luca Giancardo, Arnau Oliver, Paula Casademunt, Adrian Galdran, Matteo Delucchi, Oscar Camara, Jialu Liu, Haibin Huang, Yue Cui, Zehang Lin, Yusheng Liu, Shunzhi Zhu, Tatsat R. Patel, Adnan H. Siddiqui, Vincent M. Tutino, Maysam Orouskhani, Huayu Wang, Mahmud Mossa-Basha, Yuki Sato, Sven Hirsch, Susanne Wegener, Bjoern Menze (cs.CV, cs.LG, q-bio.QM, q-bio.TO)

The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to influence the risk, severity, and outcome of serious neurovascular diseases. However, characterizing the highly variable CoW anatomy remains a manual and time-consuming expert task. The CoW is commonly imaged by two non-invasive angiographic imaging modalities, magnetic resonance angiography (MRA) and computed tomography angiography (CTA), yet few datasets with annotated CoW anatomy exist, and there have been no established benchmarks for comparing CoW segmentation algorithms. We organized the TopCoW benchmark challenge alongside the release of an annotated CoW dataset with 125 paired MRA and CTA scans from the same patients. Voxel-level annotations for 13 vessel components were created using virtual reality technology and verified by clinical experts. Participants submitted algorithms for CoW segmentation and variant classification, which we evaluated on internal and external test sets comprising 226 scans from over five centers. The benchmark includes voxel-level segmentation, CoW component detection, CoW variant classification, and two clinical application tasks. We received submissions from over 250 participants across six continents. Top-performing teams achieved over 90% Dice scores for CoW segmentation, over 80% F1 scores for detecting key vessel components, and over 70% balanced accuracy in CoW variant classification across nearly all test sets. The best algorithms also supported clinically relevant downstream tasks by accurately classifying fetal-type posterior cerebral arteries and localizing aneurysms in relation to CoW anatomy. This benchmark demonstrated the utility of CoW segmentation algorithms for some downstream clinical applications with explainability.

Review

PDF

Published: December 29, 2023

Last updated: July 14, 2026

Significance-First Splitting: Aligning Treatment Heterogeneity Detection with Honest Estimation

Pantelis Z. Hadjipantelis, Weng Man Chiang, Karthik Nagesh (stat.ME, cs.LG, stat.ML)

Estimating heterogeneous treatment effects (CATE) requires simultaneously detecting effect modification and quantifying estimation uncertainty. Existing tree-based methods make an uneasy trade-off: significance-based approaches (Radcliffe and Surry 2011) identify subgroup interactions directly but lack valid inference; honest causal trees (Athey and Imbens 2016) deliver nominal confidence interval coverage but use outcome-agnostic splitting criteria that sacrifice interaction sensitivity. We introduce a hybrid algorithm that fuses significance-based splitting with honest sample-splitting and cross-validation. Our splitting criterion uses the squared t-statistic for the treatment × side interaction (t^2), which is shown to be directly aligned with the honest EMSE_τ criterion when the interaction is strong. Post-hoc honest cross-validation selects the cost-complexity penalty, giving a single principled estimator with nominal CI coverage at the leaf level. For forests, we retain bootstrap count vectors to enable an infinitesimal jackknife (IJ) variance estimate of Monte-Carlo convergence rather than formal pointwise inference. On the three synthetic designs from (Athey and Imbens 2016) the single tree achieves approximately 90

Review

PDF

Published: July 04, 2026

Last updated: July 14, 2026

Contrastive-Collapsed Loss for Flexible and Geometrically Optimal Embeddings and Faster Convergence

Blanca Cano-Camarero, Ángela Fernández-Pascual, José R. Dorronsoro (cs.LG)

In this work, we introduce CoCo, a loss function aimed at learning normalized and well-structured representations. The proposed loss encourages intra-class collapse and inter-class contrast while preserving sufficient flexibility for neural networks to approximate geometrically optimal embeddings with large angular separation between classes. We provide a theoretical analysis positioning CoCo with respect to related objectives such as dot regression and cross-entropy, showing that the new proposed loss benefits from closer initialization to the optimal configuration, more informative gradients, and stronger incentives for class-wise representation collapse. Extensive experiments on diverse tabular datasets from the OpenML-CC18 benchmark show that CoCo achieves competitive performance with state-of-the-art methods, including kernel SVM, Random Forest, dot regression, and cross-entropy-based neural networks. In addition, both theoretical arguments and empirical analyses demonstrate that the proposal promotes tighter class clustering and faster convergence. These results highlight CoCo loss as an effective objective for learning discriminative representations while maintaining competitive predictive performance.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Open-KNEAD: Knowledge-grounded Nutrition Estimation via Agentic Decomposition

Bruce Coburn, Jingbo Yue, Jinge Ma, Siddeshwar Raghavan, Gautham Vinod, Fengqing Zhu (cs.CV)

Multimodal Large Language Models (MLLMs) are increasingly used for dietary assessment from meal images, where retrieval-augmented grounding was shown to sharpen nutrition estimates. However, we find this premise no longer holds for current MLLMs. A modern MLLM's direct estimate now matches or surpasses the full retrieval pipeline. This raises a question: if retrieval no longer improves the overall estimate, can it still deliver the two things clinicians value, accurate portions and a traceable, item-by-item record? We pursue this while preserving what matters for clinical adoption: minimal user burden (a single, unannotated meal image), explainability (an auditable record), and privacy (locally hosted inference). We introduce Open-KNEAD, a knowledge-grounded agentic framework for meal nutrition estimation that is training-free and locally deployable. Each decomposed food item is grounded to a Food and Nutrient Database for Dietary Studies (FNDDS) code via selective, nutrient-aware retrieval, composing an auditable per-item record. Across two open MLLM families and three cuisines, Open-KNEAD improves portion estimates over both prior grounding methods and direct estimation in most backbone-dataset settings. An agent-internal recipe-prior step further recovers the invisible cooking-added energy that biases estimates on non-US cuisine. The advantage is largest on the dietitian-verified ACETADA dataset, where the local open agent surpasses the direct portion estimates of two frontier closed models by roughly 30% and 53%, all while keeping every meal image on local hardware. We release the Open-KNEAD framework and its agent-ready FNDDS knowledge base.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Real-time fall detection based on vision for low-power edge platforms

Wenjun Xia, Zhicheng Peng, Haopeng Li, Zhengdi Zhang (q-bio.NC, cs.AI, cs.CV)

Falling detection is vital for elderly care and intelligent surveillance; however, prevailing vision-based approaches predominantly frame it as static pose classification or discrete temporal pattern matching, fundamentally overlooking the instability dynamics of the human support system. This paper proposes a physics-informed falling detection framework that recasts falling as a stability-loss event in a coupled dynamical system. We introduce a novel dual-LTC architecture comprising a Center-of-Mass (CoM) subsystem and a Base-of-Support (BoS) subsystem, both instantiated as Liquid Time-Constant (LTC) neural networks to continuously model inertial trajectory evolution and ground-contact adjustment through adaptive time constants, Physical interpretability of falling motion. A learnable coupling module emulates physical interaction between the two subsystems, while a Stability Manifold classifier operates in the joint latent space to detect boundary crossing via Lyapunov-inspired stability metrics. Complementary counterfactual trajectory projection and Time-to-Collision (TTC) estimation further enable irreversibility assessment and early warning. The architecture is designed to support a three-state prediction paradigm (Normal, Falling, Fallen); in this preliminary study, we validate the core stability discrimination capability on a two-class dataset (Normal vs. Falling), leaving the full three-state temporal transition to future work. Unlike conventional CNN--RNN pipelines, the proposed formulation encodes continuous-time mechanical inertia, yielding a sub-50K-parameter network capable of real-time inference on resource-constrained edge devices. Extensive experiments demonstrate competitive accuracy with superior physical interpretability, validating its efficacy for low-compute visual fall detection.

Review

PDF

Published: July 14, 2026

Last updated: July 14, 2026

Harness VLA: Steering Frozen VLAs into Reliable Manipulation Primitives via Memory-Guided Agents

Yixian Zhang, Huanming Zhang, Feng Gao, Xiao Li, Zhihao Liu, Chunyang Zhu, Jiaxing Qiu, Yuchen Yan, Jiyuan Liu, Wenhao Tang, Zhengru Fang, Yi Nie, Changxu Wei, Yu Wang, Wenbo Ding, Chao Yu (cs.RO)

Language-conditioned manipulation requires both precise contact-rich control and robust reasoning over language, scenes, and long horizons. End-to-end Vision-Language-Action (VLA) models provide strong local visuomotor skills, but they are trained on in-distribution task trajectories and often fail under deployment perturbations such as semantic retargeting, goal re-binding, spatial-layout shifts, and unstable local contacts. LLM coding agents provide complementary semantic and compositional reasoning, but purely analytic primitives struggle with irregular grasping, constrained placement, and articulated-object interaction. We present Harness VLA, a memory-augmented agentic framework that exposes a frozen VLA as a retryable contact-rich primitive and composes it with a small fixed library of analytic primitives for grounding, staging, transport, navigation, and release. Rather than expanding the skill library, the harness learns the operating range of these fixed primitives from task-specific execution traces, global success rules, and failure models. By lifting semantic re-grounding, non-contact execution, and VLA re-staging to the planner while reserving the frozen VLA for local contact-rich phases, Harness VLA extends pretrained VLAs beyond their original trajectory distribution without finetuning. Across perturbed tabletop, household kitchen, and clean-to-randomized bimanual manipulation, Harness VLA improves over the strongest relevant baselines by 38.6 and 25.4 percentage points on LIBERO-Pro and RoboCasa365, respectively, and reaches 58.4% on RoboTwin C2R.

Review

PDF

Published: July 09, 2026

Last updated: July 14, 2026

Date Filter

Tag Filter

Do AI Agents Know When a Task Is Simple? Toward Complexity-Aware Reasoning and Execution

DenseReward: Dense Reward Learning via Failure Synthesis for Robotic Manipulation

The Seriality Gap in Video Diffusion Models

TerraZero: Procedural Driving Simulation for Zero-Demonstration Self-Play at Scale

PalmClaw: A Native On-Device Agent Framework for Mobile Phones

The Balanced Four-Color Theorem

A Shortcut to Statistically Steady-State Turbulence with Flow Matching

FlowWAM: Optical Flow as a Unified Action Representation for World Action Models

Privacy Attacks on Stable Marriage

Audio-Native Speech Recognition with a Frozen Discrete-Diffusion Language Model

Testing the Independent Set Property in Hypergraphs

Resilient Decentralized Ergodic Coverage for Scalable Multi-Robot Systems in Unknown Time-Varying Environments

Inclusive Federated Learning Through Compliance-Weighted Noise Allocation in Healthcare AI

DermDepth: Toward Monocular Metric Scale 3D Reconstruction Models for Dermatology

Dynamic Resource Allocation for Ensemble Determinization MCTS

The Spectrum Is Not Enough: When Context Helps Time-Series Forecasting

Propheticus: Machine Learning Framework for the Development of Predictive Models for Reliable and Secure Software

Watermark Forensics for Generative Models: An Information-Theoretic Perspective

The TIME Machine: On The Power of Motion for Efficient Perception

LapSurgie: Humanoid Robots Performing Surgery via Teleoperated Handheld Laparoscopy

Bringing Back Rule Induction to Fluid Intelligence Research? An Initial Validation of the ARC-AGI Benchmark in Humans

Polycepta: Object-Centric Appearance Estimation for Multi-Object Tracking

VL-Nav: Neuro-Symbolic Reasoning-based Vision-Language Navigation

X-Lens: Real-Time Metric Depth Estimation with Heterogeneous Cameras

SPECTRA: Context-Conditioned Spectral Movement Primitives for Robot Skill Generalization

ChunkFlow: Towards Continuity-Consistent Chunked Policy Learning

Growing a Tail: Increasing Output Diversity in Large Language Models

Beyond Points: Spherical Distributional Part Prototypes for Interpretable Classification

Controllable Generation of Diverse Dermatological Imagery for Fair and Efficient Malignancy Classification

Win by Silence: Deletion Non-Monotonicity, Autonomous Exploitation, and Typed-State Gating in LLM Plan Evaluation

Resist and Update: Counterfactual Report Coordinates for Incentive-Compatible LLMs

The log log jam in Gaussian state tomography

FormalAnalyticGeo: A Neural-Symbolic Based Framework for Multimodal Analytic Geometry Problem Generation

Ensemble Controlled-Flow Filtering for Implicit Data Assimilation

When and Why Does Multi-Agent Debate Fail and Does It Really Underperform?

Invariant Learning Dynamics of Transformers in Inductive Reasoning Tasks

MAMMOTH: A Multi-Modal End-to-End Policy for Off-Road Mobility Robust to Missing Modality

The Illusion of Robustness: Aggregate Accuracy Hides Prediction Flips under Task-Irrelevant Context

Form, Not Content? A Preregistered, Placebo-Controlled Evaluation of Learned Error-Conditioned Self-Repair Through Prompts and Weights in Frozen Small Code Models

Teaching signal synchronization in deep neural networks with prospective neurons

ViCo3D: Empowering LiDAR-based Collaborative 3D Object Detection with Vision Foundation Models

FairCoder: Probing LLM Bias in High-Stakes Decision Making via Coding Tasks

Robustness of Deep Learning Models for PV Power Forecasting under NWP Forecast Errors: A Spatiotemporal and Physically Interpretable Analysis

RecRec: Recursive Refinement for Sequential Recommendation

ViHoRec: A Quality-Controlled Vietnamese Hotel Recommendation Dataset and Cold-Start Benchmark

Hardness of Obligatory-Test Scheduling on Multiple Machines

Point Tracking in Surgery--The 2025 Surgical Tattoos in Infrared Challenge (STIRC2025)

Exact and Calibrated Diffusion Reconstruction for Digital Breast Tomosynthesis

Achieving Almost Exact Recovery in Almost Quadratic Time: Rank-Based Graph Matching via Local Tree Correlation Tests

Domain-Incremental Remote Sensing Change Detection via Difference-Guided Adaptation and Frequency-Decoupled Distillation

ExToken: Structured Exploration for Efficient Vision-Language-Action Reinforcement Fine-tuning

Efficient Sequential Calibration with O(T^2/3-ε) Error Bound

Knowledge- and Gradient-Guided Reinforcement Learning for Parametrized Action Markov Decision Processes

LatentFlow: A General Framework for Conditioning Stochastic Processes

The TopCoW Challenge -- Topology-Aware Circle of Willis Segmentation for CT and MR Angiography

Significance-First Splitting: Aligning Treatment Heterogeneity Detection with Honest Estimation

Contrastive-Collapsed Loss for Flexible and Geometrically Optimal Embeddings and Faster Convergence

Open-KNEAD: Knowledge-grounded Nutrition Estimation via Agentic Decomposition

Real-time fall detection based on vision for low-power edge platforms

Harness VLA: Steering Frozen VLAs into Reliable Manipulation Primitives via Memory-Guided Agents