Swift Sampling: Selecting Temporal Surprises via Taylor Series

Swift Sampling teaser: Taylor residual over time and selected frames — **Figure 1.** **Swift Sampling efficiently identifies temporal surprises in videos** by measuring how much a frame deviates from the trajectory predicted by its preceding context. Using a Taylor expansion of visual features, we select frames with the largest residuals within their temporal neighborhood as keyframes. *Top:* Temporal surprise captured using Taylor residual over time. *Bottom:* input frames and frames selected by Uniform sampling (orange), Cosine Uniqueness (yellow), and our method (green). Swift Sampling captures the video’s most informative frames with 30× less overhead than Cosine Uniqueness, while delivering a +12.5% improvement on VQA tasks on long videos with tight frame budgets.

Abstract

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling.

Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02× additional computational cost over the baseline, making it 30× cheaper in overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets, improving accuracy by up to +12.5 points.

Method

Our method is based on a simple observation: long-form video consists of vast, highly predictable intervals interjected with sparse temporal surprises. Swift Sampling treats the visual latent features of adjacent video frames as points lying on a locally smooth trajectory (Figure 2). This makes it amenable to a polynomial approximation via Taylor series using higher-order derivatives. Given the feature vectors of the N frames preceding the current frame t, we construct a Taylor predictor that captures the velocity (first order), acceleration (second order), and jerk (third order) of the feature trajectory. The Taylor residual is the ℓ₂ distance between the predicted feature f̂_t^(N) and the observed feature f_t:

r_t = ‖ f_t − f̂_t^(N) ‖₂

It serves as a principled, per-frame informativeness score. A small residual indicates a predictable, redundant frame (e.g., a bear's rhythmic walk), while a large residual signals a temporal surprise, i.e., a moment of genuinely new information (e.g., the sudden emergence of a seal out of the ice). For a given frame budget K, we select the K local maxima of the residual sequence, prioritizing the most surprising frame within each local temporal context. The sampling rate scales naturally with the video complexity, making our approach hyperparameter-light. Crucially, we compute these residuals directly from the intermediate representations of the VLM's vision encoder that must be computed anyway during the forward pass.

Latent feature trajectory with Taylor prediction — **Figure 2.** Each frame is represented on the latent feature trajectory, where we apply Taylor expansion over preceding frames to predict the next frame feature. The residual between the prediction and the actual feature measures how much the trajectory deviates from a smooth continuation. Frames with large residuals correspond to temporal surprises, e.g., a seal suddenly emerging from the ice, which Swift Sampling effectively captures.

Main results

VQA accuracy across different video durations on Video-MME, LongVideoBench (LVB), and MLVU. All methods select 32 frames from a pool of 128. FLOPs report inference cost relative to uniform sampling.

Method	FLOPs	Video-MME				LongVideoBench				MLVU
Method	FLOPs	Short	Med.	Long	Overall	≥10m	≥20m	≥30m	Overall	≥10m	≥15m	≥30m	Overall
LLaVA-OneVision
Uniform	1.00×	69.9	56.4	48.8	58.3	45.2	47.5	48.1	55.3	61.4	54.4	50.0	64.7
Cosine Uniqueness	1.60×	65.3	54.7	47.0	55.7	47.0	47.1	46.3	52.5	63.6	61.1	47.9	65.4
Swift Sampling (Ours)	1.02×	71.0	56.9	49.2	59.0	51.6	54.3	50.9	57.9	62.2	58.2	54.2	65.6
LLaVA-Video
Uniform	1.00×	74.0	59.0	51.3	61.4	50.7	53.3	54.6	56.8	56.7	54.4	50.0	64.2
Cosine Uniqueness	1.60×	69.2	54.4	50.1	57.9	50.7	52.5	54.6	56.5	61.2	57.0	50.0	66.5
Swift Sampling (Ours)	1.02×	74.9	59.1	51.6	61.9	52.1	56.2	57.4	58.6	60.0	55.0	52.1	67.2

Swift Sampling brings consistent gains over uniform sampling on both backbones, with particularly strong improvements on long-duration videos: +6.8 points on LongVideoBench (≥20 min) and +4.2 points on MLVU (≥30 min) for LLaVA-OneVision, at only 1.02× inference cost.

Comparison with query-agnostic baselines

All methods select 32 frames from a pool of 128 on LLaVA-OneVision. FLOPs report inference cost relative to uniform sampling.

Method	FLOPs	Video-MME				LongVideoBench				MLVU
Method	FLOPs	Short	Med.	Long	Overall	≥10m	≥20m	≥30m	Overall	≥10m	≥15m	≥30m	Overall
Uniform	1.00×	69.9	56.4	48.8	58.3	45.2	47.5	48.1	55.3	61.4	54.4	50.0	64.7
Cosine Uniqueness	1.60×	65.3	54.7	47.0	55.7	47.0	47.1	46.3	52.5	63.6	61.1	47.9	65.4
Frame difference	1.00×	67.4	53.3	48.3	56.4	46.3	49.3	51.9	53.5	58.5	53.7	45.8	64.6
Iframe	1.00×	67.4	54.9	48.7	57.0	52.0	49.8	49.8	57.1	60.6	54.4	52.1	63.8
Pframe	1.00×	66.9	55.1	48.2	56.7	51.9	49.1	49.1	56.5	60.8	54.4	50.0	64.1
Optical Flow	1.07×	68.6	53.0	48.0	56.5	52.4	50.7	50.7	56.9	60.0	56.4	52.1	62.9
DySeg (adapted)	1.79×	69.6	53.7	48.4	57.2	46.0	48.2	51.9	52.9	49.6	47.7	48.3	63.1
MaxInfo	1.79×	71.1	57.2	48.8	58.9	51.4	50.8	50.0	57.8	63.0	59.1	51.1	66.5
Swift Sampling (Ours)	1.02×	71.0	56.9	49.2	59.0	51.6	54.3	50.9	57.9	62.2	58.2	54.2	65.6

Swift Sampling remains highly competitive with all query-agnostic baselines while operating at a negligible 1.02× inference cost, with the strongest gains on the long-video subsets.

Effect of frame budget

Swift Sampling consistently outperforms uniform sampling across all frame budgets K, with significant gains on longer videos and under highly constrained budgets. As the frame budget tightens, identifying temporal surprises becomes critical for model reasoning.

Budget K	MLVU (all)		MLVU (≥30 min)
Budget K	Uniform	Ours	Uniform	Ours
32	64.7	65.6	50.0	54.2
16	61.6	63.9	47.9	50.0
8	58.6	60.3	50.0	54.2
4	54.4	56.7	45.8	58.3
2	51.8	54.0	43.8	54.2

On videos longer than 30 minutes, Swift Sampling improves over uniform sampling by 12.5 points at K=4 and 10.4 points at K=2.

Applications

Token compression. We integrate Swift Sampling into UniComp, the current state-of-the-art token-compression method, by replacing its default uniform frame selection. Swift Sampling consistently boosts UniComp's accuracy across all retention ratios r, achieving a peak gain of +1.6 points on MLVU (r = 0.15). It offers a drop-in improvement for token-compression pipelines.

Video captioning. We evaluate Swift Sampling on LLaVA-OneVision on the TempCompass benchmark, where improved frame selection is expected to yield more informative captions. Swift Sampling improves captioning performance across nearly all categories, but struggles on attribute change.

Other downstream tasks in Video-MME. Analyzing performance by task category, Swift Sampling excels in reasoning-intensive tasks: Spatial Reasoning (+5.4%), Action Reasoning (+3.9%), Temporal Reasoning (+2.8%), and Action Recognition (+2.2%).

Qualitative comparison

Frames selected by Uniform sampling, Cosine Uniqueness, and Swift Sampling from a 128-frame candidate pool.

**Figure 4.** **Qualitative comparison of frame selection** on a sample video from the Video-MME dataset, given a budget to select 8 frames out of 128. Answering the question requires identifying the temporal order of several visually similar but semantically distinct painting events: establishing the background, drawing the water-lily pads, adding flowers, and increasing texture. Uniform sampling captures the background and water-lily pads but misses the frames showing the addition of flowers and texture, leading to an incorrect answer. Cosine Uniqueness over-selects visually salient but task-irrelevant frames, such as title cards and end screens, which is especially harmful under a limited frame budget. Swift Sampling focuses on temporally informative changes in the painting progression, captures key intermediate stages and enables correct temporal reasoning.

Analysis

Taylor residual across encoder layers — **Figure 5.** Taylor residual across different layers. Key features (green) yield lower residuals than hidden output features (orange) at every layer, and the residual is smallest at the earliest layers, indicating that they are the most predictable from their temporal context. We use layer ℓ = 0 key features throughout our experiments.

Choice of feature layer and type. We study which layers and feature types yield the most predictable temporal dynamics. Mean-pooled key features in the earliest layers produce the lowest residuals: early-layer features provide stable, low-level scene representations that evolve smoothly, making them highly predictable yet sensitive to sudden surprises. We use ℓ = 0 as our default choice.

Choice of feature pooling. We aggregate each frame's token grid into an S×S patch grid before computing the residual, sweeping S ∈ {1, 2, 4, 7, 14}. Global mean pooling (S = 1) achieves the best overall performance; finer grids dilute the temporal signal, presumably because local residuals are dominated by texture noise and camera jitter.

Effect of Taylor expansion order. VQA accuracy improves sharply from N = 1 to N = 3 across all three benchmarks, after which performance saturates. Low-order terms effectively capture the majority of predictable local dynamics, so we adopt N = 3 as the default.

More qualitative results

Additional examples across Video-MME, LongVideoBench, and MLVU, including Taylor-residual visualizations.

**Figure 6.** **Additional qualitative example** on a MLVU surveillance video, selecting 8 frames out of 128. The correct answer requires identifying a fire breakout late in the video. Uniform sampling distributes frames evenly and captures only a brief glimpse of the fire. Cosine Uniqueness concentrates on visually distinctive but content-redundant frames in the dark surveillance footage and entirely misses the fire. Our method captures the transition from normal surveillance to fire breakout, providing the video LLM with the temporal evidence needed to correctly identify the irregularity as arson. Crucially, Swift Sampling successfully captures **local changes, e.g., a person entering between frame 111 and frame 118** as well as **global changes such as fire breakout (frame 123)**.

**Figure 7.** **Qualitative example from the LVB dataset.** Each row shows the 16 frames selected by Uniform sampling (red), Cosine Uniqueness (yellow), and our method (green) from the full 128-frame candidate pool. The question asks what the man in a yellow short-sleeve shirt is doing in a bustling street scene, with the correct answer being *riding a bicycle*. Answering this question requires retaining the specific street scene described in the question and the local action within it. Uniform sampling and Cosine Uniqueness both select visually diverse city scenes, but they miss the target moment around frame #42 and therefore predict *walking*. In contrast, our method selects a more diverse set of frames across the video while still retaining the question-relevant street scene around frame #42. As a result, it preserves the frame needed to correctly identify that the man is riding a bicycle. % Hate adding more - but Frame 42 isn't popping and required zooming in. Make the size of just that frame much bigger or if it is easy, put our swift sampling icon next to it -- something to grab the viewer's attention.

**Figure 8.** **Qualitative example from the MLVU dataset.** Each row shows the 8 frames selected by Uniform sampling (red), Cosine Uniqueness (yellow), and our method (green) from the full 128-frame candidate pool. The question asks where the basketball court in the video is located, with the correct answer being *on a cruise ship*. The video contains highly diverse scenes, so answering the question requires retaining broad scene coverage together with the brief contextual cues that indicate the cruise-ship setting. Uniform sampling captures mainly natural scenes and also includes visually similar selections, leading the model to predict *in a park*. Cosine Uniqueness concentrates on a narrow subset of visually distinctive frames, such as flames and bright sky shots, leaving the cruise-ship context under-covered. In contrast, our method selects a more diverse set of frames across the video and captures the basketball-court scene in frame #38. In this frame, the court appears together with the surrounding waterway, allowing the model to infer that it is located on a cruise ship. As a result, it correctly answers the question. % The question seems to not align with the frames. There is no basketball court in this particular selection.actually it's in frame 38, but I think this is not popping up like you said. I'll add icon next to it.

**Figure 9.** **Qualitative example from the MLVU dataset.** Each row shows the 8 frames selected by Uniform sampling (red), Cosine Uniqueness (yellow), and our method (green) from the full 128-frame candidate pool. The question asks for the color of the trolley picked up in the video, with the correct answer being *blue*. This is an egocentric video of a construction-related task, and answering the question requires capturing the brief moment when the trolley is picked up. Uniform sampling misses this moment entirely. Cosine Uniqueness captures the trolley, but only partially, making its color difficult to identify. In contrast, our method captures the pickup moment with the trolley clearly visible, preserving the information needed to correctly identify its color.

**Figure 10.** **Qualitative example from the Video-MME dataset under a tight frame budget.** Each row shows the 8 frames selected by Uniform sampling (red), Cosine Uniqueness (yellow), and our method (green) from the full 128-frame candidate pool. The question asks which option best summarizes the main content of the video, with the correct answer being *Product advertisement*. This question requires integrating evidence across multiple scenes rather than recognizing a single salient moment. Uniform sampling captures a range of scenes, but because the selected frames are spaced evenly, they resemble a sequence of unrelated movie shots and lead the model to predict *Movie commentary*. Cosine Uniqueness concentrates on highly distinctive black transition frames under this limited frame budget, leaving too few informative frames to recover the video's overall theme. In contrast, our method selects multiple device-centric frames along with diverse contextual scenes, better preserving the recurring visual cues that characterize the video as a product advertisement. As a result, it preserves the information needed to correctly identify the video as a product advertisement.

**Figure 11.** **Qualitative example from the Video-MME dataset.** Each row shows the 32 frames selected by Uniform sampling (red), Cosine Uniqueness (yellow), and our method (green) from the full 128-frame candidate pool. The question asks how many of Mbapp\'e's spectacular goals are shown in the video, with the correct answer being 4. The video shows four separate goal segments, each marked by a numbered overlay. Uniform sampling distributes frames evenly but misses one of these goal segments. Cosine Uniqueness selects visually salient frames, such as close-up shots and repeated graphic frames, but does not cover all of the numbered goal segments needed for counting. In contrast, our method selects a more diverse set of frames across the video with good temporal coverage, capturing frames corresponding to goals #1, #2, #3, and #4. As a result, it retains the frames needed to correctly answer the counting question.

**Figure 12.** **Qualitative example on an informational video** from the Video-MME dataset. Each row shows the 32 frames selected by Uniform sampling (red), Cosine Uniqueness (yellow), and our method (green) from the full 128-frame candidate pool. The question asks which item is listed in the top 2, with the correct answer being *London Aquatics Centre*. Since the video presents several places one after another, the model needs frames that cover the relevant places and their order in the video. Uniform sampling provides broad temporal coverage but still contains redundant selections. Cosine Uniqueness captures visually distinctive frames, but several of its selected frames come from similar moments of the same entries, so it misses some of the places needed to answer the question. In contrast, our method selects a more diverse set of frames across the video, providing broader coverage of the entries and retaining the question-relevant moments (# 71). As a result, it preserves the evidence needed to correctly identify the top-2 item.

**Figure 13.** **Qualitative example on a visually static yoga video from the Video-MME dataset.** Each block shows the Taylor residual over time (*top*) and the corresponding input frames together with the 32 frames selected by Uniform sampling (red), Cosine Uniqueness (yellow), and our method (green) from the full 128-frame candidate pool. The question asks for the order in which four poses are introduced, with the correct answer being (b)\,(d)\,(c)\,(a). Unlike videos with frequent scene cuts, this clip remains visually static, so the relevant evidence lies in subtle pose transitions. Uniform and Cosine Uniqueness both miss the seated chair twist around frame #42 and therefore confuse (b) downward facing dog with (d) seated chair twist. Our method captures this key transition and correctly answers the question, illustrating that temporal surprise can identify informative moments even when motion is small and the overall scene changes very little.

**Figure 14.** **Qualitative example from the MLVU dataset.** Each block shows the Taylor residual over time (*top*) and the corresponding input frames together with the 8 frames selected by Uniform sampling (red), Cosine Uniqueness (yellow), and our method (green) from the full 128-frame candidate pool. The question asks whether there is any irregularity in the surveillance video and, if so, what type it is, with the correct answer being *arson*. The video is largely composed of repetitive grayscale surveillance footage, and the key evidence appears only when the scene changes into a fire breakout near the end. Uniform sampling distributes frames evenly across the clip and captures only a weak glimpse of this anomaly, leading the model to predict *stealing*. Cosine Uniqueness concentrates on visually distinctive but content-redundant frames in the surveillance footage and fails to preserve the fire breakout. In contrast, our method captures the transition from normal surveillance to the arson scene, preserving the anomaly-related frames needed to correctly identify the irregularity as *arson*.

**Figure 17.** **Detailed view of Fig. 1.** Full frame selection results on the same video. Each block shows the Taylor residual over time (*top*) and the corresponding input frames together with the 32 frames selected by Uniform sampling (red), Cosine Uniqueness (yellow), and our method (green) from the full 128-frame candidate pool. Our method selects temporally surprising moments. In contrast, uniform sampling distributes frames evenly, sometimes resulting in redundant selections. Cosine Uniqueness sometimes concentrates on a narrow set of visually distinctive but content-redundant frames.

**Figure 18.** **Qualitative Example: Action Recognition** from the Video-MME dataset. The question asks *``What is the last magic the magician played?''*. 8 Key frames corresponding the last magic trick are being showcased for Uniform sampling and our method. Correctly identifying the action requires capturing 3 key moments: the introduction of the small feather, the small feather levitating and the appearance of the large feather. Our method is able to select all three frames (frame #59, #78 and #86) while Uniform sampling misses the levitation making the transition from small feather to large feather ambiguous

BibTeX

@article{kim2026swift,
  title         = {Swift Sampling: Selecting Temporal Surprises via Taylor Series},
  author        = {Kim, Dahye and Sachdeva, Bhuvan and Uppal, Karan and Gupta, Naman
                   and Balasubramanian, Vineeth N. and Ghadiyaram, Deepti},
  journal       = {arXiv preprint arXiv:2605.22678},
  year          = {2026}
}