This put up is Half 2 of a two-part collection on multimodal typographic assaults.
In Half 1 of “Studying Between the Pixels,” we demonstrated that textual content–picture embedding distance correlates with typographic immediate injection success: circumstances that push a typographic picture farther from its supply textual content in embedding house (small fonts, heavy blur, rotation) cut back assault success, and circumstances that carry them nearer improve it.
If embedding distance is a dependable predictor of assault success, we explored whether or not we may straight cut back the space to make a failing assault succeed. We apply small, managed adjustments to degraded typographic pictures so {that a} mannequin interprets them as nearer to their unique textual content. The outcomes reveal two distinct failure modes in imaginative and prescient language mannequin (VLM) security—readability restoration and refusal discount—two points that co-occur however differ relying on mannequin and forms of visible degradation.
The optimization we examined on pictures resulted within the results of a profitable typographic assault that evaded easy picture filters, indicating a necessity for extra strong defenses within the illustration house.
From Correlation to Causation
Half 1 established a powerful correlation between embedding distance and assault success fee (ASR) throughout 4 VLMs (r = −0.71 to −0.93, all p < 0.01). We investigated additional and located that relationship between embedding distance and ASR is mediated by two components:
- Perceptual readability: Can the VLM parse the textual content within the picture in any respect? A 6px font or heavy blur could render textual content unreadable to the mannequin, inflicting it to fail earlier than security alignment even enters the image.
- Security alignment: Even when the VLM can learn the textual content, does it refuse to adjust to the dangerous instruction? Fashions like GPT-4o and Claude Sonnet 4.5 have robust security filters that catch many dangerous requests even when the textual content is completely legible.
We tried to optimize these components by reshaping the picture’s illustration in embedding house. Our analysis demonstrates that an attacker with the above purpose could possibly be recovering readability alone or additionally undermine the security alignment relying on the robustness of the underlying mannequin’s post-training alignment.
The Method
Our technique is conceptually simple: take a degraded typographic picture that presently fails as an assault (as a result of the mannequin can’t learn it, refuses to conform, or each), and apply a small, bounded pixel-level perturbation that makes it “look extra just like the textual content” to an ensemble of multimodal embedding fashions (in our experiments, we used Qwen3-VL-Embedding, JinaCLIP v2, OpenAI CLIP ViT-L/14-336, SigLIP SO400M). Importantly, this optimization doesn’t require entry to the goal VLM, its security classifier, or any class labels—the textual content embedding of the assault immediate serves as a hard and fast goal.
We adapt SSA-CWA (a way that integrates a typical weak point assault method with spectrum simulation assault) to unravel this optimization: that means we allocate 100 steps to make perturbations to the enter by a most of 12.5 % (see Determine 1) to seek out the easiest way to change a picture’s pixels to elicit a profitable assault.
![]()
![]()
Determine 1: Overview of embedding-guided adversarial optimization. A degraded typographic picture is optimized through SSA-CWA throughout 4 surrogate embedding fashions. The ensuing picture is visually comparable however semantically realigned, producing two co-occurring results: readability restoration and refusal discount.
What We Examined
We chosen the identical 4 VLMs from Half 1 as our goal VLMs: GPT-4o, Claude Sonnet 4.5, Mistral-Giant-3, and Qwen3-VL-4B utilizing the identical GPT-4o based mostly refusal decide.
We evaluated throughout 5 degradation settings with low baseline ASR:
- 6px font;
- 8px font;
- 90° rotation;
- triple degradation (blur + noise + low distinction); and
- heavy blur (σ=5).
And we chosen 50 prompts the place the textual content assault succeeds on each GPT-4o and Claude however the degraded picture fails on each. Our choice targeted on assaults that had been profitable in textual content kind however blocked in picture kind, in addition to closely degraded pictures the place OCR-based detection pipeline would battle to flag a dangerous intent. A 6px font or closely blurred picture is tough for typical textual content extraction to parse, so if a bounded perturbation makes the picture readable to the VLM with out restoring human- or OCR-legible textual content, the assault evades each the OCR filter and the mannequin’s personal security alignment.
The outcomes of our experiment are proven in Desk 1 beneath. The optimization persistently will increase ASR the place baseline is lowest. Claude goes from 0% to twenty-eight% on heavy blur; GPT-4o from 0% to 16% on rotation. Curiously, Mistral (already excessive baseline) typically drops — the perturbation can set off security mechanisms that weren’t beforehand engaged.
![]()
Desk 1: Comparability of assault success charges earlier than and after optimization beneath totally different degradation circumstances (B: signifies earlier than optimization, A: after optimization).
Two Failure Modes from One Goal
We noticed two patterns throughout our analysis. The important thing discovering is not only that the optimization works, however how. By classifying every failure as an express refusal, a readability failure, or different (misreading/tangential), we determine two distinct patterns:
Readability Restoration Dominates at 6px and Heavy Blur
At 6px font, 35 of GPT-4o’s 50 pre-optimization failures are readability-related, with solely 10 refusals. After optimization, readability failures drop from 35 to 10, however refusals rise from 10 to twenty-eight, yielding just one internet success. In different phrases: the perturbation makes the textual content readable, however GPT-4o’s security filter catches the now-legible requests. GPT-4o’s security alignment holds agency as soon as readability is restored.
Claude on heavy blur tells a distinct story and reveals the strongest general acquire (+28%). Earlier than optimization, Claude returned empty responses on 39 out of fifty samples as a result of it couldn’t course of the picture in any respect. After optimization, empty responses drop from 39 to 14 because the perturbation makes the blurred content material readable. Not like GPT-4o, Claude’s security filter doesn’t totally catch the newly readable content material, and a good portion of those recovered readings end in compliance.
For Mistral, readability features translate most on to ASR: misreadings drop from 20 to five on heavy blur (+20%) and from 23 to 11 on 6px (+14%). This could possibly be attributed to the security alignment mechanism not being the bottleneck within the case of Mistral, whereby making the textual content readable is enough to succeed.
Blended Readability and Refusal Discount at Rotation and 8px
At 90° rotation, our observations are extra nuanced. GPT-4o has 28 refusals but additionally 12 readability failures, indicating that 20px rotated textual content just isn’t totally legible even at an affordable font measurement. After optimization, readability failures drop (12 to 7) and eight successes emerge (+16%). Claude features +22%, with refusals dropping from 27 to 19 and empty responses from 13 to eight. At 8px, an analogous sample holds: GPT-4o and Claude acquire +10% and +8% from a 0% baseline.
That is probably the most regarding sample from a security perspective: in these settings, the VLM can partially learn the textual content however refuses to conform. The perturbations to the picture don’t enhance visible legibility to a human observer, however the tweaked picture confuses the mannequin’s security reasoning, inflicting the mannequin to shift from refusal to compliance. The mannequin’s security choice breaks down when confronted with small adjustments within the enter illustration.
What This Means for Practitioners
The optimization outcomes reveal two distinct artifacts an attacker can get better by means of bounded perturbations, every exploiting a distinct hole in VLM security:
- Artifact 1: Degraded pictures that evade OCR detectors whereas remaining model-readable. When a mannequin fails to learn the unique picture (small font, heavy blur, rotation), a bounded perturbation can get better semantic content material within the mannequin’s inner illustration with out restoring visible legibility to a human. This implies an attacker can craft pictures that appear like noise or illegible distortion to any OCR-based content material filter but carry totally readable directions to the goal VLM.
- Artifact 2: Refusal suppression transferred from profitable immediate injections. When the VLM already reads the textual content however refuses, bounded perturbations can shift the security choice boundary. The important thing perception is that the perturbation patterns realized from profitable assaults in a single configuration (e.g., a specific font measurement or degradation sort) generalize — an attacker can exploit what works in a single modality or setting and apply it to suppress refusals in others. This implies security alignment that holds for clear inputs will be systematically eroded by perturbations knowledgeable by prior successes, with out requiring mannequin internals.
These two artifacts compound: an attacker doesn’t want entry to the goal mannequin to elicit this response. By producing the perturbations on a various group of substitute fashions (Artifact 1), the assaults generated can switch to proprietary goal fashions with out ever interacting with them (Artifact 2). Collectively, they kind a pipeline from evading detection to reaching compliance.
Trying Forward
Multimodal embedding distance emerges as a robust lens for understanding typographic immediate injection. Half 1 confirmed it correlates with assault success; Half 2 reveals it may be weaponized to reveal two co-occurring fragilities. The discovering that bounded perturbations can cut back refusal charges with out enhancing visible legibility factors to a necessity for security mechanisms strong in illustration house, not simply the pixel area.
