Sport-transfer SSL — what scaling up actually changed
A follow-up: capacity, resolution, fine-tuning, and a real cross-sport win.
After the first writeup, three follow-up questions kept nagging. With more compute we tested them all — and one downstream task we hadn't measured before flipped the story.
01 · Three open questions
In the first writeup, we measured a small from-scratch vision transformer (ViT-S, 22 M parameters) trained on 300 hours of soccer video, and compared it to frozen DINOv3 on two downstream tasks. The shape of the result was clean:
- In-domain (soccer action recognition): from-scratch came within ~11% of DINOv3.
- Cross-domain (basketball player segmentation): DINOv3 won by ~41%.
We flagged three follow-ups as “would pursue with more compute”: a bigger from-scratch model, proper DINO-style mixed-resolution multi-crop, and a multi-sport corpus. This post is what happened when we ran the first two — and added a third probe we hadn’t tried before. Same tags, same evaluation pipeline, same honest book-keeping.
02 · The new configurations
Two structural changes, one evaluation addition.
Bigger model — ViT-B/16 (86 M parameters). Same architecture family as before but ~4× the parameters. Trained from random weights on the same SoccerNet corpus, same SSL objective.
Mixed-resolution multi-crop. The original runs used 2 global crops per step at 224×224. The standard SSL recipe (DINO-style) adds 6 local crops at 96×96 — small windows that force the model to recognize parts of the image at multiple scales. Eight crops per step total, but the local crops are cheap (96² has roughly one-fifth the FLOPs of 224²). This is what we’d flagged as the “proper multi-resolution” follow-up.
A new downstream task: cross-sport player re-identification. We added a third evaluator — the SynergyReID dataset of basketball player crops. Given a query image of a basketball player, retrieve the same player from a gallery of 910 candidates. This is a cross-domain task (model pretrained on soccer, evaluated on basketball) but a different kind of cross-domain task than player segmentation: instance retrieval rather than dense semantic classification.
The combination gives us three independent transfer questions on the same backbone: in-domain semantics (soccer action), cross-domain dense semantics (basketball segmentation), and cross-domain instance retrieval (basketball re-ID).
03 · A capacity story with a Chinchilla twist
We ran ViT-B/16 mixed-res at 50,000 steps first. It underperformed the smaller ViT-S/16 mixed-res 100k run on every task. The natural first reading is “bigger model didn’t help” — but that ignores how training compute scales with parameters.
ViT-B has roughly 4× the parameters of ViT-S; for matched per-parameter training data, B at 50k steps has only seen 0.125× per-param what S at 100k has. So we re-ran ViT-B at 150,000 steps — putting it on closer per-parameter footing with the 100k ViT-S baseline.
| Model | Steps | Basketball IoU | Soccer accuracy | Re-ID mAP |
|---|---|---|---|---|
| ViT-S mixed-res | 100,000 | 0.414 | 0.266 | 0.559 |
| ViT-B mixed-res | 50,000 | 0.343 | 0.230 | 0.545 |
| ViT-B mixed-res | 150,000 | 0.439 | 0.271 | 0.618 |
ViT-B at 150k cleanly passes ViT-S at 100k on every task. The original 50k run wasn’t capacity-limited — it was training-budget-limited.
Lesson: When a larger model underperforms a smaller one, check whether you’re giving them comparable training compute, not comparable wall-clock or comparable step counts. A bigger model needs proportionally more steps just to reach the same point along its own training curve.
04 · The cross-sport instance-retrieval flip
Adding the re-ID probe surfaced a result we didn’t expect.
| Backbone | Re-ID pooled mAP | Δ vs DINOv3 |
|---|---|---|
| Frozen DINOv3 | 0.529 | — |
| ViT-S mixed-res (100k) | 0.559 | +0.030 (+5.7%) |
| ViT-B mixed-res (150k) | 0.618 | +0.089 (+16.8%) |
On cross-sport player re-ID, our soccer-only from-scratch model beats DINOv3 by 17% relative mAP.
This is the largest “from-scratch beats foundation model” margin we’ve measured anywhere in the study. The story makes sense in retrospect: soccer broadcast video is saturated with humans — players, referees, coaches. Our pretraining objective spends 300 hours of compute learning what makes one person visually distinct from another. DINOv3’s general pretraining sees a far broader variety of subjects, with no particular emphasis on person-identity. On a task that asks specifically “is this the same person?”, the narrow-domain specialist wins decisively.
The same backbone trained for the same number of steps on soccer video produces:
- Soccer action features that match DINOv3 (in-domain semantics, narrow win/tie).
- Basketball player-identity features that exceed DINOv3 (cross-domain instance retrieval, clear win).
- Basketball segmentation features that fall well short of DINOv3 (cross-domain dense semantics, clear loss).
The cross-domain story was never one story. It splits by what kind of cross-domain task you’re asking about.
05 · Three independent ways of probing it
Basketball segmentation is the place where DINOv3 still dominates. We ran three independent follow-up tests to see if anything could close that gap.
5a · Does higher input resolution help? We took the ViT-B 150k checkpoint and evaluated it at 448×448 (4× the pixels) via bicubic interpolation of the positional embeddings. The frozen-feature linear probe lifted by +0.034 IoU (0.439 → 0.473). Real, but not story-flipping.
5b · Was the linear probe limiting us? We replaced the linear probe with two more expressive options: a 2-layer MLP (768→256→1) and a 3×3 spatial conv probe (256 hidden channels) that lets neighboring patches inform each other. We ran both on B 150k features and on frozen DINOv3 features, at 224×224 for an apples-to-apples comparison.
| Probe | DINOv3 IoU | ViT-B 150k IoU |
|---|---|---|
| Linear | 0.671 | 0.439 |
| 2-layer MLP | 0.620 | 0.341 |
| 3×3 spatial conv | 0.618 | 0.410 |
Linear is the best probe family for both backbones — adding capacity to the head hurts on both. The basketball train set has 208 images, which is too few for a higher-capacity probe to generalize. The published 0.439-vs-0.671 gap is not a probe-capacity artifact: the features themselves have different shape.
5c · Does full backbone fine-tuning help? We took the same ViT-B 150k checkpoint and fine-tuned the entire backbone end-to-end on the 208 basketball training images at 448×448 — a setup that, on paper, has every degree of freedom the model needs. After 20 epochs of two-learning-rate AdamW with cosine decay, the best test IoU was 0.452. Worse than the frozen linear probe at the same resolution (0.473).
The bottleneck isn’t probe capacity, isn’t resolution, and isn’t representational rigidity in the backbone. It’s the data. 208 supervised images is too few to fine-tune 86 M parameters into a better-than-linear-probe operating point, and probably too few to learn whatever DINOv3 brings to basketball that our soccer features don’t.
Combined, this is the most decisive evidence we have that the cross-domain dense-semantic gap is structural at our pretraining-data scale. Closing it likely requires more visual diversity in pretraining — multi-sport, multi-viewpoint, multi-domain — rather than more parameters, more resolution, or more supervised training data on the basketball side.
06 · Class-imbalanced BCE has a quiet trap
While running the fine-tune experiments we hit a calibration trap worth flagging.
Per-patch player segmentation is heavily class-imbalanced — only about 4.6% of 16×16 patches contain a player. The textbook fix for class-imbalanced binary classification is to pass pos_weight = neg/pos ≈ 20.9 to BCEWithLogitsLoss so the model is incentivized to predict positives.
That fix works for training but breaks evaluation: with pos_weight = 20.9, the model’s logits shift far positive, and the default decision threshold of sigmoid > 0.5 over-predicts positives. The IoU score (which penalizes false positives just as much as false negatives) collapses.
Two cleaner alternatives:
-
pos_weight = 1.0+ tune the decision threshold on val. Let the model’s linear head learn the imbalance through its bias term. The first ~7 epochs predict all-zeros (loss decreases without IoU moving), but once the bias acquires the right magnitude, the resulting predictions are well-calibrated. We sweep threshold over 0.10–0.70 in 0.05 increments on validation and apply the best to test. -
pos_weight = neg/pos+ tune the decision threshold on val. Still beats fixed 0.5 by a lot, but doesn’t fully recover the recipe penalty.
On our basketball fine-tune the numbers were:
| Recipe | Test IoU |
|---|---|
pos_weight = 20.9, threshold 0.5 | 0.301 |
pos_weight = 20.9, val-tuned threshold | 0.375 |
pos_weight = 1.0, threshold 0.5 | 0.419 |
pos_weight = 1.0, val-tuned threshold | 0.452 |
Threshold tuning is worth +0.03 to +0.07 IoU on top of either pos_weight choice; the recipe choice itself is worth +0.12 IoU before tuning and +0.08 IoU after. Both knobs together are the difference between “fine-tuning hurt the model” and “fine-tuning was at least neutral.”
Lesson: For per-patch binary classification with heavy class imbalance and a small training set, default to pos_weight = 1.0 and always sweep the decision threshold on val before reporting numbers. The “automatic” class-rebalanced loss can quietly cost you 10+ IoU points if you forget the threshold is now miscalibrated.
07 · The picture after this round
| Setting | Frozen DINOv3 | Our champion (ViT-B mixed-res 150k) | DINOv3 edge |
|---|---|---|---|
| In-domain · soccer action recognition | 0.282 | 0.271 | ~4% relative |
| Cross-domain instance retrieval · basketball re-ID | 0.529 | 0.618 | −17% relative (we win) |
| Cross-domain dense semantics · basketball segmentation | 0.671 | 0.439 | ~35% relative |
The first writeup told a two-way story (in-domain vs cross-domain). This round splits the cross-domain side into two sharply different stories: instance retrieval (we win clearly) and dense semantics (we lose clearly). The in-domain gap also shrank — from ~11% to ~4% — and is now within statistical noise of DINOv3.
08 · What we’d run with more time
- Multi-sport pretraining corpus. Doubled down as the highest-expected-value next experiment. Three independent follow-ups on the dense cross-domain gap (resolution, probe capacity, fine-tuning) all failed to close it, which is consistent with visual-diversity-of-pretraining being the binding constraint.
- Multi-seed confirmation of the champion. Our champion is single-seed. Basketball segmentation IoU has ±0.06 seed variance on 84 test frames, so the ViT-B 150k vs ViT-S 100k margin (+0.025 IoU) isn’t decisively outside seed noise on that one task. Soccer and re-ID are robust as single-seed.
- Path A × Path B composition. Our two strongest techniques (LoRA-on-DINOv3 from the first writeup, mixed-resolution multi-crop from this one) were always run separately. The composition is untested.
- Audio and video. Soccer broadcasts are temporal media with strong audio cues (whistles, commentary, crowd). The natural next axis once the visual story is locked.
Builds on open work by the DINOv3 team at Meta AI, the SoccerNet consortium, the DeepSportRadar team, the SynergyReID benchmark, and the authors of LeJEPA. Thanks to all of them.
Ugg back with more brain experiments! Last time Ugg only had little brain. Now Ugg get more rocks for thinking, try bigger stuff.
Three things to test:
Bigger brain. Ugg make brain 4x bigger. First try: bigger brain LOSE to small brain! Ugg confused. Then Ugg remember — bigger brain need more practice. Like bigger fire need more wood. Ugg give bigger brain 3x more practice. Now bigger brain win at all jobs! Lesson: when big thing seem worse than small thing, check if big thing got enough food.
Surprise new test! Ugg add new game: show brain two basketball player pictures, ask “same person, yes or no?” Big smart general brain do okay. But Ugg’s soccer-only brain — BEAT general brain by lots! Why? Soccer game full of humans running around. Brain spend 300 hours staring at people. Brain become face expert! Big general brain saw many things (rocks, trees, dogs), but not so many faces. For “is this same person” job, soccer-face-expert win. First time Ugg brain beat the big famous brain at anything!
The basketball mask test. Ugg’s brain still bad at this one. Big general brain still win. Ugg try three fixes:
- Show brain bigger pictures → small help, not enough
- Use fancier readout part → actually HURT! Too fancy for tiny test
- Retrain whole brain on basketball pictures → still bad. Only 208 basketball pictures. Not enough.
Ugg now sure: this gap not fixable with more rocks, fancier tricks, or more basketball pictures. Brain need to see MANY different sports during practice, not just soccer. Variety is the answer. Same lesson as before, but now Ugg really really sure.
Bonus trap Ugg fell in. Most picture parts have no player (only 4 in 100). Ugg used special “fix unbalanced” trick. Trick worked for learning but secretly broke the score-counting! Brain got blamed for trick’s mistake. Ugg fix trick, brain score jump way up. Sometimes brain not bad — just the measuring stick was bent.
New scorecard:
- Soccer stuff → basically tie (was lose, now even)
- Match-the-player → Ugg win big! (+17%)
- Find-the-player-mask → still lose
Story now has three parts, not two. “Same domain vs different domain” too simple. The real question: what KIND of different. Some different jobs the specialist wins. Some the generalist wins. Depend on what job want.