WEBVTT

0:00:00,160 --> 0:00:04,080
We present AdjustAR, a system for 
runtime AI-driven adjustment of

0:00:04,080 --> 0:00:06,800
site-specific augmented reality content.

0:00:06,800 --> 0:00:10,640
Model-based authoring is the most widely 
used approach for creating site-specific

0:00:10,640 --> 0:00:16,800
AR experiences. For example, Unity with the 
Niantic SDK and systems like DistanciAR let

0:00:16,800 --> 0:00:22,000
authors anchor virtual content to pre-captured 
3D scans, which can later be used to localize

0:00:22,000 --> 0:00:26,560
at the target site. These models support 
remote authoring by providing environmental

0:00:26,560 --> 0:00:33,120
context without requiring physical presence, but 
assume the models remain accurate at deployment.

0:00:33,120 --> 0:00:36,000
Prior work has identified that 
this is often not the case,

0:00:36,000 --> 0:00:40,000
and that these models are frequently 
outdated or incomplete. This leads

0:00:40,000 --> 0:00:43,600
to misalignments between virtual 
content and physical referents,

0:00:43,600 --> 0:00:49,040
obscuring author intent and degrading user 
experience. CoCreatAR addressed this by enabling

0:00:49,040 --> 0:00:55,200
live collaboration between a remote author and an 
on-site user to refine and extend AR experiences.

0:00:55,200 --> 0:00:59,200
A different body of work, including 
system like SemanticAdapt, AUIT,

0:00:59,200 --> 0:01:03,440
and ScalAR, aim to reduce dependence on 
these global models. They use semantic

0:01:03,440 --> 0:01:07,520
associations and layout constraints 
to improve anchoring flexibility and

0:01:07,520 --> 0:01:11,680
highlight the potential of strategies 
that go beyond surface geometry.

0:01:11,680 --> 0:01:16,480
More recent work has explored multimodal large 
language models to enable anyone to create AR

0:01:16,480 --> 0:01:21,840
content. For instance, ImaginateAR demonstrated 
outdoor in-situ authoring using natural language

0:01:21,840 --> 0:01:26,000
instructions based on large language models 
and advanced scene understanding. However,

0:01:26,000 --> 0:01:32,240
these systems focus on creation rather than 
runtime adaptation of existing experiences.

0:01:32,240 --> 0:01:37,840
Building on these ideas, we ask: how can we 
create a site-specific AR system that 1) takes

0:01:37,840 --> 0:01:42,960
advantage of existing model-based authoring 
approaches, and 2) helps ensure AR content

0:01:42,960 --> 0:01:49,280
appears as intended despite changes in the target 
environment, without requiring re-authoring?

0:01:49,280 --> 0:01:54,000
The typical workflow begins as follows. An 
author, working remotely with a computer,

0:01:54,000 --> 0:01:59,520
uses 3D editing tools to place content into 
a pre-captured 3D model of the target site,

0:01:59,520 --> 0:02:02,880
either downloaded from the 
internet or created themselves.

0:02:02,880 --> 0:02:07,840
In this example, the author plans to create 
an AR story experience in a park and places

0:02:07,840 --> 0:02:13,440
a virtual sign along a path and positions 
a character, "Mr. Raccoon," on a trash can.

0:02:13,440 --> 0:02:17,520
The author assumes, or at least hopes, 
that these referents will be present at

0:02:17,520 --> 0:02:22,480
deployment and uses the 3D model as 
a stand-in for the real environment.

0:02:22,480 --> 0:02:27,760
Later, when a user tries the experience 
on-site, inconsistencies may appear. For

0:02:27,760 --> 0:02:32,320
example, the trash can has been moved, 
and the raccoon now floats in mid-air.

0:02:32,320 --> 0:02:37,840
To address this, AdjustAR enables runtime 
correction, triggered manually or automatically.

0:02:37,840 --> 0:02:42,240
When activated, it captures the current 
camera view and outlines each AR element

0:02:42,240 --> 0:02:48,560
with a unique color. It also caches the depth 
map along with camera intrinsics and extrinsics.

0:02:48,560 --> 0:02:52,640
In parallel, the system renders a reference 
image from the same perspective using the

0:02:52,640 --> 0:02:57,440
original authored model, with matching 
object outlines. This pair of images

0:02:57,440 --> 0:03:02,320
represents a comparison between authored 
intent and the current scene as seen by the

0:03:02,320 --> 0:03:07,040
user. Both are sent to a multimodal 
large language model. The model is

0:03:07,040 --> 0:03:12,720
prompted to assess whether each AR element 
appears aligned with its intended referent.

0:03:12,720 --> 0:03:15,280
If misalignment is detected, the model is asked

0:03:15,280 --> 0:03:21,040
to provide a corrected 2D anchor 
point based on the provided images.

0:03:21,040 --> 0:03:25,360
These corrections are projected 
into 3D using the cached depth map,

0:03:25,360 --> 0:03:29,120
and the AR elements are updated accordingly.

0:03:29,120 --> 0:03:31,360
The adjusted content is then shown in the

0:03:31,360 --> 0:03:32,963
user’s AR view. In this way, AdjustAR 
restores semantic and spatial alignment

0:03:32,963 --> 0:03:33,040
in correspondence with author intent at 
runtime, without manual re-authoring.

0:03:33,040 --> 0:03:38,000
Here we show several other examples, 
from top-left to bottom-right:

0:03:38,800 --> 0:03:44,400
Objects already aligned with the environment
A referent that has moved

0:03:44,400 --> 0:03:57,760
A missing referent
A mix of correctly and incorrectly placed elements

0:03:57,760 --> 0:04:02,480
Looking ahead, we will conduct formal 
evaluations to assess robustness and usability,

0:04:02,480 --> 0:04:06,400
while also focusing on reducing 
latency and improving accuracy.

0:04:06,400 --> 0:04:11,280
We also aim to extend spatial reasoning to 
address occlusion and off-frame referents.

0:04:11,280 --> 0:04:15,520
Finally, we plan to support more flexible 
anchoring strategies beyond bottom-center

0:04:15,520 --> 0:04:19,440
placement, and to incorporate 
author-defined semantic constraints.