That’s kinda the point of image tracking, though - you want the virtual stuff to line up with the physical stuff. We’ve made a poster that comes alive in AR, so the printed design becomes 3D and animates, and image-tracking is the way the virtual stuff gets lined up with the physical printed image. We need all our virtual content to be shown relative to the physical print
Apart from more recent phones which may have LiDAR and other gadgetry, most AR tracking at the moment relies on the camera image - a moving picture - to connect to the real world, and although we humans can make educated guesses at the physical size of things in a video, the phone has no idea. It can spot an image it recognises, and it can track how it moves in the frame and “stick” stuff to it, but it has no way to tell from the camera how near or far away it is, so it has no way to judge how large it is.
If you don’t want your AR to be tied to an image, you can use instant tracking instead, where virtual objects are superimposed so they look like they’re part of the real world, standing on the floor, or attached to a wall, but again, the phone can’t infer scale from the video feed, so there’s no way to tell it “this must look like it’s 1 metre tall”. The phone can do a good job (thanks Zap) of “locking” your 3D scene to a part of the incoming camera feed, a surface, but it has no idea if the user’s pointing at the floor, a metre away, or a desk that’s a foot away, or if they’re high up in stadium seating and tracking something football-pitch sized.
There isn’t a global scale control, but you can do it yourself with a Group node, which ends up being more flexible.
When you import your tracking image, it’ll appear 1 unit wide or 1 unit tall (whichever is the longer side) - you’ll see it in the viewport. As you add 3D things (even flat planes), you can position / rotate / scale them freely in the viewport, or you can type specific values into the properties panel.
If you have a whole bunch of stuff that you want to scale in one go, create a Group node first, and put all the objects in that: then you can scale that Group node itself to make everything contained in it bigger or smaller together. So within that group you can treat all the contents’ positions as inches, metres, nautical miles; whatever you like as long as you’re consistent. To scale the whole scene down to the sort of size you want it to appear on people’s phones (relative to the target image you only have to scale that main “parent” group. You can carry on adding/moving things inside the group around, and their coordinates will still be in whatever units you decided to call them.
Hope this helps. If it doesn’t already, it will start to make sense: but it can seem a bit abstract at first