Halfbakery: Better AI 3-D Scene Rendering

You've probably seen the AI software which tries to generate images from text-based user descriptions. It doesn't do very well but it will manage to produce blobs of tabby fur if you type "cat" and various Caucasian Cronenbergesque abominations if you type "woman", "person", "man", "child" and so forth. It also does birthday cakes, benches, fridges and trees just about recognisably. It uses an AI algorithm called a Generative Adversarial Network (GAN), which sets two neural nets to compete with each other, one trying to fool the other that an image is real and the other trying to detect fakes.

There are clearly a number of problems which need to be addressed with this program, but one apparent issue with it is that it doesn't seem to understand that the images it's been trained on represent three dimensional scenes rather than two dimensional pixel patterns. I want to suggest a way of resolving this.

There are many examples of photographs of well-known objects, scenes and people whose image files are labelled appropriately as "St Peter's Church in Rome", "Desmond Tutu", "double bed", and so forth. Some of these have been taken from angles identifiable via GPS coordinates or as recognisably similar. With some resizing, the ones from which parallax could be reconstructed would be theoretically convertible to three-dimensional scenes of the items concerned, particularly if images have been taken from all round an object such as statues or objets d'art in galleries. If a generative adversarial network is trained on a large number of these images, it will be able to extract approximate renderings of the real equivalents which could be appropriately labelled. That's stage one.

Stage two is to take this data rather than the two- dimensional image data and train another GAN to recognise other scenes which are only available viewed from a single angle, thereby constructing scenes of people standing around at birthday parties or moodily-lit toothpaste tubes or whatever.

Stage three is to pool all of these data and train yet another GAN to recognise the objects according to their descriptions as labelled in the three dimensional reconstruction.

The end result could be a better, more accurate AI application which is able to produce more realistic scenes as requested by the user's descriptions which are also possible to explore using VR.