Please log in.
Before you can vote, you need to register. Please log in or create an account.
Computer: Graphics: 3D
Better AI 3-D Scene Rendering   (+4)  [vote for, against]
Make voxels from real scenes and run them through a neural net.

You've probably seen the AI software which tries to generate images from text-based user descriptions. It doesn't do very well but it will manage to produce blobs of tabby fur if you type "cat" and various Caucasian Cronenbergesque abominations if you type "woman", "person", "man", "child" and so forth. It also does birthday cakes, benches, fridges and trees just about recognisably. It uses an AI algorithm called a Generative Adversarial Network (GAN), which sets two neural nets to compete with each other, one trying to fool the other that an image is real and the other trying to detect fakes.

There are clearly a number of problems which need to be addressed with this program, but one apparent issue with it is that it doesn't seem to understand that the images it's been trained on represent three dimensional scenes rather than two dimensional pixel patterns. I want to suggest a way of resolving this.

There are many examples of photographs of well-known objects, scenes and people whose image files are labelled appropriately as "St Peter's Church in Rome", "Desmond Tutu", "double bed", and so forth. Some of these have been taken from angles identifiable via GPS coordinates or as recognisably similar. With some resizing, the ones from which parallax could be reconstructed would be theoretically convertible to three-dimensional scenes of the items concerned, particularly if images have been taken from all round an object such as statues or objets d'art in galleries. If a generative adversarial network is trained on a large number of these images, it will be able to extract approximate renderings of the real equivalents which could be appropriately labelled. That's stage one.

Stage two is to take this data rather than the two- dimensional image data and train another GAN to recognise other scenes which are only available viewed from a single angle, thereby constructing scenes of people standing around at birthday parties or moodily-lit toothpaste tubes or whatever.

Stage three is to pool all of these data and train yet another GAN to recognise the objects according to their descriptions as labelled in the three dimensional reconstruction.

The end result could be a better, more accurate AI application which is able to produce more realistic scenes as requested by the user's descriptions which are also possible to explore using VR.
-- nineteenthly, Apr 28 2020

Twitter: #3DPhotoInpainting https://twitter.com...g?src=hashtag_click
[zen_tom, Apr 28 2020]

GitHub: 3d-photo-inpainting https://github.com/...3d-photo-inpainting
Here's the code! [zen_tom, Apr 28 2020]

Paper: 3D Photography using Context-aware Layered Depth Inpainting https://drive.googl...BFWvw-ztF4CCPP/view
Uses an algorithm referenced from : EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, Mehran Ebrahimi [zen_tom, Apr 28 2020]

Paper: EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning https://arxiv.org/abs/1901.00212
this paper is cited from the 3d-photo-inpainting paper as containing the algorithm used - and in the abstract of this one, there's "....a two-stage adversarial model EdgeConnect that comprises of an edge generator followed by an image completion network" ... so this is a GAN of some kind or another. [zen_tom, Apr 28 2020]

ClipText ClipText
"A text that visualizes itself." -- are you trying to realize this idea? :) May also add the time, as extra dimension. 4D Voxels, so that it can convert stories into movies. [Mindey, May 02 2020]

Great, so now we get not only deepfake people videos, but entire deepfake scenes of what happened in the news at x location... I see trouble ahead...
-- RayfordSteele, Apr 28 2020


Yes, but at least it's a nice, clear 3-D VR view of the trouble ... so you can appreciate it properly.

Once you move from 2-D to true 3-D, the data volumes involved increase exponentially.

However, stereoscopic/binocular "views" from selected angles would limit the data set to a manageable size, but substantially improve information available to the GAN.

Hmmm.
-- 8th of 7, Apr 28 2020


Okay, so some compression needed. To some extent it amounts to a series of linked surfaces with contours.
-- nineteenthly, Apr 28 2020


Averaging will certainly help; the image can be reduced to a set of planes and regions by reducing the colour depth and contrast, for example, yet would still be perfectly recognizable to a human observer. Tom and Jerry is clearly a cat chasing a mouse to the audience...

So a "cartoonized" view will greatly simplfy the task.
-- 8th of 7, Apr 28 2020


Hi [nineteenthly] - hope you're keeping well - have you seen these GAN-based image transformations? [link] I think they may be generated in a similar manner to the one you describe as part of your process - whatever the case, they produce some quite stunning effects. I think out there somewhere is a download to the code that lets you run these kinds of things yourself. I've not got as far as doing that myself, but it might be fun to try.

I think these take an existing picture as input and create a 3d representation of it (which is then pan/zoomed as an animation in the final effect) but there'd be nothing stopping one from feeding in an image that itself had first been generated from a text-based input.

This should provide the first couple of steps mentioned in your idea.

[edit] - Aha, found the code! [link]
-- zen_tom, Apr 28 2020


Thanks, that's interesting but it might be a while before I'm in a position to poke around in it myself.

We're alive and well thanks. Hope you are too.
-- nineteenthly, Apr 29 2020



random, halfbakery