h a l f b a k e r y
Is it soup yet?
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
or get an account
You've probably seen the AI software which tries to
generate images from text-based user descriptions. It
doesn't do very well but it will manage to produce blobs
of tabby fur if you type "cat" and various Caucasian
Cronenbergesque abominations if you type "woman",
"person", "man", "child"
and so forth. It also does
birthday cakes, benches, fridges and trees just about
recognisably. It uses an AI algorithm called a Generative
Adversarial Network (GAN), which sets two neural nets
to compete with each other, one trying to fool the other
that an image is real and the other trying to detect
There are clearly a number of problems which need to
be addressed with this program, but one apparent issue
with it is that it doesn't seem to understand that the
images it's been trained on represent three dimensional
scenes rather than two dimensional pixel patterns. I
want to suggest a way of resolving this.
There are many examples of photographs of well-known
objects, scenes and people whose image files are
labelled appropriately as "St Peter's Church in Rome",
"Desmond Tutu", "double bed", and so forth. Some of
these have been taken from angles identifiable via GPS
coordinates or as recognisably similar. With some
resizing, the ones from which parallax could be
reconstructed would be theoretically convertible to
three-dimensional scenes of the items concerned,
particularly if images have been taken from all round an
object such as statues or objets d'art in galleries. If a
generative adversarial network is trained on a large
number of these images, it will be able to extract
approximate renderings of the real equivalents which
could be appropriately labelled. That's stage one.
Stage two is to take this data rather than the two-
dimensional image data and train another GAN to
recognise other scenes which are only available viewed
from a single angle, thereby constructing scenes of
people standing around at birthday parties or moodily-lit
toothpaste tubes or whatever.
Stage three is to pool all of these data and train yet
another GAN to recognise the objects according to their
descriptions as labelled in the three dimensional
The end result could be a better, more accurate AI
application which is able to produce more realistic
scenes as requested by the user's descriptions which are
also possible to explore using VR.
[zen_tom, Apr 28 2020]
Here's the code! [zen_tom, Apr 28 2020]
Paper: 3D Photography using Context-aware Layered Depth Inpainting
Uses an algorithm referenced from : EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, Mehran Ebrahimi [zen_tom, Apr 28 2020]
Paper: EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning
this paper is cited from the 3d-photo-inpainting paper as containing the algorithm used - and in the abstract of this one, there's "....a two-stage adversarial model EdgeConnect that comprises of an edge generator followed by an image completion network" ... so this is a GAN of some kind or another. [zen_tom, Apr 28 2020]
"A text that visualizes itself." -- are you trying to realize this idea? :) May also add the time, as extra dimension. 4D Voxels, so that it can convert stories into movies. [Mindey, May 02 2020]
||Great, so now we get not only deepfake people videos, but
entire deepfake scenes of what happened in the news at x
location... I see trouble ahead...
||Yes, but at least it's a nice, clear 3-D VR view of the trouble ... so you can appreciate it properly.
||Once you move from 2-D to true 3-D, the data volumes involved increase exponentially.
||However, stereoscopic/binocular "views" from selected angles would limit the data set to a manageable size, but substantially improve information available to the GAN.
||Okay, so some compression needed. To some extent
it amounts to a series of linked surfaces with
||Averaging will certainly help; the image can be reduced to a set of planes and regions by reducing the colour depth and contrast, for example, yet would still be perfectly recognizable to a human observer. Tom and Jerry is clearly a cat chasing a mouse to the audience...
||So a "cartoonized" view will greatly simplfy the task.
||Hi [nineteenthly] - hope you're keeping well - have you
these GAN-based image transformations? [link] I think
may be generated in a similar manner to the one you
describe as part of your process - whatever the case, they produce some quite
stunning effects. I think out there somewhere is a
to the code that lets you run these kinds of things
I've not got as far as doing that myself, but it might be
||I think these take an existing picture as input and create
a 3d representation of it (which is then pan/zoomed as an
animation in the final effect) but there'd be nothing
stopping one from feeding in an image that itself had first
been generated from a text-based input.
||This should provide the first couple of steps mentioned in your idea.
|| - Aha, found the code! [link]
||Thanks, that's interesting but it might be a while
before I'm in a position to poke around in it myself.
||We're alive and well thanks. Hope you are too.