Listen to Google DeepMind generating human speech, composing music, plotting SkyNet takeover
52 replies, posted
[url]http://www.bloomberg.com/news/articles/2016-09-09/google-s-ai-brainiacs-achieve-speech-generation-breakthrough[/url]
[url]https://www.engadget.com/2016/09/10/google-deepmind-ai-wavenet-text-to-speech/[/url]
More audio in source:
[url]https://deepmind.com/blog/wavenet-generative-model-raw-audio/[/url]
[quote]Currently, developers use one of two methods to create speech programs. One involves using a large collection of words and speech fragments spoken by a single person, which makes sounds and intonations hard to manipulate. The other forms words electronically, depending on how they're supposed to sound. That makes things easier to tweak, but the results sound much more robotic.
In order to build a speech program that actually sounds human, the team fed the neural network raw audio waveforms recorded from real human speakers. Waveforms are the visual representations of the shapes sounds take -- those squiggly waves that squirm and dance to the beat in some media player displays. As such, WaveNet speaks by forming individual sound waves.[/quote]
[img]https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/documents/BlogPost-Fig2-Anim-160908-r01.gif[/img]
[quote]The above animation shows how a WaveNet is structured. It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps.[/quote]
Parametric TTS: [vid]https://storage.googleapis.com/deepmind-media/pixie/us-english/parametric-1.wav[/vid]
vs. Google DeepMind: [vid]https://storage.googleapis.com/deepmind-media/pixie/us-english/wavenet-1.wav[/vid]
That's what happens when it reads a given sentence. When it generates speech with 0 guidance, it makes up what to say (ala DeepDream) and it's fucking horrifying:
[vid]https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/first-list/speaker-1.wav[/vid]
[vid]https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/first-list/speaker-5.wav[/vid]
[quote]...we thought it would also be fun to try to generate music. Unlike the TTS experiments, we didn’t condition the networks on an input sequence telling it what to play (such as a musical score); instead, we simply let it generate whatever it wanted to.[/quote]
[vid]https://storage.googleapis.com/deepmind-media/pixie/making-music/sample_1.wav[/vid]
[vid]https://storage.googleapis.com/deepmind-media/pixie/making-music/sample_3.wav[/vid]
[vid]https://storage.googleapis.com/deepmind-media/pixie/making-music/sample_5.wav[/vid]
More audio in source:
[url]https://deepmind.com/blog/wavenet-generative-model-raw-audio/[/url]
Google deepmind seems really angry from that music generation
That's fucking amazing, it even has little breaths and lip sounds when it continues speaking after a pause.
Incredible. Instead of trying to generate the correct waveforms to synthesize human sounding language, just let a synthetic neural network approximate it effortlessly
Sounds like the background music to silent films.
The one with the "Blue Lagoon" dialogue sounds like a college student giving a oral report that rehearsed her lines days in advance. It seems so human that is actually cool, and never really dives into Uncanny Valley or the audio equivalent imo.
[QUOTE=Luni;51032219]That's fucking amazing, it even has little breaths and lip sounds when it continues speaking after a pause.[/QUOTE]
Because it samples from real human speech, it doesn't actually generate the sound from scratch :smile: We're many, many years from that. People really like to assume that something is more advanced than it actually is, like that spider robot that, according to the title "Learns to walk with any combination of it's legs broken/bent/disabled!" Which sounds cool until you read the article and learn that it actually just chooses a pre-animated walking sequence out of it's pre-programmed database that will work the best for the situation.
[QUOTE=Luni;51032219]That's fucking amazing, it even has little breaths and lip sounds when it continues speaking after a pause.[/QUOTE]
ASMR addicts will love this.
[QUOTE=Karmah;51032300]Incredible. Instead of trying to generate the correct waveforms to synthesize human sounding language, just let a synthetic neural network approximate it effortlessly[/QUOTE]
I've always wondered if future videogames would have voice acting generated artificially, like how Fallout 4/The Sims can generate random characters instead of needing custom models for every face. With this, it'd be like the system we have now - humans are only needed as a reference, like a scan/recording and then any infinite amount of characters/lines are possible. Not only that, but can you imagine an RPG where you type out your dialogue and NPCs are next-level chatbots?
[QUOTE=TurtleeyFP;51032404]I've always wondered if future videogames would have voice acting generated artificially, like how Fallout 4/The Sims can generate random characters instead of needing custom models for every face. Not only that, but can you imagine an RPG where you type out your dialogue and NPCs are next-level chatbots?[/QUOTE]
That's one of those things that would be awesome but unless we had true AI would never actually be done by any company.
Why use our massively complex computer hardware to simulate advanced intelligence in terms of strategy, planning, conversations, and relationships when instead it's much easier to improve our lighting tech and textures?
[QUOTE=CakeMaster7;51032416]That's one of those things that would be awesome but unless we had true AI would never actually be done by any company.
Why use our massively complex computer hardware to simulate advanced intelligence in terms of strategy, planning, conversations, and relationships when instead it's much easier to improve our lighting tech and textures?[/QUOTE]
I'm assuming that in the future, text-based AI will be developed independently and be much smarter than they are now. All this is doing is running that text through a neural network of voices. I assume it'd save a lot of money compared to hiring VAs for hundreds of hours :v:
I wonder how long until MMORPGs are populated with fake people indistinguishable from other players.
soon ill be able to talk to a woman who sounds real and maybe the warmth of her voice will be as equal to that of my mothers
[QUOTE=fruxodaily;51032462]soon ill be able to talk to a woman who sounds real and maybe the warmth of her voice will be as equal to that of my mothers[/QUOTE]
:speechless:
I thought I was impressed by Wolfram|Alpha's music generator, this is definitely a whole new ball game.
It's like a schizoid Gershwin.
[QUOTE=TurtleeyFP;51032404]I've always wondered if future videogames would have voice acting generated artificially, like how Fallout 4/The Sims can generate random characters instead of needing custom models for every face. With this, it'd be like the system we have now - humans are only needed as a reference, like a scan/recording and then any infinite amount of characters/lines are possible. Not only that, but can you imagine an RPG where you type out your dialogue and NPCs are next-level chatbots?[/QUOTE]
This is something I have always wanted to do. Unfortunately, all I have been able to synthesize so far is creepy "shush" sounds and robotic laughter in varying timbres.
[QUOTE=Spor;51032364]Because it samples from real human speech, it doesn't actually generate the sound from scratch :smile: We're many, many years from that. People really like to assume that something is more advanced than it actually is, like that spider robot that, according to the title "Learns to walk with any combination of it's legs broken/bent/disabled!" Which sounds cool until you read the article and learn that it actually just chooses a pre-animated walking sequence out of it's pre-programmed database that will work the best for the situation.[/QUOTE]
That's what I figured it was. I'm sure it doesn't /understand/ what it's doing, it's just mimicking the input it was trained with, but it's still cool.
[QUOTE=Matthew0505;51032532]It definitely improves the quality, but at the cost of the computational simplicity that the other models offer. Then again, Moore's law and ASICS.[/QUOTE]
I can see some sort of cloud service being launched when the tech is ready
Now hook it up to a synthesizer, and make the true era of Synthwave real!
[QUOTE=Matthew0505;51032532]It definitely improves the quality, but at the cost of the computational simplicity that the other models offer. Then again, Moore's law and ASICS.[/QUOTE]
These sorts of models are also incredibly time-complex right now due to lack of research, almost guaranteed there's going to be MUCH faster solutions coming out within a decade.
Additionally, they're only expensive on first generation of the speech- pre-generated speech could be served just as easily as any other game file.
[QUOTE=TurtleeyFP;51032404]I've always wondered if future videogames would have voice acting generated artificially, like how Fallout 4/The Sims can generate random characters instead of needing custom models for every face. With this, it'd be like the system we have now - humans are only needed as a reference, like a scan/recording and then any infinite amount of characters/lines are possible. Not only that, but can you imagine an RPG where you type out your dialogue and NPCs are next-level chatbots?[/QUOTE]
It'd simultaneously be a goldmine for saving on voice actors, but also probably cause a bit of a scare in the world of professional VAs. I imagine there'd probably be games that end up having "voiced by actual human voice actors" as a bullet point on the back of the box, much like how some companies use "made in the USA" as a selling point for their product.
Either way, this new tech is going to be very, VERY interesting...
[QUOTE=Ott;51032630]I can see some sort of cloud service being launched when the tech is ready[/QUOTE]
Very likely, Google already provides machine learning as a service. This tech is probably going to see use in voicemail systems first, since those can be pre-generated and only need to be processed at 8 khz.
[editline]11th September 2016[/editline]
E.G
"You have one missed call from: John Doe"
"Unheard message from: 555-123-4567"
"We're sorry for local service interruptions, the E.T.A for a repair is five hours"
I want like a full 10 minutes of that google piano business.
If a machine can synthesize a voice and CGI gets advanced enough Hollywood won't need actors anymore.
[QUOTE=jimbobjoe1234;51034446]If a machine can synthesize a voice and CGI gets advanced enough Hollywood won't need actors anymore.[/QUOTE]
that'll never happen, cgi is still expensive as fuck compared to real actors
[QUOTE=TheDrunkenOne;51034570]that'll never happen, cgi is still expensive as fuck compared to real actors[/QUOTE]
More of a "when" than a "that'll never happen"
[QUOTE=TheDrunkenOne;51034570]that'll never happen, cgi is still expensive as fuck compared to real actors[/QUOTE]
It will get much cheaper when we have systems designed to generate models and movements for us.
It's theoretically possible to train a NN how to combine 3d objects- I.E. "MOLLE vest and (power crystal prefab)" and it'll generate a vest with some sort of stylistically added glowing purple crystals. Nets can already be trained to do "styles" of other artists, it's just a matter of time before that gets translated to other mediums.
So that's what English sounds like for people that can't speak it