Maybe Arma can finally get some REALLY good radio speech.
guys, when something like this comes out to the public, the military has had access to it for 20 years
I don't think this is actually "replicating" speech, as it requires gathering a soundbank much longer than the short sample shown in the video.
Extremely convenient for editing-via-text, yes, but not really what I'd consider revolutionary
As cool as this tech is, I'm not sure it's a good thing to release. It could essentially make any and all social engineering and scams 100% effective, you could no longer trust anything or anyone who isn't in front of you.
[QUOTE=nagachief;51380598]As cool as this tech is, I'm not sure it's a good thing to release. It could essentially make any and all social engineering and scams 100% effective, you could no longer trust anything or anyone who isn't in front of you.[/QUOTE]
and that's a good thing. the more society learns towards not trusting the internet and making it a side-thing to our lives, the better
This is like that Simpsons episode with the Road Runner voice actor. They can just make them say few things, then use it for the rest. This could hurt them in the long run, since they get payed by the sentence and stuff.
This is awesome! I hope it comes out soon as a beta.
Isn't this tool assisted cut and paste rather than speech synthesis? Similar to what "sentence splicing" in YTP videos are.
I think all the words he used are in the audio file, and the way he says "my wife" and "my dogs" sounds unchanged. So I think it's figuring out what's being said in the audio clip (speech recognition) and then marking the beginning and end of words.
Although I wonder if you can use a technique like [url]https://deepmind.com/blog/wavenet-generative-model-raw-audio/[/url]
to fixup the sound so that for instance if you swapped "my wife" and "my dogs" it will retain the pitch and other emotional cues.
[QUOTE=CapsAdmin;51381226]Isn't this tool assisted cut and paste rather than speech synthesis? Similar to what "sentence splicing" in YTP videos are.
I think all the words he used are in the audio file, and the way he says "my wife" and "my dogs" sounds unchanged. So I think it's figuring out what's being said in the audio clip (speech recognition) and then marking the beginning and end of words.
Although I wonder if you can use a technique like [url]https://deepmind.com/blog/wavenet-generative-model-raw-audio/[/url]
to fixup the sound so that for instance if you swapped "my wife" and "my dogs" it will retain the pitch and other emotional cues.[/QUOTE]
I know what you mean, but what about the words not in the source material?
[QUOTE=geogzm;51379302]that speech synthesis sounds very crude. what does this do that a DAW can't?[/QUOTE]
Although it's not speech synthesis you can additionally use adobe audition's healing tools to fix some of the crude transitions.
[img]http://wikisound.org/images/1/15/Audition_healing.gif[/img]
(sorry I couldn't find a better gif. I don't have access to adobe audition at the moment)
[QUOTE=IQ-Guldfisk;51381233]I know what you mean, but what about the words not in the source material?[/QUOTE]
you need like 20 minutes of sample
[QUOTE=IQ-Guldfisk;51381233]I know what you mean, but what about the words not in the source material?[/QUOTE]
I'm assuming "jordan" was something said in the rest of the source.
[editline]16th November 2016[/editline]
[QUOTE=Scratch.;51381267]you need like 20 minutes of sample[/QUOTE]
But as a training set so you can make it say any word or are you limited to the words said in the 20 min sample? I believe it's the latter.
[QUOTE=Snickerdoodle;51379947]What program can I use to replicate my own voice by just typing words?[/QUOTE]
i'm not talking about the process used, i'm talking about the end result. you can use audacity on your grandmother's laptop and chop samples of your own voice, for a more realistic result you can clean up that audio with something like rx-5 by izotope
it's a great convenience don't get me wrong, but people freaking out over ethical concerns can rest easy, this is a shortcut more than a new technology
[QUOTE=CapsAdmin;51381279]I'm assuming "jordan" was something said in the rest of the source.
[/QUOTE]
Except in 3:23 he says "we can actually type something that's not here" indicating that its not in the source material.
For the jordan part i was like "eh it's like a YTP except it does the hard work for you" but the three times bit wtf.
Wasn't super impressed by this either especially since deepmind "published" their work on the generative wavenet model beforehand (which to me sounds better straight off the bat). On the other hand, realtime application of wavenet is not possible at the moment (which speaks for voco)
[QUOTE=nagachief;51380598]As cool as this tech is, I'm not sure it's a good thing to release. It could essentially make any and all social engineering and scams 100% effective, you could no longer trust anything or anyone who isn't in front of you.[/QUOTE]
How will you impersonate someone? You need 20 minutes of recorded voice to synthesize from, presumably from a script.
"M'am, the IRS reports that you have unpaid taxes. To avoid prosecution, please read the entirety of Hamlet right now."
Otherwise it would be the same as any other scam with a live operator.
[QUOTE=SGTNAPALM;51382276]How will you impersonate someone? You need 20 minutes of recorded voice to synthesize from, presumably from a script.
"Mam, the IRS reports that you have unpaid taxes. To avoid prosecution, please read the entirety of Hamlet right now."
Otherwise it would be the same as any other scam with a live operator.[/QUOTE]
You can get that much and a ton more from any public speaker for example. This includes a ton of youtubers and all sorts of people.
[QUOTE=SGTNAPALM;51382276]How will you impersonate someone? You need 20 minutes of recorded voice to synthesize from, presumably from a script. [/QUOTE]
Time to buy a lot of movies with Morgan Freeman in it
[QUOTE=rndgenerator;51382338]You can get that much and a ton more from any public speaker for example. This includes a ton of youtubers and all sorts of people.[/QUOTE]
You mean the exact script neded to get all phonemes neccesary with the exact cadence and stress required to work?
[QUOTE=SGTNAPALM;51382372]You mean the exact script neded to get all phonemes neccesary with the exact cadence and stress required to work?[/QUOTE]
Any source on that requirement?
When this comes out I'm expecting at least 20-25 more seasons of The Simpsons thanks to Adobe.
[QUOTE=rndgenerator;51382381]Any source on that requirement?[/QUOTE]
No, but it makes sense logically. There's a ton of different sounds language makes, some used frequently and some used rarely. The software would need a sample of each different possibility in order to properly synthesize the speech. The only way the software could be sure it had all the neccesary parts of speech would be to require users to read a script with a specific tone and cadence.
That or the software is amazing beyond my comprehension and can synthesize sounds that aren't provided.
Cartoons could be dubbed in other countries much faster too, this is next-level shit right here.
We should use this to bring actors back from the dead. Just plug in a bunch of speech examples form old movies and boom we've got Robin Williams back.
[QUOTE=CapsAdmin;51381255]Although it's not speech synthesis you can additionally use adobe audition's healing tools to fix some of the crude transitions.
[img]http://wikisound.org/images/1/15/Audition_healing.gif[/img]
(sorry I couldn't find a better gif. I don't have access to adobe audition at the moment)[/QUOTE]
The audio heal function in audition really isn't... very useful in practice though. It leaves lots of artifacts even on what should be simple to fix.
Can't wait to be dallas from pd2
This is neat and all, but I hope they expose an advanced interface that lets you specify stuff like inflection. TTS generally gets a lot better when it doesn't have to guess everything from just a string of English text.
This reminds me of voice transformation (learn more about it here: [url]http://festvox.org/transform/transform.html[/url]) which basically maps vocal features from one persons voice to another by supplying a trained model and your recorded voice as a WAV file. It can be quite convincing if you have enough training data but it requires a good amount of high quality isolated samples from both parties saying exactly the same thing. One excellent source of victim's audio could be video game dialogue which is generally high quality and almost always isolated (i.e. separated onto audio tracks in BINK video, wav files of dialogue etc).
Sorry, you need to Log In to post a reply to this thread.