Text-to-Speech Voice Synthesis using tortoise-tts

deluga503 · April 13, 2023

I found this really fascinating and wanted to share.

TorToise TTS

TorToiSe is a zero-shot multi-voice text-to-speech engine developed by neonbjb. The github covers how to install it and all the different ways its performance/output can be tweaked.
https://github.com/neonbjb/tortoise-tts

The "zero-shot" refers to the fact that TorToiSe doesn't need training to generate outputs. Just give it some text and a few wav files and it's ready to go. The results are quite good for how modest the requirements are.

How does it work?

Provide some text you want turned into spoken speech
Provide some samples of voiced dialogue (wav files)
TorToiSe will synthesize a voice from the provided samples and read out the provided text using the mimicked voice

From a TK17 perspective, I see some interesting potential here with creating voiced dialogue that can be added to videos in post production.

Examples

The github itself contains some example outputs, but I also generated a few of my own:

Amy (Soul Calibur IV)
"What's up, Klub Exile! I hope everybody is having a wonderful day."

Ivy (Soul Calibur IV)
"What an absolutely despicable collection of perverts. Exactly what I was looking for."

Rachel (Dead or Alive 5)
"I wonder if any of you fellas could show me a good time? If you can even manage to keep up with me that is."

Makoto Niijima (Persona 5)
"It's important to eat three square meals a day and to get plenty of rest and exercise. Let's do our best to stay healthy."

Haughty Elf Voiceset (Skyrim)
An excerpt from the Navy Seal Copypasta

I'm using SndUp to host these so they embed, I hope that's okay. If anyone knows of a better host for sharing small audio files I'm all ears.

Notes

The name TorToiSe is appropriate. This thing is quite slow at generating outputs. I actually switched over to using the following fork which has much better performance at the cost of being a pain to set up:
https://github.com/152334H/tortoise-tts-fast

Using the fork, the above examples each took me <10 seconds to generate (the big one took about a minute). My GPU is a GTX 1080. I provided between 3 and 10 wav files as dialogue samples for each character, using content I downloaded from The Sounds Resource. It's a great place to find collections of ripped game audio.

Exiled_Vizir · May 3, 2023

Hi, thank you for your presentation of the soft and for your sample, I will try it when because you're right it can be very useful for adding speech in a video.

Edited April 16 by Exiled_Vizir
spelling

WAX · May 3, 2023

Thats really crazy stuff. Thx

pes1972 · May 5, 2023

This program seems really interesting but a non-tech skilled 50+ man has issues understanding how to get this installation together. Any chance that someone who has put the TTS fast version together with the available GUI that could be nice and share their program folder for me?

Exiled_Vizir · May 10, 2023

Hi deluga503, I install the base version using torch on CPU because I don't have an Nvidia gfx card, it work I made some working test with the voice-sets provided by the developer, it is really impressive, but as you say it's slow. As you use tortoise-tts-fast, I have some questions:

- Do you use tortoise-tts-fast work with torch in CPU mode ? Or just the CUDA version?

- The basic tortoise in CPU mode was already an adventure to install, is the fast version really harder to install?

- You says that generate a sound take between 10 seconds and 1 minute for you with the fast version? How does it takes for you with the normal version? I ask that to see if I'll spend some time on the fast version or not.

Again thanks for your infos and your sample and the site for gaming sound.

x17 · May 10, 2023

Its possibly the best free TTS currently at the moment.

The Elevenlabs voice modulation and TTS seems even more powerful, but ofc, paid and ultimately all paid TTS companies get restrictions on their content related to potential legal issues.

deluga503 · May 11, 2023

22 hours ago, Exiled_Vizir said:

- Do you use tortoise-tts-fast work with torch in CPU mode ? Or just the CUDA version?

- You says that generate a sound take between 10 seconds and 1 minute for you with the fast version? How does it takes for you with the normal version? I ask that to see if I'll spend some time on the fast version or not.

I use CUDA version on both. The instructions for tortoise-tts-fast specifically include CUDA 11.7, I don't know if it can be adapted to the CPU version.

On the original version the shorter clips in the OP would each take me around 2 - 4 minutes to generate compared to the 10 seconds on the fast version.

Quote

- The basic tortoise in CPU mode was already an adventure to install, is the fast version really harder to install?

The fast fork isn't maintained anymore and the instructions are a bit of a mess and outdated so I had to piece together how to get it working from discussions on the github. I made some notes if it helps.

Do note that there is a lot of extra crap (10+ GB) to install just to get the fast version working compared to the original.

https://github.com/152334H/tortoise-tts-fast
Installation on Windows 10 using an Nvidia GPU

Prerequisites:

Install Anaconda for Python environment management (includes Python itself)
Install git for repo operations
Install CUDA Toolkit 11.7
Install Visual Studio 14+ C++ build tools, remember to select a recent Windows 10 SDK (or Win11 SDK if you're on Win11) in the install (context https://www.scivision.dev/python-windows-visual-c-14-required)

Steps inside Anaconda Prompt:

Navigate to desired directory to hold the application code
git clone https://github.com/152334H/tortoise-tts-fast
cd tortoise-tts-fast
conda create -n ttts-fast python=3.8
conda activate ttts-fast
(ensure environment (ttts-fast) is indicated on left side of CLI)
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -e .
pip install git+https://github.com/152334H/BigVGAN.git

Running the WebUI application from Anaconda Prompt:

cd tortoise-tts-fast (navigate to code directory)
conda activate ttts-fast (activate the python environment)
streamlit run scripts/app.py (launch the local WebUI app)
Choose text and settings in WebUI then execute

Exiled_Vizir · May 13, 2023

On 5/11/2023 at 7:17 PM, deluga503 said:

I use CUDA version on both. The instructions for tortoise-tts-fast specifically include CUDA 11.7, I don't know if it can be adapted to the CPU version.

On the original version the shorter clips in the OP would each take me around 2 - 4 minutes to generate compared to the 10 seconds on the fast version.

The fast fork isn't maintained anymore and the instructions are a bit of a mess and outdated so I had to piece together how to get it working from discussions on the github. I made some notes if it helps.

Do note that there is a lot of extra crap (10+ GB) to install just to get the fast version working compared to the original.

https://github.com/152334H/tortoise-tts-fast
Installation on Windows 10 using an Nvidia GPU

Prerequisites:

Install Anaconda for Python environment management (includes Python itself)

Install git for repo operations

Install CUDA Toolkit 11.7

Install Visual Studio 14+ C++ build tools, remember to select a recent Windows 10 SDK (or Win11 SDK if you're on Win11) in the install (context https://www.scivision.dev/python-windows-visual-c-14-required)

Steps inside Anaconda Prompt:

    Navigate to desired directory to hold the application code

    git clone https://github.com/152334H/tortoise-tts-fast

    cd tortoise-tts-fast

    conda create -n ttts-fast python=3.8

    conda activate ttts-fast

    (ensure environment (ttts-fast) is indicated on left side of CLI)

    conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

    pip install -e .

    pip install git+https://github.com/152334H/BigVGAN.git

Running the WebUI application from Anaconda Prompt:

    cd tortoise-tts-fast        (navigate to code directory)

    conda activate ttts-fast                   (activate the python environment)

    streamlit run scripts/app.py            (launch the local WebUI app)

    Choose text and settings in WebUI then execute

Hi, thanks for these informations, so on GPU you take between 12 and 24 less time to generate a speech. If I can get this increase (or even the half) with the CPU, it's very interresting because actually it take me around one hour to generate a small sentence... But if it's a newer CUDA the main source of speed-up, it will not be the same on CPU.

Exiled_Vizir · October 11, 2023

Hi, I don't know if there are some others people whom play with tortoise-tts lately? My last extensive usage was for my video Grand Theft Arcanum and I used it in CPU mode for PyTorch. After that my conclusions was the tool is powerfull (maybe the most powerful open source TTS) but in CPU it's slow.

I had some ideas to do with tortoise-tts but it will involve some extended testing and lot of iterations and it would be too long via cpu...

But I see that recently AMD has made some progress on its GPGPU SDK, so I try to install it and try tortoise using a PyTorch accelerated by my Radeon. There is a mainly 2 ways to get PyTorch accelerated by RocM (the AMD GPGPU SDK): Native Windows (kind of beta state and seems very complex to install) Native Linux (should work but I'm lazy and I don't want to set-up a linux right) so I cut corners and try WLS2 (linux subsystem for windows) and after setup a small linux and some trial and errors annoying loops I manage to get a pytorch running and normally accelerated by torch-directml on top of rocm-hip-sdk5.6.1. Aparently it is a wrapper for Cuda code to work on any GPU via... DirectX 12 (yeah on a simily linux VM, we live interresting time...).

Since I already loose my time on it, I thought let's finish this shit and I installed Tortoise-TTS on my embeded linux and after some DIY it worked! But.. when I launch it I had this warning:

UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling

I ran the test entirely, and I though ok there is no CUDA but it seems faster than I used it for my video. So I make some run on my old tortoise-tts and on the new, and good news the run I done so far are between 3 and 4 times faster on the new install.

So to be sure, I start a bigger sentence on one of my voices I maked for my previous video. The sentence is 21 words long, it should take more than 4 hours to generate on the pure CPU version. After I will test on my new maybe-gpu accelerated version where I hope it will take between 25 minutes (optimistic forecast) and 2 hours (pesimistic forecast).

I will post the results here, stay tuned if you're interrested.

Edited October 11, 2023 by Exiled_Vizir
spelling

Exiled_Vizir · October 11, 2023

So I finish my tests, results are very interresting:

- Sentence of 21 words, voice with 28 seconds of total sample duration, pure cpu PyTorch: 270 minutes

- Sentence of 21 words, voice with 28 seconds of total sample duration, torch-directml/rocm-hip-sdk5.6.1 "I don't know what I'm doing" PyTorch: 35 minutes

- Sentence of 21 words, voice with 43 seconds of total sample duration, torch-directml/rocm-hip-sdk5.6.1 "I don't know what I'm doing" PyTorch: 32 minutes

In this test the new install is between 7 and 8 times faster, and better in the pure cpu mode going from 5 to 21 words the computation time is multiply by ~7 (36 minutes to 270) but for the torch-directml version only by ~3 (12 to 35 minutes). I've also did a test with the same sentence and 43 seconds of R De Niro sample and the duration is about the same (youtube lighter than netflix^^?).

Now the strange things I noticed:

- The CPU usage is in the range of 50-60% total CPU time for every tests

- I tried to monitor my GPU during torch-directml tests: it did nothing or the monitoring software I used where unable to see any activity (by the way if you know a good software for monitoring a GPU doing computing tell me I will test it).

To sum up, here are my opinions on these gains:

- My previous installation is totally broken, so using the same CPU power it's way slower (it's the most likely)

- My GPU had doing some works and I didn't see it (very possible also)

- The recent version of tortoise have been heavily optimized

- Linux kernel is way faster than NT kernel for this kind of computation (very unlikely). I've seen 10-15% better for linux for heavy compute workload at my work, some pathological case with a factor 2, but never factor 7, and in my case linux was virtualized.

Edited October 11, 2023 by Exiled_Vizir
De Niro

pes1972 · October 11, 2023

1 hour ago, Exiled_Vizir said:

So I finish my tests, results are very interresting:

- Sentence of 21 words, voice with 28 seconds of total sample duration, pure cpu PyTorch: 270 minutes

- Sentence of 21 words, voice with 28 seconds of total sample duration, torch-directml/rocm-hip-sdk5.6.1 "I don't know what I'm doing" PyTorch: 35 minutes

- Sentence of 21 words, voice with 43 seconds of total sample duration, torch-directml/rocm-hip-sdk5.6.1 "I don't know what I'm doing" PyTorch: 32 minutes

In this test the new install is between 7 and 8 times faster, and better in the pure cpu mode going from 5 to 21 words the computation time is multiply by ~7 (36 minutes to 270) but for the torch-directml version only by ~3 (12 to 35 minutes). I've also did a test with the same sentence and 43 seconds of R De Niro sample and the duration is about the same (youtube lighter than netflix^^?).

Now the strange things I noticed:

- The CPU usage is in the range of 50-60% total CPU time for every tests

- I tried to monitor my GPU during torch-directml tests: it did nothing or the monitoring software I used where unable to see any activity (by the way if you know a good software for monitoring a GPU doing computing tell me I will test it).

To sum up, here are my opinions on these gains:

- My previous installation is totally broken, so using the same CPU power it's way slower (it's the most likely)

- My GPU had doing some works and I didn't see it (very possible also)

- The recent version of tortoise have been heavily optimized

- Linux kernel is way faster than NT kernel for this kind of computation (very unlikely). I've seen 10-15% better for linux for heavy compute workload at my work, some pathological case with a factor 2, but never factor 7, and in my case linux was virtualized.

Do you use the standard TTS Tortoise version or the fast version?

GitHub - 152334H/tortoise-tts-fast: Fast TorToiSe inference (5x or your money back!)

Exiled_Vizir · October 12, 2023

Hi @pes1972, I use the standard one because if I had understood right the fast version rely on optimisations of the CUDA part (the Nvidia GPGPU SDK) of the code, and to work these optimization need an old version of CUDA. So I didn't bother to test it since I don't have an Nvidia GPU and I don't think I will test it soon because of :

UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling

In my new version.

Exiled_Vizir · October 12, 2023

If you want to test tortoise by yourself and you have a Nvidia Gfx card you can:

- Install the standart version with CUDA support: to me it's the obvious choice: easy to install, and you can generate big sentences (30 words+) in less than 10 minutes based on infos @deluga503 give me. Just be sure to have more than 8 Go of free space on your C drive in windows or in your home directory under linux. I don't install it but my full cpu version is based on this one, so I can help you if you have problems.

- Install the fast version: based on infos @deluga503 said it's really fast, but it seems to be a pain in the ass to install.

If you don't have a Nvidia Gfx card you can:

- Install the torch-directml/rocm-hip-sdk5.6.1 version: it's fast and I can provide all the commands to do so it should be easy to install. Just be sure to have more than 6 Go of free space on your C drive in windows or in your home directory under linux, for windows you need at least Windows 10 22H2 I think (the real need is to be able to launch WSL) and maybe a recent AMD or Intel APU/GPU. To me the obvious choice when you don't have a Nvidia Gfx.

- Install the full CPU version: it's slow and it is a pain in the ass to install. Use this if you don't have choice.

Edited April 4 by Exiled_Vizir
spelling

Exiled_Vizir · April 16

Hi, I continue my journey in the fabulous world of artificial voicing recently, the last time I spoke of my test of a torch-directml/rocm-hip-sdk5.6.1 version under linux or WSL which was approx. 7 time faster than CUDA on CPU under windows version. And I wasn't if that acceleration involve my GPU or not. So now I have the answer, the GPU was not used, but recently I find a new version and:

And as you can see my non Nvidia GPU is used by tortoise-tts, but apart using more electricity, is it useful? Yes it seems because I generate a sentence of 7 words I used in my Grand Theft Arcanum video in 3 minutes versus 15 minutes torch-directml/rocm-hip-sdk5.6.1 under WSL. So for now we have Radeon 6700 which have approximatively 80/90% of the compute power of a geforce 1080 based on the information given by deluga503 in this thread.

Exiled_Vizir · April 16

For those who are interested to try tortoise-tts on a radeon or an Intel GPU, install pytorch on directML use this version of the project by Chapoly1305 (many thanks to him for the adaptation of the code on pythorch DirectML). Also if your cards has less than 16 Go of VRAM you will to add a --batch_size argument to reduce the size of the portion of the model loaded in VRAM. By default it seems that --batch_size=14 is used, when I used the CPU this configuration used ~16 Go of RAM on my system, to run the program on my gfx card with 10 Go of VRAM I used --batch_size=8, with that between 9 and 10 Go are used in the 2nd phase of the generation of sample. In the issues of the original projects some Geforce 1060 with 6 Go of VRAM report to use --batch_size=4 parameter. So the formula for the batch_size seems to be memory target - 2.

Age Verification

WARNING! Adult Only Content

Sign In

Text-to-Speech Voice Synthesis using tortoise-tts

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Similar Content

Important Information

WARNING!

Adult Only Content