Jump to content

Hi There

Welcome to Klub Exile. If you happened to make your way to the site either from Lovers Lab or a Search on Google, we are glad you found us.  To unlock the entire site you will need to have a account registered.  Don't worry it is free but in the mean time you can read up on why we made the site and other little tidbits.  Feel free to join or Discord Server also if you have any more questions.  Thanks for stopping by and See You on the other side.

admin admin

Text-to-Speech Voice Synthesis using tortoise-tts


Recommended Posts

Hi, thank you for your presentation of the soft and for your sample, I will try it when because you're right it can be very useful for adding speech in a video.

Edited by Exiled_Vizir
spelling
Link to comment
Share on other sites

This program seems really interesting but a non-tech skilled 50+ man has issues understanding how to get this installation together. Any chance that someone who has put the TTS fast version together with the available GUI that could be nice and share their program folder for me?

Link to comment
Share on other sites

Hi deluga503, I install the base version using torch on CPU because I don't have an Nvidia gfx card, it work I made some working test with the voice-sets provided by the developer, it is really impressive, but as you say it's slow. As you use tortoise-tts-fast, I have some questions:

- Do you use tortoise-tts-fast work with torch in CPU mode ? Or just the CUDA version?

- The basic tortoise in CPU mode was already an adventure to install, is the fast version really harder to install?

- You says that generate a sound take between 10 seconds and 1 minute for you with the fast version? How does it takes for you with the normal version? I ask that to see if I'll spend some time on the fast version or not.

Again thanks for your infos and your sample and the site for gaming sound.

 

Link to comment
Share on other sites

  • Administrator

Its possibly the best free TTS currently at the moment. 

The Elevenlabs voice modulation and TTS seems even more powerful, but ofc, paid and ultimately all paid TTS companies get restrictions on their content related to potential legal issues.

  • Agree 1
Link to comment
Share on other sites

22 hours ago, Exiled_Vizir said:

- Do you use tortoise-tts-fast work with torch in CPU mode ? Or just the CUDA version?

- You says that generate a sound take between 10 seconds and 1 minute for you with the fast version? How does it takes for you with the normal version? I ask that to see if I'll spend some time on the fast version or not.

I use CUDA version on both. The instructions for tortoise-tts-fast specifically include CUDA 11.7, I don't know if it can be adapted to the CPU version.

On the original version the shorter clips in the OP would each take me around 2 - 4 minutes to generate compared to the 10 seconds on the fast version.

 

Quote

- The basic tortoise in CPU mode was already an adventure to install, is the fast version really harder to install?

The fast fork isn't maintained anymore and the instructions are a bit of a mess and outdated so I had to piece together how to get it working from discussions on the github. I made some notes if it helps.

Do note that there is a lot of extra crap (10+ GB) to install just to get the fast version working compared to the original.

https://github.com/152334H/tortoise-tts-fast
Installation on Windows 10 using an Nvidia GPU

Prerequisites:

  • Install Anaconda for Python environment management (includes Python itself)
  • Install git for repo operations
  • Install CUDA Toolkit 11.7
  • Install Visual Studio 14+ C++ build tools, remember to select a recent Windows 10 SDK (or Win11 SDK if you're on Win11) in the install (context https://www.scivision.dev/python-windows-visual-c-14-required)

Steps inside Anaconda Prompt:

  1.     Navigate to desired directory to hold the application code
  2.     git clone https://github.com/152334H/tortoise-tts-fast
  3.     cd tortoise-tts-fast
  4.     conda create -n ttts-fast python=3.8
  5.     conda activate ttts-fast
  6.     (ensure environment (ttts-fast) is indicated on left side of CLI)
  7.     conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
  8.     pip install -e .
  9.     pip install git+https://github.com/152334H/BigVGAN.git

Running the WebUI application from Anaconda Prompt:

  1.     cd tortoise-tts-fast                          (navigate to code directory)
  2.     conda activate ttts-fast                   (activate the python environment)
  3.     streamlit run scripts/app.py            (launch the local WebUI app)
  4.     Choose text and settings in WebUI then execute
  • Thanks 1
Link to comment
Share on other sites

On 5/11/2023 at 7:17 PM, deluga503 said:

I use CUDA version on both. The instructions for tortoise-tts-fast specifically include CUDA 11.7, I don't know if it can be adapted to the CPU version.

On the original version the shorter clips in the OP would each take me around 2 - 4 minutes to generate compared to the 10 seconds on the fast version.

 

The fast fork isn't maintained anymore and the instructions are a bit of a mess and outdated so I had to piece together how to get it working from discussions on the github. I made some notes if it helps.

Do note that there is a lot of extra crap (10+ GB) to install just to get the fast version working compared to the original.

https://github.com/152334H/tortoise-tts-fast
Installation on Windows 10 using an Nvidia GPU

Prerequisites:

  • Install Anaconda for Python environment management (includes Python itself)
  • Install git for repo operations
  • Install CUDA Toolkit 11.7
  • Install Visual Studio 14+ C++ build tools, remember to select a recent Windows 10 SDK (or Win11 SDK if you're on Win11) in the install (context https://www.scivision.dev/python-windows-visual-c-14-required)

Steps inside Anaconda Prompt:

  1.     Navigate to desired directory to hold the application code
  2.     git clone https://github.com/152334H/tortoise-tts-fast
  3.     cd tortoise-tts-fast
  4.     conda create -n ttts-fast python=3.8
  5.     conda activate ttts-fast
  6.     (ensure environment (ttts-fast) is indicated on left side of CLI)
  7.     conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
  8.     pip install -e .
  9.     pip install git+https://github.com/152334H/BigVGAN.git

Running the WebUI application from Anaconda Prompt:

  1.     cd tortoise-tts-fast                          (navigate to code directory)
  2.     conda activate ttts-fast                   (activate the python environment)
  3.     streamlit run scripts/app.py            (launch the local WebUI app)
  4.     Choose text and settings in WebUI then execute

Hi, thanks for these informations, so on GPU you take between 12 and 24 less time to generate a speech. If I can get this increase (or even the half) with the CPU, it's very interresting because actually it take me around one hour to generate a small sentence... But if it's a newer CUDA the main source of speed-up, it will not be the same on CPU.

Link to comment
Share on other sites

Hi, I don't know if there are some others people whom play with tortoise-tts lately? My last extensive usage was for my video Grand Theft Arcanum and I used it in CPU mode for PyTorch. After that my conclusions was the tool is powerfull (maybe the most powerful open source TTS) but in CPU it's slow.

I had some ideas to do with tortoise-tts but it will involve some extended testing and lot of iterations and it would be too long via cpu...

But I see that recently AMD has made some progress on its GPGPU SDK, so I try to install it and try tortoise using a PyTorch accelerated by my Radeon. There is a mainly 2 ways to get PyTorch accelerated by RocM (the AMD GPGPU SDK): Native Windows (kind of beta state and seems very complex to install) Native Linux (should work but I'm lazy and I don't want to set-up a linux right) so I cut corners and try WLS2 (linux subsystem for windows) and after setup a small linux and some trial and errors annoying loops I manage to get a pytorch running and normally accelerated by torch-directml on top of rocm-hip-sdk5.6.1. Aparently it is a wrapper for Cuda code to work on any GPU via... DirectX 12 (yeah on a simily linux VM, we live interresting time...).

Since I already loose my time on it, I thought let's finish this shit and I installed Tortoise-TTS on my embeded linux and after some DIY it worked! But.. when I launch it I had this warning:

UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling

I ran the test entirely, and I though ok there is no CUDA but it seems faster than I used it for my video. So I make some run on my old tortoise-tts and on the new, and good news the run I done so far are between 3 and 4 times faster on the new install.

So to be sure, I start a bigger sentence on one of my voices I maked for my previous video. The sentence is 21 words long, it should take more than 4 hours to generate on the pure CPU version. After I will test on my new maybe-gpu accelerated version where I hope it will take between 25 minutes (optimistic forecast) and 2 hours (pesimistic forecast).

I will post the results here, stay tuned if you're interrested.

Edited by Exiled_Vizir
spelling
Link to comment
Share on other sites

So I finish my tests, results are very interresting:

- Sentence of 21 words, voice with 28 seconds of total sample duration, pure cpu PyTorch: 270 minutes

- Sentence of 21 words, voice with 28 seconds of total sample duration, torch-directml/rocm-hip-sdk5.6.1 "I don't know what I'm doing" PyTorch: 35 minutes

- Sentence of 21 words, voice with 43 seconds of total sample duration, torch-directml/rocm-hip-sdk5.6.1 "I don't know what I'm doing" PyTorch: 32 minutes

In this test the new install is between 7 and 8 times faster, and better in the pure cpu mode going from 5 to 21 words the computation time is multiply by ~7 (36 minutes to 270) but for the torch-directml version only by ~3 (12 to 35 minutes). I've also did a test with the same sentence and 43 seconds of R De Niro sample and the duration is about the same (youtube lighter than netflix^^?).

Now the strange things I noticed:

- The CPU usage is in the range of 50-60% total CPU time for every tests

- I tried to monitor my GPU during torch-directml tests: it did nothing or the monitoring software I used where unable to see any activity (by the way if you know a good software for monitoring a GPU doing computing tell me I will test it).

To sum up, here are my opinions on these gains:

- My previous installation is totally broken, so using the same CPU power it's way slower (it's the most likely)

- My GPU had doing some works and I didn't see it (very possible also)

- The recent version of tortoise have been heavily optimized

- Linux kernel is way faster than NT kernel for this kind of computation (very unlikely). I've seen 10-15% better for linux for heavy compute workload at my work, some pathological case with a factor 2, but never factor 7, and in my case linux was virtualized.

 

Edited by Exiled_Vizir
De Niro
Link to comment
Share on other sites

1 hour ago, Exiled_Vizir said:

So I finish my tests, results are very interresting:

- Sentence of 21 words, voice with 28 seconds of total sample duration, pure cpu PyTorch: 270 minutes

- Sentence of 21 words, voice with 28 seconds of total sample duration, torch-directml/rocm-hip-sdk5.6.1 "I don't know what I'm doing" PyTorch: 35 minutes

- Sentence of 21 words, voice with 43 seconds of total sample duration, torch-directml/rocm-hip-sdk5.6.1 "I don't know what I'm doing" PyTorch: 32 minutes

In this test the new install is between 7 and 8 times faster, and better in the pure cpu mode going from 5 to 21 words the computation time is multiply by ~7 (36 minutes to 270) but for the torch-directml version only by ~3 (12 to 35 minutes). I've also did a test with the same sentence and 43 seconds of R De Niro sample and the duration is about the same (youtube lighter than netflix^^?).

Now the strange things I noticed:

- The CPU usage is in the range of 50-60% total CPU time for every tests

- I tried to monitor my GPU during torch-directml tests: it did nothing or the monitoring software I used where unable to see any activity (by the way if you know a good software for monitoring a GPU doing computing tell me I will test it).

To sum up, here are my opinions on these gains:

- My previous installation is totally broken, so using the same CPU power it's way slower (it's the most likely)

- My GPU had doing some works and I didn't see it (very possible also)

- The recent version of tortoise have been heavily optimized

- Linux kernel is way faster than NT kernel for this kind of computation (very unlikely). I've seen 10-15% better for linux for heavy compute workload at my work, some pathological case with a factor 2, but never factor 7, and in my case linux was virtualized.

 

Do you use the standard TTS Tortoise version or the fast version?

GitHub - 152334H/tortoise-tts-fast: Fast TorToiSe inference (5x or your money back!)

Link to comment
Share on other sites

Hi @pes1972, I use the standard one because if I had understood right the fast version rely on optimisations of the CUDA part (the Nvidia GPGPU SDK) of the code, and to work these optimization need an old version of CUDA. So I didn't bother to test it since I don't have an Nvidia GPU and I don't think I will test it soon because of :

UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling

In my new version.

Link to comment
Share on other sites

If you want to test tortoise by yourself and you have a Nvidia Gfx card you can:

- Install the standart version with CUDA support: to me it's the obvious choice: easy to install, and you can generate big sentences (30 words+) in less than 10 minutes based on infos @deluga503 give me. Just be sure to have more than 8 Go of free space on your C drive in windows or in your home directory under linux. I don't install it but my full cpu version is based on this one, so I can help you if you have problems.

- Install the fast version: based on infos @deluga503 said it's really fast, but it seems to be a pain in the ass to install.

If you don't have a Nvidia Gfx card you can:

- Install the torch-directml/rocm-hip-sdk5.6.1 version: it's fast and I can provide all the commands to do so it should be easy to install. Just be sure to have more than 6 Go of free space on your C drive in windows or in your home directory under linux, for windows you need at least Windows 10 22H2 I think (the real need is to be able to launch WSL) and maybe a recent AMD or Intel APU/GPU. To me the obvious choice when you don't have a Nvidia Gfx.

- Install the full CPU version: it's slow and it is a pain in the ass to install. Use this if you don't have choice.

Edited by Exiled_Vizir
spelling
  • Thumbs Up 1
Link to comment
Share on other sites

Hi, I continue my journey in the fabulous world of artificial voicing recently, the last time I spoke of my test of a torch-directml/rocm-hip-sdk5.6.1 version under linux or WSL which was approx. 7 time faster than CUDA on CPU under windows version. And I wasn't if that acceleration involve my GPU or not. So now I have the answer, the GPU was not used, but recently I find a new version and:

Capture_gpu.JPG

And as you can see my non Nvidia GPU is used by tortoise-tts, but apart using more electricity, is it useful? Yes it seems because I generate a sentence of 7 words I used in my Grand Theft Arcanum video in 3 minutes versus 15 minutes  torch-directml/rocm-hip-sdk5.6.1 under WSL. So for now we have Radeon 6700 which have approximatively 80/90% of the compute power of a geforce 1080 based on the information given by deluga503 in this thread. 

Link to comment
Share on other sites

For those who are interested to try tortoise-tts on a radeon or an Intel GPU, install pytorch on directML  use this version of the project by Chapoly1305 (many thanks to him for the adaptation of the code on pythorch DirectML). Also if your cards has less than 16 Go of VRAM you will to add a --batch_size argument to reduce the size of the portion of the model loaded in VRAM. By default it seems that --batch_size=14 is used, when I used the CPU this configuration used ~16 Go of RAM on my system, to run the program on my gfx card with 10 Go of VRAM I used --batch_size=8, with that between 9 and 10 Go are used in the 2nd phase of the generation of sample. In the issues of the original projects some Geforce 1060 with 6 Go of VRAM report to use --batch_size=4 parameter. So the formula for the batch_size seems to be memory target - 2.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

WARNING! Adult Only Content You must be 18 years of age or older to enter. By accepting you agree to Klub Exile's Terms of Use and Guidelines upon creating an account.