FunCodec Homepage

Features

Reproducible

FunCodec reproduces state-of-the-art models, including SoundStream, Encodec and so on.

Integrable

FunCodec can be easily extended to downstream tasks, such as ASR and TTS.

Efficient

FunCodec can train models on distributed GPUs, and inference in batch mode.

FreqCodec

FunCodec natively supports frequency domain, which is more suitable for speech signals.

Semantic Augment

FunCode models can be augmented with semantic tokens, such as phonemes and Hubert embeddings.

Demos

1. Quantized speech samples under high token rate (200 token per 16k samples)

Original speech	Encodec	Despt-Codec	AudioDec	FunCodec

2. Quantized speech samples under low token rate (100 token per 16k samples)

Original speech	Encodec	Despt-Codec	AudioDec	FunCodec

3. Text-to-speech synthesis samples

Prompt text	Text	Prompt Speech + Synthesis Speech
later in accordance with what i believe to be the modern practice	i always accept the gift, after deprecating the honour with words and gestures, and a little later, in accordance with what i believe to be the modern practice, return the compliment.
chris he asked	"why don't they look like us, chris? " he asked.
农地征补标准	最终也还是由政府来确定农地征补标准
同比增长三	同比增长三十点二百分之二

Open-source Codebase

License

The FunCodec codebase is available to download under a MIT license.

Links

Our codebase available at FunCode on github

Released Models

Model	Corpus	Stride	Model Hub
FunCodec	LibriTTS	320	🤗⭐
FunCodec-2x	LibriTTS	640	🤗⭐
FunCodec	Multi-Domain	320	🤗⭐
FunCodec-2x	Multi-Domain	640	🤗⭐
FreqCodec-MagPhase-GR1	LibriTTS	320	🤗⭐
FreqCodec-MagPhase-GR8	LibriTTS	320	🤗⭐

Results

1. Comparison of academic models in terms of ViSQOL scores on LibriTTS dataset. † means the model is causal.

Models	Stride	400 tk/s	200 tk/s	100 tk/s	50 tk/s
SoundStream †	320	4.23	4.00	3.60	3.12
SoundStream	320	4.28	4.06	3.76	3.31
Encodec	320	4.24	4.05	3.73	3.30
FunCodec	320	4.29	4.12	3.86	3.43
FunCodec-2x	640	4.29	4.16	3.94	3.64
FunCodec-4x	1280	4.31	3.94	3.43	2.91

2. Comparison between FunCodec and other toolkits under (a) lower and (b) higher token rate. LS denotes Librispeech test sets. While Librispeech and gigaspeech are English corpora, aishell and Wenet are Mandarin corpora.

3. Comparison of FreqCodec and other time domain models in terms of ViSQOL score on LibriTTS. Mag denotes magnitude spectrogram. C_in represents the channel number of inputs.

MID	Domain	Param.	Flops	Groups(Enc,Dec)	400 tk/s	100 tk/s
M1	Time	14.85M	3.72G	1,1	4.29	3.86
M2	Mag,Angle	16.21M	6.39G	1,1	4.32	3.84
M3	Mag,Phase	16.21M	6.47G	1,1	4.36	3.85
M4	Mag,Phase	4.38M	1.73G	1,C_in	4.28	3.79
M5	Mag,Phase	4.50M	1.73G	1,C_in/8	4.31	3.81
M6	Mag,Phase	0.52M	0.34G	C_in,C_in	4.21	3.65
M7	Mag,Phase	0.83M	1.03G	C_in/4,C_in/4	4.25	3.80

Publication

Please cite the following if you make use of the toolkit.

Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng. FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec, arXiv:2309.07405, 2023

@misc{du2023funcodec,
      title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
      author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
      year={2023},
      eprint={2309.07405},
      archivePrefix={arXiv},
      primaryClass={cs.Sound}
}