【飞桨开发者说】顾茜，PPDE飞桨开发者技术专家，烟草行业开发工程师，毕业于厦门大学数学科学学院，研究方向为：人工智能在烟草行业的应用。

深度学习的论文读起来总是有点艰难，看不下去咋办？

让飞桨帮我读吧︿(￣︶￣)︿

项目简介

如何让飞桨可以自己“读”论文，也就是实现文字转语音的任务？简单分解一下，通过实现下面三个场景的文字转语音（TTS,Text-to-Speech）任务就可以做到：

HTML页面论文介绍
PDF论文摘要
图片英文语句OCR

这三个场景的实现用到了这两个飞桨开发套件：

1. 采用飞桨Parakeet开发套件实现文字转语音，并选用WaveFlow和Griffin-Lim两种声码器分别实现文字转语音的拟声合成。WaveFlow属于基于深度神经网络的声码器，而Griffin-Lim是在仅知幅度谱、不知道相位谱的条件下重建语音的算法，属于经典声码器，算法简单，高效。读者可以在最终TTS效果音频中对比两种算法的拟声合成差异。

Parakeet（项目地址：

https://github.com/PaddlePaddle/Parakeet）

飞桨语音合成套件，提供了灵活、高效、先进的文本到语音合成工具，帮助开发者更便捷高效地完成语音合成模型的开发和应用。

前置项目Parakeet：手把手教你训练语音合成模型（脚本任务、Notebook）

2. 采用飞桨PaddleOCR 开发套件，实现图片文字转为可读文本。论文中有图片，图片中的文字需要先转成文本文字，才能“读”出来，用OCR模型即可实现。文本转语音的过程是对每个单词进行发音，OCR模型不仅需要认“字”，还需要认“词”。因此，本项目中使用PaddleOCR中可识别空格的预训练模型，将图片文字转为可读文本。

PaddleOCR（项目地址：

https://github.com/PaddlePaddle/PaddleOCR）

飞桨文字识别套件，旨在打造一套丰富、领先、实用的文字检测、识别模型和工具库，开源了超轻量级中文OCR模型和通用中文OCR模型，提供了数十种文本检测、识别模型训练方法，助力使用者训练出更好的模型，并应用落地。

最终TTS效果

HTML文章段落朗读效果：

----------------------------

Audio synthesis has a variety of applications, including text-to-speech (TTS),

music generation, virtual assistant, and digital content creation.

In recent years, deep neural network has obtained noticeable successes for

synthesizing raw audio in high-fidelity speech and music generation.

One of the most successful examples are autoregressive models (e.g., WaveNet).

However, they sequentially generate high temporal resolution of raw waveform (e.g., 24 kHz) at synthesis,

which are prohibitively slow for real-time applications.

Many researchers from various organizations have spent considerable effort to develop parallel generative models for raw audio.

Parallel WaveNet and ClariNet could generate high-fidelity audio in parallel,

but they require distillation from a pretrained autoregressive model and a set of auxiliary losses for training,

which complicates the training pipeline and increases the cost of development.

GAN-based model can be trained from scratch, but it provides inferior audio fidelity than WaveNet.

WaveGlow can be trained directly with maximum likelihood,

but the model has huge number of parameters (e.g., 88M parameters) to reach the comparable fidelity of audio as WaveNet.

Today, we’re excited to announce WaveFlow (paper, audio samples), the latest milestone of audio synthesis research at Baidu.

It features: 1) high-fidelity & ultra-fast audio synthesis, 2) simple likelihood-based training,

and 3) small memory footprint, which could not be achieved simultaneously in previous work.

Our small-footprint model (5.91M parameters) can synthesize high-fidelity speech (MOS: 4.32)

more than 40x faster than real-time on a Nvidia V100 GPU.

WaveFlow also provides a unified view of likelihood-models for raw audio,

which includes both WaveNet and WaveGlow as special cases and allow us to explicitly trade inference parallelism for model capacity.

Our paper will be presented at ICML 2020.

For more details of WaveFlow, please check out our paper: https://arxiv.org/abs/1912.01219

Audio samples are in: https://waveflow-demo.github.io/

The implementation can be accessed in Parakeet, which is a text-to-speech toolkit building on PaddlePaddle:

https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow

----------------------------

PDF摘要文章朗读效果，阅读段落：

----------------------------

Abstract

In this work, we propose WaveFlow, a small-footprint generative ﬂow for raw audio, which

is directly trained with maximum likelihood.

It handles the long-range structure of 1-D waveform with a dilated 2-D convolutional architecture,

while modeling the local variations using expressive autoregressive functions.

WaveFlow provides a uniﬁed view of likelihood-based models for 1-D data,

including WaveNet and WaveGlow as special cases.

It generates high-ﬁdelity speech as WaveNet,

while synthesizing several orders of magnitude faster as

it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps.

Furthermore, it can signiﬁcantly reduce the likelihood gap that has existed

between autoregressive models and ﬂow-based models for efﬁcient synthesis.

Finally, our small-footprint WaveFlow has only 5.91M parameters,

which is 15× smaller than WaveGlow.

It can generate 22.05 kHz high-ﬁdelity audio 42.6× faster than real-time

(at a rate of 939.3 kHz) on a V100 GPU without engineered inference kernels.

----------------------------

OCR图片文字朗读效果：

项目过程详尽回放

以下操作过程已经在AI Stuidio上开放，可以在线体验，当然读者也可以尝试在自己电脑上参考运行：

https://aistudio.baidu.com/aistudio/projectdetail/676162

第一步：下载并安装工具库

安装Parakeet模型库

注意：安装完成后如果出现Parakeet模型库import报错的情况，需要重启项目才能正常import

!git clone https://github.com/PaddlePaddle/Parakeet
!cd Parakeet
!pip install -e .
!cd .. 

import nltk
nltk.download("punkt")
nltk.download("cmudict")

准备Parakeet预训练模型

需要准备的预训练模型包括：

WaveFlow模型128比特率的预训练模型
FastSpeech文字转语音预训练模型

!wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip
!unzip waveflow_res128_ljspeech_ckpt_1.0.zip -d  Parakeet/examples/fastspeech/
!wget https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech_ljspeech_ckpt_1.0.zip
!unzip fastspeech_ljspeech_ckpt_1.0.zip -d  Parakeet/examples/fastspeech/fastspeech_ljspeech_ckpt_1.0/

安装PaddleOCR

!git clone https://gitee.com/paddlepaddle/PaddleOCR.git
!cd PaddleOCR/
!pip install -r requirments.txt

准备支持空格的识别预训练模型

!mkdir inference
!cd inference

!wget https://paddleocr.bj.bcebos.com/ch_models/ch_rec_r34_vd_crnn_enhance_infer.tar && tar xf ch_rec_r34_vd_crnn_enhance_infer.tar

!wget https://paddleocr.bj.bcebos.com/ch_models/ch_det_r50_vd_db_infer.tar && tar xf ch_det_r50_vd_db_infer.tar
%cd ../..

安装Beautiful Soup等工具库

!pip install bs4
!pip install xlwt
!pip install xlrd
!pip install lxml
!pip install w3lib
!pip install pdfminer3k

第二步：解析文章内容

对HTML网页文章、普通PDF和图片文字三种典型场景的文章内容解析方法如下。

解析HTML文章：

这里使用requests模块和Beautiful Soup库对Baidu Research上关于WaveFlow的介绍 WaveFlow: A Compact Flow-Based Model for Raw Audio 页面内容进行爬取和清洗。

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。

它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

参考链接：

Beautiful Soup 4.4.0 文档
Python beautiful soup解析html获得数据
BeautifulSoup中find和find_all的使用
利用BeautifulSoup去除HTML指定标签和去除注释
AI Studio项目：《青春有你2》选手信息爬取

import json
import re
import requests
import datetime
from bs4 import BeautifulSoup
import os
def print_crawl_data(url, save_path):
    """
    爬取指定url的Html页面内容并打印
    """
    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    url = url                       
    try:
        response = requests.get(url,headers=headers)
        # print(response.status_code)
        #将一段文档传入BeautifulSoup的构造方法,就能得到一个文档的对象, 可以传入一段字符串
        soup = BeautifulSoup(response.text)
        # [s.extract() for s in soup('a')]
        # 按css搜索
        # #返回的是class为'style':'color: rgb(0, 0, 0); font-family: Arial, sans-serif;'的<span>所有标签
        texts = soup.find_all('span',{'style':'color: rgb(0, 0, 0); font-family: Arial, sans-serif;'})
        for text in texts:           
            #对当前节点前面的标签和字符串进行查找，并指定查找内容为文本
            # print(text.text)
            with open('%s' % (save_path), 'a') as f:
                result = text.text
                print (result)
                f.write(result + "n")
    except Exception as e:
        print(e)
print_crawl_data('http://research.baidu.com/Blog/index-view?id=139','article.txt')

Audio synthesis has a variety of applications, including text-to-speech (TTS), music generation, virtual assistant, and digital content creation. In recent years, deep neural network has obtained noticeable successes for synthesizing raw audio in high-fidelity speech and music generation. One of the most successful examples are autoregressive models (e.g., WaveNet). However, they sequentially generate high temporal resolution of raw waveform (e.g., 24 kHz) at synthesis, which are prohibitively slow for real-time applications. 

Many researchers from various organizations have spent considerable effort to develop parallel generative models for raw audio. Parallel WaveNet and ClariNet could generate high-fidelity audio in parallel, but they require distillation from a pretrained autoregressive model and a set of auxiliary losses for training, which complicates the training pipeline and increases the cost of development. GAN-based model can be trained from scratch, but it provides inferior audio fidelity than WaveNet. WaveGlow can be trained directly with maximum likelihood, but the model has huge number of parameters (e.g., 88M parameters) to reach the comparable fidelity of audio as WaveNet.

Today, we’re excited to announce WaveFlow (paper, audio samples), the latest milestone of audio synthesis research at Baidu. It features: 1) high-fidelity & ultra-fast audio synthesis, 2) simple likelihood-based training, and 3) small memory footprint, which could not be achieved simultaneously in previous work. Our small-footprint model (5.91M parameters) can synthesize high-fidelity speech (MOS: 4.32) more than 40x faster than real-time on a Nvidia V100 GPU. WaveFlow also provides a unified view of likelihood-models for raw audio, which includes both WaveNet and WaveGlow as special cases and allow us to explicitly trade inference parallelism for model capacity.

Our paper will be presented at ICML 2020.
For more details of WaveFlow, please check out our paper: https://arxiv.org/abs/1912.01219
Audio samples are in:  https://waveflow-demo.github.io/
The implementation can be accessed in Parakeet, which is a text-to-speech toolkit building on PaddlePaddle:  https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow

with open('article.txt','r',encoding = 'utf-8') as fr,open('article2.txt','w',encoding = 'utf-8') as fd:
        for text in fr.readlines():
                if text.split():
                        fd.write(text)
print('完成去空行处理...')
完成去空行处理...

with open('article2.txt','r',encoding = 'utf-8') as fr,open('article3.txt','w',encoding = 'utf-8') as fd:
        for text in fr.readlines():
            text = text.replace('.','.n')
            fd.write(text)
print('完成去换行处理...')

注意：由于Parakeet模型库的预训练模型都是在短句上训练的，为保证较好的语音合成效果，还需要手动对txt文件进一步整理，最终修改效果可查看article3.txt文件。

解析PDF文章

这里使用pdfminer解析PDF（注：普通PDF，不能解析的PDF需要转成图片进行OCR识别），另外需注意在python3中，需要安装的工具库是pdfminer3k。

在示例中，将对 WaveFlow: A Compact Flow-based Model for Raw Audio 这篇论文的PDF文件（下载后重命名为waveflow.pdf）进行解析，将摘要提取出来，为后续文字转语音（TTS）做好准备。

参考链接：

Python使用pdfminer解析PDF
Python去除文本文件中的空行

import urllib
import importlib,sys
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfdevice import PDFDevice
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal, LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed


def parse(DataIO, save_path):

    #用文件对象创建一个PDF文档分析器
    parser = PDFParser(DataIO)
    #创建一个PDF文档
    doc = PDFDocument()
    #分析器和文档相互连接
    parser.set_document(doc)
    doc.set_parser(parser)
    #提供初始化密码，没有默认为空
    doc.initialize()
    #检查文档是否可以转成TXT，如果不可以就忽略
    if not doc.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        #创建PDF资源管理器，来管理共享资源
        rsrcmagr = PDFResourceManager()
        #创建一个PDF设备对象
        laparams = LAParams()
        #将资源管理器和设备对象聚合
        device = PDFPageAggregator(rsrcmagr, laparams=laparams)
        #创建一个PDF解释器对象
        interpreter = PDFPageInterpreter(rsrcmagr, device)

        #循环遍历列表，每次处理一个page内容
        #doc.get_pages()获取page列表
        for page in doc.get_pages():
            interpreter.process_page(page)
            #接收该页面的LTPage对象
            layout = device.get_result()
            #这里的layout是一个LTPage对象里面存放着page解析出来的各种对象
            #一般包括LTTextBox，LTFigure，LTImage，LTTextBoxHorizontal等等一些对像
            #想要获取文本就得获取对象的text属性
            for x in layout:
                try:
                    if(isinstance(x, LTTextBoxHorizontal)):
                        with open('%s' % (save_path), 'a') as f:
                            result = x.get_text()
                            print (result)
                            f.write(result + "n")
                except:
                    print("Failed")
#解析本地PDF文本，保存到本地TXT
with open('waveflow.pdf','rb') as pdf_html:
    parse(pdf_html, 'pdf2text_output.txt')
with open('pdf2text_output.txt','r',encoding = 'utf-8') as fr,open('abstract.txt','w',encoding = 'utf-8') as fd:
        for text in fr.readlines()[60:86:]:
                if text.split():
                        fd.write(text)
                        print(text)
        print('摘要打印完成')
Abstract

In this work, we propose WaveFlow, a small-

footprint generative ﬂow for raw audio, which

is directly trained with maximum likelihood. It

handles the long-range structure of 1-D wave-

form with a dilated 2-D convolutional architec-

ture, while modeling the local variations using

expressive autoregressive functions. WaveFlow

provides a uniﬁed view of likelihood-based mod-

els for 1-D data, including WaveNet and Wave-

Glow as special cases. It generates high-ﬁdelity

speech as WaveNet, while synthesizing several

orders of magnitude faster as it only requires a

few sequential steps to generate very long wave-

forms with hundreds of thousands of time-steps.

Furthermore, it can signiﬁcantly reduce the likeli-

hood gap that has existed between autoregressive

models and ﬂow-based models for efﬁcient syn-

thesis. Finally, our small-footprint WaveFlow has

only 5.91M parameters, which is 15× smaller

than WaveGlow. It can generate 22.05 kHz high-

ﬁdelity audio 42.6× faster than real-time (at a rate

of 939.3 kHz) on a V100 GPU without engineered

inference kernels.

摘要打印完成

注意：为保证较好的语音合成效果，论文中换行连字符需要手动处理，最终修改效果可查看abstract.txt文件。

OCR识别图片中英文语句

对PaddleOCR/tools/infer/predict_system.py中的main()函数下面这一部分稍作修改，只识别文字，比较直观：

        drop_score = 0.5
        dt_num = len(dt_boxes)
        for dno in range(dt_num):
            text, score = rec_res[dno]
            if score >= drop_score:
                # 只打印文本，并存储为txt文件
                # text_str = "%s, %.3f" % (text, score)
                with open('../ocr_text.txt', 'a') as f:
                    text_str = "%s" % (text)
                    f.write(text_str + "n")
                print(text_str)
!cd /home/aistudio/PaddleOCR
/home/aistudio/PaddleOCR

# 找一些英文名言的图片
!wget https://quotefancy.com/media/wallpaper/3840x2160/50594-Francis-Bacon-Quote-Knowledge-is-power.jpg --no-check-certificate
!wget https://www.quotemaster.org/images/24/2423b4151b7283c4570e2967fbf022cf.jpg
!wget https://www.promptaconsultinggroup.com/wp-content/uploads/2018/10/Focus-on-Results.jpg
!wget https://quotefancy.com/media/wallpaper/1600x900/50583-Francis-Bacon-Quote-Knowledge-is-power.jpg --no-check-certificate
!wget https://quotefancy.com/media/wallpaper/3840x2160/2347129-William-Shakespeare-Quote-To-be-or-not-to-be-that-is-the-question.jpg --no-check-certificate
--2020-08-02 19:40:58--  https://www.promptaconsultinggroup.com/wp-content/uploads/2018/10/Focus-on-Results.jpg
Resolving www.promptaconsultinggroup.com (www.promptaconsultinggroup.com)... 67.43.226.3
Connecting to www.promptaconsultinggroup.com (www.promptaconsultinggroup.com)|67.43.226.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 883254 (863K) [image/jpeg]
Saving to: ‘Focus-on-Results.jpg’

Focus-on-Results.jp 100%[===================>] 862.55K  11.6KB/s    in 72s     

2020-08-02 19:42:14 (12.0 KB/s) - ‘Focus-on-Results.jpg’ saved [883254/883254]

!python tools/infer/predict_system.py 
    --image_dir="50594-Francis-Bacon-Quote-Knowledge-is-power.jpg" 
    --det_model_dir="./inference/ch_det_r50_vd_db/"  
    --rec_model_dir="./inference/ch_rec_r34_vd_crnn_enhance/" 
    --use_space_char=True
dt_boxes num : 6, elapse : 0.02082991600036621
rec_res num  : 6, elapse : 0.019023895263671875
Predict time of 50594-Francis-Bacon-Quote-Knowledge-is-power.jpg: 0.097s
Knowledge
is
power
Francis
Bacon
quotefancy
The visualized image saved in ./inference_results/50594-Francis-Bacon-Quote-Knowledge-is-power.jpg

OCR文字识别效果：

第三步：文字转语音

在该步骤中，需要对示例的Parakeet/examples/fastspeech/synthesis.py进行修改，关键就是将指定语句输入的效果测试修改为按行读取txt文件生成语音。synthesis()函数的修改如下，完成修改内容请查看synthesis.py文件

def synthesis(args):
    local_rank = dg.parallel.Env().local_rank
    place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace())
    fluid.enable_dygraph(place)

    with open(args.config) as f:
        cfg = yaml.load(f, Loader=yaml.Loader)


    if not os.path.exists(args.output):
        os.mkdir(args.output)

    writer = SummaryWriter(os.path.join(args.output, 'log'))

    model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
    # Load parameters.
    global_step = io.load_parameters(
        model=model, checkpoint_path=args.checkpoint)
    model.eval()
    # 按行读取txt文本并生成语音
    for i,line in enumerate(open(args.text_input)): 
        text_input = line
        text = np.asarray(text_to_sequence(text_input))
        text = np.expand_dims(text, axis=0)
        pos_text = np.arange(1, text.shape[1] + 1)
        pos_text = np.expand_dims(pos_text, axis=0)

        text = dg.to_variable(text).astype(np.int64)
        pos_text = dg.to_variable(pos_text).astype(np.int64)

        _, mel_output_postnet = model(text, pos_text, alpha=args.alpha)

        if args.vocoder == 'griffin-lim':
            #synthesis use griffin-lim
            wav = synthesis_with_griffinlim(mel_output_postnet, cfg['audio'])
        elif args.vocoder == 'waveflow':
            wav = synthesis_with_waveflow(mel_output_postnet, args,
                                        args.checkpoint_vocoder, place)
        else:
            print(
                'vocoder error, we only support griffinlim and waveflow, but recevied %s.'
                % args.vocoder)

        writer.add_audio(text_input + '(' + args.vocoder + ')', wav, 0,
                        cfg['audio']['sr'])
        if not os.path.exists(os.path.join(args.output, 'samples')):
            os.mkdir(os.path.join(args.output, 'samples'))
        write(
            os.path.join(
                os.path.join(args.output, 'samples'), args.vocoder + str(i) + '.wav'),
            cfg['audio']['sr'], wav)
        print("Synthesis completed !!!")
        writer.close()
!export CUDA_VISIBLE_DEVICES=0
env: CUDA_VISIBLE_DEVICES=0

!cd /home/aistudio/Parakeet/examples/fastspeech
/home/aistudio/Parakeet/examples/fastspeech

使用WaveFlow作为声码器朗读HTML文章

!python synthesis.py 
    --use_gpu=1 
    --alpha=1.0 
    --checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' 
    --config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' 
    --output='./synthesis' 
    --vocoder='waveflow' 
    --config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' 
    --checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' 
    --text_input='/home/aistudio/article3.txt'
{'alpha': 1.0,
 'checkpoint': './fastspeech_ljspeech_ckpt_1.0/step-162000',
 'checkpoint_vocoder': './waveflow_res128_ljspeech_ckpt_1.0/step-2000000',
 'config': './fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml',
 'config_vocoder': './waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml',
 'output': './synthesis',
 'text_input': '/home/aistudio/article3.txt',
 'use_gpu': 1,
 'vocoder': 'waveflow'}

验证文字转语音效果

生成的TTS音频保存在

Parakeet/examples/fastspeech/synthesis/samples文件夹下，可以选择几段音频验证效果

import IPython
IPython.display.Audio('synthesis/samples/waveflow3.wav')

使用ffmpeg合并

生成的音频文件

由于前面是通过对文本逐行扫描生成的音频文件，如果希望听到完整的文章段落，就需要将生成的音频文件按顺序拼接。

用ffmpeg拼接音频前需要先准备一个list.txt文件，格式如下：

file 'path/to/file1' file 'path/to/file2' file 'path/to/file3'

然后执行命令 ffmpeg -f concat -i list.txt -c copy "outputfile"完成拼接

# 生成list文件
for i,line in enumerate(open('/home/aistudio/article3.txt')): 
    with open('waveflow_article3.txt', 'a') as f:
        result = 'file synthesis/samples/waveflow' + str(i) +'.wav'
        f.write(result + "n")
# 音频拼接
!ffmpeg -f concat -i waveflow_article3.txt -c copy 'waveflow_article3.wav'
ffmpeg version 2.8.15-0ubuntu0.16.04.1 Copyright (c) 2000-2018 the FFmpeg developers
  built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.10) 20160609
  configuration: --prefix=/usr --extra-version=0ubuntu0.16.04.1 --build-suffix=-ffmpeg --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --cc=cc --cxx=g++ --enable-gpl --enable-shared --disable-stripping --disable-decoder=libopenjpeg --disable-decoder=libschroedinger --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmodplug --enable-libmp3lame --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-librtmp --enable-libschroedinger --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxvid --enable-libzvbi --enable-openal --enable-opengl --enable-x11grab --enable-libdc1394 --enable-libiec61883 --enable-libzmq --enable-frei0r --enable-libx264 --enable-libopencv
  libavutil      54. 31.100 / 54. 31.100
  libavcodec     56. 60.100 / 56. 60.100
  libavformat    56. 40.101 / 56. 40.101
  libavdevice    56.  4.100 / 56.  4.100
  libavfilter     5. 40.101 /  5. 40.101
  libavresample   2.  1.  0 /  2.  1.  0
  libswscale      3.  1.101 /  3.  1.101
  libswresample   1.  2.101 /  1.  2.101
  libpostproc    53.  3.100 / 53.  3.100
[0;33mGuessed Channel Layout for  Input Stream #0.0 : mono
[0mInput #0, concat, from 'waveflow_article3.txt':
  Duration: N/A, start: 0.000000, bitrate: 705 kb/s
    Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, 1 channels, flt, 705 kb/s
Output #0, wav, to 'waveflow_article3.wav':
  Metadata:
    ISFT            : Lavf56.40.101
    Stream #0:0: Audio: pcm_f32le ([3][0][0][0] / 0x0003), 22050 Hz, mono, 705 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (copy)
Press [q] to stop, [?] for help
size=   16235kB time=00:03:08.49 bitrate= 705.6kbits/s    
video:0kB audio:16235kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000686%

使用Griffin-Lim算法

作为声码器朗读HTML文章

!python synthesis.py 
    --use_gpu=1 
    --alpha=1.0 
    --checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' 
    --config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' 
    --output='./synthesis' 
    --text_input='/home/aistudio/article3.txt'
{'alpha': 1.0,
 'checkpoint': './fastspeech_ljspeech_ckpt_1.0/step-162000',
 'checkpoint_vocoder': None,
 'config': './fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml',
 'config_vocoder': None,
 'output': './synthesis',
 'text_input': '/home/aistudio/article3.txt',
 'use_gpu': 1,
 'vocoder': 'griffin-lim'}

验证文字转语音效果

import IPython
IPython.display.Audio('synthesis/samples/griffin-lim3.wav')

使用ffmpeg合并

生成的音频文件

# 生成list文件
for i,line in enumerate(open('/home/aistudio/article3.txt')): 
    with open('griffin-lim_article3.txt', 'a') as f:
        result = 'file synthesis/samples/griffin-lim' + str(i) +'.wav'
        f.write(result + "n")
# 音频拼接
!ffmpeg -f concat -i griffin-lim_article3.txt -c copy 'griffin-lim_article3.wav'

论文摘要和OCR文字

转语音效果

abstract.txt和ocr_text.txt的TTS实现过程和上面的article3.txt完全一致，唯一不同在于OCR识别最终合成的音频文件比较小，可以直接在Notebook中查看效果。

1. 论文摘要TTS：

!python synthesis.py 
    --use_gpu=1 
    --alpha=1.0 
    --checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' 
    --config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' 
    --output='./synthesis' 
    --vocoder='waveflow' 
    --config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' 
    --checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' 
    --text_input='/home/aistudio/abstract.txt'
# 生成list文件
for i,line in enumerate(open('/home/aistudio/abstract.txt')): 
    with open('waveflow_abstract.txt', 'a') as f:
        result = 'file synthesis/samples/waveflow' + str(i) +'.wav'
        f.write(result + "n")
# 音频拼接
!ffmpeg -f concat -i waveflow_abstract.txt -c copy 'waveflow_abstract.wav'

2. OCR识别TTS（Knowledge is Power）

注：ocr_text.txt中内容较少，已手动整理成一行文字。

!python synthesis.py 
    --use_gpu=1 
    --alpha=1.0 
    --checkpoint='./fastspeech_ljspeech_ckpt_1.0/step-162000' 
    --config='./fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' 
    --output='./synthesis' 
    --vocoder='waveflow' 
    --config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' 
    --checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' 
    --text_input='/home/aistudio/ocr_text.txt'
{'alpha': 1.0,
 'checkpoint': './fastspeech_ljspeech_ckpt_1.0/step-162000',
 'checkpoint_vocoder': './waveflow_res128_ljspeech_ckpt_1.0/step-2000000',
 'config': './fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml',
 'config_vocoder': './waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml',
 'output': './synthesis',
 'text_input': '/home/aistudio/ocr_text.txt',
 'use_gpu': 1,
 'vocoder': 'waveflow'}
[checkpoint] Rank 0: loaded model from ./fastspeech_ljspeech_ckpt_1.0/step-162000.pdparams
[checkpoint] Rank 0: loaded model from ./waveflow_res128_ljspeech_ckpt_1.0/step-2000000.pdparams
Synthesis completed !!!

!mv synthesis/samples/waveflow0.wav ./ocr.wav
import IPython
IPython.display.Audio('ocr.wav')

小结：

TTS效果如何进一步提升？

1. 找到更好的智能排版办法，本项目虽然使用Python对HTML和PDF解析后的文章进行了部分处理，但最后一个环节的排版调整还是手动完成的，TTS效果才比较好。需要进一步结合正则表达式等NLP处理技术，优化自动排版（想必这块也是业界难题，比如最新的Edge浏览器也存在排版问题）。

2. Parakeet的预训练模型只是在LJSpeech数据集上训练得到的，可以考虑加入更多的语音数据集继续训练，得到更加丰富的发音风格和更准确的发音效果，使用Parakeet的训练过程可参考 Parakeet：手把手教你训练语音合成模型（脚本任务、Notebook）。

3. PaddleOCR提供的预训练模型在英文识别上效果可以进一步提升，可以尝试用PaddleOCR在更多英文OCR数据集上训练。（后续将更新）

更多资源

完整项目包括项目代码、文字文件等均公开在AIStudio上，欢迎Fork。

https://aistudio.baidu.com/aistudio/projectdetail/676162

我用飞桨Parakeet合成小姐姐声音帮我“读”论文

如何让飞桨可以自己“读”论文，也就是实现文字转语音的任务？简单分解一下，通过实现下面三个场景的文字转语音（TTS,Text-to-Speech）任务就可以做到：

HTML文章段落朗读效果：

以下操作过程已经在AI Stuidio上开放，可以在线体验，当然读者也可以尝试在自己电脑上参考运行：

第一步：下载并安装工具库

安装Parakeet模型库

准备Parakeet预训练模型

安装PaddleOCR

准备支持空格的识别预训练模型

安装Beautiful Soup等工具库

第二步：解析文章内容

解析HTML文章：

解析PDF文章

对PaddleOCR/tools/infer/predict_system.py中的main()函数下面这一部分稍作修改，只识别文字，比较直观：

第三步：文字转语音

使用WaveFlow作为声码器朗读HTML文章

生成的TTS音频保存在

Parakeet/examples/fastspeech/synthesis/samples文件夹下，可以选择几段音频验证效果

由于前面是通过对文本逐行扫描生成的音频文件，如果希望听到完整的文章段落，就需要将生成的音频文件按顺序拼接。

abstract.txt和ocr_text.txt的TTS实现过程和上面的article3.txt完全一致，唯一不同在于OCR识别最终合成的音频文件比较小，可以直接在Notebook中查看效果。