Stylized Image Caption论文笔记

Neural Storyteller (Krios et al. 2015)

: NST breaks down the task into two steps, which first generate unstylish captions than apply style shift techniques to generate stylish descriptions.

SentiCap: Generating Image Descriptions with Sentiments (AAAI 2016)

代码和数据都有公布. （代码用的是比较老的框架，没有读。）

Supervised Image Caption

Style: Positive, Negtive

Datasets:

MSCOCO

SentiCap Dataset:作者自己收集的一个数据集 (数据量不大,Positive: 998 images/2873 captions for train, 673 images/2019 captions for test, Negtive: 997 images/2468 captions for train, 503 images/ 1509 captions for test) 3 positive and 3 negative captions per image

This is done in a caption re-writing task based upon objective captions from MSCOCO by asking AMT workers to choose among ANPs of the desired sentiment, and incorporate one or more of them into any one of the five existing captions.

Evaluation Metrics:

Automatic metrics: BLEU, ROUGEL, METEOR, CIDEr

Human evaluation

Model

Shortcomings: requires paired image-sentiment caption data, but also world-level supervison to emphsize the sentiment words(e.g., sentiment strengths of each word in the sentiment caption), which makes the approach very expensive and difficult to scale up.(StyleNet)

StyleNet: Generating Attractive Visual Captions with Styles (CVPR2017)

代码没有公布，有第三方Pytorch实现，数据集公布了FlickrStyle9K(1k测试数据没有公开)

Unsupervised(without using supervised style-specific image-caption paired data): factual image caption pairs + stylized language corpus(only text)

Produce attractive visual captions with styles only using monolingual stylized language corpus(without paired images) and standard factual image/video-caption pairs.

Style:Romantic, Humorous

Datasets:

FlickrStyle10K(built on Flickr 30K image caption dataset, show a standard factual caption for a image, to revise the caption to make it romantic or humorous)(这里虽然有image-stylized caption pairs，但训练的时候作者并没有用这些成对的数据，而是用image-factual caption pairs + stylized text corpora，在evaluate的时候会用到image-stylized caption pairs，用作Ground Truth.)

Evaluation Metrics:

Automatic Metrics:BLEU, METEOR, ROUGE, CIDEr

Human evaluation

Model

关键点:

1.将LSTM中参数W_x拆分成3项，U_x，S_x，V_x，模型中所有的LSTM网络除S之外的参数都是共享的,参数S用来记忆特定的风格。

2.类似于Multi-task sequence to sequence training. First task, train to generate factual captions given the paired images，更新所有的参数. Second, factored LSTM is trained as a language model，只更新S_R或者S_H.

“Factual” and “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention (ECCV 2018)

Style-factual LSTM block: S_x, S_h and g_xt, g_ht

Two-stage learning strategy

MLE loss + KL divergence

Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions (Preprint 30 Jan 2018)

SENTI-ATTEND: Image Captioning using Sentiment and Attention (Preprint 24 Nov 2018)

这篇文章可以看作是SentiCap的后续工作，采用的是Supervised的方式。

Datasets

MS COCO: 用于生成generic image captions

SentiCap dataset:

Evaluation Metrics

standard image caption evaluation metrics: BLEU, ROUGE-L, METEOR, CIDEr, SPICE

Entropy

Model

损失函数:

文章没有公布代码，实验部分对比的是SentiCap以及Image Caption at Will

疑问: SentiCap数据集很小，利用image-caption pairs来Cross entropy loss训练会有效果吗？？？

LSTM多加了E₁和E₂两个输入，每一步LSTM拿h_t来预测s这个操作在SentiCap里也有，然后文章一直处于PrePrint状态。

SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text (CVPR 2018)

公布了部分代码和数据

Style: Story

Learns on existing image caption datasets with only factual descriptions + a large set of styled texts without aligned images

Two-stage training strategy for the term generator and language generator

Dataset:

Descriptive Image Captions: MSCOCO

The Styled Text: bookcorpus

Evaluation:

Automatic relevance metrics: Widely-used captioning metrics (BLEU, METEOR, CIDEr, SPICE)

Automatic style metrics: 作者自己提出的LM(4-gram model)、GRULM(GRU language model)、CLF(binary classifier)

Human evaluations of relevance and style

Unsupervised Stylish Image Description Generation via Domain Layer Norm (AAAI 2019)

Unsupervised Image Caption

Four different styles: fairy tale, romance, humor, country song lyrics(lyrics)

Our model is jointly trained with a paired unstylish image description corpus(source domain) and a monolingual corpus of the specific style(target domain)

代码和数据集均未公开

Datasets:

Source Domain:VG-Para(Krause et al. 2017)

Target: BookCorpus(humor and romance), 作者自己收集的country song lyrics and fairy tale

Evaluation Metircs:

Metrics of Semantic Relevance: 作者自己提出的p和r，SPICE

Metrics of Stylishness: transfer accuracy

Human evaluation

Approach Key Point

E_I和E_T分别将图片和目标风格的描述映射到同一个隐空间，Gs用来生成非风格化的描述，即Source domain里的句子，E_I和G_S组合起来就是传统的Image Caption的Encoder-Decoder模型，训练数据是有监督的Image-Caption对。G_T用来生成风格化的描述，E_T将风格化的句子编码到隐空间Z，G_T则根据隐空间内的编码z_T重新生成风格化的句子(Reconstruction)，训练数据是风格化的句子。模型训练完成之后，将E_I和G_T组合，就可以生成风格化的图像描述。

关键点1：作者假设存在一个隐空间Z使得可以将图片, 不带风格的源描述以及带风格的目标描述映射到这个空间。

关键点2：G_S和G_T只在层规范化的参数不同，其他参数是共享的。即G_S和G_T的LN-LSTM是共享的，其中只有参数{g_S,b_S}和{g_T,b_T}不同，作者将这种机制称为Domain Layer Norm(DLN)。层规范化操作(layer norm operation)作用在LSTM的每一个Gate(input gate，forget gate, output gate)上。

原文地址：https://www.cnblogs.com/czhwust/p/stylizedimagecaption.html