Skip to main content

Table 1 Multi-modal datasets created and adopted in existing studies

From: YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation

Dataset

Scale

Modal

Content

V

M

T

CFM400 [29]

401

\(\checkmark\)

\(\checkmark\)

 

Game videos (Cross fire)

HoK400 [29]

427

\(\checkmark\)

\(\checkmark\)

 

Game videos (Honor of king)

UGV [30]

1265

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

User generated videos

YouCook2 [31]

2000

\(\checkmark\)

 

\(\checkmark\)

Cooking videos on Youtube

EmoMV [32]

5986

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

Music videos with emotion label

MSR-VTT [33]

10,000

\(\checkmark\)

 

\(\checkmark\)

Online videos with caption

TT-150K [34]

150,000

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

Microvideos on Tiktok

HIMV-200K [21]

205,000

\(\checkmark\)

\(\checkmark\)

 

Music videos on YouTube

Youtbe-8M [35]

8,000,000

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

Videos on YouTube

Commercial-98K

98,071

\(\checkmark\)

\(\checkmark\)

\(\checkmark\)

E-commerce ads

  1. The modalities corresponding to abbreviations in the table are as follows: V, video; M, music; T, text