YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Multi-modal datasets created and adopted in existing studies

Dataset	Scale	Modal			Content
Dataset	Scale	V	M	T	Content
CFM400 [29]	401	\(\checkmark\)	\(\checkmark\)		Game videos (Cross fire)
HoK400 [29]	427	\(\checkmark\)	\(\checkmark\)		Game videos (Honor of king)
UGV [30]	1265	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	User generated videos
YouCook2 [31]	2000	\(\checkmark\)		\(\checkmark\)	Cooking videos on Youtube
EmoMV [32]	5986	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	Music videos with emotion label
MSR-VTT [33]	10,000	\(\checkmark\)		\(\checkmark\)	Online videos with caption
TT-150K [34]	150,000	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	Microvideos on Tiktok
HIMV-200K [21]	205,000	\(\checkmark\)	\(\checkmark\)		Music videos on YouTube
Youtbe-8M [35]	8,000,000	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	Videos on YouTube
Commercial-98K	98,071	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	E-commerce ads

The modalities corresponding to abbreviations in the table are as follows: V, video; M, music; T, text