Application of Vision Transformers in Online Advertisement Identification

Liyanage, C.R.; Madushika, M.K.S.; Nawarathna, R.D.

IRUOR Home
→
Scholarly Publications
→
Academic Sessions of University of Ruhuna
→
19th Academic Sessions 2022
→
View Item

dc.contributor.author	Liyanage, C.R.
dc.contributor.author	Madushika, M.K.S.
dc.contributor.author	Nawarathna, R.D.
dc.date.accessioned	2022-04-22T04:00:35Z
dc.date.available	2022-04-22T04:00:35Z
dc.date.issued	2022-03-02
dc.identifier.citation	Liyanage, C. R.,Madushika, M. K. S. & Nawarathna, R. D. (2022). Application of Vision Transformers in Online Advertisement Identification. 19th Academic Sessions, University of Ruhuna, Matara, Sri Lanka. 13.
dc.identifier.issn	2362-0412
dc.identifier.uri	http://ir.lib.ruh.ac.lk/xmlui/handle/iruor/5709
dc.description.abstract	Advertisements(ads) play an important role in many sectors, such as business, education and government as they can influence cultural and religious aspects of a society by disseminating important messages to people. Generally, image-based advertisements are more creative and different from other images as these contain slogans explaining the message of the ad, symbolic and atypical objects and different placements of objects within an image. Identification of advertisements from other images is important on digital media in getting customer attention or blocking them from websites. This study proposes a method to use a supervised learning approach to classify images into ads or not-ads. Another objective of this study is to verify the application of Vision Transformers (ViT) in the domain of image-based ad analysis. ViT is a novel image classification architecture derived similar to the Convolutional Neural Network (CNN), where images are divided into patches and trained using the technique called “Multi- Headed Self Attention”. The experiment was conducted using 19,700 images that were labelled as ad and not-ad. Two ViT models with different patch sizes, which were pre-trained on ImageNet-21K dataset were used for image classification. These two models were trained as batches of size 10 for a maximum of 20 epochs. The dataset was split into two main parts as training and testing and set the validation split as 0.2. The highest accuracy of 82% was gained from the 32x32 patch sized model during validation. Moreover, an accuracy of 84%, precision of 85%, and recall of 84% resulted during its testing phase. The results of this study were compared with the state of the art research using CNN. The study has proved that the ViT architecture can achieve comparative results with the limited available computational resources.	en_US
dc.language.iso	en	en_US
dc.publisher	University of Ruhuna, Matara, Sri Lanka	en_US
dc.subject	Advertisements	en_US
dc.subject	Classification	en_US
dc.subject	Vision Transformers	en_US
dc.title	Application of Vision Transformers in Online Advertisement Identification	en_US
dc.type	Article	en_US