Integration of Feature and Decision Fusion with Deep Learning Architectures for Video Classification

Savran Kiziltepe, Rukiye; Gan, John Q.; Escobar Pérez, Juan José

dc.contributor.author	Savran Kiziltepe, Rukiye
dc.contributor.author	Gan, John Q.
dc.contributor.author	Escobar Pérez, Juan José
dc.date.accessioned	2024-02-12T13:36:07Z
dc.date.available	2024-02-12T13:36:07Z
dc.date.issued	2024-02-01
dc.identifier.citation	R. S. Kiziltepe, J. Q. Gan and J. J. Escobar, "Integration of Feature and Decision Fusion With Deep Learning Architectures for Video Classification," in IEEE Access, vol. 12, pp. 19432-19446, 2024, doi: 10.1109/ACCESS.2024.3360929	es_ES
dc.identifier.uri	https://hdl.handle.net/10481/89102
dc.description.abstract	Information fusion is frequently employed to integrate diverse inputs, including sensory data, features, or decisions, in order to leverage the advantageous relationships among various features and classifiers. This paper presents a novel approach for video classification using deep learning architectures, including ConvLSTM and vision transformer based fusion architectures, which incorporates the combination of spatial and temporal features, along with the utilisation of decision fusion at multiple levels. The proposed vision transformer based method uses a 3D CNN to extract spatio-temporal information and different attention mechanisms to pay attention to essential features for action recognition and thus learns spatio-temporal dependencies effectively. The effectiveness of the methods proposed in this paper is validated through empirical evaluations conducted on two well-known video classification datasets, namely UCF-101 and KTH. The experimental findings indicate that the utilisation of both spatial and temporal features is essential, with the superior performance gained by using temporal features as the primary source of features in conjunction with two types of distinct spatial features when compared to other configurations. The multi-level decision fusion approach proposed in this study produces results comparable to those of feature fusion methods while offering the advantage of reduced memory requirements and computational costs. The fusion of RGB, HOG, and optical flow representations has demonstrated the best performance compared to other fusion methods examined in this study. It has also been demonstrated that the vision transformer based approaches significantly outperformed the ConvLSTM based approaches. Furthermore, an ablation study was conducted to compare the performances of vision transformer based feature fusion approaches for enhancing the performance of video classification.	es_ES
dc.description.sponsorship	Ministry of National Education, Turkey	es_ES
dc.language.iso	eng	es_ES
dc.publisher	IEEE	es_ES
dc.rights	Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/
dc.subject	Computer vision	es_ES
dc.subject	data fusion	es_ES
dc.subject	deep neural networks	es_ES
dc.subject	human action recognition	es_ES
dc.subject	spatio-temporal features	es_ES
dc.title	Integration of Feature and Decision Fusion with Deep Learning Architectures for Video Classification	es_ES
dc.type	journal article	es_ES
dc.rights.accessRights	open access	es_ES
dc.type.hasVersion	VoR	es_ES

Ficheros en el ítem

Nombre:: published_paper.pdf
Tamaño:: 1.584Mb
Formato:: PDF
Descripción:: Published paper

Este ítem aparece en la(s) siguiente(s) colección(ones)

TIC117 - Artículos

Mostrar el registro sencillo del ítem

Excepto si se señala otra cosa, la licencia del ítem se describe como Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License