A Comparison of Embedding Models in Semantic Similarity Tasks for Software Engineering

Vinícius Henrique Ferraz Lima

Supervisor: Arthur Pilone Maia da Silva

Capstone Project Report - University of São Paulo, 2025

Student

Vinícius Henrique Ferraz Lima

Supervisor

Arthur Pilone Maia da Silva

Co-supervisor

Prof. Paulo Roberto Miranda Meirelles

Institution

University of São Paulo
Institute of Mathematics, Statistics, and Computer Science
Bachelor of Computer Science

Title

A Comparison of Embedding Models in Semantic Similarity Tasks for Software Engineering

Abstract

Embedding models have been widely used in Natural Language Processing (NLP) tasks to represent texts in vector spaces that capture their semantic relationships. Despite the emergence of different architectures and training techniques, it is presumed that, in certain tasks, these models exhibit similar performance. This work investigates this hypothesis through a systematic comparison of different embedding models applied to semantic textual similarity tasks. From the experiments conducted, we seek to understand whether differences between models used widely in the market are relevant for practical applications. The results contribute to a better understanding of the real impact of choosing embedding models in NLP applications.

Keywords: Embeddings. Semantic Similarity. Natural Language Processing. Vectorial Representation of Text. Model Evaluation.

Introduction

In recent years, advances in Natural Language Processing (NLP) and Artificial Intelligence (AI) have transformed the way researchers and industry professionals analyze, retrieve, and relate textual information. Central to these advances are embedding models, which are mathematical representations that encode the semantic meaning of text into numerical vectors. These representations serve as the backbone of modern retrieval, classification, and generation systems, enabling more semantic driven interpretations of human language than traditional keyword-based or statistical methods.


As embedding models have matured, a growing body of research in software engineering has adopted them to address complex tasks such as duplicate bug report detection, vulnerability discovery, source code analysis, and requirements traceability. These studies consistently demonstrate that embeddings capture semantic and contextual relationships that would be difficult or impossible to model through handcrafted rules or surface-level text matching.


This work investigates whether the choice of embedding model significantly influences the performance of semantic similarity tasks in software engineering, specifically in the context of matching user reviews with development issues. Through a systematic comparison of five embedding models (Jina, OpenAI Large, OpenAI Small, Gemini, and Cohere) across four software projects, we address the fundamental question: to what extent does the choice of embedding model actually influence the final outcomes of a given task?

Objectives

This research project aimed to investigate the use of embedding models in semantic similarity tasks for software engineering, specifically focusing on matching user reviews with development issues. The main objectives were:

  • Compare the performance of five different embedding models (Jina, OpenAI Large, OpenAI Small, Gemini, and Cohere) in semantic textual similarity tasks
  • Evaluate how these models perform across different software projects and relevance levels
  • Understand whether differences between widely-used embedding models are relevant for practical applications
  • Analyze the semantic space organization through visualization techniques (t-SNE)
  • Provide empirical evidence on model selection for semantic similarity tasks in software engineering

Concepts

Embedding Models

Embedding models are mathematical representations that encode the semantic meaning of text into numerical vectors. These models transform text into dense vector representations in a high-dimensional space where semantically similar texts are located close to one another. This enables machine learning algorithms to understand semantic relationships and perform tasks such as similarity computation, retrieval, and classification.


Semantic Textual Similarity (STS)

Semantic Textual Similarity is a task that measures how semantically similar two pieces of text are, typically on a scale from 0 (completely unrelated) to 1 (semantically equivalent). In software engineering, STS can be used to match user reviews with development issues, identify duplicate bug reports, or find related documentation.


t-SNE Visualization

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique used to visualize high-dimensional data in two or three dimensions. It preserves local neighborhood structures, making it useful for understanding how embedding models organize semantic space and identify clusters of similar texts.

Methods

Dataset

The research utilized real-world data from four software projects: WordPress Android, Mindustry, K9 Mail (fsck_k9), and PPSSPP. The dataset consisted of user reviews and development issues that were matched and evaluated by human annotators.


Embedding Models

Five embedding models were systematically compared:

  • Jina: A general-purpose embedding model
  • OpenAI Large: OpenAI's larger embedding model
  • OpenAI Small: OpenAI's smaller, more efficient embedding model
  • Gemini: Google's Gemini embedding model
  • Cohere: Cohere's embedding model

Similarity Computation

Similarity scores were computed using cosine similarity between embedding vectors of issue-review pairs. The results were compared against human-assigned relevance levels (0-5) to evaluate model performance.


Visualization

t-SNE was used to visualize the semantic space organization, revealing how different models structure the relationships between issues and reviews in the embedding space.

Main Findings

Key Results

Our analysis reveals a nuanced answer: while embedding models show remarkable convergence at the extremes of the relevance spectrum, they exhibit meaningful differences at intermediate levels and in overall similarity score distributions.


Convergence at Extremes

At the extremes—clearly irrelevant pairs (Level 0) and highly relevant pairs (Level 5)—all five models produce comparable results. This convergence suggests that when semantic relationships are unambiguous, different embedding models capture similar patterns.


Differences at Intermediate Levels

At intermediate relevance levels (Levels 2-3), significant differences emerge. Jina and OpenAI Large consistently assign higher similarity scores to moderately relevant pairs, while Gemini and Cohere tend to be more conservative, requiring stronger semantic signals before assigning positive similarity scores.


Model Performance Patterns

Across all projects, Gemini and Jina consistently outperform other models in terms of both median similarity scores and consistency. Cohere shows a more conservative approach that may reduce false positives but potentially increase false negatives.


Semantic Space Organization

t-SNE visualizations confirm that all models successfully separate issues and reviews into distinct clusters. OpenAI models produce more compact, well-defined clusters, while Jina and Cohere show more diffuse distributions.


Practical Implications

The findings suggest that model selection matters most when dealing with ambiguous or borderline cases. For strongly relevant or clearly irrelevant pairs, the choice of embedding model appears to have less impact. Practitioners should carefully consider their tolerance for false positives and false negatives when selecting an embedding model.

References

[1] Jurafsky, D.; Martin, J. H. (2023). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (3rd ed.). Pearson.

View Book

[2] Pilone, A.; Raglianti, M.; Lanza, M.; Kon, F.; Meirelles, P. (2025). "Automatically augmenting GitHub issues with informative user reviews". In: 2025 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[3] Li, J.; Li, M.; Liu, Y.; Zhang, L.; Wang, Y. (2024). "HYDBre: a hybrid retrieval method for detecting duplicate software bug reports". In: Proceedings of the 2024 IEEE International Conference on Software Maintenance and Evolution, pp. 1-12.

View Paper

[4] Vaswani, A. et al. (2017). "Attention is all you need". Advances in Neural Information Processing Systems 30.

[5] Tripathy, B. K.; Anveshrithaa, S.; Ghela, S. (2021). "T-distributed stochastic neighbor embedding (t-sne)". In: Data Science and Innovations for Intelligent Systems. CRC Press, p. 13.