SentenceTransformers
SentenceTransformers
https://www.sbert.net/
Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art embedding and reranker models. It can be used to compute embeddings using Sentence Transformer models (quickstart) or to calculate similarity scores using Cross-Encoder (a.k.a. reranker) models (quickstart). This unlocks a wide range of applications, including semantic search, semantic textual similarity, and paraphrase mining.
A wide selection of over 10,000 pre-trained Sentence Transformers models are available for immediate use on 🤗 Hugging Face, including many of the state-of-the-art models from the Massive Text Embeddings Benchmark (MTEB) leaderboard. Additionally, it is easy to train or finetune your own embedding models or reranker models using Sentence Transformers, enabling you to create custom models for your specific use cases.
Sentence Transformers was created by UKPLab and is being maintained by 🤗 Hugging Face. Don’t hesitate to open an issue on the Sentence Transformers repository if something is broken or if you have further questions.
Usage
See also
See the Quickstart for more quick information on how to use Sentence Transformers.
Working with Sentence Transformer models is straightforward:
from sentence_transformers import SentenceTransformer # 1. Load a pretrained Sentence Transformer model model = SentenceTransformer("all-MiniLM-L6-v2") # The sentences to encode sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ] # 2. Calculate embeddings by calling model.encode() embeddings = model.encode(sentences) print(embeddings.shape) # [3, 384] # 3. Calculate the embedding similarities similarities = model.similarity(embeddings, embeddings) print(similarities) # tensor([[1.0000, 0.6660, 0.1046], # [0.6660, 1.0000, 0.1411], # [0.1046, 0.1411, 1.0000]])
from sentence_transformers import CrossEncoder # 1. Load a pretrained CrossEncoder model model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2") # The texts for which to predict similarity scores query = "How many people live in Berlin?" passages = [ "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.", "Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union.", "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.", ] # 2a. Either predict scores pairs of texts scores = model.predict([(query, passage) for passage in passages]) print(scores) # => [8.607139 5.506266 6.352977] # 2b. Or rank a list of passages for a query ranks = model.rank(query, passages, return_documents=True) print("Query:", query) for rank in ranks: print(f"- #{rank['corpus_id']} ({rank['score']:.2f}): {rank['text']}") """ Query: How many people live in Berlin? - #0 (8.61): Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers. - #2 (6.35): In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs. - #1 (5.51): Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union. """