Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 13
How to use kavish218/nomic_embeddings-htc-2 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("kavish218/nomic_embeddings-htc-2", trust_remote_code=True)
sentences = [
"Viola is a genus of flowering plants in the violet family Violaceae. It is the largest genus in the family, containing between 525 and 600 species. Most species are found in the temperate Northern Hemisphere; however, some are also found in widely divergent areas such as Hawaii, Australasia, and the Andes. Some Viola species are perennial plants, some are annual plants, and a few are small shrubs. Many species, varieties and cultivars are grown in gardens for their ornamental flowers. In horticulture the term pansy is normally used for those multi-colored, large-flowered cultivars which are raised annually or biennially from seed and used extensively in bedding. The terms viola and violet are normally reserved for small-flowered annuals or perennials, including the wild species.",
"In biology, phylogenetics (from Greek φυλή/φῦλον (phylé/phylon) \"tribe, clan, race\", and γενετικός (genetikós) \"origin, source, birth\") is a part of systematics that addresses the inference of the evolutionary history and relationships among or within groups of organisms (e.g. species, or more inclusive taxa). These relationships are hypothesized by phylogenetic inference methods that evaluate observed heritable traits, such as DNA sequences or morphology, often under a specified model of evolution of these traits. The result of such an analysis is a phylogeny (also known as a phylogenetic tree)—a diagrammatic hypothesis of relationships that reflects the evolutionary history of a group of organisms. The tips of a phylogenetic tree can be living taxa or fossils, and represent the 'end', or the present, in an evolutionary lineage. A phylogenetic diagram can be rooted or unrooted. A rooted tree diagram indicates the hypothetical common ancestor, or ancestral lineage, of the tree. An unrooted tree diagram (a network) makes no assumption about the ancestral line, and does not show the origin or \"root\" of the taxa in question or the direction of inferred evolutionary transformations. In addition to their proper use for inferring phylogenetic patterns among taxa, phylogenetic analyses are often employed to represent relationships among gene copies or individual organisms. Such uses have become central to understanding biodiversity, evolution, ecology, and genomes. In February 2021, scientists reported, for the first time, the sequencing of DNA from animal remains, a mammoth in this instance, over a million years old, the oldest DNA sequenced to date.Taxonomy is the identification, naming and classification of organisms. Classifications are now usually based on phylogenetic data, and many systematists contend that only monophyletic taxa should be recognized as named groups. The degree to which classification depends on inferred evolutionary history differs depending on the school of taxonomy: phenetics ignores phylogenetic speculation altogether, trying to represent the similarity between organisms instead; cladistics (phylogenetic systematics) tries to reflect phylogeny in its classifications by only recognizing groups based on shared, derived characters (synapomorphies); evolutionary taxonomy tries to take into account both the branching pattern and \"degree of difference\" to find a compromise between them.",
"A nut is a fruit composed of an inedible hard shell and a seed, which is generally edible. In general usage and in a culinary sense, a wide variety of dried seeds are called nuts, but in a botanical context \"nut\" implies that the shell does not open to release the seed (indehiscent). The translation of \"nut\" in certain languages frequently requires paraphrases, as the word is ambiguous. Most seeds come from fruits that naturally free themselves from the shell, unlike nuts such as hazelnuts, chestnuts, and acorns, which have hard shell walls and originate from a compound ovary. The general and original usage of the term is less restrictive, and many nuts (in the culinary sense), such as almonds, pecans, pistachios, walnuts, and Brazil nuts, are not nuts in a botanical sense. Common usage of the term often refers to any hard-walled, edible kernel as a nut. Nuts are an energy-dense and nutrient-rich food source.",
"Bellis perennis, the daisy, is a common European species of the family Asteraceae, often considered the archetypal species of that name. To distinguish this species from other \"daisies\" it is sometimes qualified as common daisy, lawn daisy or English daisy. Historically, it has also been widely known as bruisewort, and occasionally woundwort (although the common name \"woundwort\" is now more closely associated with the genus Stachys). B. perennis is native to western, central and northern Europe, including remote islands such as the Faroe Islands but has become widely naturalised in most temperate regions, including the Americas and Australasia."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from nomic-ai/nomic-embed-text-v1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'NomicBertModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("kavish218/nomic_embeddings-htc-2")
# Run inference
sentences = [
'Modernism is both a philosophical movement and an art movement that arose from broad transformations in Western society during the late 19th and early 20th centuries. The movement reflected a desire for the creation of new forms of art, philosophy, and social organization which reflected the newly emerging industrial world, including features such as urbanization, new technologies, and war. Artists attempted to depart from traditional forms of art, which they considered outdated or obsolete. The poet Ezra Pound\'s 1934 injunction to "Make it New" was the touchstone of the movement\'s approach. Modernist innovations included abstract art, the stream-of-consciousness novel, montage cinema, atonal and twelve-tone music, and divisionist painting. Modernism explicitly rejected the ideology of realism and made use of the works of the past by the employment of reprise, incorporation, rewriting, recapitulation, revision and parody. Modernism also rejected the certainty of Enlightenment thinking, and many modernists also rejected religious belief. A notable characteristic of modernism is self-consciousness concerning artistic and social traditions, which often led to experimentation with form, along with the use of techniques that drew attention to the processes and materials used in creating works of art.While some scholars see modernism continuing into the 21st century, others see it evolving into late modernism or high modernism. Postmodernism is a departure from modernism and rejects its basic assumptions.',
'Postmodernism is a broad movement that developed in the mid-to-late 20th century across philosophy, the arts, architecture, and criticism, marking a departure from modernism. The term has been more generally applied to describe a historical era said to follow after modernity and the tendencies of this era. Postmodernism is generally defined by an attitude of skepticism, irony, or rejection toward what it describes as the grand narratives and ideologies associated with modernism, often criticizing Enlightenment rationality and focusing on the role of ideology in maintaining political or economic power. Postmodern thinkers frequently describe knowledge claims and value systems as contingent or socially-conditioned, framing them as products of political, historical, or cultural discourses and hierarchies. Common targets of postmodern criticism include universalist ideas of objective reality, morality, truth, human nature, reason, science, language, and social progress. Accordingly, postmodern thought is broadly characterized by tendencies to self-consciousness, self-referentiality, epistemological and moral relativism, pluralism, and irreverence. Postmodern critical approaches gained popularity in the 1980s and 1990s, and have been adopted in a variety of academic and theoretical disciplines, including cultural studies, philosophy of science, economics, linguistics, architecture, feminist theory, and literary criticism, as well as art movements in fields such as literature, contemporary art, and music. Postmodernism is often associated with schools of thought such as deconstruction, post-structuralism, and institutional critique, as well as philosophers such as Jean-François Lyotard, Jacques Derrida, and Fredric Jameson. Criticisms of postmodernism are intellectually diverse and include arguments that postmodernism promotes obscurantism, is meaningless, and that it adds nothing to analytical or empirical knowledge.',
'Yucca is a genus of perennial shrubs and trees in the family Asparagaceae, subfamily Agavoideae. Its 40–50 species are notable for their rosettes of evergreen, tough, sword-shaped leaves and large terminal panicles of white or whitish flowers. They are native to the hot and dry (arid) parts of the Americas and the Caribbean. Early reports of the species were confused with the cassava (Manihot esculenta). Consequently, Linnaeus mistakenly derived the generic name from the Taíno word for the latter, yuca.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.8460, 0.0451],
# [0.8460, 1.0000, 0.0430],
# [0.0451, 0.0430, 1.0000]])
content_1 and content_2| content_1 | content_2 | |
|---|---|---|
| type | string | string |
| details |
|
|
| content_1 | content_2 |
|---|---|
Sacral architecture (also known as sacred architecture or religious architecture) is a religious architectural practice concerned with the design and construction of places of worship or sacred or intentional space, such as churches, mosques, stupas, synagogues, and temples. Many cultures devoted considerable resources to their sacred architecture and places of worship. Religious and sacred spaces are amongst the most impressive and permanent monolithic buildings created by humanity. Conversely, sacred architecture as a locale for meta-intimacy may also be non-monolithic, ephemeral and intensely private, personal and non-public. Sacred, religious and holy structures often evolved over centuries and were the largest buildings in the world, prior to the modern skyscraper. While the various styles employed in sacred architecture sometimes reflected trends in other structures, these styles also remained unique from the contemporary architecture used in other structures. With the rise of C... |
Architecture (Latin architectura, from the Greek ἀρχιτέκτων arkhitekton "architect", from ἀρχι- "chief" and τέκτων "creator") is both the process and the product of planning, designing, and constructing buildings or other structures. Architectural works, in the material form of buildings, are often perceived as cultural symbols and as works of art. Historical civilizations are often identified with their surviving architectural achievements.The practice, which began in the prehistoric era, has been used as a way of expressing culture for civilizations on all seven continents. For this reason, architecture is considered to be a form of art. Texts on architecture have been written since ancient time. The earliest surviving text on architectural theory is the 1st century AD treatise De architectura by the Roman architect Vitruvius, according to whom a good building embodies firmitas, utilitas, and venustas (durability, utility, and beauty). Centuries later, Leon Battista Alberti developed... |
Proportion is a central principle of architectural theory and an important connection between mathematics and art. It is the visual effect of the relationships of the various objects and spaces that make up a structure to one another and to the whole. These relationships are often governed by multiples of a standard unit of length known as a "module".Proportion in architecture was discussed by Vitruvius, Leon Battista Alberti, Andrea Palladio, and Le Corbusier among others. |
Landscape architecture is the design of outdoor areas, landmarks, and structures to achieve environmental, social-behavioural, or aesthetic outcomes. It involves the systematic design and general engineering of various structures for construction and human use, investigation of existing social, ecological, and soil conditions and processes in the landscape, and the design of other interventions that will produce desired outcomes. The scope of the profession is broad and can be subdivided into several sub-categories including professional or licensed landscape architects who are regulated by governmental agencies and possess the expertise to design a wide range of structures and landforms for human use; landscape design which is not a licensed profession; site planning; stormwater management; erosion control; environmental restoration; parks, recreation and urban planning; visual resource management; green infrastructure planning and provision; and private estate and residence landscape... |
The Basílica de la Sagrada Família (Catalan: [bəˈzilikə ðə lə səˈɣɾaðə fəˈmiljə]; Spanish: Basílica de la Sagrada Familia; 'Basilica of the Holy Family'), also known as the Sagrada Família, is a large unfinished Roman Catholic minor basilica in the Eixample district of Barcelona, Catalonia, Spain. Designed by the Spanish architect Antoni Gaudí (1852–1926), his work on the building is part of a UNESCO World Heritage Site. On 7 November 2010, Pope Benedict XVI consecrated the church and proclaimed it a minor basilica.On 19 March 1882, construction of the Sagrada Família began under architect Francisco de Paula del Villar. In 1883, when Villar resigned, Gaudí took over as chief architect, transforming the project with his architectural and engineering style, combining Gothic and curvilinear Art Nouveau forms. Gaudí devoted the remainder of his life to the project, and he is buried in the crypt. At the time of his death in 1926, less than a quarter of the project was complete.Relying sole... |
The Colosseum ( KOL-ə-SEE-əm; Italian: Colosseo [kolosˈsɛːo]) is an oval amphitheatre in the centre of the city of Rome, Italy, just east of the Roman Forum. It is the largest ancient amphitheatre ever built, and is still the largest standing amphitheatre in the world today, despite its age. Construction began under the emperor Vespasian (r. 69–79 AD) in 72 and was completed in 80 AD under his successor and heir, Titus (r. 79–81). Further modifications were made during the reign of Domitian (r. 81–96). The three emperors that were patrons of the work are known as the Flavian dynasty, and the amphitheatre was named the Flavian Amphitheatre (Latin: Amphitheatrum Flavium; Italian: Anfiteatro Flavio [aɱfiteˈaːtro ˈflaːvjo]) by later classicists and archaeologists for its association with their family name (Flavius).The Colosseum is built of travertine limestone, tuff (volcanic rock), and brick-faced concrete. The Colosseum could hold an estimated 50,000 to 80,000 spectators at various poin... |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false
}
per_device_eval_batch_size: 16num_train_epochs: 1warmup_ratio: 0.1fp16: Truebatch_sampler: no_duplicatesoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 8per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss |
|---|---|---|
| 0.0270 | 10 | 0.4417 |
| 0.0541 | 20 | 0.2135 |
| 0.0811 | 30 | 0.0849 |
| 0.1081 | 40 | 0.2744 |
| 0.1351 | 50 | 0.2297 |
| 0.1622 | 60 | 0.2694 |
| 0.1892 | 70 | 0.1039 |
| 0.2162 | 80 | 0.144 |
| 0.2432 | 90 | 0.0802 |
| 0.2703 | 100 | 0.0886 |
| 0.2973 | 110 | 0.1841 |
| 0.3243 | 120 | 0.0515 |
| 0.3514 | 130 | 0.373 |
| 0.3784 | 140 | 0.0519 |
| 0.4054 | 150 | 0.0942 |
| 0.4324 | 160 | 0.1645 |
| 0.4595 | 170 | 0.1254 |
| 0.4865 | 180 | 0.1549 |
| 0.5135 | 190 | 0.1378 |
| 0.5405 | 200 | 0.1643 |
| 0.5676 | 210 | 0.116 |
| 0.5946 | 220 | 0.0724 |
| 0.6216 | 230 | 0.1589 |
| 0.6486 | 240 | 0.2252 |
| 0.6757 | 250 | 0.1201 |
| 0.7027 | 260 | 0.2506 |
| 0.7297 | 270 | 0.0639 |
| 0.7568 | 280 | 0.2527 |
| 0.7838 | 290 | 0.267 |
| 0.8108 | 300 | 0.0509 |
| 0.8378 | 310 | 0.2324 |
| 0.8649 | 320 | 0.2107 |
| 0.8919 | 330 | 0.1843 |
| 0.9189 | 340 | 0.0659 |
| 0.9459 | 350 | 0.1914 |
| 0.9730 | 360 | 0.0676 |
| 1.0 | 370 | 0.1129 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
nomic-ai/nomic-embed-text-v1