PerSCiDO facilitates the exploration of research datasets.

Share your research datasets using PerSCiDO!

Numbers
Datasets: 31
Downloaded: 401
  • textual data
  • 4.0
Sense Embeddings Models
This dataset contains the models of sense embeddings, or sense vectors, produced for the article called "Sense Embeddings in Knowledge-Based Word Sense Disambiguation" by Loïc Vial, Benjamin Lecouteux and Didier Schwab, in proceedings of the 12th International Conference on Computational Semantics (IWCS 2017).
Read me file
readme.txt
Read me file
This dataset contains the models of sense embeddings, or sense vectors, produced for the article called "Sense Embeddings in Knowledge-Based Word Sense Disambiguation" by Loïc Vial, Benjamin Lecouteux and Didier Schwab, in proceedings of the 12th International Conference on Computational Semantics (IWCS 2017).

There are 3 directories, and each directory contains 5 files.

The 3 directories are:
- The directory "words", which contains the original word embeddings models that were used for creating the sense embeddings models.
- The directory "senses", which contains the produced sense embeddings models.
- The directory "combined", which contains embeddings models that contain both the words and the senses.

Each file indicate its origins in its name:
- The files prefixed with "baroni_c" are the context-counting vectors originating from Baroni et al. work called "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors" (ACL 2014) (http://clic.cimec.unitn.it/composes/semantic-vectors.html).
- The files prefixed with "baroni_p" are the context-predicting vectors originating from the same work.
- The files prefixed with "deps" originate from Levy and Goldberg work called "Dependency-Based Word Embeddings" (ACL 2014) (https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/).
- The files prefixed with "glove" originate from Pennington et al. work called "GloVe: Global Vectors for Word Representation" (EMNLP 2014) (https://nlp.stanford.edu/projects/glove/)
- The files prefixed with "word2vec" originate from Mikolov et al. work called "Distributed Representations of Words and Phrases and their Compositionality" (NIPS 2013) (https://code.google.com/archive/p/word2vec/)

The sense embeddings models contain all 206,941 senses from WordNet 3.0, accessible through their sense key (e.g. "sense%1:10:00::").

The format of the models is the same binary format as the typical word embeddings models. The tools and scripts that come from the word2vec toolkit (https://github.com/dav/word2vec) can parse them, for instance.
More details on the format, it consists of:
1) A string representation of an integer denoting the total number of vectors in the model, followed by a space character (hexadecimal value 0x20).
2) A string representation of an integer denoting the number of dimension of each vector, followed by a newline character (hexadecimal value 0x0A).
3) For each vector:
3a) A string denoting the word or sense, followed by a space character (hexadecimal value 0x20).
3b) A 32-bits float representation of the vector.
2017 09 27
The size of this dataset is more than 4000 Mb
Archive files
sense_embeddings_models.zip
2017 09 27
13.01 GB
  • baroni_c_combined.bin 997 124 596 ko
  • baroni_p_combined.bin 799 099 604 ko
  • deps_combined.bin 452 379 830 ko
  • word2vec_combined.bin 3 805 853 896 ko
  • glove_combined.bin 2 511 255 694 ko
  • glove_senses.bin 247 007 678 ko
  • baroni_c_senses.bin 408 686 584 ko
  • baroni_p_senses.bin 327 847 131 ko
  • deps_senses.bin 247 007 678 ko
  • word2vec_senses.bin 247 007 678 ko
  • deps_words.bin 205 372 163 ko
  • word2vec_words.bin 3 558 846 213 ko
  • baroni_c_words.bin 588 438 022 ko
  • baroni_p_words.bin 471 252 484 ko
  • glove_words.bin 2 264 248 027 ko
Related publications
Other metadata
  • External Identifiers:

  • Subjects:

    Computer Science, Linguistics, Mathematics
  • Keywords:

    sense embeddings, sense vectors, word sense disambiguation
  • Corresponding tasks:

    word sense disambiguation
  • Encoding data format:

    word2vec binary format

Vial L., Lecouteux B., Schwab D. (2017). Sense Embedding Models, companion datasets for the IWCS 2017 publication "Sense Embeddings in Knowledge-Based Word Sense Disambiguation". [dataset], doi:10.18709/PERSCIDO.2017.10.DS117. Published 2017 via Perscido-Grenoble-Alpes;

Vial L., Lecouteux B., Schwab D. (2017). Sense Embedding Models, companion datasets for the IWCS 2017 publication "Sense Embeddings in Knowledge-Based Word Sense Disambiguation". [dataset], doi:10.18709/PERSCIDO.2017.10.DS117. Published 2017 via Perscido-Grenoble-Alpes