NLP - Word Embedding

Word Embedding: is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. [1]

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. [2]

Example with a document in English

Step 1 - Read natural text from a book

Step 2 - Tokenize and remove Stopwords

Data Quality process

Refers to the cleaning process of input data so they have meaning and value.

Stopwords

Refers to the most common words in a language, which do not significantly affect the meaning of the text.

Step 3 - Create a Word2Vec model

Word2Vec consists of models for generating word embedding. These models are shallow two layer neural networks having one input layer, one hidden layer and one output layer. Word2Vec utilizes two architectures: CBOW (Continuous Bag of Words) and Skip Gram. [3]

Vocabulary

Unique words of the document.

Word Embedding

Similar Words

Words more similar in terms of meaning and context.

Step 4 - Plot similars words

Step 5 - Export similarity between the Words

Dense matrix

Sparse matrix

Reference

[1] Wikipedia - Word Embedding.
[2] Pypi - Gensim project site.
[3] Wikipedia - Word2Vect.


« Home