sentiment_classifier.nlp package

Submodules

sentiment_classifier.nlp.preprocessing module

Module for text pre-processing

We provide a basic test preprocessing function, that does the following tasks:

  • Removes HTML
  • Surround punctuation and special characters by spaces

This function can be passed to a Reader instance when loading the dataset.

Note: we did not lowercase the sentence, or removed the special characters on purpose. We think this information can make a difference in classifying sentiments. We are also using Word Embeddings, and the embeddings are different on lowercase vs uppercase words.

sentiment_classifier.nlp.preprocessing.clean_text(text)[source]

Function to clean a string. This function does the following:

  • Remove HTML tags
  • Surround punctuation and special characters by spaces
  • Remove extra spaces
Parameters:text (str) – text to clean
Returns:the cleaned text
Return type:text (str)

sentiment_classifier.nlp.reader module

We are using the IMDB Large Movie Reviews dataset from Stanford AI.

It provides 50,000 reviews on movies, splitted half-half in train/test and labelled as positive or negative.

We provide an abstract class Reader that we can subclass for each dataset.

We do this to standardise the dataset loading, and make it easy to use multiple datasets in the rest of the code with a common interface.

The IMDBReader class implements all the code needed to load the IMDB dataset.

class sentiment_classifier.nlp.reader.IMDBReader(path)[source]

Bases: sentiment_classifier.nlp.reader.Reader

load_dataset(limit=None, preprocessing_function=None)[source]

Load the IMDB dataset.

This function can also:
  • preprocess using a custom function
  • set a maximum number of files to load
Parameters:
  • limit (int, optional) – Defaults to None. Max number of files to load.
  • preprocessing_function (optional) – Defaults to None. Function for preprocessing the texts. No preprocessing by default.
class sentiment_classifier.nlp.reader.Reader(path)[source]

Bases: abc.ABC

load_dataset(path, limit=None, preprocessing_function=None)[source]

sentiment_classifier.nlp.tokenizer module

This module abstracts the tokenizer object, so that we can use tokenizers from different libraries and provide the same interface. Hence, we won’t need to change the rest of the code when changing tokenizers.

So far we only have one tokenizer, based on keras.preprocessing.text.Tokenizer.

class sentiment_classifier.nlp.tokenizer.BaseTokenizer[source]

Bases: abc.ABC

fit(train_data)[source]

Fit the tokenizer on the training data.

Parameters:train_data (list) – List of texts to fit the tokenizer on.
load(filepath)[source]

Load the tokenizer from disk

Parameters:filename (str) – Path to load the tokenizer from
Returns:the tokenizer itself, with loaded data
Return type:self (BaseTokenizer)
save(filename)[source]

Persist the tokenizer to disk

Parameters:filename (str) – Path to save to.
transform(data)[source]

Predict on data.

Parameters:data (list) – List of texts to predict on
class sentiment_classifier.nlp.tokenizer.KerasTokenizer(pad_max_len, lower=False, filters='tn')[source]

Bases: sentiment_classifier.nlp.tokenizer.BaseTokenizer

fit(train_data)[source]

Fit the tokenizer on the training data.

Parameters:train_data (list) – List of texts to fit the tokenizer on.
transform(data)[source]

Predict on data.

Parameters:data (list) – List of texts to predict on

sentiment_classifier.nlp.utils module

sentiment_classifier.nlp.utils.load_word_vectors(filepath, word_index, vector_size)[source]

Load word embeddings from a file.

Parameters:
  • filepath (str) – path to the embedding file
  • word_index (dict) – word indices from the keras Tokenizer
  • vector_size (int) – embedding dimension, must match the trained word vectors
Returns:

a matrix of size (len(word_index) * vector_size) that assigns each word to its learned embedding.

Return type:

embedding_matrix (np.ndarray)