sentiment_classifier.nlp package¶
Subpackages¶
Submodules¶
sentiment_classifier.nlp.preprocessing module¶
Module for text pre-processing
We provide a basic test preprocessing function, that does the following tasks:
- Removes HTML
- Surround punctuation and special characters by spaces
This function can be passed to a Reader instance when loading the dataset.
Note: we did not lowercase the sentence, or removed the special characters on purpose. We think this information can make a difference in classifying sentiments. We are also using Word Embeddings, and the embeddings are different on lowercase vs uppercase words.
-
sentiment_classifier.nlp.preprocessing.
clean_text
(text)[source]¶ Function to clean a string. This function does the following:
- Remove HTML tags
- Surround punctuation and special characters by spaces
- Remove extra spaces
Parameters: text (str) – text to clean Returns: the cleaned text Return type: text (str)
sentiment_classifier.nlp.reader module¶
We are using the IMDB Large Movie Reviews dataset from Stanford AI.
It provides 50,000 reviews on movies, splitted half-half in train/test and labelled as positive or negative.
We provide an abstract class Reader that we can subclass for each dataset.
We do this to standardise the dataset loading, and make it easy to use multiple datasets in the rest of the code with a common interface.
The IMDBReader class implements all the code needed to load the IMDB dataset.
-
class
sentiment_classifier.nlp.reader.
IMDBReader
(path)[source]¶ Bases:
sentiment_classifier.nlp.reader.Reader
-
load_dataset
(limit=None, preprocessing_function=None)[source]¶ Load the IMDB dataset.
- This function can also:
- preprocess using a custom function
- set a maximum number of files to load
Parameters: - limit (int, optional) – Defaults to None. Max number of files to load.
- preprocessing_function (optional) – Defaults to None. Function for preprocessing the texts. No preprocessing by default.
-
sentiment_classifier.nlp.tokenizer module¶
This module abstracts the tokenizer object, so that we can use tokenizers from different libraries and provide the same interface. Hence, we won’t need to change the rest of the code when changing tokenizers.
So far we only have one tokenizer, based on keras.preprocessing.text.Tokenizer.
-
class
sentiment_classifier.nlp.tokenizer.
BaseTokenizer
[source]¶ Bases:
abc.ABC
-
fit
(train_data)[source]¶ Fit the tokenizer on the training data.
Parameters: train_data (list) – List of texts to fit the tokenizer on.
-
load
(filepath)[source]¶ Load the tokenizer from disk
Parameters: filename (str) – Path to load the tokenizer from Returns: the tokenizer itself, with loaded data Return type: self (BaseTokenizer)
-
-
class
sentiment_classifier.nlp.tokenizer.
KerasTokenizer
(pad_max_len, lower=False, filters='tn')[source]¶
sentiment_classifier.nlp.utils module¶
-
sentiment_classifier.nlp.utils.
load_word_vectors
(filepath, word_index, vector_size)[source]¶ Load word embeddings from a file.
Parameters: - filepath (str) – path to the embedding file
- word_index (dict) – word indices from the keras Tokenizer
- vector_size (int) – embedding dimension, must match the trained word vectors
Returns: a matrix of size (len(word_index) * vector_size) that assigns each word to its learned embedding.
Return type: embedding_matrix (np.ndarray)