Source code for sentiment_classifier.nlp.preprocessing

""" Module for text pre-processing

We provide a basic test preprocessing function, that does the following tasks:

 - Removes HTML
 - Surround punctuation and special characters by spaces

This function can be passed to a Reader instance when loading the dataset.

Note: we did not lowercase the sentence, or removed the special characters \
    on purpose. We think this information can make a difference in \
    classifying sentiments. We are also using Word Embeddings, \
    and the embeddings are different on lowercase vs uppercase words.
"""

import re


[docs]def clean_text(text):
    """ Function to clean a string.
    This function does the following:

    - Remove HTML tags
    - Surround punctuation and special characters by spaces
    - Remove extra spaces

    Args:
        text (str): text to clean

    Returns:
        text (str): the cleaned text

    """
    # remove html
    text = re.sub(string=text, pattern=r"<[^>]*>", repl="")
    # add spaces between special characters
    text = re.sub(string=text,
                  pattern=r"([$&+,:;=?@#|\"<>.^*()%!-])",
                  repl=r" \1 ")
    text = text.strip()

    return text