Classification binaire de Tweets avec Azure Machine Learning

Tout d'abord un petit rappel:

"L'objectif de la classification supervisée est principalement de définir des règles permettant de classer des objets dans des classes à partir de variables qualitatives ou quantitatives caractérisant ces objets. Les méthodes s'étendent souvent à des variables Y quantitatives (régression). On dispose au départ d'un échantillon dit d'apprentissage dont le classement est connu. Cet échantillon est utilisé pour l'apprentissage des règles de classement. Il est nécessaire d'étudier la fiabilité de ces règles pour les comparer et les appliquer, évaluer les cas de sous apprentissage ou de sur apprentissage (complexité du modèle). On utilise souvent un deuxième échantillon indépendant, dit de validation ou de test."

Binary Classification: Twitter sentiment analysis

In this article, we'll explain how to to build an experiment for sentiment analysis using Microsoft Azure Machine Learning Studio. Sentiment analysis is a special case of text mining that is increasingly important in business intelligence and and social media analysis. For example, sentiment analysis of user reviews and tweets can help companies monitor public sentiment about their brands, or help consumers who want to identify opinion polarity before purchasing a product.

This experiment demonstrates the use of the Feature Hashing, Execute R Script and Filter-Based Feature Selection modules to train a sentiment analysis engine.
We use a data-driven machine learning approach instead of a lexicon-based approach, as the latter is known to have high precision but low coverage compared to an approach that learns from a corpus of annotated tweets. The hashing features are used to train a model using the Two-Class Support Vector Machine (SVM), and the trained model is used to predict the opinion polarity of unseen tweets. The output predictions can be aggregated over all the tweets containing a certain keyword, such as brand, celebrity, product, book names, etc in order to find out the overall sentiment around that keyword. The experiment is generic enough that you could use this framework to solve any text classification task given a reasonable amount of labeled training data.

Experiment Creation

The main steps of the experiment are:
  • Step 1: Get data
  • Step 2: Text preprocessing using R
  • Step 3: Feature engineering
  • Step 4: Split the data into train and test
  • Step 5: Train prediction model
  • Step 6: Evaluate model performance
  • Step 7: Publish prediction web service