Understanding NLP and Topic Modeling

A beginner guide for business students

What is Data Analysis?

Data analysis means examining large amounts of information in order to discover patterns, trends, and useful insights.

In business and finance, organizations collect huge amounts of data every day. Data analysis helps decision makers understand this information and make better strategic choices.

Example: A financial analyst may analyse thousands of news articles to understand:

In this project we analyse more than 100,000 financial news articles. Because it would be impossible for a human to read them all manually, we use computer programs to automatically analyse the text and identify the main topics discussed in financial media.

What is NLP?

Natural Language Processing (NLP) is a field of Artificial Intelligence that allows computers to understand human language.

Example:

A computer reading thousands of financial news articles and identifying the main themes such as:

This allows financial analysts to analyse large volumes of news automatically.

What is Python?

Python is a programming language commonly used for:

Example Python Code:
print("Hello World")

In this assignment Python is used to analyse large datasets of financial news articles.

What is Anaconda?

Anaconda is a software platform used to run Python for data science.

It provides many tools already installed such as:

Think of Anaconda as a toolbox that already contains everything needed for data analysis.

What is Kaggle?

Kaggle is an online platform where people share datasets and run data science projects.

Students and researchers use Kaggle to:

Example in this assignment: You download the dataset "US Financial News Articles" from Kaggle. The dataset contains more than 100,000 financial news articles stored in JSON format.

Financial News Dataset

The assignment uses a dataset of more than 100,000 financial news articles.

These articles are stored in JSON files.

Example JSON structure:
{
"title":"Stock market rises",
"date":"2022-05-10",
"text":"Investors reacted positively..."
}

Text Processing Pipeline

Before using machine learning the text must be cleaned.

This is done using spaCy.

Steps include:
Example: "Stocks were rising quickly" Becomes: "stock rise quick"

Topic Modeling with LDA

LDA (Latent Dirichlet Allocation) is a machine learning algorithm that finds hidden topics inside documents.

Example topic: Topic: Banking bank interest loan credit deposit

Each article can contain multiple topics with different probabilities.

Model Evaluation

Different topic models must be compared.

Two common evaluation metrics are:
Coherence measures how meaningful the topics are. Example good topic: market stock investor trading price Example bad topic: market banana politics car cloud

Interactive Quiz

Click to check your understanding