Tracking a Historic Market Crash through Articles

Introduction and Motivation

Methodology

Data Collection

News dataste

Bloomberg Businessweek is an economic and business-oriented weekly magazine published by Bloomberg, a global multimedia corporation. Renowned for its in-depth coverage, analysis, and commentary on global business, finance, technology, markets, and economics, this publication offers a comprehensive view of various industry trends, corporate strategies, and market dynamics. The dataset is generated from Bloomberg website through their public APIs by Philippe Remy and Xiao Ding. It contains 450,341 news from 2006 to 2013.

A sample of the content of Bloomberg newspaper.

Dictionaries

Data Preprocessing

Text Data Cleaning

Textual data undergoes a rigorous cleaning process as the initial step. This involves the elimination of numerical digits, special characters, and punctuation marks from the text. This meticulous cleansing enhances the readability and suitability of the text for subsequent analyses.

Text Tokenization and Lemmatization

Post-cleaning, the text is subjected to tokenization and lemmatization. Tokenization disassembles the text into individual words or tokens, while lemmatization reduces these tokens to their base or root forms. This standardization minimizes the impact of varying word forms on analytical outcomes.

Stopwords Removal

The process of removing stopwords involves the exclusion of frequently occurring words that contribute minimal semantic value to the analysis. This includes a carefully curated set of stopwords, incorporating additional stopwords provided by citations2. By eliminating these extraneous words, the noise within the text is mitigated, resulting in more accurate analysis.

Financial terms worldcloud.

Feature Extraction

TF-IDF

TF-IDF, an abbreviation for Term Frequency-Inverse Document Frequency, holds a pivotal role in the domain of text mining. It stands as a cornerstone technique for determining the significance of terms within a collection of documents. It relies on two fundamental principles: Term Frequency (TF) and Inverse Document Frequency (IDF).

- Term Frequency (TF) quantifies how often a term appears within an individual document. A term's frequency is higher when it occurs more frequently within a document.

- Inverse Document Frequency (IDF) encapsulates the significance of a term across the entire document corpus. It computes the reciprocal of a term's frequency across all documents, distinguishing common yet less consequential terms from those of higher corpus-specific importance.

The computation of TF-IDF involves multiplying TF by IDF, amalgamating a term's local significance within a single document with its global significance across the entire corpus.

In our research endeavors, the application of TF-IDF enables the evaluation of the significance of selected financial terms within individual articles. This methodology facilitates the identification of pivotal terms (a) within specific articles (w), prioritizing their contextual relevance over their frequency within the entire corpus. Furthermore, our approach avoids information leakage by utilizing daily occurrence counts of terms, ensuring the accuracy and reliability of our analytical processes. The frequency of the term in each article is defined as tf(a)_w, the number of articles per day N_t, and the number of articles in which the term appears per day n_t < N_t. The tf-idf function could be defined as:

\[ \text{TF-IDF}(a)_w = \text{tf}(a)_w \times \log{\frac{N_t}{1 + n_t}} \]

Pretrained Models

Machine Learning Models for Prediction

Model Introduction

Training Settings

Model Integration

Results and Quality Assessments

Assessment & Model Selection

Results

Model Prediction

Limitations

Future Work

Project Plan and Milestones

Weekly Project Plan

Week	Tasks	Completion
Week 4	Read topic-related Acadamic papers to figure basic paradigms Brainstorm and present initial ideas for the project	✓
Week 5	Learn the standard process for NLP preprocessing. Find suitable news datasets and economic crisis labels.	✓
Week 6	Initially define the entire project's workflow. Configure the development environment and master the relevant software and libraries. Finishing data preprocessing on the news dataset.	✓
Week 7	Completing the feature engineering construction and basic pipeline for the TF-IDF based model. (Completed) Completing the feature engineering construction and basic pipeline for the sentiment dictionary-based model.	✓
Weeks 8–9	Choose and train appropriate machine learning models to build feature-to-label mappings. Learn and implement cross-validation of timing models to validate model performance. Analyze the experimental results and summarize preliminary conclusions.	✓
Week 10	Prepare slides for the midterm presentation.(Completed) Fill in information on wiki.	✓
Week 11	Expand the fine-grained news dataset and replenish the economic analysis metrics. Introduce pre-trained models with transformer architecture to optimize the extraction of sentiment features. Explore a variety of deep learning and machine learning techniques for optimization.
Week 12	Complete project workflows on new datasets with new time series models. Compare the results and analyze the correlations between sentiment scores and different financial indicators. Decision fusion for enhancing model performance.
Week 13	Achieve visual representation to display news trends, sentiment analysis outcomes, and predictive metrics. Finalize modifications and refinements for the project's concluding model iterations.
Week 14	Write the report. Prepare for the final presentation.

Milestones

Milestone 1

Draft a comprehensive project proposal outlining aims and objectives.
Identify datasets with appropriate time granularity and relevant economic labels.
Prepare and clean selected datasets for analysis.

Milestone 2

Master the NLP processing workflow and techniques.
Construct TF-IDF representation and emotional indicators in news data.
Conduct preliminary model adjustments to enhance accuracy based on initial data.

Milestone 3

Implement pre-trained models for sentiment analysis and integrate them into the project.
Apply decision fusion techniques to optimize model performance.
Refine and fine-tune the models based on the results and feedback.

Milestone 4

Prepare the final presentation summarizing and visualizing the project findings and outcomes.
Create and finalize content for the Wikipedia page, documenting the project.
Conduct a thorough project review and ensure all documentation is complete and accurate.

Deliverables

References

Financial News Dataset from Bloomberg and Reuters by Philippe Remy (2015)

Tracking a Historic Market Crash through Articles

Contents

Introduction and Motivation

Methodology

Data Collection

News dataste

Dictionaries

Data Preprocessing

Text Data Cleaning

Text Tokenization and Lemmatization

Stopwords Removal

Feature Extraction

TF-IDF

Pretrained Models

Machine Learning Models for Prediction

Model Introduction

Training Settings

Model Integration

Results and Quality Assessments

Assessment & Model Selection

Results

Model Prediction

Limitations

Future Work

Project Plan and Milestones

Weekly Project Plan

Milestones

Milestone 1

Milestone 2

Milestone 3

Milestone 4

Deliverables

References

Navigation menu

Tracking a Historic Market Crash through Articles

Introduction and Motivation

Methodology

Data Collection

News dataste

Dictionaries

Data Preprocessing

Text Data Cleaning

Text Tokenization and Lemmatization

Stopwords Removal

Feature Extraction

TF-IDF

Pretrained Models

Machine Learning Models for Prediction

Model Introduction

Training Settings

Model Integration

Results and Quality Assessments

Assessment & Model Selection

Results

Model Prediction

Limitations

Future Work

Project Plan and Milestones

Weekly Project Plan

Milestones

Milestone 1

Milestone 2

Milestone 3

Milestone 4

Deliverables

References

Navigation menu

Search