Tracking a Historic Market Crash through Articles
Introduction and Motivation
Methodology
Data Collection
Bloomberg Businessweek is an economic and business-oriented weekly magazine published by Bloomberg, a global multimedia corporation. Renowned for its in-depth coverage, analysis, and commentary on global business, finance, technology, markets, and economics, this publication offers a comprehensive view of various industry trends, corporate strategies, and market dynamics. The dataset is generated from Bloomberg website through their public APIs by Philippe Remy and Xiao Ding. It contains 450,341 news from 2006 to 2013.
Feature Extraction
TF-IDF
TF-IDF, an acronym for Term Frequency-Inverse Document Frequency, stands as a foundational technique within the realm of text mining. It serves as a critical metric to ascertain the importance of a term amidst a collection of documents, relying upon two core principles: Term Frequency (TF) and Inverse Document Frequency (IDF).
- Term Frequency (TF) quantifies the frequency of a term's occurrence within an individual document. A term's frequency is higher when it appears more frequently within a document.
- Inverse Document Frequency (IDF) encapsulates the significance of a term across the entire document corpus. It computes the reciprocal of a term's frequency across all documents, distinguishing common but less significant terms from those with higher corpus-specific importance.
The computation of TF-IDF involves multiplying TF by IDF, amalgamating a term's local significance within a single document with its global significance across the entire corpus.
In our research endeavors, the application of TF-IDF allows us to evaluate the significance of selected financial terms within individual articles. This methodology facilitates the identification of pivotal terms(a) within specific articles(w), prioritizing their contextual relevance over their frequency within the entire corpus. Furthermore, our approach circumvents information leakage by utilizing daily occurrence counts of terms, ensuring the accuracy and reliability of our analytical processes. The frequency of the term in each article is defined as tf(a)_w, the number of articles per day N_t, and the number of articles in which the term appears per day n_t < N_t. The tf-idf function could be defined as: \[ \text{TF-IDF}(a)_w = \text{tf}(a)_w * \log{\frac{N_t} D}{1 + n_t } \]
Pretrained Models
Machine Learning Models for Prediction
Model Introduction
Training Settings
Model Integration
Results and Quality Assessments
Assessment & Model Selection
Results
Model Prediction
Limitations
Future Work
Project Plan and Milestones
Weekly Project Plan
Week | Tasks | Completion |
---|---|---|
Week 4 |
|
✓ |
Week 5 |
|
✓ |
Week 6 |
|
✓ |
Week 7 |
|
✓ |
Weeks 8–9 |
|
✓ |
Week 10 |
|
✓ |
Week 11 |
|
|
Week 12 |
|
|
Week 13 |
|
|
Week 14 |
|
Milestones
Milestone 1
- Draft a comprehensive project proposal outlining aims and objectives.
- Identify datasets with appropriate time granularity and relevant economic labels.
- Prepare and clean selected datasets for analysis.
Milestone 2
- Master the NLP processing workflow and techniques.
- Construct TF-IDF representation and emotional indicators in news data.
- Conduct preliminary model adjustments to enhance accuracy based on initial data.
Milestone 3
- Implement pre-trained models for sentiment analysis and integrate them into the project.
- Apply decision fusion techniques to optimize model performance.
- Refine and fine-tune the models based on the results and feedback.
Milestone 4
- Prepare the final presentation summarizing and visualizing the project findings and outcomes.
- Create and finalize content for the Wikipedia page, documenting the project.
- Conduct a thorough project review and ensure all documentation is complete and accurate.