Project Timeline & Milestones
Week 4
- Explore and define possible approaches to the problem
- Explore the existing Guida Commerciale documents
Week 5
- Establish a pipeline for the project
- Explore pricing plans for OpenAI GPT models and Geocoding platforms
- Set up the GitHub repository
Week 6
- Start to experiment with different OCR models and potential alternatives, such as PDF text extraction libraries
- Attempt to cluster pages based on different page types using an unsupervised learning approach
Week 7
Week 8
- Create a new approach/pipeline that relies on basic domain knowledge to obtain better results
- Set up the semantic annotation platform, INCEpTION, for the new approach
Week 9
- Complete midterm presentation
- Run OCR with Pytesseract
- Extract text using PDFPlumber
- Compare the two text extraction approaches, choose the most suitable approach, and pre-process the text output
Week 10
Week 11
Week 12
Week 13
Week 14
- Complete the written deliverables and the GitHub repository
- Prepare the final presentation discussing the results
Limitations & Future Work
Github Repository
