Data Ingestion of Guide Commericiale: Difference between revisions

From FDHwiki
Jump to navigation Jump to search
(Created page with "== Introduction == == Project Timeline & Milestones == == Methodology == == Results == == Conclusion == == Appendix == == References ==")
 
Line 2: Line 2:


== Project Timeline & Milestones ==
== Project Timeline & Milestones ==
{|class="wikitable"
! style="text-align:center;"|Timeframe
! Task
! Completion
|-
| align="center" |Week 4
|
* Explore and define possible approaches to the problem
* Explore the existing Guida Commerciale documents
| align="center" |✓
|-
| align="center" |Week 5
|
* Establish a pipeline for the project
* Explore pricing plans for OpenAI GPT models and Geocoding platforms
* Set up the GitHub repository
| align="center" |✓
|-
| align="center" |Week 6
|
* Start to experiment with different OCR models and potential alternatives, such as PDF text extraction libraries
* Attempt to cluster pages based on different page types using an unsupervised learning approach
| align="center" |✓
|-
| align="center" |Week 7
|
* Autumn Break
| align="center" |✓
|-
| align="center" |Week 8
|
* Create a new approach/pipeline that relies on basic domain knowledge to obtain better results
* Set up the semantic annotation platform, INCEpTION, for the new approach
| align="center" |✓
|-
| align="center" |Week 9
|
* Complete midterm presentation
* Run OCR with Pytesseract
* Extract text using PDFPlumber
* Compare the two approaches, choose the most suitable approach, and pre-process the text output
| align="center" |✓
|-
| align="center" |Week 10
|
*
*
| align="center" |✓
|-
| align="center" |Week 11
|
*
*
| align="center" |✓
|-
| align="center" |Week 12
|
*
*
*
| align="center" |
|-
| align="center" |Week 13
|
*
*
*
| align="center" |
|-
| align="center" |Week 14
|
* Complete the written deliverables and the GitHub repository
* Prepare the final presentation discussing the results
| align="center" |
|-
|}


== Methodology ==
== Methodology ==

Revision as of 11:50, 28 November 2024

Introduction

Project Timeline & Milestones

Timeframe Task Completion
Week 4
  • Explore and define possible approaches to the problem
  • Explore the existing Guida Commerciale documents
Week 5
  • Establish a pipeline for the project
  • Explore pricing plans for OpenAI GPT models and Geocoding platforms
  • Set up the GitHub repository
Week 6
  • Start to experiment with different OCR models and potential alternatives, such as PDF text extraction libraries
  • Attempt to cluster pages based on different page types using an unsupervised learning approach
Week 7
  • Autumn Break
Week 8
  • Create a new approach/pipeline that relies on basic domain knowledge to obtain better results
  • Set up the semantic annotation platform, INCEpTION, for the new approach
Week 9
  • Complete midterm presentation
  • Run OCR with Pytesseract
  • Extract text using PDFPlumber
  • Compare the two approaches, choose the most suitable approach, and pre-process the text output
Week 10
Week 11
Week 12
Week 13
Week 14
  • Complete the written deliverables and the GitHub repository
  • Prepare the final presentation discussing the results

Methodology

Results

Conclusion

Appendix

References