Data Ingestion of Guide Commericiale: Difference between revisions
Jump to navigation
Jump to search
(Created page with "== Introduction == == Project Timeline & Milestones == == Methodology == == Results == == Conclusion == == Appendix == == References ==") |
|||
Line 2: | Line 2: | ||
== Project Timeline & Milestones == | == Project Timeline & Milestones == | ||
{|class="wikitable" | |||
! style="text-align:center;"|Timeframe | |||
! Task | |||
! Completion | |||
|- | |||
| align="center" |Week 4 | |||
| | |||
* Explore and define possible approaches to the problem | |||
* Explore the existing Guida Commerciale documents | |||
| align="center" |✓ | |||
|- | |||
| align="center" |Week 5 | |||
| | |||
* Establish a pipeline for the project | |||
* Explore pricing plans for OpenAI GPT models and Geocoding platforms | |||
* Set up the GitHub repository | |||
| align="center" |✓ | |||
|- | |||
| align="center" |Week 6 | |||
| | |||
* Start to experiment with different OCR models and potential alternatives, such as PDF text extraction libraries | |||
* Attempt to cluster pages based on different page types using an unsupervised learning approach | |||
| align="center" |✓ | |||
|- | |||
| align="center" |Week 7 | |||
| | |||
* Autumn Break | |||
| align="center" |✓ | |||
|- | |||
| align="center" |Week 8 | |||
| | |||
* Create a new approach/pipeline that relies on basic domain knowledge to obtain better results | |||
* Set up the semantic annotation platform, INCEpTION, for the new approach | |||
| align="center" |✓ | |||
|- | |||
| align="center" |Week 9 | |||
| | |||
* Complete midterm presentation | |||
* Run OCR with Pytesseract | |||
* Extract text using PDFPlumber | |||
* Compare the two approaches, choose the most suitable approach, and pre-process the text output | |||
| align="center" |✓ | |||
|- | |||
| align="center" |Week 10 | |||
| | |||
* | |||
* | |||
| align="center" |✓ | |||
|- | |||
| align="center" |Week 11 | |||
| | |||
* | |||
* | |||
| align="center" |✓ | |||
|- | |||
| align="center" |Week 12 | |||
| | |||
* | |||
* | |||
* | |||
| align="center" | | |||
|- | |||
| align="center" |Week 13 | |||
| | |||
* | |||
* | |||
* | |||
| align="center" | | |||
|- | |||
| align="center" |Week 14 | |||
| | |||
* Complete the written deliverables and the GitHub repository | |||
* Prepare the final presentation discussing the results | |||
| align="center" | | |||
|- | |||
|} | |||
== Methodology == | == Methodology == |
Revision as of 11:50, 28 November 2024
Introduction
Project Timeline & Milestones
Timeframe | Task | Completion |
---|---|---|
Week 4 |
|
✓ |
Week 5 |
|
✓ |
Week 6 |
|
✓ |
Week 7 |
|
✓ |
Week 8 |
|
✓ |
Week 9 |
|
✓ |
Week 10 |
|
✓ |
Week 11 |
|
✓ |
Week 12 |
|
|
Week 13 |
|
|
Week 14 |
|