28 lines
949 B
Markdown
Raw Normal View History

2024-11-22 19:37:42 +00:00
# PDF2Text
I've converted 8gb of PDF's to text in one afternoon on a decade old x270 using this script. Performant enough imho. Try to get 8Gb in your LLM and getting it to actually use it. That's the challenge.
## Convert all PDF's to text
2024-11-22 19:41:31 +00:00
This is an [script](/pdf2text) for converting a batch of PDF's to text for machine learning.
2024-11-22 19:37:42 +00:00
It only has two dependencies:
2024-11-22 19:41:31 +00:00
- `python3`
- `pdf.miner` (python requirement, specified in [requirements.txt](/requirements.txt) file)
2024-11-22 19:37:42 +00:00
## Installation
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
## Usage:
Activate your virtual environment.
```bash
source .venv/bin/activate
./pdf2text [source/destination dir]
```
You read that correctly, the source directory is also the destination directory.
2024-11-22 19:41:31 +00:00
## Todo:
Make decent python package so it's installable on system without having to load environment first. Not sure if worth it, it's not something you daily use.