|
# PDF2Text
|
|
|
|
I've converted 8gb of PDF's to text in one afternoon on a decade old x270 using this script. Performant enough imho. Try to get 8Gb in your LLM and getting it to actually use it. That's the challenge.
|
|
|
|
## Convert all PDF's to text
|
|
This is an script for converting a batch of PDF's to text for machine learning.
|
|
It only has two dependencies:
|
|
- python3
|
|
- pdf.miner (python requirement, specified in requirements.txt file)
|
|
|
|
## Installation
|
|
```bash
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Usage:
|
|
Activate your virtual environment.
|
|
```bash
|
|
source .venv/bin/activate
|
|
./pdf2text [source/destination dir]
|
|
```
|
|
You read that correctly, the source directory is also the destination directory.
|