Compare commits

..

2 Commits

Author SHA1 Message Date
746f6da5d5 Added workflow 2024-11-22 20:45:58 +01:00
994d5495b2 Provided links 2024-11-22 20:41:31 +01:00
2 changed files with 26 additions and 3 deletions
.gitea/workflows
README.md

View File

@ -0,0 +1,20 @@
name: pdf2text test
run-name: syntax check
on: [push]
jobs:
Compile:
runs-on: ubuntu-latest
steps:
- name: Check out repository code
uses: actions/checkout@v4
- name: List files in the repository
run: |
ls ${{ gitea.workspace }}
- run: echo "Install dependencies."
- run: apt update
- run: apt install python3
- run: python3 -m pip install -r requirements.txt
- run: "Check if starts correcly. Syntax check."
- run: ./pdf2text .
- run: echo "This job's status is ${{ job.status }}."

View File

@ -3,10 +3,10 @@
I've converted 8gb of PDF's to text in one afternoon on a decade old x270 using this script. Performant enough imho. Try to get 8Gb in your LLM and getting it to actually use it. That's the challenge.
## Convert all PDF's to text
This is an script for converting a batch of PDF's to text for machine learning.
This is an [script](/pdf2text) for converting a batch of PDF's to text for machine learning.
It only has two dependencies:
- python3
- pdf.miner (python requirement, specified in requirements.txt file)
- `python3`
- `pdf.miner` (python requirement, specified in [requirements.txt](/requirements.txt) file)
## Installation
```bash
@ -22,3 +22,6 @@ source .venv/bin/activate
./pdf2text [source/destination dir]
```
You read that correctly, the source directory is also the destination directory.
## Todo:
Make decent python package so it's installable on system without having to load environment first. Not sure if worth it, it's not something you daily use.