Plain Text to Vision Models: Evolving My Expense Workflow
A few years ago, I went looking for a good app for budgeting and tracking expenses. I wanted something simple enough that I would actually use it.
Then I discovered the Plain Text Accounting approach and eventually hledger, and it immediately clicked for me. If you are a terminal lover or you like your data in plain text, hledger is a great fit. Your records live in ordinary text files: no proprietary format, no locked‑in database, and best of all, everything is Git‑friendly.
Using hledger quickly becomes natural once you grasp the basics of double‑entry accounting. You run hledger add, type in accounts and amounts with tab completion, and hledger appends the transaction to your journal file:
1 | 2026-03-02 Supermarket |
I type each item as a comment next to the price, so if I ever need to find something, a quick text search pulls it right up. In practice, this feels smoother than any graphical budgeting app I have tried, and it costs nothing except a bit of time. This simple workflow has worked reliably for years.
During my workflow review session in December 2025, I wanted to try a new approach that could save me time on the manual work. Hledger is great, but not without its own limitations:
- Flexibility: I had to be at my desk to actually use hledger. It probably works fine with Termux, but I was never tempted to set it up.
- Chores: I would buy something during the day, stuff the receipt in my bag, and by evening I had zero interest in sitting down to type entries. The receipts piled up. Saturday mornings became catch‑up sessions — me and a stack of crumpled paper, working through the backlog.
The second issue was partially solved: use a VLM to digitize and structure the data from receipts directly, just the way I was digitizing my notes with Google Gemini. I wanted to build a personal app with my own database, which would also be flexible enough to swap models in case I had a problem with whatever I was using.
The idea was not novel. In fact, I had played with OCR before (ocr.space generously gives some free OCR API access, by the way). But it was always finicky: the photo had to be straight, the lighting had to be good, the text had to be clear, and even then it often stumbled. Using a VLM is different. It does more than read characters; it understands context. It knows what a receipt looks like, which number is the total, which one is the tax. With proper prompting, you can get cleaned‑up data with proper formatting.
So I wrote a small Python script at first: send the image to the model and get back structured data.
It worked, so I turned it into a proper backend with an API, then built a Vue frontend to use my phone camera. It was pretty cool: point the camera, send, and the extracted data is directly saved in the app’s database. The rest of the features came up organically. I added:
- categories, tags, and search features
- a dashboard
- budgeting features with spending progress and alerts
- a calendar with an expense heatmap, scheduling, and recurring expenses
- a natural language extraction feature
- …
By the end of January 2026, after some polishing and a security checkup, I had a full‑featured, production‑ready, AI‑powered expense management web app deployed on my VPS.
Now, when I have a receipt, I take a photo from the app… and it is done. I can do that whenever and wherever I am. No need to wait until I am in front of my computer anymore, and no need to manually put the items in hledger. The app handles everything automatically for me.
The receipt on my desk does not wait for Saturday anymore.
I built a hledger export feature, just in case. If for some reason I no longer have access to a VLM, I can still export my data and fall back to hledger. But so far, I have not needed it.