Gregg Shorthand Practice App: Investigation & Pipeline Plan

1. reMarkable Stroke Data Format

What You’re Working With

The reMarkable stores drawings in .rm binary files (currently v6 format since firmware 3.0, late 2022). Each notebook page gets its own .rm file, stored at /home/root/.local/share/remarkable/xochitl/<UUID>/<page-UUID>.rm.

Per-point data available:

Field	Type	Range	Notes
x	float	0–1404	Horizontal coordinate
y	float	0–1872	Vertical coordinate
speed	float	varies	Stylus velocity across surface
direction	float	radians	Tangent angle between consecutive points
width	float	varies	Effective brush width (accounts for tilt/pressure)
pressure	float	0.0–1.0	Pen-to-surface pressure
tilt	float	radians	Stylus angle to surface (0–π/2 and 3π/2–2π)

Points are grouped into strokes (continuous pen-down sequences), which belong to layers (up to 5 per page). Each stroke has metadata: pen type, color (black/grey/white), and base brush size.

This is vastly richer than raster image data. You get the temporal sequence of strokes, pressure dynamics, and pen angle — essentially the same signal space as “online handwriting recognition,” which is a much more tractable problem than offline OCR.

Key Python Libraries for Parsing

Library	Format	Status	Notes
rmscene	v6 (.rm)	Active	Best option for current firmware. Reads stroke data, text, layers. By Rick Lupton.
rmc	v6 (.rm)	Active	CLI converter built on rmscene. Exports to SVG, PDF, Markdown.
remarkable-layers	v3/v5	Unmaintained	Python API for older format.
rmtool	v3/v5	Active	Go library, also has Cloud API client.

rmscene (pip install rmscene) is the clear winner for our use case. It gives direct access to the stroke/point data structures in Python, which can then be fed into the ML pipeline.

Extraction workflow:

reMarkable tablet → SSH/Cloud API → .rm files → rmscene → [(x, y, pressure, tilt, speed, direction), ...] per stroke

Getting Files Off the Tablet

Three paths, all well-documented by the community:

SSH (direct USB or WiFi): scp root@10.11.99.1:/home/root/.local/share/remarkable/xochitl/<UUID>/*.rm ./
Cloud API: Use rmapi or the REST API to sync files programmatically
Export as PDF/SVG: Loses stroke-level data — avoid for training

Kaitai Struct Spec

Barry Van Tassell reverse-engineered the v6 format and published a full Kaitai Struct spec at github.com/YakBarber/remarkable_file_format. This is useful if you need to build tooling in languages other than Python, since Kaitai generates parsers for dozens of languages.

2. Prior Art: Gregg Shorthand Recognition

There is more prior work here than you might expect. The field is small but active.

Datasets

Dataset	Size	Type	Source	Notes
Gregg-1916	~15,700 word images	Offline (raster)	Extracted from 1916 Gregg Shorthand Dictionary	Publicly available at `github.com/anonimously/Gregg1916-Recognition`. Printed shorthand, not handwritten.
LION	Line-level stenography	Offline (raster)	Astrid Lindgren manuscripts (Melin shorthand, not Gregg)	Published 2024, available on Zenodo. First handwritten stenography HTR dataset.
StenogrApp dataset	48 images	Offline (raster)	48 basic Gregg brief forms	Very small, proof-of-concept only.

Critical gap: No existing dataset has online (stroke-level) Gregg shorthand data. Every dataset is raster images from scanned printed or handwritten sources. Our reMarkable pipeline would produce the first online Gregg shorthand dataset, which is a genuinely novel contribution.

Published Research

Zhai et al. (2018) — “A Dataset and a Novel Neural Approach for Optical Gregg Shorthand Recognition” (TSD 2018, Springer)

Created Gregg-1916 dataset from the 1916 dictionary
CNN feature extractor → bidirectional RNN decoder → word retrieval module
Key insight: Gregg is pronunciation-based, not spelling-based, so the mapping from shorthand to English is many-to-many
Code available at the GitHub repo above

Padilla et al. (2020) — “Deep Learning Approach in Gregg Shorthand Word to English-Word Conversion”

Used Inception-v3 (transfer learning) on 135 legal terms in Gregg shorthand
TensorFlow-based, word-level classification approach

StenogrApp (2024) — “E-Learning Android Application in Recognition of Basic Gregg Shorthand using Machine Learning” (ICETT 2024)

Most relevant to our concept: an e-learning app for Gregg shorthand using ML
Used k-Nearest Neighbors on 48 brief forms, achieved 86% precision
Android app with API for cross-platform use
Limited scope (only brief forms), but validates the educational app concept

Heil & Breznik (2024) — “Handwritten stenography recognition and the LION dataset” (IJDAR)

First baseline for handwritten stenography recognition
CER of ~25%, WER of ~45-48% with stenography-specific encodings + pretraining
Key finding: integrating domain knowledge (stenographic theory) into target sequence encoding significantly improves results
Not Gregg (it’s Melin/Swedish), but the methodology transfers

Rajasekaran & Ramar (2012) — “Handwritten Gregg Shorthand Recognition”

Earlier work using PCA + Logistic Regression, also explored CANN and backpropagation
Focused on both character-level and word-level recognition

Practice Apps (Non-Shorthand)

No dedicated shorthand practice app with AI feedback exists. General handwriting practice apps:

Writey (iOS): Real-time feedback on handwriting with Apple Pencil. Closest analog to what we’re building, but for standard alphabets.
Kaligo (tablets): AI-powered handwriting for kids, uses on-device stroke analysis
MyScript Notes / Nebo: Best-in-class handwriting recognition (66 languages), but recognition not training
Handwriting Success: Getty-Dubay curriculum on iPad with stylus practice

None of these handle shorthand, and none integrate with the reMarkable specifically.

3. Feasibility Assessment

Why This Is More Tractable Than It Looks

Online vs. Offline: We have stroke data, not raster images. Online handwriting recognition is substantially easier — you get temporal ordering, stroke segmentation for free, and rich per-point features.
Curriculum-constrained recognition: We don’t need a general Gregg recognizer. Unit 1 has ~10-15 distinct strokes. Unit 2 adds a handful more. At any given lesson, the classification space is tiny.
Known vocabulary per lesson: Each unit introduces specific words. We can constrain the decoder to only output words from the current unit’s vocabulary — a massive reduction in search space.
Synthetic data generation: Gregg strokes are well-defined geometric primitives (arcs, lines, circles, hooks). We can parameterize them and generate thousands of synthetic training examples with controlled variation.

Where It Gets Hard

Proportional strokes: The same curve shape at different sizes means different letters. This requires the model to learn relative sizing, not just shape.
Contextual joining: Later units introduce blends where strokes flow into each other. Segmenting these requires understanding stroke boundaries in continuous writing.
Writer variation: Even with stroke data, everyone’s “a” circle will be slightly different. Need sufficient examples per writer to generalize.
Feedback quality: Recognizing what someone wrote is easier than telling them how to improve. The feedback mechanism needs careful design.

4. Training Pipeline Design

Architecture Overview

┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────┐
│ Data Ingest  │───▶│ Preprocessing │───▶│  Training     │───▶│  Serving    │
│              │    │              │    │              │    │             │
│ • rmscene    │    │ • Normalize  │    │ • Per-unit   │    │ • FastAPI   │
│ • Synthetic  │    │ • Segment    │    │   models     │    │ • ONNX      │
│   generator  │    │ • Augment    │    │ • Transfer   │    │ • Feedback  │
│ • Gregg-1916 │    │ • Feature    │    │   learning   │    │   engine    │
│   (raster)   │    │   extract    │    │              │    │             │
└─────────────┘    └──────────────┘    └──────────────┘    └─────────────┘

Cycle 0: Data Foundation

Deliverable: A Python package that extracts stroke data from reMarkable .rm files and converts it to a normalized, ML-ready format.

Tasks:

Build extraction pipeline using rmscene: .rm → list of strokes → list of (x, y, pressure, tilt, speed, time) tuples
Normalize coordinates to [0, 1] range (dividing by 1404 and 1872)
Implement stroke segmentation (split page-level data into individual glyph attempts)
Define data schema (probably a simple JSON or Parquet format per practice session)
Build a synthetic stroke generator that creates parameterized Gregg primitives:
- Circles (vowels: a, e, o) at varying sizes
- Straight lines (consonants: t, d, n, m) at varying lengths/angles
- Curves (r, l, k, g) with controlled curvature
- Add realistic noise: jitter, pressure variation, speed variation
Validate by rendering synthetic strokes back to SVG and visually confirming they look like Gregg

Open Source Tools:

rmscene — .rm file parsing
numpy — numerical operations
svgwrite or matplotlib — rendering/validation
pydantic — data schema validation

Cycle 1: Stroke-Level Classifier (Unit 1)

Deliverable: A model that classifies individual Gregg strokes from Unit 1 with >90% accuracy on held-out test data.

Tasks:

Define Unit 1 stroke vocabulary from the greggshorthand.github.io curriculum (~10-15 classes)
Create training data: mix of synthetic strokes + handwritten samples on reMarkable
Implement feature extraction from stroke sequences:
- Geometric features: total arc length, bounding box aspect ratio, start/end angles, curvature statistics, stroke height/width ratio
- Raw sequence features: padded/interpolated (x, y, pressure) sequences for neural network input
Train two parallel approaches and compare:
1. Classical ML baseline: Extract geometric features → Random Forest or SVM (scikit-learn). Fast to iterate, good baseline.
2. Sequence model: 1D CNN or small LSTM on interpolated point sequences (PyTorch). Better ceiling.
Evaluate with k-fold cross-validation (small dataset, so need to be careful about splits)
Implement stroke-to-label prediction API

Open Source Tools:

scikit-learn — classical ML baseline
PyTorch — neural network models
wandb or mlflow — experiment tracking

Cycle 2: Word-Level Recognition

Deliverable: A model that maps a sequence of strokes to an English word from the current unit’s vocabulary, with top-3 accuracy >85%.

Tasks:

Implement word-level segmentation: given a page of practice, identify individual word attempts (likely gap-based heuristic on x-coordinates between strokes)
Build word recognition as sequence classification:
- Input: ordered sequence of stroke classifications from Cycle 1
- Output: English word from unit vocabulary
- Approach: CTC loss over stroke sequence → word, constrained to unit vocabulary
Alternatively (simpler): treat each word attempt as a single stroke sequence and classify holistically
Integrate the Gregg-1916 dataset as supplementary training data:
- These are raster images, so use a CNN to extract features
- Use as a transfer learning source: pretrain on Gregg-1916, fine-tune on online stroke data
Build vocabulary constraint: given a unit number, restrict output space to only valid words for that unit

Open Source Tools:

PyTorch — CTC loss, sequence models
torchvision — for Gregg-1916 raster image preprocessing
Pillow / OpenCV — image processing for raster data

Cycle 3: Feedback Engine

Deliverable: A system that compares a user’s stroke to a reference and produces actionable feedback (e.g., “your ‘a’ circle is too large relative to the ‘n’ curve”).

Tasks:

Define reference strokes: canonical representations of each Gregg primitive with acceptable tolerance bands
Implement comparison metrics:
- Dynamic Time Warping (DTW): Align user stroke to reference, identify where deviation is largest
- Fréchet distance: Overall shape similarity
- Proportional analysis: Compare stroke sizes relative to each other (critical for Gregg)
Generate natural language feedback from metric deviations:
- Size too large/small → “Make your [stroke] about 2/3 the height of your [other stroke]”
- Curvature wrong → “This curve should be more/less pronounced”
- Angle off → “The starting angle should be steeper”
Build a scoring rubric per unit that weights different aspects of stroke quality
Could also use an LLM to generate more nuanced feedback from structured metric data — this is where a Claude API call per evaluation could add real value

Open Source Tools:

dtaidistance or tslearn — DTW implementation
scipy — Fréchet distance, curve analysis
jinja2 — feedback template rendering

Cycle 4: App Integration

Deliverable: End-to-end workflow: practice on reMarkable → export → get feedback in web UI.

Tasks:

Build PDF template generator for practice sheets (guided lines, reference strokes, practice areas)
Implement file upload/processing pipeline:
- User exports .rm file (or syncs via cloud)
- Server parses with rmscene
- Runs through recognition + feedback pipeline
- Returns results via web UI
Build a simple web frontend (could be a React app) showing:
- Rendered version of what user drew
- Side-by-side with reference
- Feedback annotations
- Progress tracking per unit
Package model for serving: export to ONNX for fast inference

Open Source Tools:

FastAPI — API server
reportlab or fpdf2 — PDF template generation
onnxruntime — model serving
React / Next.js — web frontend

5. Data Strategy: Bootstrapping from Nothing

The cold start problem is real but solvable. Here’s the progression:

Phase 1 — Synthetic only (0 real samples needed): Parameterize each Gregg primitive mathematically. Generate 1,000+ examples per stroke class with controlled variation. Train initial model. This gets you a working prototype.

Phase 2 — Self-play (~50-100 real samples per class): As you practice on the reMarkable, feed your own writing through the pipeline. Label it (you know what you were trying to write). Fine-tune from synthetic baseline.

Phase 3 — Additional writers (~50-100 real samples per class): A second writer’s data dramatically improves generalization. If another practitioner is willing to write a few pages of practice forms, that’s invaluable.

Phase 4 — Community (if the tool gains traction): If open-sourced, the r/shorthand community + Gregg enthusiasts could contribute labeled samples. The curriculum structure means you can crowdsource labels cheaply (“I was practicing Unit 3, here are my .rm files”).

6. Key Risks & Mitigations

Risk	Likelihood	Impact	Mitigation
reMarkable firmware update breaks .rm format	Medium	High	Pin to rmscene library, which tracks format changes. The community has adapted to every format change so far (v3→v5→v6).
Insufficient training data for generalization	High	Medium	Synthetic data + curriculum constraints drastically reduce data requirements. Start with Unit 1 (tiny classification space).
Proportional stroke discrimination too hard	Medium	High	Use relative features (ratios) rather than absolute sizes. The reMarkable’s consistent DPI helps here.
Feedback quality feels unhelpful	Medium	Medium	Start with simple pass/fail + DTW visualization. Iterate based on your own experience using it.
reMarkable restricts cloud API access	Low	Medium	SSH extraction always works (they’ve committed to keeping SSH access). Community has survived multiple API changes.

7. Recommended Tech Stack Summary

Component	Tool	Why
.rm parsing	rmscene (Python)	Best maintained v6 parser
ML framework	PyTorch	Best ecosystem for sequence models, good ONNX export
Classical ML	scikit-learn	Fast iteration for baselines
Experiment tracking	Weights & Biases or MLflow	Track training runs
Sequence alignment	dtaidistance / tslearn	DTW for feedback
API server	FastAPI	Fast, typed, async
Model serving	ONNX Runtime	Cross-platform, fast inference
PDF generation	fpdf2 or reportlab	Practice template generation
Data storage	Parquet (training data), SQLite (app state)	Efficient columnar storage for stroke data
Frontend	React + Tailwind	Standard, or even a simple Gradio app for prototyping

8. Summary

This project is feasible and genuinely novel — nobody has built a curriculum-aware shorthand practice app with AI feedback using online stroke data. The reMarkable’s rich stroke format is a massive advantage over raster-based approaches. The Gregg-1916 dataset and StenogrApp paper validate that ML-based shorthand recognition works, and the curriculum structure of the greggshorthand.github.io course gives a natural way to constrain the problem to tractable sub-problems.

The biggest risk is scope creep. Start with Cycle 0 (data extraction) and Cycle 1 (stroke classification for Unit 1 only). If those work, everything else follows incrementally.