- Python 97.7%
- Dockerfile 2.3%
Adds a fire-and-forget endpoint that schedules poll_once on the running event loop and returns 202 immediately, so clients don't need `docker exec ... python poller.py --once` to fetch new messages now. A process-wide asyncio.Lock (_poll_lock in archive.py, exposed via poll_once_locked) serializes the manual trigger with the scheduled poll, so a manual POST and a cron tick can never overlap (concurrent polls would contend for the single Telegram session and SQLite writes). Both the scheduler (poller._scheduled_poll) and the endpoint go through poll_once_locked; --once is unchanged since it runs in its own process. Also restores `import hmac` in api.py, which had been lost in a prior edit and was breaking the auth middleware (NameError on every /api/* route); the end-to-end test for this endpoint caught it. |
||
|---|---|---|
| local | ||
| server | ||
| .env.example | ||
| .gitignore | ||
| docker-compose.yml | ||
| README.md | ||
tg-archive
A personal Telegram archive that runs as a Docker container and archives one
specific direct-message conversation — the one identified by TARGET_CHAT
(@username, +phone, or numeric user id). Every incoming and outgoing message
in that single DM is stored to a SQLite database, with media downloaded to a
volume. It does not touch your other chats, groups, or channels.
The Telegram client connects only during scheduled polls (hourly by default) and stays disconnected between them, so you don't leak continuous "online"/last-seen presence. Both message directions are captured, with up to one schedule-interval of latency.
Two deliverables:
local/— a session extractor you run once on your personal machine (Windows or macOS, where Telegram Desktop lives) to mint a PyrogramStringSessionfrom itstdata. No phone/SMS is ever needed on the server.server/— the deployed archive container that uses that session string.
How it works
┌─────────────────┐ extract once ┌─────────────────────────────────────┐
│ Telegram │ ──────────────► │ Dokploy / Docker host │
│ Desktop (Win) │ session string │ │
│ + tdata │ │ tg-archive container │
└─────────────────┘ │ ├─ APScheduler (cron, in-process) │
│ │ └─ every tick: connect → poll │
│ │ TARGET_CHAT only → download│
│ │ media → disconnect │
│ ├─ SQLite → /data/archive.db │
│ └─ media → /media/<chat>/<ym>/ │
└─────────────────────────────────────┘
Each poll:
- Connects the user session.
- Resolves
TARGET_CHATwithget_chat(). - First poll → fetches the newest
BASELINE_LIMIT(default 200) messages. Later polls → fetches newest-first and keeps onlyid > last_seen_message_iduntil it reaches the watermark (auto-paginated;POLL_LIMITis a safety cap). - Upserts each message by composite key
(chat_id, message_id)— re-runs are idempotent. - Downloads media for messages that have it, storing the path relative to
/media. - Advances the watermark, records
last_polled_at, disconnects.
Repository layout
tg-archive/
├── docker-compose.yml Dokploy / docker-compose entry point
├── local/ run on your machine (Win/macOS), NOT deployed
│ ├── requirements.txt
│ ├── extract.py tdata → StringSession (auto-detects tdata path)
│ └── verify.py proves the session is live (get_me)
└── server/ the deployed container
├── Dockerfile
├── requirements.txt
├── config.py env → config, fast-fail on missing vars
├── models.py chats + messages schema, init_db()
├── archive.py poll_once() — the core archive routine
├── api.py read-only REST API (aiohttp, opt-in via API key)
└── poller.py entrypoint: --once | scheduled loop (+ API host)
Part 1 — Extract the session (Windows or macOS)
The session is extracted once from Telegram Desktop's tdata. Get your API
credentials at https://my.telegram.org (API ID + API hash).
macOS users: this requires the cross-platform Telegram Desktop app (from https://desktop.telegram.org), not the native macOS "Telegram" app from the App Store. The native app stores its data in a different, incompatible format. Install Telegram Desktop and sign in there once before extracting.
-
Fully close Telegram Desktop — it locks the live
tdatafolder.- Windows: quit from the tray and confirm no
Telegram.exein Task Manager. - macOS: quit (
⌘Q) and confirm it isn't running — Activity Monitor, orpgrep -fl "Telegram Desktop".
- Windows: quit from the tray and confirm no
-
Install dependencies and run the extractor:
cd local pip install -r requirements.txt python extract.py --api-id 12345 --api-hash abcdef- The default
--tdatapath is auto-detected per OS:- Windows:
%APPDATA%\Telegram Desktop\tdata - macOS:
~/Library/Application Support/Telegram Desktop/tdata - Linux:
~/.local/share/Telegram Desktop/tdata
- Windows:
- For a portable or custom install, pass the folder explicitly with
--tdata PATH. - If you set a Telegram Desktop local passcode, add
--passcode <PIN>. - The extractor copies
tdatato./tdata_copyand parses the copy — the live folder is never read directly. - Reads
tdatawith tdata-reader (extracts the MTProtoauth_key,dc_id,user_id) and packs them into a Pyrogram v2 StringSession — the exact format the server's pyrogram 2.0.106 expects. This works on Python 3.8–3.14 (pyrogram and TGConvertor don't import on 3.14, so the local tools avoid them).
- The default
-
You'll see (copy these):
SESSION=<long string> USER_ID=<your telegram id> DC_ID=<datacenter id>It also writes
local/session.txt(the Pyrogram string — paste its first line intoTG_SESSION_STRING),local/telethon-session.txt(used byverify.py), andlocal/api.env. All are gitignored — they contain secrets. -
Verify the session is live before shipping it:
python verify.pyverify.pylogs in via Telethon (the same MTProto auth_key works for any client library) and prints your@username, id, name, and phone. A successful Telethon login confirms the Pyrogram string is live for the server too. If it fails, the server will fail — fix it here first.
Part 2 — Run the archive container
Environment variables
| Variable | Required | Default | Description |
|---|---|---|---|
TG_API_ID |
yes | — | Telegram API id |
TG_API_HASH |
yes | — | Telegram API hash |
TG_SESSION_STRING |
yes | — | SESSION= value from Part 1 (mark as secret) |
TARGET_CHAT |
yes | — | @username | +phone | numeric id of the one DM |
SCHEDULE |
no | 0 * * * * |
cron for polls (hourly default) |
BASELINE_LIMIT |
no | 200 |
messages fetched on the first poll |
POLL_LIMIT |
no | 10000 |
incremental safety cap per poll |
DB_PATH |
no | /data/archive.db |
SQLite path |
MEDIA_DIR |
no | /media |
media download root |
ARCHIVE_API_KEY |
no | — | if set, serves the read-only REST API (mark as secret) |
ARCHIVE_API_HOST |
no | 0.0.0.0 |
API bind address |
ARCHIVE_API_PORT |
no | 8080 |
API port inside the container |
SCHEDULE is a 5-field crontab evaluated in UTC. 0 * * * * = top of every
hour. Missing any required variable fails the container immediately on startup
(visible in logs).
Quick local run (Docker)
Docker Compose reads a .env file in the project root automatically — the
easiest way to test locally. Copy the template, fill it in, and bring it up:
cp .env.example .env
# edit .env: set TG_API_ID, TG_API_HASH, TG_SESSION_STRING (first line of
# local/session.txt), and TARGET_CHAT
docker compose up --build -d
docker compose logs -f tg-archive # "scheduler started" + cron polls
docker compose exec tg-archive python poller.py --once # manual fire
.env is gitignored (it holds the TG_SESSION_STRING secret). You can also
export the variables inline in your shell instead:
TG_API_ID=12345 TG_API_HASH=abcdef \
TG_SESSION_STRING='<paste>' TARGET_CHAT=@someuser \
docker compose up --build -d
Deploy on Dokploy
- Create an application from this Git repository, compose file at the repo root.
- In the Dokploy UI, set the env vars above (mark
TG_SESSION_STRINGas a secret). - Set
SCHEDULEif you don't want hourly. - Deploy. The
./dataand./mediabind mounts persist on the Dokploy host across redeployments.
Single poll without Docker (host cron)
You can also run the server code directly and drive it with your own cron:
python server/poller.py --once # one poll, then exit 0
Database schema
SQLite at /data/archive.db.
chats (one row — the single archived DM):
| column | type | notes |
|---|---|---|
id |
BigInteger | PK — Telegram chat/user id |
type |
String | private (for a DM) |
title |
String | nullable |
username |
String | nullable |
last_seen_message_id |
BigInteger | high-water mark; 0 = nothing archived |
first_polled_at |
DateTime | nullable |
last_polled_at |
DateTime | nullable — poll heartbeat |
messages — composite primary key (chat_id, message_id):
| column | type | notes |
|---|---|---|
chat_id |
BigInteger | PK part 1 |
message_id |
Integer | PK part 2 |
date |
DateTime | message send time (UTC) |
direction |
String | in | out |
sender_id |
BigInteger | nullable |
sender_name |
String | nullable |
text |
Text | msg.text or msg.caption |
has_media |
Boolean | |
media_type |
String | photo|video|voice|video_note|audio|document|sticker|animation |
file_id |
String | nullable |
file_unique_id |
String | nullable |
file_name |
String | nullable |
mime_type |
String | nullable |
size |
Integer | nullable |
local_path |
String | relative to MEDIA_DIR; NULL until download succeeds |
downloaded_at |
DateTime | nullable |
raw_json |
Text | full Pyrogram message dump (escape hatch) |
archived_at |
DateTime | row insert time |
Inspect:
sqlite3 ./data/archive.db \
"SELECT direction, date, substr(text,1,40) FROM messages ORDER BY date DESC LIMIT 20;"
sqlite3 ./data/archive.db "SELECT count(*) FROM messages WHERE local_path IS NOT NULL;"
Media files land at ./media/<chat_id>/<YYYY-MM>/<filename>.
REST API
A read-only JSON API runs in the same container as the poller, on the same
event loop. It's opt-in: set ARCHIVE_API_KEY and the API starts on
0.0.0.0:8080; leave it unset and no API is served.
# .env
ARCHIVE_API_KEY=<a long random string>
All /api/* routes require the key, via either header:
curl -H "Authorization: Bearer <KEY>" http://localhost:8080/api/messages
curl -H "X-API-Key: <KEY>" http://localhost:8080/api/messages
(/health is public and unauthenticated.)
Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /health |
Liveness check — {"status":"ok"} (no auth) |
| GET | /api/chat |
The single archived DM (id, type, title, username, watermark, poll timestamps) |
| GET | /api/messages |
Paginated message list (newest first by default) |
| GET | /api/messages/{message_id} |
One message, full (includes raw_json) |
| GET | /api/media/{path} |
Serve a downloaded media file (auth-protected, path-traversal guarded) |
| POST | /api/poll |
Trigger a poll now; returns 202. Watch /api/chat.last_polled_at to confirm. |
/api/messages query params:
| param | default | notes |
|---|---|---|
limit |
50 |
clamped to 1–200 |
offset |
0 |
pagination offset |
order |
desc |
desc (newest) or asc (oldest), by message_id |
direction |
— | filter in / out |
q |
— | case-insensitive substring search on text |
full |
false |
true to include the large raw_json field |
Each message includes a media_url (e.g. /api/media/403303563/2026-06/photo.jpg)
when a file was downloaded — fetch it with the same API key.
Examples
# newest 10 outgoing messages
curl -H "X-API-Key: $KEY" "http://localhost:8080/api/messages?limit=10&direction=out"
# search for a phrase, oldest first
curl -H "X-API-Key: $KEY" "http://localhost:8080/api/messages?q=flight&order=asc"
# one message with its full raw dump
curl -H "X-API-Key: $KEY" "http://localhost:8080/api/messages/700"
# trigger a poll (returns 202 immediately)
curl -s -X POST -H "X-API-Key: $KEY" http://localhost:8080/api/poll
# → {"status":"queued"}
The API reads the same SQLite file the poller writes; there's no second process
or port for writes. Host port mapping is ${API_PORT:-8080}:8080 in
docker-compose.yml — change API_PORT to expose it elsewhere.
Verification checklist
After deploying, confirm end-to-end capture in the target DM:
- Send
archive-test-outto the peer from your phone/desktop →python poller.py --once→SELECT direction, text FROM messages WHERE text LIKE '%archive-test-out%';→ one row,direction = out. - Have the peer send
archive-test-inin that same DM →--once→ same query forarchive-test-in→direction = in.
Both rows present proves the archive captures both directions, scoped to the one
target DM. The counts dict {"messages": M, "media": K, "errors": E} is logged
after every poll.
Important caveats
- Shared auth key (load-bearing). The session is extracted from Telegram
Desktop and shares its auth key. Logging out of Telegram Desktop, or
terminating that session in Settings → Devices, also kills the server
session. Keep Telegram Desktop logged in. If the session dies, re-run
extract.py(with TD still logged in) and updateTG_SESSION_STRING, then redeploy. If TD itself was logged out, do a one-time fresh phone+code login to mint an independent session string. - Residual presence. The client is connected only during each poll — a
window of seconds (one history fetch + media downloads). For the
leakage-avoidance guarantee, set Telegram → Settings → Privacy →
Last Seen & Online → Nobody. A brief
onlineflicker during each poll is irreducible for a user session. - Long gaps are safe. Incremental fetch reads newest-first until the
watermark, paging through everything missed. It does not rely on MTProto
update sequencing, so there's no
differenceTooLongrisk after downtime — a missed poll just means the next one fetches more pages. - Edits and deletions are not captured. This is a new-message-only fetch. Edits to already-archived messages, and deletions, are out of scope for v1.
- One DM only. Expanding to more than one DM is a one-line change (iterate a
list of targets in
archive.py), but is out of scope here. TARGET_CHATresolution.@usernameor a numeric user id is most reliable.+phoneworks only if the peer is resolvable as a contact. If the peer has no username and you don't know the numeric id, message them once, then read the id from thechats.idcolumn after a first poll.
Tech
- Pyrogram v2 (async) — user session,
get_chat,get_chat_history,download_media. - tdata-reader — reads the MTProto
auth_keyout of Telegram Desktop'stdata; the extractor packs it into a Pyrogram v2 StringSession (verified against pyrogram 2.0.106's source). - Telethon — used by
verify.pyto prove the auth_key is live (pyrogram 2.0.106 doesn't import on Python 3.14; telethon does). - APScheduler
AsyncIOScheduler+CronTrigger— schedule lives in-process; no dependency on host cron. - SQLAlchemy 2.0 (sync) against SQLite on a mounted volume.
- aiohttp — the read-only REST API, served on the
poller's event loop (opt-in via
ARCHIVE_API_KEY).