No description
  • Python 97.7%
  • Dockerfile 2.3%
Find a file
yuris 12caf4eaf3 feat(api): add POST /api/poll to trigger a poll on demand
Adds a fire-and-forget endpoint that schedules poll_once on the running
event loop and returns 202 immediately, so clients don't need
`docker exec ... python poller.py --once` to fetch new messages now.

A process-wide asyncio.Lock (_poll_lock in archive.py, exposed via
poll_once_locked) serializes the manual trigger with the scheduled poll,
so a manual POST and a cron tick can never overlap (concurrent polls
would contend for the single Telegram session and SQLite writes).
Both the scheduler (poller._scheduled_poll) and the endpoint go through
poll_once_locked; --once is unchanged since it runs in its own process.

Also restores `import hmac` in api.py, which had been lost in a prior
edit and was breaking the auth middleware (NameError on every /api/*
route); the end-to-end test for this endpoint caught it.
2026-06-21 15:32:12 +07:00
local feat: initial commit 2026-06-21 14:18:34 +07:00
server feat(api): add POST /api/poll to trigger a poll on demand 2026-06-21 15:32:12 +07:00
.env.example feat: initial commit 2026-06-21 14:18:34 +07:00
.gitignore feat: initial commit 2026-06-21 14:18:34 +07:00
docker-compose.yml feat: initial commit 2026-06-21 14:18:34 +07:00
README.md feat(api): add POST /api/poll to trigger a poll on demand 2026-06-21 15:32:12 +07:00

tg-archive

A personal Telegram archive that runs as a Docker container and archives one specific direct-message conversation — the one identified by TARGET_CHAT (@username, +phone, or numeric user id). Every incoming and outgoing message in that single DM is stored to a SQLite database, with media downloaded to a volume. It does not touch your other chats, groups, or channels.

The Telegram client connects only during scheduled polls (hourly by default) and stays disconnected between them, so you don't leak continuous "online"/last-seen presence. Both message directions are captured, with up to one schedule-interval of latency.

Two deliverables:

  1. local/ — a session extractor you run once on your personal machine (Windows or macOS, where Telegram Desktop lives) to mint a Pyrogram StringSession from its tdata. No phone/SMS is ever needed on the server.
  2. server/ — the deployed archive container that uses that session string.

How it works

┌─────────────────┐   extract once   ┌─────────────────────────────────────┐
│  Telegram       │  ──────────────► │  Dokploy / Docker host              │
│  Desktop (Win)  │   session string │                                     │
│  + tdata        │                  │  tg-archive container               │
└─────────────────┘                  │   ├─ APScheduler (cron, in-process) │
                                     │   │    └─ every tick: connect → poll │
                                     │   │       TARGET_CHAT only → download│
                                     │   │       media → disconnect         │
                                     │   ├─ SQLite  → /data/archive.db      │
                                     │   └─ media   → /media/<chat>/<ym>/   │
                                     └─────────────────────────────────────┘

Each poll:

  1. Connects the user session.
  2. Resolves TARGET_CHAT with get_chat().
  3. First poll → fetches the newest BASELINE_LIMIT (default 200) messages. Later polls → fetches newest-first and keeps only id > last_seen_message_id until it reaches the watermark (auto-paginated; POLL_LIMIT is a safety cap).
  4. Upserts each message by composite key (chat_id, message_id) — re-runs are idempotent.
  5. Downloads media for messages that have it, storing the path relative to /media.
  6. Advances the watermark, records last_polled_at, disconnects.

Repository layout

tg-archive/
├── docker-compose.yml      Dokploy / docker-compose entry point
├── local/                  run on your machine (Win/macOS), NOT deployed
│   ├── requirements.txt
│   ├── extract.py          tdata → StringSession (auto-detects tdata path)
│   └── verify.py           proves the session is live (get_me)
└── server/                 the deployed container
    ├── Dockerfile
    ├── requirements.txt
    ├── config.py           env → config, fast-fail on missing vars
    ├── models.py           chats + messages schema, init_db()
    ├── archive.py          poll_once() — the core archive routine
    ├── api.py              read-only REST API (aiohttp, opt-in via API key)
    └── poller.py           entrypoint: --once | scheduled loop (+ API host)

Part 1 — Extract the session (Windows or macOS)

The session is extracted once from Telegram Desktop's tdata. Get your API credentials at https://my.telegram.org (API ID + API hash).

macOS users: this requires the cross-platform Telegram Desktop app (from https://desktop.telegram.org), not the native macOS "Telegram" app from the App Store. The native app stores its data in a different, incompatible format. Install Telegram Desktop and sign in there once before extracting.

  1. Fully close Telegram Desktop — it locks the live tdata folder.

    • Windows: quit from the tray and confirm no Telegram.exe in Task Manager.
    • macOS: quit (⌘Q) and confirm it isn't running — Activity Monitor, or pgrep -fl "Telegram Desktop".
  2. Install dependencies and run the extractor:

    cd local
    pip install -r requirements.txt
    python extract.py --api-id 12345 --api-hash abcdef
    
    • The default --tdata path is auto-detected per OS:
      • Windows: %APPDATA%\Telegram Desktop\tdata
      • macOS: ~/Library/Application Support/Telegram Desktop/tdata
      • Linux: ~/.local/share/Telegram Desktop/tdata
    • For a portable or custom install, pass the folder explicitly with --tdata PATH.
    • If you set a Telegram Desktop local passcode, add --passcode <PIN>.
    • The extractor copies tdata to ./tdata_copy and parses the copy — the live folder is never read directly.
    • Reads tdata with tdata-reader (extracts the MTProto auth_key, dc_id, user_id) and packs them into a Pyrogram v2 StringSession — the exact format the server's pyrogram 2.0.106 expects. This works on Python 3.83.14 (pyrogram and TGConvertor don't import on 3.14, so the local tools avoid them).
  3. You'll see (copy these):

    SESSION=<long string>
    USER_ID=<your telegram id>
    DC_ID=<datacenter id>
    

    It also writes local/session.txt (the Pyrogram string — paste its first line into TG_SESSION_STRING), local/telethon-session.txt (used by verify.py), and local/api.env. All are gitignored — they contain secrets.

  4. Verify the session is live before shipping it:

    python verify.py
    

    verify.py logs in via Telethon (the same MTProto auth_key works for any client library) and prints your @username, id, name, and phone. A successful Telethon login confirms the Pyrogram string is live for the server too. If it fails, the server will fail — fix it here first.


Part 2 — Run the archive container

Environment variables

Variable Required Default Description
TG_API_ID yes Telegram API id
TG_API_HASH yes Telegram API hash
TG_SESSION_STRING yes SESSION= value from Part 1 (mark as secret)
TARGET_CHAT yes @username | +phone | numeric id of the one DM
SCHEDULE no 0 * * * * cron for polls (hourly default)
BASELINE_LIMIT no 200 messages fetched on the first poll
POLL_LIMIT no 10000 incremental safety cap per poll
DB_PATH no /data/archive.db SQLite path
MEDIA_DIR no /media media download root
ARCHIVE_API_KEY no if set, serves the read-only REST API (mark as secret)
ARCHIVE_API_HOST no 0.0.0.0 API bind address
ARCHIVE_API_PORT no 8080 API port inside the container

SCHEDULE is a 5-field crontab evaluated in UTC. 0 * * * * = top of every hour. Missing any required variable fails the container immediately on startup (visible in logs).

Quick local run (Docker)

Docker Compose reads a .env file in the project root automatically — the easiest way to test locally. Copy the template, fill it in, and bring it up:

cp .env.example .env
# edit .env: set TG_API_ID, TG_API_HASH, TG_SESSION_STRING (first line of
# local/session.txt), and TARGET_CHAT
docker compose up --build -d

docker compose logs -f tg-archive                       # "scheduler started" + cron polls
docker compose exec tg-archive python poller.py --once  # manual fire

.env is gitignored (it holds the TG_SESSION_STRING secret). You can also export the variables inline in your shell instead:

TG_API_ID=12345 TG_API_HASH=abcdef \
TG_SESSION_STRING='<paste>' TARGET_CHAT=@someuser \
docker compose up --build -d

Deploy on Dokploy

  1. Create an application from this Git repository, compose file at the repo root.
  2. In the Dokploy UI, set the env vars above (mark TG_SESSION_STRING as a secret).
  3. Set SCHEDULE if you don't want hourly.
  4. Deploy. The ./data and ./media bind mounts persist on the Dokploy host across redeployments.

Single poll without Docker (host cron)

You can also run the server code directly and drive it with your own cron:

python server/poller.py --once     # one poll, then exit 0

Database schema

SQLite at /data/archive.db.

chats (one row — the single archived DM):

column type notes
id BigInteger PK — Telegram chat/user id
type String private (for a DM)
title String nullable
username String nullable
last_seen_message_id BigInteger high-water mark; 0 = nothing archived
first_polled_at DateTime nullable
last_polled_at DateTime nullable — poll heartbeat

messages — composite primary key (chat_id, message_id):

column type notes
chat_id BigInteger PK part 1
message_id Integer PK part 2
date DateTime message send time (UTC)
direction String in | out
sender_id BigInteger nullable
sender_name String nullable
text Text msg.text or msg.caption
has_media Boolean
media_type String photo|video|voice|video_note|audio|document|sticker|animation
file_id String nullable
file_unique_id String nullable
file_name String nullable
mime_type String nullable
size Integer nullable
local_path String relative to MEDIA_DIR; NULL until download succeeds
downloaded_at DateTime nullable
raw_json Text full Pyrogram message dump (escape hatch)
archived_at DateTime row insert time

Inspect:

sqlite3 ./data/archive.db \
  "SELECT direction, date, substr(text,1,40) FROM messages ORDER BY date DESC LIMIT 20;"
sqlite3 ./data/archive.db "SELECT count(*) FROM messages WHERE local_path IS NOT NULL;"

Media files land at ./media/<chat_id>/<YYYY-MM>/<filename>.


REST API

A read-only JSON API runs in the same container as the poller, on the same event loop. It's opt-in: set ARCHIVE_API_KEY and the API starts on 0.0.0.0:8080; leave it unset and no API is served.

# .env
ARCHIVE_API_KEY=<a long random string>

All /api/* routes require the key, via either header:

curl -H "Authorization: Bearer <KEY>" http://localhost:8080/api/messages
curl -H "X-API-Key: <KEY>"        http://localhost:8080/api/messages

(/health is public and unauthenticated.)

Endpoints

Method Path Description
GET /health Liveness check — {"status":"ok"} (no auth)
GET /api/chat The single archived DM (id, type, title, username, watermark, poll timestamps)
GET /api/messages Paginated message list (newest first by default)
GET /api/messages/{message_id} One message, full (includes raw_json)
GET /api/media/{path} Serve a downloaded media file (auth-protected, path-traversal guarded)
POST /api/poll Trigger a poll now; returns 202. Watch /api/chat.last_polled_at to confirm.

/api/messages query params:

param default notes
limit 50 clamped to 1200
offset 0 pagination offset
order desc desc (newest) or asc (oldest), by message_id
direction filter in / out
q case-insensitive substring search on text
full false true to include the large raw_json field

Each message includes a media_url (e.g. /api/media/403303563/2026-06/photo.jpg) when a file was downloaded — fetch it with the same API key.

Examples

# newest 10 outgoing messages
curl -H "X-API-Key: $KEY" "http://localhost:8080/api/messages?limit=10&direction=out"

# search for a phrase, oldest first
curl -H "X-API-Key: $KEY" "http://localhost:8080/api/messages?q=flight&order=asc"

# one message with its full raw dump
curl -H "X-API-Key: $KEY" "http://localhost:8080/api/messages/700"

# trigger a poll (returns 202 immediately)
curl -s -X POST -H "X-API-Key: $KEY" http://localhost:8080/api/poll
# → {"status":"queued"}

The API reads the same SQLite file the poller writes; there's no second process or port for writes. Host port mapping is ${API_PORT:-8080}:8080 in docker-compose.yml — change API_PORT to expose it elsewhere.


Verification checklist

After deploying, confirm end-to-end capture in the target DM:

  1. Send archive-test-out to the peer from your phone/desktop → python poller.py --onceSELECT direction, text FROM messages WHERE text LIKE '%archive-test-out%'; → one row, direction = out.
  2. Have the peer send archive-test-in in that same DM → --once → same query for archive-test-indirection = in.

Both rows present proves the archive captures both directions, scoped to the one target DM. The counts dict {"messages": M, "media": K, "errors": E} is logged after every poll.


Important caveats

  • Shared auth key (load-bearing). The session is extracted from Telegram Desktop and shares its auth key. Logging out of Telegram Desktop, or terminating that session in Settings → Devices, also kills the server session. Keep Telegram Desktop logged in. If the session dies, re-run extract.py (with TD still logged in) and update TG_SESSION_STRING, then redeploy. If TD itself was logged out, do a one-time fresh phone+code login to mint an independent session string.
  • Residual presence. The client is connected only during each poll — a window of seconds (one history fetch + media downloads). For the leakage-avoidance guarantee, set Telegram → Settings → Privacy → Last Seen & Online → Nobody. A brief online flicker during each poll is irreducible for a user session.
  • Long gaps are safe. Incremental fetch reads newest-first until the watermark, paging through everything missed. It does not rely on MTProto update sequencing, so there's no differenceTooLong risk after downtime — a missed poll just means the next one fetches more pages.
  • Edits and deletions are not captured. This is a new-message-only fetch. Edits to already-archived messages, and deletions, are out of scope for v1.
  • One DM only. Expanding to more than one DM is a one-line change (iterate a list of targets in archive.py), but is out of scope here.
  • TARGET_CHAT resolution. @username or a numeric user id is most reliable. +phone works only if the peer is resolvable as a contact. If the peer has no username and you don't know the numeric id, message them once, then read the id from the chats.id column after a first poll.

Tech

  • Pyrogram v2 (async) — user session, get_chat, get_chat_history, download_media.
  • tdata-reader — reads the MTProto auth_key out of Telegram Desktop's tdata; the extractor packs it into a Pyrogram v2 StringSession (verified against pyrogram 2.0.106's source).
  • Telethon — used by verify.py to prove the auth_key is live (pyrogram 2.0.106 doesn't import on Python 3.14; telethon does).
  • APScheduler AsyncIOScheduler + CronTrigger — schedule lives in-process; no dependency on host cron.
  • SQLAlchemy 2.0 (sync) against SQLite on a mounted volume.
  • aiohttp — the read-only REST API, served on the poller's event loop (opt-in via ARCHIVE_API_KEY).