A Python tool to fetch, archive, and analyze messages from Telegram channels and groups using Telethon. Includes trust scoring with OpenRank and database integration.
- Simple - Login with phone number, no session strings needed
- Flexible - Configure via
config.tomlfile - Async - Built with async/await for efficient message fetching
- Rate limiting - Respects Telegram API limits
- Parallel processing - Crawls multiple channels concurrently
- Channel exclusion - Skip unwanted channels (logs, bots, etc.)
- Checkpoints - Automatically saves progress and allows resuming if interrupted
- JSON export - Saves messages with full metadata
- Trust scoring - Calculate trust scores using OpenRank algorithm
- Database integration - Import data to PostgreSQL
- AI summarization - Generate summaries using OpenAI
- Photo management - Download and upload user profile photos to S3
- Visit https://my.telegram.org
- Login with your phone number
- Go to "API Development Tools"
- Create a new application (any name/description)
- Copy your
api_idandapi_hash
pip install -r requirements.txtCreate a .env file with your credentials:
# Required for Telegram
TELEGRAM_APP_ID=12345678
TELEGRAM_APP_HASH=abcdef1234567890abcdef1234567890
TELEGRAM_PHONE=+1234567890
# Optional: For database imports
DATABASE_URL=postgresql://user:pass@localhost:5432/dbname
# Optional: For S3 photo uploads
S3USERNAME=your_aws_access_key_id
S3CREDENTIAL=your_aws_secret_access_key
# Optional: For AI summarization
OPENAI_API_KEY=sk-...Note: TELEGRAM_PHONE is optional - you'll be prompted if not set
First, list all your accessible channels:
python list_channels.pyThis will show all channels with their IDs. Then edit config.toml to set which channels to crawl:
[group_chats]
include = [
1234567890, # Group chat ID (from list_channels.py)
]
[channels]
include = [
-1001234567890, # Channel ID (from list_channels.py)
]python read_messages.pyOn first run, you'll be prompted to:
- Enter your phone number (if not in .env)
- Enter the verification code Telegram sends you
- Enter your 2FA password (if enabled)
A session file will be created so you don't need to login again on subsequent runs.
Edit config.toml to customize the crawler:
[crawler]
time_window_days = 365 # How many days back to fetch
max_messages_per_channel = 40000 # Message limit per channel
parallel_requests = 1 # Concurrent channels to process
batch_size = 500 # Number of messages to fetch per batch
rate_limiting_delay = 0.5 # Delay between requests (seconds)
checkpoint_interval = 2000 # Save checkpoint every N messages (0 to disable)
fetch_replies = true # Fetch replies/comments to channel posts
max_reply_depth = 4 # Maximum depth for nested replies (0-5 recommended)
[group_chats]
include = [1234567890] # Group chat IDs to crawl
[channels]
include = [-1001234567890] # Channel IDs to crawl
[output]
pretty_print = true # Format JSON nicely
indent_spaces = 2 # JSON indentation
[trust]
mention_points = 50 # Points for direct mentions
reply_points = 40 # Points for replies
reaction_points = 30 # Points for reactions| Script | Description |
|---|---|
read_messages.py |
Main crawler script - fetches messages from Telegram |
list_channels.py |
List all accessible channels/groups with their IDs |
list_admins.py |
List admins/moderators for configured channels (saves to CSV) |
generate_trust.py |
Calculate trust scores from messages |
process_scores.py |
Aggregate and normalize trust scores |
process_seed.py |
Process seed CSV files with tier-based weighting |
generate_json.py |
Generate JSON files for UI from seed/output data |
| Script | Description |
|---|---|
import_metadata_to_db.py |
Import messages, reactions, users, channels to PostgreSQL |
import_scores_to_db.py |
Import seeds and scores to PostgreSQL |
| Script | Description |
|---|---|
download_photos.py |
Download user profile photos to raw/photos/ |
upload_photos.py |
Upload photos to S3 (s3://openrank-files/telegram) |
| Script | Description |
|---|---|
summarize_posts.py |
Generate AI summaries of posts using OpenAI |
Located in channel/ directory for channel-specific processing:
| Script | Description |
|---|---|
channel/read_channel_messages.py |
Channel-specific message fetching |
channel/generate_channel_trust.py |
Channel-specific trust generation |
channel/generate_channel_json.py |
Channel-specific JSON generation |
| Script | Description |
|---|---|
run_pipeline.sh |
Run complete pipeline: trust → OpenRank → scores → JSON |
| File | Description |
|---|---|
config.toml |
Main configuration file |
.env |
Environment variables (credentials) |
requirements.txt |
Python dependencies |
trank/
├── channel/ # Channel-specific scripts
├── output/ # Processed output files
├── raw/ # Raw data from Telegram
│ ├── checkpoints/ # Checkpoint files for resuming
│ └── photos/ # Downloaded user profile photos
├── schemas/ # PostgreSQL schema files
├── scores/ # Computed OpenRank scores
├── seed/ # Seed values for trust computation
├── trust/ # Trust edge files
├── ui/ # JSON files for UI consumption
└── tmp/ # Temporary files
# List all your channels
python list_channels.py
# Run the crawler
python read_messages.py
# List admins/moderators (saves to CSV)
python list_admins.py
# Calculate trust scores
python generate_trust.py
# Process and normalize scores
python process_scores.py
# Generate UI JSON files
python generate_json.py
# Run complete pipeline
./run_pipeline.sh
# Run pipeline for channels (not group chats)
./run_pipeline.sh --channel
# Download user profile photos
python download_photos.py
python download_photos.py --skip-existing
python download_photos.py --verbose
# Upload photos to S3
python upload_photos.py
python upload_photos.py --dry-run
python upload_photos.py --force
# Import to database
python import_metadata_to_db.py
python import_metadata_to_db.py --channel 123456
python import_metadata_to_db.py --dry-run
python import_scores_to_db.py
python import_scores_to_db.py --channel 123456
# Generate AI summaries
python summarize_posts.pyThe crawler automatically saves checkpoints during message fetching to prevent data loss if interrupted. Checkpoints are saved to raw/checkpoints/ directory.
How it works:
- Checkpoint files are created every N messages (configurable via
checkpoint_intervalin config.toml) - Default is every 2000 messages
- If the script is interrupted, it will detect the checkpoint on next run and ask if you want to resume
- Checkpoints are automatically deleted after successful completion
- Set
checkpoint_interval = 0to disable checkpoints
Resuming from checkpoint:
python read_messages.py
# If a checkpoint is found, you'll see:
# 📂 Found checkpoint for channel -1001234567890 with 500 messages
# Last saved: 2025-01-15T10:30:45+00:00
# Resume from checkpoint? (y/n):Messages are saved in the raw/ directory:
- Format:
raw/[channel_id]_messages.json - One file per channel
- Contains simplified message data (ID, date, user ID, text, reactions, replies)
Example message:
{
"id": 9099,
"date": "2025-11-13T01:49:52+00:00",
"from_id": 526750941,
"message": "@lazovicff @dharmikumbhani",
"reply_to_msg_id": 9098,
"reactions": [
{"user_id": 526750941, "emoji": "👍"},
{"user_id": 123456789, "emoji": "👍"}
],
"replies_count": 3,
"replies_data": [...]
}- Format:
raw/[channel_id]_user_ids.csv - Columns:
user_id,username,first_name,last_name - Some users may not have usernames (this is normal on Telegram)
- Format:
raw/[channel_id]_admins.csv - Columns:
user_id,username,first_name,last_name - Generated by running
python list_admins.py
trust/[channel_id].csv- Raw trust edges with user IDs (i,j,v format)scores/[channel_id].csv- OpenRank computed scoresoutput/[channel_id].csv- Processed scores with display names or user IDs
The complete trust scoring workflow:
-
Fetch messages:
python read_messages.py- Saves messages to
raw/[channel_id]_messages.json - Saves user info to
raw/[channel_id]_user_ids.csv
- Saves messages to
-
Generate trust edges:
python generate_trust.py- Reads messages and calculates trust based on reactions, replies, and mentions
- Saves trust edges to
trust/[channel_id].csv(format:i,j,v)
-
Compute OpenRank scores: Uses external
openrankCLI toolopenrank compute-local-et trust/[channel_id].csv seed/[channel_id].csv \ --out-path=scores/[channel_id].csv --alpha=0.25 --delta=0.000001 -
Process scores:
python process_scores.py- Aggregates incoming trust for each user
- Converts user IDs to display names
- Normalizes scores
- Saves to
output/[channel_id].csv
-
Generate JSON:
python generate_json.py- Creates UI-ready JSON files in
ui/directory
- Creates UI-ready JSON files in
Or run everything at once:
./run_pipeline.sh # For group chats
./run_pipeline.sh --channel # For channelsLocated in schemas/ directory:
messages.sql- Message storagereactions.sql- Reaction datausers.sql- User informationchannels.sql- Channel metadataruns.sql- Processing run trackingseeds.sql- Seed valuesscores.sql- Computed scoressummaries.sql- AI-generated summaries
# Set DATABASE_URL in .env first
export DATABASE_URL=postgresql://user:pass@localhost:5432/dbname
# Import all metadata (messages, reactions, users, channels)
python import_metadata_to_db.py
# Import specific channel
python import_metadata_to_db.py --channel 123456
# Preview without inserting
python import_metadata_to_db.py --dry-run
# Import seeds and scores
python import_scores_to_db.py"Missing Telegram credentials"
→ Make sure .env has TELEGRAM_APP_ID and TELEGRAM_APP_HASH
"Channel is not a valid ID"
→ Only numeric IDs are accepted, run python list_channels.py to get IDs
"Could not find the input entity" → Make sure the channel ID is correct (from list_channels.py)
"A wait of X seconds is required"
→ You're rate limited. Increase rate_limiting_delay in config.toml
Script keeps getting interrupted
→ Enable checkpoints in config.toml with checkpoint_interval = 2000 to save progress periodically
Want to restart from scratch (ignore checkpoint)
→ When prompted to resume, type 'n' or manually delete checkpoint files in raw/checkpoints/
Import errors
→ Install dependencies: pip install -r requirements.txt
Authorization failed → Make sure you enter the correct phone number and verification code
"Collected info for 0 unique users" for channel posts
→ This is normal for channels (not groups). Set fetch_replies = true in config.toml to fetch comments/replies where user interactions happen.
Database connection failed
→ Check that DATABASE_URL is set correctly in .env
S3 upload failed
→ Check that S3USERNAME and S3CREDENTIAL are set in .env
The crawler creates a telegram_session.session file to remember your login.
- This file is automatically created on first login
- Don't commit this file to git (it's in .gitignore)
- Delete it if you want to login with a different account
ISC