Project Overview
This project investigates message deletion and potential censorship behavior in Telegram channels.
The core idea is to model Telegram discussions as graphs, where users and messages interact through
replies, temporal activity, and channel-specific moderation patterns.
Unlike a standard text-only classifier, the project focuses on graph-based modelling. The pipeline uses
reply relationships, sender behavior, temporal features, topic information, propaganda indicators, and
deletion labels to predict whether a message is likely to be deleted or to impute deletion labels in
channels where real-time deletion labels are missing.
Research Motivation
Telegram has become a major platform for political communication, propaganda, and large-scale group
discussions. However, moderation and deletion behavior can be difficult to observe if data are collected
only from historical exports, because messages deleted before export disappear from the historical record.
Main research goal: model the structural, temporal, behavioral, and topical patterns
associated with deleted Telegram messages in order to better understand moderation and censorship dynamics.
Dataset
The project is based on a large Telegram dataset collected from multiple Russian-language channels.
The dataset combines historical channel exports with real-time message collection. This dual collection
strategy makes it possible to label deleted messages: a message observed in real time but missing from
the later historical export can be marked as deleted.
Dataset Characteristics
- Large-scale Telegram message dataset with more than 17 million messages.
- 13 Telegram channels, including Readovka, Nexta, Topor, Ru2ch, Rtrus, Shtefanov, Samaranovosti, Murz, and Agitprop.
- Historical and real-time data collection sources.
- Message deletion labels for channels with real-time collection.
- Propaganda labels for pro-Russian and pro-Ukrainian propaganda networks.
- Metadata including channel, origin, topic, sender ID, timestamp, reply relationships, and text.
Problem Definition
The main supervised learning task is binary classification: predict whether a Telegram message is deleted
or not deleted. The target variable is the deletion label, while the predictors combine message-level,
sender-level, thread-level, temporal, and graph-level information.
A second important task is imputation: for channels such as Murz and Agitprop, where only historical data
are available and no real-time deletion labels exist, the trained graph-based model can estimate deletion
probabilities and support censorship analysis even without direct deletion observations.
Graph Construction
Telegram conversations were represented as graphs. Reply relationships are especially important because
moderation decisions may depend not only on a single message, but also on the context around it: the parent
message, reply chain, active users, and surrounding discussion burst.
Graph Design
- Nodes: messages or users, depending on the modelling stage.
- Edges: reply relations between messages or interactions between users.
- Edge attributes: response delay, same-author reply flag, reply depth, message-length ratios, and temporal information.
- Graph context: in-degree, out-degree, reply-chain depth, thread-level deletion history, and topic/channel context.
Feature Engineering
The project uses a rich feature set designed to capture more than the raw text of a message. The goal is
to detect patterns that indicate moderation risk, such as sender history, reply-chain position, temporal bursts,
and topic-specific deletion behavior.
| Feature Group |
Examples |
Purpose |
| Message Features |
message length, word count, links, hashtags, question marks |
Capture direct content structure and surface-level text signals |
| Temporal Features |
hour, weekday, month, time bin, daily message count |
Model timing and moderation bursts |
| Sender Features |
sender message count, past deletion rate, active hours, topic diversity |
Model behavioral risk of the sender |
| Reply-Chain Features |
reply depth, parent relation, response delay, same-author reply |
Capture conversation structure and context |
| Graph Features |
in-degree, out-degree, centrality-style interaction signals |
Represent structural importance within the discussion network |
| Propaganda / Topic Features |
ru_pa, ua_pa, topic, channel, origin |
Capture propaganda and channel-level context |
Model Architecture
The final modelling direction uses graph neural networks, especially GAT/GATv2-style architectures.
The goal is to let the model learn from a message together with its neighborhood, reply structure, and
edge attributes, rather than treating each message independently.
Model Components
- GATv2 encoder: learns node embeddings from graph neighborhoods and edge attributes.
- Edge-aware classifier: combines graph embeddings with behavioral, temporal, and reply-chain features.
- MLP classification head: predicts deletion probability.
- Focal Loss / class weighting: addresses imbalance between deleted and non-deleted messages.
- NeighborLoader: enables scalable mini-batch training without full-graph forward passes.
Training Strategy
Training was designed to be memory-aware because the dataset is very large. The project uses PyTorch
Geometric-style graph batching and avoids loading the full graph into a single forward pass.
Model Development Path
- Start with channel-specific graph models on channels such as Samaranovosti, Nexta, and Readovka.
- Engineer reply-chain, sender-behavior, temporal, and graph features.
- Train GAT/GATv2-based deletion classifiers with class imbalance handling.
- Evaluate with threshold tuning and classification metrics.
- Use pretraining or cross-channel learning to improve robustness.
- Apply the trained model for deletion imputation on channels without real-time labels.
Evaluation Metrics
Since deleted messages are relatively rare, overall accuracy is not sufficient. The evaluation focuses on
ranking quality, positive-class detection, and the balance between false positives and false negatives.
ROC-AUC
Measures how well the model ranks deleted messages above non-deleted messages.
PR-AUC
Important for rare deletion labels because it focuses on positive-class retrieval.
F1-score
Balances precision and recall after choosing a decision threshold.
Additional Metrics
- Precision: among predicted deleted messages, how many were actually deleted.
- Recall: among actually deleted messages, how many the model detected.
- Threshold tuning: used to select the decision boundary that best balances precision and recall.
- Confusion matrix: used to inspect false positives and false negatives.
Imputation and Smoothing Strategy
A key part of the project is estimating deletion probabilities for messages in channels where real deletion
labels are unavailable. Instead of training a second model on model-generated labels, smoothing is used only
as a post-processing step to make predictions more consistent with graph and temporal structure.
Post-processing Ideas
- Thread smoothing: boost borderline messages when surrounding reply-chain neighbors have high deletion probability.
- Sender burst smoothing: adjust borderline messages during short sender-level deletion bursts.
- Time-bin smoothing: use rolling averages to reduce unrealistic sudden drops in estimated deletion rates.
- User/topic analysis: aggregate deletion probabilities by sender, topic, and channel.
Results and Interpretation
The project showed that deletion prediction benefits from combining graph structure, sender behavior,
temporal context, and metadata. Text alone is not enough to understand moderation patterns because deletion
can depend on who sends a message, where it appears in a thread, when it is posted, and how the surrounding
conversation develops.
The strongest interpretation of this project is not simply that a classifier can predict deletion. The more
important contribution is the research pipeline: constructing graph-based representations of Telegram
conversations, designing leakage-aware features, evaluating imbalanced deletion prediction, and using the
trained model for censorship-oriented analysis and imputation.
Limitations
The project has several important limitations. First, deleted messages can be observed only when real-time
collection exists. Second, it is not always possible to know whether a message was deleted by the user or by
a moderator. Third, imputation for historical-only channels should be interpreted as estimated deletion risk,
not as ground-truth censorship.
- Deletion labels are available only for channels with real-time data collection.
- Moderator deletion and user self-deletion cannot always be separated.
- Channel-specific moderation policies may reduce cross-channel generalization.
- Graph models require careful memory management on large-scale data.
- Features using future information must be avoided to prevent temporal leakage.
Outcome
This project strengthened my ability to design a research-grade machine learning pipeline for large-scale
social media data. It combines graph machine learning, temporal feature engineering, imbalanced classification,
censorship analysis, and scalable PyTorch Geometric training.
It is one of my most important portfolio projects because it demonstrates both technical depth and research
maturity: data understanding, graph modelling, feature design, model evaluation, imputation strategy, and careful
interpretation of limitations.
Graph Neural Networks
GATv2
PyTorch Geometric
Telegram
Censorship Detection
Deletion Prediction
Imbalanced Classification
Temporal Features
Graph Features