Machine learning models are hungry for labeled data. In computer vision, labeling means drawing bounding boxes around objects. In NLP, it means tagging entities in text. But in the world of sensor data — accelerometers, temperature sensors, vibration monitors, current measurements — annotation is a different beast entirely. The data is continuous, high-dimensional, and often requires domain expertise to interpret correctly.
I have spent significant time building and refining annotation pipelines for sensor data in product development projects. The lesson that keeps coming back: fully manual annotation does not scale, but fully automated annotation is not reliable enough for most use cases. The sweet spot lies in hybrid strategies that combine human expertise with algorithmic assistance. This article is a deep dive into those strategies.
The annotation bottleneck
Consider a typical scenario: you are building a machine learning model to detect equipment faults from vibration data. You have a sensor recording at 1 kHz on a motor that runs 24/7. After one week you have over 600 million data points. Somewhere in that data are a handful of fault events — and your model needs to know exactly where they start and end.
Manual annotation of this data means a domain expert scrolling through hours of waveforms, identifying events, marking start and end timestamps, and classifying each one. At best, an experienced annotator can label about 2-4 hours of sensor data per hour of work. For a dataset covering weeks or months, this becomes impossibly expensive. And unlike image labeling, you cannot easily crowdsource it — interpreting vibration spectra requires specialised knowledge.
The cost equation is stark. A supervised learning model might need thousands of labeled examples to generalise well. If each example takes 5 minutes to label, you are looking at hundreds of hours of expert time before you can even train a first model. This is where automation becomes not just helpful, but essential.
Types of sensor data and their annotation challenges
Not all sensor data is created equal. The annotation approach that works for one type may be completely wrong for another. Understanding the characteristics of your data is the first step to choosing the right strategy.
Event-based data
Accelerometer impacts, machine start/stop events, door openings. Discrete events with clear boundaries. Relatively easy to annotate because events are distinct — the challenge is finding them in long recordings.
State-based data
Temperature regimes, operational modes, degradation phases. The data shifts between states that can last minutes to hours. Boundaries are often gradual rather than sharp, making precise labeling subjective.
Anomaly data
Bearing wear, sensor drift, unusual vibration patterns. By definition rare and diverse. You may have thousands of hours of normal operation but only a handful of anomalies — and each anomaly may look different.
Multi-channel data
Multiple sensors measuring simultaneously (3-axis accelerometer, multi-point temperature, combined vibration + current). The label may depend on patterns across channels, not within a single signal.
Why sensor data is harder to annotate than images
Image annotation has mature tools (LabelImg, CVAT, Label Studio) and straightforward workflows: look at the image, draw a box, assign a class. Sensor data is fundamentally different in ways that make standard annotation tools inadequate:
Temporal context matters. A single data point means nothing on its own. To determine whether a vibration reading is "normal" or "faulty", you need to see minutes or hours of surrounding data, know the operating conditions, and understand the machine's history. This makes labeling slow because the annotator must maintain context while scrolling through long time series.
Boundaries are fuzzy. When does a fault "start"? Is it when the vibration amplitude first exceeds a threshold? When the frequency spectrum shifts? When a human operator would notice something wrong? Different annotators will place boundaries at different points, and there is often no objectively correct answer.
Class imbalance is extreme. In predictive maintenance, 99.9% of the data is "normal". Finding and labeling the 0.1% that matters requires scanning through vast amounts of uninteresting data. This is exhausting for human annotators and leads to missed events.
Domain expertise is essential. Unlike labeling cats vs dogs in images, labeling "inner race bearing fault" in a vibration spectrum requires years of experience. The pool of qualified annotators is small, expensive, and has better things to do.
Strategy 1: Rule-based pre-labeling
The simplest form of automation is applying deterministic rules to generate candidate labels. This works best for event-based data where events have clear signal characteristics.
For example: "If the acceleration exceeds 5g for more than 10 milliseconds, label it as an impact event." Or: "If the temperature rises above 80°C and stays there for more than 5 minutes, label it as an overheating episode." These rules encode basic domain knowledge into simple threshold logic.
The rules do not need to be perfect. Their job is to reduce the annotator's workload from "find and label everything" to "review and correct what the rules found." In practice, a set of well-tuned rules can correctly label 60-80% of events, leaving the annotator to fix the remaining 20-40%. That is a 3-5x speedup in annotation time.
When it works: Clear signal thresholds, well-understood physics, event-based data. When it fails: Gradual degradation, context-dependent labels, patterns that require frequency-domain analysis.
Strategy 2: Model-assisted labeling
Once you have a small labeled dataset (even a few hundred examples), you can train a preliminary model and use its predictions as pre-annotations for new, unlabeled data. The annotator then reviews and corrects the model's suggestions rather than labeling from scratch.
This creates a virtuous cycle: labeled data trains the model, the model pre-labels new data, the annotator corrects the pre-labels (which is faster than labeling from scratch), the corrected labels go back into training, and the model improves. Each round, the model's predictions get better, and the annotator's corrections get fewer.
The key to making this work is showing the annotator the model's confidence score alongside each prediction. High-confidence predictions can be accepted with a quick glance. Low-confidence predictions get careful attention. This focuses human effort exactly where it is most needed.
Strategy 3: Active learning
Active learning takes model-assisted labeling one step further. Instead of presenting all unlabeled data for review, an active learning system selects the most informative samples to show the annotator. "Most informative" typically means the samples where the model is most uncertain — the data points on the decision boundary where a human label would teach the model the most.
The mathematics behind sample selection vary. Uncertainty sampling picks the samples where the model's predicted probability is closest to 50% (for binary classification). Query-by-committee trains multiple models and picks the samples where they disagree most. Expected model change selects samples that would cause the largest update to the model parameters.
In practice, uncertainty sampling is the easiest to implement and works well for most sensor data problems. The improvement can be dramatic: research consistently shows that active learning can achieve the same model performance with 30-70% fewer labeled samples compared to random selection.
Practical consideration: Active learning assumes you can label data in small batches and retrain the model between batches. For sensor data, this means your annotation tool, model training pipeline, and sample selection logic need to be integrated into a single workflow. Building this infrastructure takes effort up front, but pays off quickly when you have large datasets to annotate.
Strategy 4: Weak supervision
Weak supervision, popularised by Stanford's Snorkel project, flips the annotation paradigm. Instead of labeling individual data points, you write labeling functions — small programs that each express one heuristic about how to label the data. Each function may be noisy (sometimes wrong), but by combining many noisy functions, you can generate surprisingly accurate labels.
Threshold functions
"If peak vibration exceeds X, label as fault." Simple, fast, but coarse. Works well as one signal among many.
Pattern functions
"If the frequency spectrum has a peak at the bearing fault frequency, label as bearing issue." Encodes domain knowledge about specific failure modes.
Context functions
"If the machine was scheduled for maintenance within 24 hours and vibration is elevated, label as degraded." Combines sensor data with external information.
Model-based functions
"If a pre-trained anomaly detection model flags this segment, label as anomalous." Leverages existing models as noisy labelers alongside rule-based functions.
A label model (the core of the Snorkel framework) then learns the accuracy and correlation of each labeling function and combines them into probabilistic labels. The beauty of this approach is that writing a labeling function takes minutes, while labeling individual data points takes hours. You can rapidly iterate on your labeling strategy without touching individual samples.
When weak supervision shines: When you have strong domain knowledge that can be expressed as rules but the rules alone are too noisy. When you have multiple signals or data sources that each provide partial evidence. When the volume of data is so large that even active learning would take too long.
Strategy 5: Semi-supervised and self-supervised approaches
These approaches reduce annotation needs by leveraging the structure of unlabeled data itself. Semi-supervised learning uses a small labeled set alongside a large unlabeled set, assuming that data points close together in feature space are likely to have the same label. Self-supervised learning creates pretext tasks (like predicting the next sensor reading, or detecting which segment has been artificially corrupted) that teach the model useful representations without any labels at all.
For sensor data, contrastive learning has shown particular promise. By training a model to recognise that two windows of data from the same operational state are "similar" while windows from different states are "different", you can learn powerful feature representations. These features then make the downstream classification task much easier, often requiring only a fraction of the labeled data that a model trained from scratch would need.
In practice, I use self-supervised pre-training as a complement to the other strategies. Train a feature extractor on all your unlabeled data (you have plenty), then fine-tune a classifier on the labeled subset (which you have carefully curated using active learning or weak supervision). This combination often delivers the best results with the least annotation effort.
Building a practical annotation pipeline
Theory is useful, but what does an actual annotation pipeline look like in practice? Here is the workflow I have converged on across multiple projects:
Data exploration and rule definition
Spend time with the domain expert looking at raw data. Identify obvious patterns and encode them as rule-based pre-labels. This generates a noisy but large initial dataset and builds shared understanding of what the labels mean.
Manual correction of a seed set
Have the expert manually verify and correct 200-500 pre-labeled samples. This creates a clean seed set for model training and refines the annotation guidelines. Document ambiguous cases and edge cases explicitly.
Train initial model and enter active learning loop
Train a model on the seed set. Use uncertainty sampling to select the next batch of samples for human review. After each batch, retrain and repeat. Continue until model performance plateaus or the annotation budget is exhausted.
Weak supervision for scaling
If the labeled set is still too small, write labeling functions that encode the knowledge accumulated during the previous steps. Use a label model to generate probabilistic labels for the remaining unlabeled data. Train the final model on the combination of human labels and weak labels.
Quality assurance and iteration
Evaluate the model on a held-out test set with verified labels. Analyse failure modes — where does the model fail? Are there patterns? Use these insights to refine labeling functions, collect more targeted annotations, and improve the model iteratively.
Quality assurance: when is a label good enough?
Automated and semi-automated annotation introduces a fundamental question: how do you know the labels are correct? Unlike a carefully hand-labeled gold standard, automatically generated labels will always have some noise. The question is whether that noise is acceptable for your use case.
Measuring label quality
- Inter-annotator agreement: Have two experts label the same data independently and measure agreement (Cohen's kappa). This sets the ceiling for model performance.
- Spot-check audits: Randomly sample 5-10% of auto-generated labels and have an expert verify them. Track accuracy over time.
- Model performance on clean test set: Maintain a small, carefully verified test set. If model performance on this set is acceptable, label noise in the training set is likely manageable.
- Confusion analysis: Look at what the model confuses. If label noise causes systematic confusion between two classes, targeted re-annotation of those classes is more efficient than re-labeling everything.
Acceptable noise levels
- Safety-critical applications: Label noise below 2%. Human verification of all edge cases. Consider this mandatory for medical, automotive, or industrial safety.
- Quality monitoring: Label noise below 5%. Automated labels with periodic expert review. Suitable for most industrial analytics applications.
- Exploratory analysis: Label noise below 15%. Weak supervision labels are often sufficient. Good enough for trend detection and initial model development.
- Pre-training: Higher noise is tolerable. Self-supervised or weakly supervised labels work well for feature learning, even with significant noise.
When to automate and when to stay manual
Automation is not always the answer. Here is a simple decision framework:
Stay manual when your dataset is small (under 500 samples), when the task requires deep contextual understanding that is hard to encode in rules, when you are still defining what the labels should be (the annotation guidelines are not stable), or when the cost of a wrong label is very high (safety-critical applications).
Automate when your dataset is large (thousands to millions of samples), when clear patterns or thresholds exist in the data, when you have a working model that can be improved with more data, when annotation cost or time is the primary bottleneck, or when the same annotation task will recur as new data arrives.
Most real projects start manual and transition to automation as the understanding of the data matures. The initial manual effort is never wasted — it builds the domain knowledge and seed datasets that automation strategies depend on.
Lessons from real projects
Across several sensor data ML projects, here are the lessons that keep coming back:
Define the label schema carefully
Spend more time on what the labels mean than on how to apply them. Ambiguous definitions create inconsistent labels that no automation can fix. Write down definitions, examples, and counter-examples for each class.
Invest in tooling early
A good annotation interface for time-series data saves hours per week. Build or configure a tool that lets annotators zoom, pan, overlay multiple channels, and edit labels with keyboard shortcuts. The ROI is immediate.
Version your labels
Labels evolve as understanding deepens. Track every change: who labeled what, when, and according to which version of the guidelines. This is essential for reproducibility and debugging model regressions.
Start with the model, not the data
Train a simple model on a tiny labeled set first. See where it fails. Then annotate specifically to address those failures. This targeted approach is 5-10x more efficient than labeling data randomly.
Conclusion: annotation as an engineering discipline
Data annotation is not a mundane task to be outsourced and forgotten — it is an engineering discipline that directly determines the quality of your machine learning model. For sensor data, the challenges are unique: temporal dependencies, fuzzy boundaries, extreme class imbalance, and the need for domain expertise. But the strategies are well-established: rule-based pre-labeling, model-assisted annotation, active learning, weak supervision, and self-supervised pre-training.
The most successful approach combines these strategies into a pipeline that evolves with your project. Start manual, encode knowledge into rules, let models assist the annotators, and scale with weak supervision. At each stage, measure label quality and invest human effort where it matters most.
The payoff is significant. A well-designed annotation pipeline can reduce labeling time by 5-10x, improve label consistency, and make it practical to build ML models on sensor data that would otherwise be too expensive to annotate. In a world where the model architecture is increasingly commoditised, the quality of your data is your real competitive advantage.
Working on a machine learning project with sensor data? I build end-to-end ML pipelines for product development, from data collection and annotation to model training and deployment on embedded hardware. Get in touch to discuss your data challenge.