To ensure successful fine-tuning, consider the best practices in the sections below.
The training algorithm uses deep learning to create a data-driven decision boundary. To establish an effective boundary in a data-driven manner, provide diverse examples:
By providing both types of examples, you help the base model learn to distinguish between the desired taxonomy and visually similar instances. This approach improves the precision and generalization capabilities of the model in real-world scenarios.
For example, if you want to fine-tune a base model to recognize the “timeout” gesture in American football footage, you should provide the following:
When fine-tuning a base model, the training data should be representative of the real-world scenarios in which the model will be used, including factors such as lighting conditions, camera angles, and object variations.
Align the data distribution with your practical use case to improve the accuracy and reliability of the model in the target environment.
For example, if you’re fine-tuning a base model to detect product defects, consider two approaches to creating a training dataset:
If your training data doesn’t account for real-world variations, the performance of the model may decrease when deployed in practical settings.
Visual similarity occurs when the embedding vectors of a taxonomy cluster well in the embedding space. By ensuring visual similarity in your taxonomy examples, you can create more stable and effective fine-tuned models. This approach helps minimize distortions to the base model and improves the overall performance of your fine-tuned model in real-world applications.
The impact of visual dissimilarity:
To assess the visual similarity of taxonomy examples, you can use the following method:
For example, when fine-tuning a model to detect “hurdles,” you might perform the following searches:
If both searches yield similar results, the examples of “hurdles” are likely visually similar and well-suited for fine-tuning.
The quality and specificity of your training dataset significantly impact the performance of the fine-tuned model. When preparing your dataset, focus on two key aspects of taxonomy tightness: spatial and temporal.
Spatial tightness: Refers to the precision and specificity of visual content within the training data. To ensure spatial tightness, follow these best practices:
Temporal tightness: Refers to the accuracy and precision of time-based annotations within the training dataset. To ensure temporal tightness, follow these best practices:
Maintaining both spatial and temporal tightness in your dataset helps create a more accurate and reliable fine-tuned model. Focusing on these aspects ensures that your model learns to recognize and classify actions with precision and accuracy.
This section illustrates critical concepts in fine-tuning, such as taxonomy tightness, data diversity, and temporal tightness. Each concept is essential for effective model training and performance.
The videos below illustrate the importance of temporal tightness in training data. They demonstrate how the precise timing of action labeling affects model training. Accurately isolating the specific moment of an action is crucial for the model to learn the correct association between the action and its visual cues.
This video shows an example of tight temporal bounds for a “spike” action in American football. The ground truth segment accurately isolates the specific moment of the spike:

This video illustrates loose temporal bounds for a “spike” action. It includes the spike motion followed by unrelated hand-waving, which can confuse the model if incorrectly labeled:

This video illustrates the importance of spatial tightness in training data. It represents a suboptimal training sample for identifying paramedic activities. While paramedics are present in the scene, they appear as small figures in the background next to an ambulance, making it difficult for the model to associate this video segment with paramedic-related categories. For effective training, the target action should be the primary focus of the sample.

The videos below illustrate the importance of diverse positive samples in training data. Insufficient diversity in training data can lead to poor performance. When training samples lack the complexity of real-world scenarios, the model may fail to generalize effectively.
The video below shows an example of insufficient diversity in training data. In it, a cyclist falls from their bike on an empty road. This isolated scenario, while clear, doesn’t represent the full range of real-world conditions.

The video below demonstrates a real-world scenario. It shows a cyclist falling during a professional cycling competition with other racers nearby, better representing the environmental complexity the model needs to handle.
