To ensure successful fine-tuning, consider the best practices in the sections below.

Provide diverse positive and negative examples

The training algorithm uses deep learning to create a data-driven decision boundary. To establish an effective boundary in a data-driven manner, provide diverse examples:

Positive examples: Instances that belong to the target taxonomy.
Negative examples: Instances that don’t belong to the target taxonomy but share similar visual characteristics.

By providing both types of examples, you help the base model learn to distinguish between the desired taxonomy and visually similar instances. This approach improves the precision and generalization capabilities of the model in real-world scenarios.

For example, if you want to fine-tune a base model to recognize the “timeout” gesture in American football footage, you should provide the following:

Positive examples:
- Players or coaches calling a timeout in various American football games.
- Timeout gestures are performed by different individuals to capture variations in motion.
- Timeout gestures from different camera angles and distances to improve the robustness of the model.
Negative examples:
- Unrelated actions that resemble the timeout gesture, such as clapping or waving

Match data distribution to your use case

When fine-tuning a base model, the training data should be representative of the real-world scenarios in which the model will be used, including factors such as lighting conditions, camera angles, and object variations.

Align the data distribution with your practical use case to improve the accuracy and reliability of the model in the target environment.

For example, if you’re fine-tuning a base model to detect product defects, consider two approaches to creating a training dataset:

Limited dataset: This dataset contains only close-up images of defective products under ideal lighting conditions. As a result:
- The model learns to identify defects based on specific, controlled conditions.
- The model may struggle with real-world applications that involve varying distances and lighting.
Comprehensive dataset: This dataset includes images of products at various distances from the camera and under different lighting conditions. As a result:
- The model performs better in real-world environments.
- The model can detect defects across a range of practical scenarios.

If your training data doesn’t account for real-world variations, the performance of the model may decrease when deployed in practical settings.

Provide visually similar examples

Visual similarity occurs when the embedding vectors of a taxonomy cluster well in the embedding space. By ensuring visual similarity in your taxonomy examples, you can create more stable and effective fine-tuned models. This approach helps minimize distortions to the base model and improves the overall performance of your fine-tuned model in real-world applications.

The impact of visual dissimilarity:

When examples of a taxonomy are visually dissimilar, bringing the embedding vectors together requires more distortion of the base model during fine-tuning.
This approach might work for in-domain samples, but larger distortions can negatively affect the general baseline model.
These distortions may lead to instabilities in the model’s performance and cause the model to “forget” previously learned information.

To assess the visual similarity of taxonomy examples, you can use the following method:

Search for the same moments using the base model with your original query.
Perform another search using paraphrased queries that describe the visual content.
Compare the results of both searches.

For example, when fine-tuning a model to detect “hurdles,” you might perform the following searches:

Original search: “hurdles”
Paraphrased search: “man jumps over another man.”

If both searches yield similar results, the examples of “hurdles” are likely visually similar and well-suited for fine-tuning.

Tightness of the taxonomy

The quality and specificity of your training dataset significantly impact the performance of the fine-tuned model. When preparing your dataset, focus on two key aspects of taxonomy tightness: spatial and temporal.

Spatial tightness: Refers to the precision and specificity of visual content within the training data. To ensure spatial tightness, follow these best practices:
- Annotate raw videos with precise start and end timestamps for each target taxonomy occurrence.
- Avoid including extraneous actions or objects in annotated segments.
- Minimize noise and irrelevant information within the training data.
  For example, annotating the “sawing” action tightly encompasses the sawing motion itself. Do not include broader scenes where sawing happens in the background or alongside other actions. By focusing on spatial tightness, the model learns to recognize and classify the target action accurately without influence from irrelevant background elements.
Temporal tightness: Refers to the accuracy and precision of time-based annotations within the training dataset. To ensure temporal tightness, follow these best practices:
- Provide tight temporal bounds for labeled actions.
- Ensure the model associates the correct temporal context with each action.
  For example, annotating the “spike” action in American football tightly encompasses the spiking motion. Avoid including extended scenes of post-spike celebrations. By maintaining temporal tightness, the model accurately recognizes the target action without erroneously associating it with related but distinct events.

Maintaining both spatial and temporal tightness in your dataset helps create a more accurate and reliable fine-tuned model. Focusing on these aspects ensures that your model learns to recognize and classify actions with precision and accuracy.

Practical examples

This section illustrates critical concepts in fine-tuning, such as taxonomy tightness, data diversity, and temporal tightness. Each concept is essential for effective model training and performance.

Temporal tightness

The videos below illustrate the importance of temporal tightness in training data. They demonstrate how the precise timing of action labeling affects model training. Accurately isolating the specific moment of an action is crucial for the model to learn the correct association between the action and its visual cues.

This video shows an example of tight temporal bounds for a “spike” action in American football. The ground truth segment accurately isolates the specific moment of the spike:

This video illustrates loose temporal bounds for a “spike” action. It includes the spike motion followed by unrelated hand-waving, which can confuse the model if incorrectly labeled:

Spatial tightness

This video illustrates the importance of spatial tightness in training data. It represents a suboptimal training sample for identifying paramedic activities. While paramedics are present in the scene, they appear as small figures in the background next to an ambulance, making it difficult for the model to associate this video segment with paramedic-related categories. For effective training, the target action should be the primary focus of the sample.

Diverse examples

The videos below illustrate the importance of diverse positive samples in training data. Insufficient diversity in training data can lead to poor performance. When training samples lack the complexity of real-world scenarios, the model may fail to generalize effectively.

The video below shows an example of insufficient diversity in training data. In it, a cyclist falls from their bike on an empty road. This isolated scenario, while clear, doesn’t represent the full range of real-world conditions.

The video below demonstrates a real-world scenario. It shows a cyclist falling during a professional cycling competition with other racers nearby, better representing the environmental complexity the model needs to handle.