What is the best way to do data version control for Machine Learning models

9 views (last 30 days)
Hi,
I am currently using the matlab git integration to store my source code and my collected data samples. However, my data files (.csv and .h5) are getting bigger as my experiments increase in complexity resulting in very large .git files. I therefore want to move my experiments to a proper versioning software designed for data. My question is which data version control software can easily be integrated into a matlab framework?

Accepted Answer

Hassaan
Hassaan on 9 Jan 2024
  1. DVC (Data Version Control):
  • DVC is an open-source tool designed to handle large files, datasets, and machine learning experiments.
  • It works alongside Git, but it manages data separately.
  • DVC can store data in multiple storage backends like S3, GCP, Azure, and others.
  • It is command-line based, which can be used within MATLAB through system commands.
  1. Git Large File Storage (LFS):
  • Git LFS is an extension of Git specifically designed to handle large files and binaries.
  • It stores large files in a separate server but integrates smoothly with Git.
  • The integration with MATLAB's Git integration can be more straightforward since it's an extension of Git.
  1. Pachyderm:
  • Pachyderm offers data versioning with a focus on data pipelines for machine learning and data science workflows.
  • It can handle large data sets efficiently and is suitable for complex experiments.
  • The integration with MATLAB might require a bit of setup, as Pachyderm operates with containers and Kubernetes.
  1. Quilt:
  • Quilt is a version control system for data. It manages data like code in packages.
  • It offers versioning and packaging for large datasets.
  • It might require some scripting to integrate effectively with MATLAB workflows.
  1. Apache NiFi:
  • While primarily a data processing tool, Apache NiFi can be used to manage and automate the flow of data between systems.
  • It's more complex and suited for larger, more intricate data workflows.
  • Integration with MATLAB could be complex and might require a significant setup.
------------------------------------------------------------------------------------------------------------------------------------------------
If you find the solution helpful and it resolves your issue, it would be greatly appreciated if you could accept the answer. Also, leaving an upvote and a comment are also wonderful ways to provide feedback.
Professional Interests
  • Technical Services and Consulting
  • Embedded Systems | Firmware Developement | Simulations
  • Electrical and Electronics Engineering
Feel free to contact me.

More Answers (0)

Categories

Find more on Source Control in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!