Meta’s AI engineers were increasingly frustrated with slow build times and inefficient distribution processes hindering their productivity. The company has now outlined the solutions its engineers devised to maximise efficiency.
The workflows of Meta’s machine learning engineers consist of iteratively checking-out code, writing new algorithms, building models, packaging the output, and testing in Meta’s remote execution environment. As ML models and the codebases behind Meta’s apps grew in complexity, its engineers dealt with two primary pain points: slow builds and inefficient distribution.
Older revisions of codebases are not cached as efficiently in Meta’s build infrastructure, frequently requiring extensive rebuilds. The company says the problem is exacerbated by build non-determinism. Differences in outputs for the same source code make caching previous build artifacts useless.
Distribution was also an issue because Python executables are typically packaged as self-contained XAR files. Even minor code changes require a full rebuild and distribution of dense executable files; an arduous process resulting in lengthy delays before engineers can test them.
Meta’s engineers devised solutions focused on maximising build caching and introducing incrementality into the distribution process.
To address build speeds, the team worked to minimise unnecessary rebuilds in two ways:
- First, by using Meta’s Buck2 build system in tandem with its Remote Execution (RE) environment to eliminate non-determinism through consistent outputs
- Second, by reducing dependencies and removing unnecessary code to streamline build graphs.
For distribution, engineers created a Content Addressable Filesystem (CAF) to skip redundant uploads and leverage file duplication across executables. The system also maintains local caches to only download updated content. Meta says this “incremental” approach drastically reduces distribution times.
The company quantified the impact writing, “Faster build times and more efficient packaging and distribution of executables have reduced overhead by double-digit percentages.”
But Meta believes there’s room for improvement. Its current focus is developing “LazyCAF”—a system to only fetch the executable content needed for specific scenarios, rather than whole models. Meta also aims to enforce consistent code revisions to further improve caching.
The solutions devised by Meta’s engineers culminate to overcome scale challenges in AI development.
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is co-located with Digital Transformation Week and Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.