Tesla Dojo tiles. Credit: Tesla
While Tesla’s full self-driving features have yet to overcome some technical hurdles and regulatory restrictions, its AI team has demonstrated impressive work at Tesla’s AI Day 2022.
The annual event on October 1 feels more like a computer science lecture. In addition to founder Elon Musk and the humanoid robot, Optimus, more than 20 engineering team leads at Tesla took the stage to share their progress over the past year. The whole event lasted about three hours.
According to Tesla, more than 160,000 customers are using its FSD Beta software. The number was 2,000 last year. During the past year, the FSD team has trained more than 75,000 AI models in total and shipped 281 models that actually improved self-driving performance.
Methods of training the neural network system include auto-labeling, simulating, and using a data engine, and apparently it is a series of trials and errors, according to Ashok Elluswamy, the director of autopilot software at Tesla.
The training process required Tesla to expand training infrastructures by 40-50% within the year to about 14,000 GPUs across multiple training clusters in the US. The neural network now executes across two independent system-on-chips (SoC) within the same self-driving computer with tightly controlled end-to-end latency.
The system runs not only in the Tesla cars but also in the Tesla robot Optimus.
Notably, the team has reached to language modeling to improve computer vision. Elluswamy noted that the language modeling will be foundational to computer vision in the future, which runs autonomous driving for cars and predicts the optimized paths for robots to reach destinations, for example, at home or at a factory.
The neural network of the system is capable of constructing a 3D vector space with physical objects (i.e. occupancy in Tesla terms) and detects lanes and road structures by encoding them with words and tokens.
John Emmons, who leads Tesla’s autopilot vision team, said autopilot in the early days detected lanes by modeling image-space instance segmentation. It could efficiently detect highly structured roads like highways but led to a total breakdown in executing complex maneuvers such as taking turns at an intersection and other places with more complex road topology.
To detect lane connectivities, the team developed a lane language for the system to predict routes for the vehicle and other vehicles. “By modeling (lane detecting) as a language with words and tokens, we can capitalize on recent autoregressive architectures,” said Emmons.
The problem with segmenting the lanes was that, for one, sometimes the input image of the road was not clear enough under various weather conditions.
By predicting a set of short time horizon of future trajectories of all objects, the dangerous situations could be anticipated and avoided. That is how the semantics really worked for lane detection.
Building a supercomputer for AI-training
The engineering leads could not stress enough how training these models was important to improvements, and a huge amount of training will require larger computational power and high efficiency. The engineering team retreated from DRAM to SRAM with high bandwidth and low latency, despite a modest amount of capacity, that help achieve high utilization of the arithmetic units.
The team noted another unusual move with most machines today was that they decided to use model parallelism as training methodology.
“All the decisions were made centering around the ‘No Limits’ philosophy,” said Ganesh Venkataramanan, the senior director of autopilot hardware at Tesla. “And that is why we integrated vertically our data center to extract new level of efficiency, optimize power delivery, cooling, as well as system management.”
So the Dojo environment was integrated into the autopilot software very early on to figure out the limits of the scale for the software workloads.
To support the unprecedented power and cooling density, the team bought a fully custom designed CDU that cost only a fraction of buying off the shelf and modifying it. Early this year, the team started load testing power and cooling. According to Bill Chang, principal system engineer for autopilot at Tesla, they were able to push over two MWh before tripping the subordinate power station of the city.
With the current compiler performance, a single Dojo tile could replace the amount of machine learning compute from six GPU boxes.
The first large deployment of the supercomputer would target the auto-labeling that requires high arithmetic intensity and currently occupies 4,000 GPUs over 72 GPU racks. The same throughput is expected to be provided by four Dojo cabinets.
The first Exapod comprising of 10 Dojo cabinets that generates 2.5 times auto-labeling capacity will be deployed by the first quarter of 2023 in Palo Alto. Furthermore, Tesla plans to deploy seven Exapods in Palo Alto in the future.
Tesla Dojo supercomputer ExaPOD
Tesla builds Dojo supercomputer with in-house chips – DIGITIMES
Tesla Dojo tiles. Credit: Tesla