DLRover
2024-08-09 16:06:53 0 Report
Log in to view full content
code chain of dlrover
Other works by the author
Outline/Content
diagnosis.py中用额外的线程不断分析日志并记录root cause
servicer.py中的MasterServicer通过gRPC接收到updated的资源后,动态调整资源分配
在save_step_checkpoint中调用_save_shard。另起一个线程存储ckpt
通过ResourceMoniter监控并上报占用资源,包括CPU、Mem、GPU
def _diagnose_failures(self): logger.info(\"Start to diagnose failures\") while True: observed_problems = self.diagnostician.observe_training() for problem in observed_problems: logger.info(f\"observed problems: {problem}\") root_causes = self.diagnostician.diagnose_failure(problem) for root_cause in root_causes: logger.info(f\"identify root cause: {root_cause}\") time.sleep(180)
资源监控
worker.py中,WorkerManager用于调整worker占用的资源
DistributedJobManager通过监控每个训练node的训练速度,动态调整PS,workers或CPU,memory等资源
class FsdpDcpSaver(CommonDirCheckpointSaver): \"\"\"The saver saves the distributed checkpoint of FSDP into the storage.\"\"\
错误诊断
checkpoint快速存储与恢复
gRPC
class DistributedJobMaster(JobMaster): \"\"\
资源监控与动态调整
ckpt_saver.py中,可以将state dict持久化到磁盘
servicer.py中,MasterServicer通过gRPC接受到信息,再由DiagnosisManager将信息存在self.diagnosis_data中
_save_shard中调用persist_to_storage
ps.py中的ParameterServerManager用于调整ps占用的资源
JobResource类中的update_node_group_resource方法将会动态更新node_group_resource,然后在job_auto_scaler.py中调用该方法。

Collect

0 Comments
Next Page
Recommended for you