+---------------------------------------------------------------------------------------+ | Processes: | | XPU XI CI PID Type Process name XPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 2118 C 21528MiB | | 0 N/A N/A 3161 C /usr/bin/python 25726MiB | +---------------------------------------------------------------------------------------+
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 2801 bml mem CHR 195, 255 943 /dev/xpuctrl python 2801 bml mem CHR 195, 6 1724 /dev/xpu6 python 2801 bml 3u CHR 195, 255 0t0 943 /dev/xpuctrl python 2801 bml 4u CHR 195, 6 0t0 1724 /dev/xpu6 ......
其中COMMAND下的为程序名,一般用kill终止掉其中所有python后就能释放显存。
在确认本机无其他人用XPU的情况下也可以用
1
lsof -t /dev/xpu* | sargs -r kill -9
终止所有占用XPU的应用。
Dataloader报错
在训练过程中可能遇到Dataloader环节报错,比如
1 2 3 4 5 6 7 8 9 10 11 12 13
Traceback (most recent call last): File "xxx/PaddleYOLO-example/tools/train.py", line 202, in <module> main() File "xxx/PaddleYOLO-example/tools/train.py", line 198, in main run(FLAGS, cfg) File "xxx/PaddleYOLO-example/tools/train.py", line 151, in run trainer.train(FLAGS.eval) File "xxx/PaddleYOLO-example/ppdet/engine/trainer.py", line 496, in train for step_id, data in enumerate(self.loader): File "/usr/local/lib/python3.10/dist-packages/paddle/io/dataloader/dataloader_iter.py", line 850, in __next__ self._reader.read_next_list()[0] SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception. [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /host/Paddle/paddle/phi/core/operators/reader/blocking_queue.h:175)