python -m torch.distributed.launch --nproc_per_node=3 --master_port=12233 --use_env run_train.py
*--diff_steps 1000 *
--lr 0.0001
*--learning_steps 3000 *
--save_interval 600
--seed 102
--noise_schedule sqrt
--hidden_dim 128
--bsz 2048
--dataset qqp
--data_dir datasets/ForTest
--vocab bert
--seq_len 128
--schedule_sampler lossaware
--notes test-qqp
I change the 'learning steps' and 'diff steps'.
File "/home/documents/DiffuSeq-main/diffuseq/gaussian_diffusion.py", line 805, in ddim_sample_loop_progressive indices = list(range(self.num_timesteps))[::-1][::gap] ValueError: slice step cannot be zero ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2903262) of binary: /home/yjhongkr/miniconda3/envs/dec/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
And some problems come up with 'slice step'. Would there be some relations between them?
gap is determined as '1'like this.
def ddim_sample_loop_progressive( self, model, shape, noise=None, clip_denoised=True, denoised_fn=None, model_kwargs=None, device=None, progress=False, eta=0.0, langevin_fn=None, mask=None, x_start=None, gap=1 ):
I want to know whether it is related.
If it is, I have to run the model all over again...