
火炬射击器- 与优化模块兼容的Pytorch的优化器收集。
import torch_optimizer as optim
# model = ...
optimizer = optim . DiffGrad ( model . parameters (), lr = 0.001 )
optimizer . step ()安装过程很简单,仅:
$ pip安装torch_optimizer
https://pytorch-optimizer.rtfd.io
请引用优化算法的原始作者。如果您喜欢此包:
@software {Novik_torchoptimizers,
title = {{{torch-optimizer- pytorch的优化算法的收集。}},
作者= {novik,mykola},
年= 2020,
月= 1,
版本= {1.0.1}
}
或使用GitHub功能:“引用此存储库”按钮。
| a2gradexp | https://arxiv.org/abs/1810.00553 |
| A2Gradinc | https://arxiv.org/abs/1810.00553 |
| A2Graduni | https://arxiv.org/abs/1810.00553 |
| ACCSGD | https://arxiv.org/abs/1803.05591 |
| 适应性 | https://arxiv.org/abs/2010.07468 |
| Adabound | https://arxiv.org/abs/1902.09843 |
| 阿达莫德 | https://arxiv.org/abs/1910.12249 |
| afactor | https://arxiv.org/abs/1804.04235 |
| Adahessian | https://arxiv.org/abs/2006.00719 |
| Adamp | https://arxiv.org/abs/2006.08217 |
| aggmo | https://arxiv.org/abs/1804.00325 |
| 阿波罗 | https://arxiv.org/abs/2009.13586 |
| diffgrad | https://arxiv.org/abs/1909.11015 |
| 羊肉 | https://arxiv.org/abs/1904.00962 |
| lookahead | https://arxiv.org/abs/1907.08610 |
| 麦格拉德 | https://arxiv.org/abs/2101.11075 |
| 诺维格拉德 | https://arxiv.org/abs/1905.11286 |
| pid | https://www4.comp.polyu.edu.hk/~cslzhang/paper/cvpr18_pid.pdf |
| Qhadam | https://arxiv.org/abs/1810.06801 |
| QHM | https://arxiv.org/abs/1810.06801 |
| radam | https://arxiv.org/abs/1908.03265 |
| 游侠 | https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination of-radam-lookahead-the-the-the-the-the-the-the-for-2dc83f79a48d |
| rangerqh | https://arxiv.org/abs/1810.06801 |
| 兰格瓦 | https://arxiv.org/abs/1908.00700v2 |
| SGDP | https://arxiv.org/abs/2006.08217 |
| SGDW | https://arxiv.org/abs/1608.03983 |
| 武士 | https://arxiv.org/abs/1712.07628 |
| 洗发水 | https://arxiv.org/abs/1802.09568 |
| 瑜伽士 | https://papers.nips.cc/paper/8186-aptive-methods-for-nonconvex-optimization |
可视化有助于我们了解不同的算法如何处理简单情况,例如:鞍点,本地的minima,valleys等,并可能对算法的内部运作提供有趣的见解。选择了Rosenbrock和Rastrigin基准功能,因为:
rastrigin是一种非凸功能,在(0.0,0.0)中具有一个全局最小值。由于其较大的搜索空间和大量的本地最小值,找到此功能的最小值是一个相当困难的问题。
每个优化器都执行501个优化步骤。学习率是通过超级参数搜索算法找到的最佳学习率,其余的调整参数为默认值。扩展脚本并调整其他优化器参数非常容易。
python示例/viz_optimizers.py
请勿根据可视化选择优化器,优化方法具有独特的属性,并且可能是为了不同目的而定制的,或者可能需要明确的学习率计划等。找出最佳方法是在您的特定问题上尝试一种,看看它是否提高了分数。
如果您不知道要使用哪种优化器,请从内置的SGD/ADAM开始。一旦训练逻辑准备就绪并建立了基线得分,请交换优化器,看看是否有任何改进。
import torch_optimizer as optim
# model = ...
optimizer = optim . A2GradExp (
model . parameters (),
kappa = 1000.0 ,
beta = 10.0 ,
lips = 10.0 ,
rho = 0.5 ,
)
optimizer . step ()论文:最佳自适应和加速随机梯度下降(2018)[https://arxiv.org/abs/1810.00553]
参考代码:https://github.com/severilov/a2grad_optimizer
import torch_optimizer as optim
# model = ...
optimizer = optim . A2GradInc (
model . parameters (),
kappa = 1000.0 ,
beta = 10.0 ,
lips = 10.0 ,
)
optimizer . step ()论文:最佳自适应和加速随机梯度下降(2018)[https://arxiv.org/abs/1810.00553]
参考代码:https://github.com/severilov/a2grad_optimizer
import torch_optimizer as optim
# model = ...
optimizer = optim . A2GradUni (
model . parameters (),
kappa = 1000.0 ,
beta = 10.0 ,
lips = 10.0 ,
)
optimizer . step ()论文:最佳自适应和加速随机梯度下降(2018)[https://arxiv.org/abs/1810.00553]
参考代码:https://github.com/severilov/a2grad_optimizer
import torch_optimizer as optim
# model = ...
optimizer = optim . AccSGD (
model . parameters (),
lr = 1e-3 ,
kappa = 1000.0 ,
xi = 10.0 ,
small_const = 0.7 ,
weight_decay = 0
)
optimizer . step ()论文:关于随机优化现有动量方案的不足(2019)[https://arxiv.org/abs/1803.05591]
参考代码:https://github.com/rahulkidambi/accsgd
import torch_optimizer as optim
# model = ...
optimizer = optim . AdaBelief (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-3 ,
weight_decay = 0 ,
amsgrad = False ,
weight_decouple = False ,
fixed_decay = False ,
rectify = False ,
)
optimizer . step ()论文: Anibelief Optimizer,通过对观察到的梯度的信念进行调整(2020)[https://arxiv.org/abs/2010.07468]
参考代码:https://github.com/juntang-zhuang/adabelief-optimizer
import torch_optimizer as optim
# model = ...
optimizer = optim . AdaBound (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
final_lr = 0.1 ,
gamma = 1e-3 ,
eps = 1e-8 ,
weight_decay = 0 ,
amsbound = False ,
)
optimizer . step ()论文:具有动态学习率的自适应梯度方法(2019)[https://arxiv.org/abs/1902.09843]
参考代码:https://github.com/luolc/adabound
Adamod方法以自适应和矩上的上限限制了自适应学习率。动态学习速率范围基于自适应学习率本身的指数移动平均值,这使意外的大型学习率平稳并稳定了深层神经网络的训练。
import torch_optimizer as optim
# model = ...
optimizer = optim . AdaMod (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
beta3 = 0.999 ,
eps = 1e-8 ,
weight_decay = 0 ,
)
optimizer . step ()论文:一种自适应和矩界的随机学习方法。 (2019)[https://arxiv.org/abs/1910.12249]
参考代码:https://github.com/lancopku/adamod
import torch_optimizer as optim
# model = ...
optimizer = optim . Adafactor (
m . parameters (),
lr = 1e-3 ,
eps2 = ( 1e-30 , 1e-3 ),
clip_threshold = 1.0 ,
decay_rate = - 0.8 ,
beta1 = None ,
weight_decay = 0.0 ,
scale_parameter = True ,
relative_step = True ,
warmup_init = False ,
)
optimizer . step ()论文: afafactor:具有均衡记忆成本的自适应学习率。 (2018)[https://arxiv.org/abs/1804.04235]
参考代码:https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py.py
import torch_optimizer as optim
# model = ...
optimizer = optim . Adahessian (
m . parameters (),
lr = 1.0 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-4 ,
weight_decay = 0.0 ,
hessian_power = 1.0 ,
)
loss_fn ( m ( input ), target ). backward ( create_graph = True ) # create_graph=True is necessary for Hessian calculation
optimizer . step ()论文: Adahessian:用于机器学习的自适应二阶优化器(2020)[https://arxiv.org/abs/2006.00719]
参考代码:https://github.com/amirgholami/adahessian
ADAMP提出了一个简单有效的解决方案:在ADAM优化器的每次迭代中,都应用于尺度不变的权重(例如,在BN层之前的Conv withs),Adamp从更新向量中删除了径向分量(即,与权重矢量平行)。直觉上,此操作防止了沿径向方向的不必要的更新,而不必增加重量标准,而不会导致损失最小化。
import torch_optimizer as optim
# model = ...
optimizer = optim . AdamP (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-8 ,
weight_decay = 0 ,
delta = 0.1 ,
wd_ratio = 0.1
)
optimizer . step ()纸:基于动量的优化器的重量标准降低了体重标准的增加。 (2020)[https://arxiv.org/abs/2006.08217]
参考代码:https://github.com/clovaai/adamp
import torch_optimizer as optim
# model = ...
optimizer = optim . AggMo (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.0 , 0.9 , 0.99 ),
weight_decay = 0 ,
)
optimizer . step ()纸:综合动量:通过被动阻尼的稳定性。 (2019)[https://arxiv.org/abs/1804.00325]
参考代码:https://github.com/athemathmo/aggmo
import torch_optimizer as optim
# model = ...
optimizer = optim . Apollo (
m . parameters (),
lr = 1e-2 ,
beta = 0.9 ,
eps = 1e-4 ,
warmup = 0 ,
init_lr = 0.01 ,
weight_decay = 0 ,
)
optimizer . step ()论文:阿波罗:一种自适应参数的对角准牛顿法,用于非凸随机优化。 (2020)[https://arxiv.org/abs/2009.13586]
参考代码:https://github.com/xuezhemax/apollo
优化器基于当前和过去梯度之间的差异,对每个参数进行调整的步长以使得它应该具有更大的步长以使其具有更大的步长,以更快地更换梯度更换参数,而对于较低的梯度更换参数,则应具有较低的步长。
import torch_optimizer as optim
# model = ...
optimizer = optim . DiffGrad (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-8 ,
weight_decay = 0 ,
)
optimizer . step ()论文: Diffgrad:一种卷积神经网络的优化方法。 (2019)[https://arxiv.org/abs/1909.11015]
参考代码:https://github.com/shivram1987/diffgrad
import torch_optimizer as optim
# model = ...
optimizer = optim . Lamb (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-8 ,
weight_decay = 0 ,
)
optimizer . step ()论文:深度学习的大批量优化:76分钟内的培训BERT (2019)[https://arxiv.org/abs/1904.00962]
参考代码:https://github.com/cybertronai/pytorch-lamb
import torch_optimizer as optim
# model = ...
# base optimizer, any other optimizer can be used like Adam or DiffGrad
yogi = optim . Yogi (
m . parameters (),
lr = 1e-2 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-3 ,
initial_accumulator = 1e-6 ,
weight_decay = 0 ,
)
optimizer = optim . Lookahead ( yogi , k = 5 , alpha = 0.5 )
optimizer . step ()论文: lookahead优化器:k向前迈进,向后1步(2019)[https://arxiv.org/abs/1907.08610]
参考代码:https://github.com/alphadl/lookahead.pytorch
import torch_optimizer as optim
# model = ...
optimizer = optim . MADGRAD (
m . parameters (),
lr = 1e-2 ,
momentum = 0.9 ,
weight_decay = 0 ,
eps = 1e-6 ,
)
optimizer . step ()论文:不妥协的适应性:一种随机优化的势头,自适应,平均梯度方法(2021)[https://arxiv.org/abs/2101.11075]
参考代码:https://github.com/facebookresearch/madgrad
import torch_optimizer as optim
# model = ...
optimizer = optim . NovoGrad (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-8 ,
weight_decay = 0 ,
grad_averaging = False ,
amsgrad = False ,
)
optimizer . step ()论文:培训深网的随机梯度方法具有层次自适应时刻(2019)[https://arxiv.org/abs/1905.11286]
参考代码:https://github.com/nvidia/deeplearningningexamples/
import torch_optimizer as optim
# model = ...
optimizer = optim . PID (
m . parameters (),
lr = 1e-3 ,
momentum = 0 ,
dampening = 0 ,
weight_decay = 1e-2 ,
integral = 5.0 ,
derivative = 10.0 ,
)
optimizer . step ()论文:深层网络随机优化的PID控制器方法(2018)[http://www4.comp.polyu.edu.edu.hk/~cslzhang/paper/paper/cvpr18_pid.pdf]
参考代码:https://github.com/tensorboy/pidoptimizer
import torch_optimizer as optim
# model = ...
optimizer = optim . QHAdam (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
nus = ( 1.0 , 1.0 ),
weight_decay = 0 ,
decouple_weight_decay = False ,
eps = 1e-8 ,
)
optimizer . step ()论文:准亨利贝利动量和深度学习的亚当(2019)[https://arxiv.org/abs/1810.06801]
参考代码:https://github.com/facebookresearch/qhoptim
import torch_optimizer as optim
# model = ...
optimizer = optim . QHM (
m . parameters (),
lr = 1e-3 ,
momentum = 0 ,
nu = 0.7 ,
weight_decay = 1e-2 ,
weight_decay_type = 'grad' ,
)
optimizer . step ()论文:准亨利贝利动量和深度学习的亚当(2019)[https://arxiv.org/abs/1810.06801]
参考代码:https://github.com/facebookresearch/qhoptim
弃用,请使用Pytorch提供的版本。
import torch_optimizer as optim
# model = ...
optimizer = optim . RAdam (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-8 ,
weight_decay = 0 ,
)
optimizer . step ()论文:关于自适应学习率和超越(2019年)[https://arxiv.org/abs/1908.03265]的差异]
参考代码:https://github.com/liyuanlucasliu/radam
import torch_optimizer as optim
# model = ...
optimizer = optim . Ranger (
m . parameters (),
lr = 1e-3 ,
alpha = 0.5 ,
k = 6 ,
N_sma_threshhold = 5 ,
betas = ( .95 , 0.999 ),
eps = 1e-5 ,
weight_decay = 0
)
optimizer . step ()论文:新的深度学习优化器,游侠:radam + lookahead的协同组合(2019年)
参考代码:https://github.com/lessw2020/ranger-deep-learning-optimizer
import torch_optimizer as optim
# model = ...
optimizer = optim . RangerQH (
m . parameters (),
lr = 1e-3 ,
betas = ( 0.9 , 0.999 ),
nus = ( .7 , 1.0 ),
weight_decay = 0.0 ,
k = 6 ,
alpha = .5 ,
decouple_weight_decay = False ,
eps = 1e-8 ,
)
optimizer . step ()论文:准亨利贝利动量和深度学习的亚当(2018)[https://arxiv.org/abs/1810.06801]
参考代码:https://github.com/lessw2020/ranger-deep-learning-optimizer
import torch_optimizer as optim
# model = ...
optimizer = optim . RangerVA (
m . parameters (),
lr = 1e-3 ,
alpha = 0.5 ,
k = 6 ,
n_sma_threshhold = 5 ,
betas = ( .95 , 0.999 ),
eps = 1e-5 ,
weight_decay = 0 ,
amsgrad = True ,
transformer = 'softplus' ,
smooth = 50 ,
grad_transformer = 'square'
)
optimizer . step ()论文:校准自适应学习率以提高亚当的融合(2019)[https://arxiv.org/abs/1908.00700v2]
参考代码:https://github.com/lessw2020/ranger-deep-learning-optimizer
import torch_optimizer as optim
# model = ...
optimizer = optim . SGDP (
m . parameters (),
lr = 1e-3 ,
momentum = 0 ,
dampening = 0 ,
weight_decay = 1e-2 ,
nesterov = False ,
delta = 0.1 ,
wd_ratio = 0.1
)
optimizer . step ()纸:基于动量的优化器的重量标准降低了体重标准的增加。 (2020)[https://arxiv.org/abs/2006.08217]
参考代码:https://github.com/clovaai/adamp
import torch_optimizer as optim
# model = ...
optimizer = optim . SGDW (
m . parameters (),
lr = 1e-3 ,
momentum = 0 ,
dampening = 0 ,
weight_decay = 1e-2 ,
nesterov = False ,
)
optimizer . step ()论文: SGDR:随机梯度下降与温暖重启(2017年)[https://arxiv.org/abs/1608.03983]
参考代码:Pytorch/Pytorch#22466
import torch_optimizer as optim
# model = ...
optimizer = optim . SWATS (
model . parameters (),
lr = 1e-1 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-3 ,
weight_decay = 0.0 ,
amsgrad = False ,
nesterov = False ,
)
optimizer . step ()论文:通过从Adam转换为SGD(2017)[https://arxiv.org/abs/1712.07628]来改善概括性能。
参考代码:https://github.com/mrpatekful/swats
import torch_optimizer as optim
# model = ...
optimizer = optim . Shampoo (
m . parameters (),
lr = 1e-1 ,
momentum = 0.0 ,
weight_decay = 0.0 ,
epsilon = 1e-4 ,
update_freq = 1 ,
)
optimizer . step ()论文:洗发水:预处理的随机张量优化(2018)[https://arxiv.org/abs/1802.09568]
参考代码:https://github.com/moskomule/shampoo.pytorch
Yogi是基于Adam的优化算法,具有更细颗粒的有效学习率控制,并且在收敛上具有类似的理论保证,与Adam相似。
import torch_optimizer as optim
# model = ...
optimizer = optim . Yogi (
m . parameters (),
lr = 1e-2 ,
betas = ( 0.9 , 0.999 ),
eps = 1e-3 ,
initial_accumulator = 1e-6 ,
weight_decay = 0 ,
)
optimizer . step ()论文:非convex优化的自适应方法(2018)[https://papers.nips.cc/paper/8186-aptive-methods-for-nonconvex-optimization]
参考代码:https://github.com/4rtemi5/yogi-optimizer_keras