はじめに

AWS の Elastic Beanstalk にて構築した環境にアプリケーションをデプロイする際、Beanstalk 環境の更新処理がタイムアウトしてしまう事象が発生した。

結論としては、アプリケーションサーバーをホストしている EC2 インスタンスを入れ替えることで、事象が解消した。

この事象を解消するために行ったこと、調査したことを記載していく。

事象

AWS の Elastic Beanstalk にて構築した環境にアプリケーションをデプロイする際、Beanstalk 環境の更新処理がタイムアウトしてしまう事象が発生した。

INFO    Environment update is starting.
INFO    Deploying new version to instance(s).
INFO    Environment health has transitioned from Ok to Info. Application update in progress on 1 instance. 0 out of 1 instance completed (running for 30 seconds).
ERROR   During an aborted deployment, some instances may have deployed the new application version. To ensure all instances are running the same version, re-deploy the appropriate application version.
ERROR   Failed to deploy application.
ERROR   Unsuccessful command execution on instance id(s) '<インスタンス ID>'. Aborting the operation.
INFO    Command execution completed on all instances. Summary: [Successful: 0, TimedOut: 1].
WARN    The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own): [<インスタンス ID>].
WARN    Environment health has transitioned from Info to Warning. Application update is aborting (running for 14 minutes).

また、アプリケーションのデプロイ以外の操作ではどうなるか？を確認するためにアプリケーションサーバーを再起動したところ、やはり Beanstalk 環境の更新処理がタイムアウトしてしまうことが分かった。次のような挙動から、アプリケーションサーバーをホストしている EC2 インスタンスにて異常が発生している可能性が高いと想定した。

事象が発生している Beanstalk 環境のアプリケーションサーバーの再起動は、普段であれば数十秒もかからずに完了する
Beanstalk 環境の更新ログの内容が、アプリケーションのデプロイに失敗している際のログと同じような内容になっている

INFO    restartAppServer is starting.
INFO    Environment health has transitioned from Ok to Info. Application restart in progress (running for 4 seconds).
ERROR   Unsuccessful command execution on instance id(s) '<インスタンス ID>'. Aborting the operation.
INFO    Command execution completed on all instances. Summary: [Successful: 0, TimedOut: 1].
WARN    The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own): [<インスタンス ID>].
INFO    Environment health has transitioned from Info to Ok. Application restart completed 57 seconds ago and took 14 minutes.

次に、アプリケーションサーバーをホストしている EC2 インスタンスのログを調べた。

アプリケーションサーバーをホストしている EC2 インスタンスに SSH ログインし、各種ログを調べたところ、/var/lib/log/message にて、cfn-hup.service が停止 / 起動を繰り返していることを示唆するログが出力されていた。 cfn-hup.service は、Beanstalk の環境を更新する際、ユーザーが指定した操作を実行するデーモンのようで、サーバーのデプロイがタイムアウトしてしまうのは、cfn-hup.service が停止 / 起動を繰り返していて、ユーザーが指定した操作が実行されていない可能性が高そう。

The cfn-hup helper is a daemon that detects changes in resource metadata and runs user-specified actions when a change is detected. This allows you to make configuration updates on your running Amazon EC2 instances through the UpdateStack API action.

↓ /var/lib/log/message に出力されていた、cfn-hup.service が停止 / 起動を繰り返しているログ

systemd: cfn-hup.service: main process exited, code=exited, status=1/FAILURE
systemd: Unit cfn-hup.service entered failed state.
systemd: cfn-hup.service failed.
systemd: cfn-hup.service holdoff time over, scheduling restart.
systemd: Stopped This is cfn-hup daemon.
systemd: Starting This is cfn-hup daemon...
systemd: Started This is cfn-hup daemon.

事象を解消するために行ったこと

ユーザーが指定した操作がタイムアウトしてしまう原因は、cfn-hup.service の異常動作にある可能性が高そう。ということは、cfn-hup.service を再起動することで事象が解消できそうだが、すでに cfn-hup.service が停止 / 開始を繰り返している状態のため、cfn-hup.service を再起動しても事象が解消されない可能性がある *1

今回は、アプリケーションサーバーをホストしている EC2 インスタンスを入れ替えて、事象を解消した *2

事象が発生していたアプリケーションサーバーは部内でのみ使われているもので、多少ダウンタイムが発生してもまったく問題がないものではあったが、ダウンタイムを極小化するため、次のようにして EC2 インスタンスを入れ替えた。

アプリケーションサーバーをホストする EC2 インスタンスはもともと1台だったが、1台増やして2台にする
- 新しい EC2 インスタンスが起動してくる
- 新しく起動した EC2 インスタンスは、cfn-hup.service が問題なく稼働していることを確認
元々の EC2 インスタンスを停止する
- 新しい EC2 インスタンスが起動してくる
- 新しく起動した EC2 インスタンスは、cfn-hup.service が問題なく稼働していることを確認
- つまり、起動している 2台の EC2 インスタンスは、cfn-hup.service が問題なく稼働しているということ
アプリケーションサーバーをホストする EC2 インスタンスを1台減らし、2台→1台にする
- 残った1台の EC2 インスタンスは、cfn-hup.service は問題なく稼働しているということ

以上。

参考サイト

*1:実際に cfn-hup.service を再起動してみたが、事象が解消されなかった

*2:後から思ったことだけど、やっぱりまずは cfn-hup.service の再起動を行えばよかったとは思う

全力で怠けたい

怠けるために全力を尽くしたいブログ。

AWS Elastic Beanstalk にて構築した環境にアプリケーションをデプロイする際、更新処理がタイムアウトしてしまう事象の解消方法。

はじめに

事象

事象を解消するために行ったこと

参考サイト