LGRW PostWait v/s Polling Algorithm in action.

Well, this post is about “Adaptive log file sync”, this feature was introduced in 11.2.0.3 which primarily dictates how the foreground process will be posted whether or not LGWR has written the committed change victor to the persistent storage, With the adaptive method LGWR has got the choice to choose between Post-Wait v/s Polling method.

Traditionally LGWR used to post the FG process after it has written to disk, however, with the polling method, foreground process will sleep for a while, come back, and check whether commit completed.  

Recently we started facing very high latency on log file sync wait, so the first thing was to verify the “log file parallel write” and found that more than 90% of IO was completed in less than 1 ms, it was worth verifying with storage and storage team did confirm the same. Further LGWR trace file did reveal that it was stuck on “ polling” and did not switch back to Post wait, and because of which extra time was wasted on polling caused the high latency.

Further LGWR trace file revels following.

[oracle@hostname trace]$ grep “Log file sync switching” -B 1 *
FAKE_DB_lgwr_25389.trc-*** 2020-02-06 15:56:49.376
FAKE_DB_lgwr_25389.trc:Log file sync switching to polling
FAKE_DB_lgwr_25389.trc-*** 2020-02-06 21:12:57.705
FAKE_DB_lgwr_25389.trc:Log file sync switching to post/wait
FAKE_DB_lgwr_27858.trc-*** 2020-02-23 06:06:14.698
FAKE_DB_lgwr_27858.trc:Log file sync switching to polling

It was not difficult to understand the problem as I had read the amazing post by “Frits Hoogland” long back, solution was to turn off Adaptive  “alter system set “_use_adaptive_log_file_sync” = false” and immediately after that log file sync wait disappeared.

After I published this post, “Frits Hoogland” was gracious enough to read and comment on twitter, He mentioned that post does not have enough information to conclude, therefore, I decided to add some more information . I was troubleshooting on 11.2.0.3, running on RHEL 5.8. After it was confirmed from the trace file that it was switching back and forth, I did compare the graph before and after the switch and it was clear that, because of polling latency was way too high. By this time it was clear to me that Oracle was at least not able to calibrate well enough to switch back. With couple of search I did find few related bugs,

Bug 27143321 – LGWR May Not Switch Back to post/wait Mode (Doc ID 27143321.8)

Bug 13707904 – LGWR sometimes uses polling, sometimes post/wait (Doc ID 13707904.8)

Patching was not in the choice at all, due to change freeze and Lock down, hence I decided to turn off the Adaptive as the default is the Post/Wait and it was working better for my system.