最近工作上遇到高流量系統在進行壓力測試時資料庫會卡住,pg_stat_activity出現很多transaction卡在commit操作,顯示的wait_event_type是Lock,而wait_event是object
據手冊的說法,這樣的情況是:「Waiting to acquire a lock on a non-relation database object.」
app又沒有去操作資料庫本體,上網查也查不到相關的資訊,手冊只有很籠統地說跟資料庫內部有關,到底是怎麼回事?
查了pg_locks沒有看到任何和object lock有關的東西,只看得出是transaction正在commit時卡住,沒辦法我只好把postgres的原始碼載下來用人工的方式反查
先確定定義object lock的地方,找到他的enum名稱,接著依照caller stack整理出所有會產生object lock的code
最後篩選出一個和前面提到的「transaction正在commit」有關,就是PreCommit_Notify()
有在用PostgreSQL的朋友可能知道他有Notify/Listen的功能,可以很方便地利用資料庫滿足pub/sub的需求,還可以直接設定trigger在資料庫有特定操作時自動發送訊息到正在listen的client去
而PreCommit_Notify()的功能就是在commit前將要傳送的訊息放進notification queue裡
剛好我們的資料庫有使用到notify,而且更剛好卡住的transaction會發送notify,那問題已經很明顯了,就是發送了notify導致卡住
至於詳細的原因在程式碼裡面有註解(把商業決策寫下是很棒的好習慣)
/* * Serialize writers by acquiring a special lock that we hold till * after commit. This ensures that queue entries appear in commit * order, and in particular that there are never uncommitted queue * entries ahead of committed ones, so an uncommitted transaction * can't block delivery of deliverable notifications. * * We use a heavyweight lock so that it'll automatically be released * after either commit or abort. This also allows deadlocks to be * detected, though really a deadlock shouldn't be possible here. * * The lock is on "database 0", which is pretty ugly but it doesn't * seem worth inventing a special locktag category just for this. * (Historical note: before PG 9.0, a similar lock on "database 0" was * used by the flatfiles mechanism.) */ LockSharedObject(DatabaseRelationId, InvalidOid, 0, AccessExclusiveLock);
大意是為了確保進到queue的notification有成功commit且遵循commit完成的順序,如果transaction有發送notify就會對一個不存在的database上AccessExclusiveLock直到該commit完成
因為這個「不存在的database」是寫死id的,所以不論你正在對哪一個資料庫操作,只要transaction有發送notify通通都會受影響
如果你有大量transaction同時操作的使用場景,最好避免在裡面送notify,避免整個資料庫卡住,嚴重影響效能
ref: https://github.com/postgres/postgres/blob/REL_15_7/src/backend/commands/async.c#L956
留言