The following is what I understand from reading the source code; it could be wrong.
ZHEEVD is being lazy and only asking for the optimal ZHETRD workspace size, without considering the size desired by ZUNMTR. If you use an unchanged ILAENV in Netlib Lapack, these two happen to want the same block size, so if it were as simple as that, there should be no performance difference. The problem however, is that by the time ZUNMTR gets called, there is only a size N workspace remaining, so it's essentially use the completely unblocked code. The computation of the optimal workspace is wrong, since around line 290 of ZHEEVD, it compares the minimum size LWMIN with N + ZHETRD_opt, which really should be something more like LWMIN + ZHETRD_OPT - N.
As a temporary measure, it seems you can take the optimal size returned, and add onto it the optimal size returned by ZUNMTR, and subtract N, and that should be "optimal".