I think there are two factors:
Mainly - Your program is dominated by thread creation overhead. You are creating and destroying 2000 threads, and only accessing the mutex/CS once per thread. The time spent creating threads swamps the difference in lock/unlock times.
Also - You may not be testing the use case that these locks were optimized for. Try spawning two threads that each try to access the mutex/CS thousands of times.