The one thing in our favor is that not all the nodes are communicating with each other simultaneously. The dreaded 0.5*n*(n-1) mesh scaling can be avoided (actually does this even merit a true mesh to mimic a human brain?). As such, this effectively becomes a networking problem and I classify that as a 'solved problem'. Bandwidth isn't free but it can be managed efficiently.
Chip stacking is an interesting idea, especially if TSV is used to enable vertical movement in the middle of a die. The catch would be keeping the middle of the stack cool but that's no different than any other massively stacked device. IBM's claims of ultra low power may enable this.
Another thought occurred to me: SMT. Essentially double the clock rate and evenly divide up the resources. Bandwidth requirements scale up to go between sockets but then you won't necessarily have to move off-die as often. It'd require a massive change to each 'core' though and it is already a massive die. A small change like SMT on a per core basis would be massively amplified by the sheer number on the silicon. This is something that'd require a die shrink or two I'd fathom.
Dual Opteron 6376, 96 GB DDR3, Asus KGPE-D16, GTX 970
Mac Pro Dual Xeon E5645, 48 GB DDR3, GTX 770
Core i7 3930K@4.2 Ghz, 32 GB DDR3, GA-X79-UP5-Wifi
Core i7 2600K@4.4 Ghz, 16 GB DDR3, GTX 970, GA-X68XP-UD4