It's because the level of possible phrase sequences raises, along with the styles that tell effects grow to be weaker. By weighting words in the nonlinear, dispersed way, this model can "learn" to approximate phrases rather than be misled by any not known values. Its "comprehending" of a presented term just isn't as tightly tethered to the instant