based on the 'big corpus' with external data
based on the 'big corpus' with external data