Using Natural Language Processing and Machine Learning in the Fight Against Invasive Species

Natural language processing and machine learning may seem like unlikely weapons in the fight against invasive species, but for the Great Lakes Commission (GLC), they are key to understanding the scope of the problem.

The Internet has enabled all kinds of criminal activity, from the distribution of destructive malware to the buying and selling of stolen goods. Many of these crimes go unnoticed by the public, and yet they can have a devastating and long-lasting impact on our world. One of these is the trafficking of invasive species. So when the Great Lakes Commission (GLC) approached RightBrain Networks with a proposal for a web-crawling application that could help curb the sale of invasive species through awareness, we were happy to help.

An invasive species is any kind of living organism that’s not native to an environment and can cause harm by spreading aggressively. Unfortunately, due to the lack of surveillance, the Internet has become an efficient pathway to traffic thousands of invasive species, sometimes illegally. These transactions can result in destructive organisms being released into sewers, rivers and lakes, where they can destroy native plants and animals.

One of the environmental agencies at the forefront of the fight against invasive species is the Great Lakes Commission (GLC). The GLC sought our help in building an intelligent web-crawling application that could capture information about the sales of aquatic invasive species (AIS) from across the Internet – a formidable challenge, but our software engineers were up for the task.

The result was the Great Lakes Detector of Invasive Aquatics in Trade (GLDIATR). GLDIATR scrapes the largest online marketplaces, which are largely unregulated and contain a wide variety of items for sale. The application uses APIs when available, crawling and scraping otherwise, to collect data from these marketplaces and extract text. GLDIATR also scrapes select search engines in order to capture data on sales from smaller businesses and individuals.

Key to the application’s success is the use of Natural Language Processing (NLP) and machine learning capabilities. NLP algorithms and processes break down sentence structure into forms a computer can understand. To streamline the application’s search and match capabilities, we programmed the NLP processes to remove irrelevant text, non-printable and specific special characters while still allowing for any misspelled words. This included all HTML markup and entities, CSS, and Javascript content. The NLP algorithms drive the matching process and prepare GLDIATR for machine learning.

The machine learning NLP algorithms capture parts of speech (POS) of interest and normalize words, allowing GLDIATR to intelligently process web pages with minimal human interaction or decision making. In addition, the machine intelligence capabilities allow GLDIATR to understand new data and events, and combine that with its knowledge of past data and events, to make intelligent predictions and decisions. In short, GLDIATR will become increasingly accurate with further use.

Initial results look promising. In its first month of operation, GLDIATR identified 200 unique websites and sellers offering 58 different invasive species for sale, demonstrating that GLDIATR has the capacity to provide a valuable service to the Great Lakes region. As a result, GLC plans to implement a coordinated outreach and enforcement effort targeted to a subset of high-priority species with a goal of achieving a significant reduction in the availability of those species in the marketplace.