This pipeline is designed to perform entity resolution using data collected from North Dakota Business Search. The pipeline consists of a web crawler to pull and parse data and an entity resolution (ER) service to visualize the relationship between entities.
Explore an interactive plot showcasing entity relationships.
Access the original data source from which the data is crawled to use for this entity resolution pipeline.
Find the crawled data used for the entity resolution process.
Make sure Docker Desktop is installed and running.
Configure search parameters and output file path in docker-compose.yml and run: [Default values form search param and output file path are set in services.py]
docker-compose run web_crawlerConfigure the path for input dataset and output file path for the plot and run: [Default values form search param and output file path are set in services.py]
docker-compose run erdocker-compose build view_erdocker-compose up view_erAccess the ER visualization in your browser http://localhost:8000.
docker-compose run formatpip install poetrycd entity_resolution_pipeline
poetry installpoetry run er_pipeline run_crawlerpoetry run er_pipeline run_erpoetry run er_pipeline view_er_in_browserIf you want to configure custom parameters for any of the above services with Poetry, use the below command to view the configuration options for each service.
Default custom parameters for these services are configured in services.py.
poetry run er_pipeline {service_name} --helpeg: poetry run er_pipeline run_crawler --helppoetry run black .