Extracting content from Documents & Web

hamidoudi

10 years ago

Back in 2016 when we were designing our Social Media Monitoring platform; we had to index unstructured data from multiple sources like blogs, news agencies, online file repositories and etc. Hence, we often struggled with:

- Scraping the correct information from Web pages and blog posts i.e Article title, content, author, image and etc.
- Processing different file types discovered during the indexing i.e. white papers, case studies, industry reports which were often in PDF or PPT/PPTX

A team of Data Engineers and Web Specialists participated in the following phases to solve the problems above:

Phase 1: Automated Data Indexing and consolidation techniques

Developed dozens of bespoke Robotic Process Automation pipelines to automatically crawl and index the content of over 10,000 data sources, hence a central platform was created to enable efficient data source consolidation and management

Phase 2: Web & File Scraping Engine

Dozens of scraping engines were developed to automatically extract the content of each file or website based on it purpose. The engine was able to cleanse the data and ignore unnecessary information on the go.

Phase 3: Data Transformation & Storage

The cleansed content extracted from each file or website was transformed into a semi-structured format containing images, author, source address, title and text. The transformed data could then be automatically inserted into remote database or APIs. This integration enabled multiple cloud servers to focus on different sources and save the output into a Master Database.

As a result of this development, our Web and Document Scraping Engine was able to:

- Scrape the correct information from Web pages and blog posts i.e Article title, content, author, image and etc.
- Process different file types discovered during the indexing i.e. white papers, case studies, industry reports which were often in PDF or PPT/PPTX
- Integrate New Data Sources in minutes.

With this new technology, we will be able to add countless file types and formats in the future including but not limited to Sensor data, External APIs, Internal Data sources (SharePoint, Local drives, Network drives) , Live Streaming data and etc.

Our professional consultants are here to implement and deliver similar projects for your organisation, contact us today to start.

Share this: