360° Data Collection
Different data sets for different insights
You never know where the next great idea, company, or technology may come from. This is why Mergeflow collects and analyzes data from across various disparate data sets and sources. As a result, you are aware of what's going on around you, and you get a 360° perspective.
Scientific publications help you identify experts and leading organizations in a given technology field.
Patenting activity in a certain technology field can be an early indicator of future commercial activities in this area.
News & Blogs
News and blogs provide new ideas, forecasts, and scenarios, often from an unconventional angle.
Market research provides size and growth estimates, connects technologies to applications, and identifies major players.
Technology transfer offerings from universities and R&D organizations worldwide help you strengthen and speed up your own R&D projects
Funded research project reports help you discover innovative, early-stage companies that would otherwise remain under the radar.
A crawler is software that collects data from the web. It starts off with a web page that it collects into a database, then follows the links from that page to other pages, and collect those too. Then it moves on to the next set of pages from there, and so on.
Mergeflow's crawlers pay attention to content. Rather than simply following and collecting every available page, they use self-learning web scraping technologies that recognize and collect contents that are relevant to technology and business.
Selective content collection
In order to decide whether or not to collect a page, Mergeflow's crawlers use methods such as language detection (we try to collect English contents); semantic topic modeling to distinguish relevant contents (technology descriptions etc.) from irrelevant ones (directions etc.); and text layout modeling. For example, if a page looks like a list of items or like driving directions, Mergeflow ranks it lower than if it looks like body text with tech and business contents.
Document structure detection
Once our data collection systems have collected a web page, a science article, or some other content, they only feed relevant parts of that document into our analytics pipeline. For example, they extract data from the title or body but not from disclaimers or other irrelevant parts. In order to do this, we have built an unsupervised algorithm for document content extraction that learns from just a few examples how a data source typically structures its contents.