The data set of the online demo

The documents contained in the online demo dataset were crawled from publicly available sources. The documents are mixed in content, so there should be documents on every topic. Critical content was filtered in advance as best as possible.

File types

Since PDF documents are the easiest to access, they make up the largest part of the dataset. However, the filter functions can also be used to reduce the data set to certain file types.

To get a feeling for how many file types are in the demo, here is a small list:

File type	Quantity
PDF	>40.000 files
Scanned PDF's	~100 files
PowerPoint	~150 files
Word	>11.500 files
Excel	>1.000 files
E-Mail	>2.000 files
Images	>100.000 files
3D Models	>30.000 files
Tickets	>6.000 files

Data sources

In the demo we have limited ourselves to a selection of our connectors. These include:

Network drives
SharePoint
OneDrive
Teams
Outlook
OneNote
Jira
Confluence
D.velop
Gitlab

In real life, however, we can support many more systems.

You can find an overview of possible queries here.