Automatic merging

and validation of data

The aim of this subproject is to develop methods for converting and combining terminology data from various existing sources. Two very complex types of problems exist in this process. The first type of problems that are likely to be encountered pertains to form: The data are likely to have different structures and be stored in different formats. The second type of problems pertains to content: The data may be of varying quality, and entries from the various resources may contain information about the same concept, but be associated with different sets of synonyms and with slightly varying definitions, or the other way round, have overlapping form but be associated with different concepts.

We have developed a taxonomy of datatypes for termbases, see the publication Madsen et al. (2013) in eDITion and visit the database: vip.iterm.dk (select the database: DanTermBank Data Categories from the drop-down list, Login and password: PUBLIC)

Furthermore, we have started developing methods for merging entries containing equivalent concepts, see the publication Madsen et al. (2012) from TKE. Further work on merging entries has been postponed for a later phase of the overall DanTermBank project.