Feature Vectors as Criteria Sets in a Q-Language Model

Following the previous post on cover sets in q-analysis it is important to consider another way for constructing simple cover sets where key terms represent criteria for determining ranked “meaning” in a text stream. This is particularly relevant in the automated formation of ontologies from a given set of text documents.

The recent proliferation of formalized linked vocabularies for domain specific knowledge representations provide a valuable input source for generating new cover sets in the q-language system. The elements (vocabulary words) are the most important features because they correspond directly to the terms we are hoping to cluster documents around.

And in a sense we can ensure some level of relatedness between terms and our document vectors through simple cooccurrence calculations.In this way the features, operate as "attractors" -- points at which other terms congregate around due to the rules of a given relation.

In this case, it might be as simple as computing term cooccurrence as context vectors whenever a vocabulary term or sequence of terms are found in given document vector. The new feature sets can then be used to compute the relational mapping of terms to document vectors resulting in an index of term-related documents, and a relatedness ranking (q-connected). This is potentially extraordinary in that q-analysis and the methods of topology provide a relative simple way to link vocabularies, data and human written knowledge (e.g. research, news). And because of the use of simplicial complexes, it provides a means to compress information without significant data loss (Gould 1980), it is possible that q-analysis can provide a relatively lossless indexing methodology and similarity ranking system.

Take for example, Wikipedia. Wikipedia provides the largest single repository of crowdsourced knowledge on the planet. As a dictionary of terms and definitions, Wikipedia can serve as a useful starting point for constructing cover sets for measuring similarity (q-connected vectors) between wiki terms and streaming text.

Using the same method for cover set construction, alternative vocabularies can be used to expand the number of filters used to detect related discursive events in a text stream. This provides a means for clustering document sets around terms with high q-connectivity. Just to provide a use case, lets say we take the North American Industry Classification System, and use the list of codes to generate a feature set.

The NAICS codes provide a useful set of semantically related information that contain unique identifiers, but are also closely related based upon the category id (i.e. 111000 = Agriculture; crop related; 112000 Agriculture, animal related, etc.)

This results in a mapping that produces multiple levels of N. Here seed is q-connected with the id fields for agriculture, crops. This includes soybean, mustard and canola. Again, this presents the formation of a multi-dimensional cover set spanning several hierarchical layers (note: many more simplicies could be derived from these relations).

Notes on Human-Machine Learning

Search This Blog

Feature Vectors as Criteria Sets in a Q-Language Model

Labels

Comments

Post a Comment

Popular posts from this blog

Notes on defining a language model

Q-Analysis of Natural Language

Defining "lenses" for a Q-Language Topology