Skip to main content

Feature Vectors as Criteria Sets in a Q-Language Model

Following the previous post on cover sets in q-analysis it is important to consider another way for constructing simple cover sets where key terms represent criteria for determining ranked “meaning” in a text stream. This is particularly relevant in the automated formation of ontologies from a given set of text documents. 

The recent proliferation of formalized linked vocabularies for domain specific knowledge representations provide a valuable input source for generating new cover sets in the q-language system. The elements (vocabulary words) are the most important features because they correspond directly to the terms we are hoping to cluster documents around. 


And in a sense we can ensure some level of relatedness between terms and our document vectors through simple cooccurrence calculations.In this way the features, operate as "attractors" -- points at which other terms congregate around due to the rules of a given relation. 

In this case, it might be as simple as computing term cooccurrence as context vectors whenever a vocabulary term or sequence of terms are found in given document vector. The new feature sets can then be used to compute the relational mapping of terms to document vectors resulting in an index of term-related documents, and a relatedness ranking (q-connected). This is potentially extraordinary in that q-analysis and the methods of topology provide a relative simple way to link vocabularies, data and human written knowledge (e.g. research, news). And because of the use of simplicial complexes, it provides a means to compress information without significant data loss (Gould 1980), it is possible that q-analysis can provide a relatively lossless indexing methodology and similarity ranking system. 

Take for example, Wikipedia. Wikipedia provides the largest single repository of crowdsourced knowledge on the planet. As a dictionary of terms and definitions, Wikipedia can serve as a useful starting point for constructing cover sets for measuring similarity (q-connected vectors) between wiki terms and streaming text.  

Using the same method for cover set construction, alternative vocabularies can be used to expand the number of filters used to detect related discursive events in a text stream. This provides a means for clustering document sets around terms with high q-connectivity. Just to provide a use case, lets say we take the North American Industry Classification System, and use the list of codes to generate a feature set. 

The NAICS codes provide a useful set of semantically related information that contain unique identifiers, but are also closely related based upon the category id (i.e. 111000 = Agriculture; crop related; 112000 Agriculture, animal related, etc.) 
This results in a mapping that produces multiple levels of N. Here seed is q-connected with the id fields for agriculture, crops. This includes soybean, mustard and canola. Again, this presents the formation of a multi-dimensional cover set spanning several hierarchical layers (note: many more simplicies could be derived from these relations).

Comments

Popular posts from this blog

Notes on defining a language model

Wikipedia defines "Language Model" as " a  probability distribution  over sequences of words.  Given such a sequence, say of length  m , it assigns a probability   to the whole sequence."    The Stanford NLP Group similarly implies this definition through the description of the language modeling in the context of Information Retrieval .  The equation above refers to the chain rule defined by:  See chain-rule definition in the  NLP Review of Basic Probability Theory .  Generating a probability distribution is one part of building a usable language processing infrastructure. A  useful statistical language model typically depends on the specific need, or problem you want to solve, and of course the domain of your problem. Thus the ability to cluster and partition sequences of words based on their likely occurrence given a query as input can serve as the starting point for connecting probability distri...

Q-Analysis of Natural Language

Q-Analysis is a methodological perspective and language that can be applied to study system structure, and its dynamics. Indeed, q-analysis has been dubbed the “language of structure” ( Legrand 2002 ), because it provides both a mathematical framework and particular vocabulary for defining system features and relationships ( Atkin & Casti, 1977 ; Gould 1980 ). The mathematical framework of q-analysis is built on algebraic topology , a branch of abstract mathematics that is interested in space and shape under continuous deformation (e.g. the bending, compressing, stretching of shapes). In topology, and specifically q-analysis, shape is defined by the relationships between elements in open sets. The relationship between these sets produce new sets representing edges, faces and simplicial complexes that form as a result of the relational mapping λ     from some set A and some set B to a new set C.   The relation  λ represents a rule for defining the condit...

Defining "lenses" for a Q-Language Topology

I n my previous post,  Q-Analysis of Natural Language  I started to describe a path for applying q-analysis in the study of natural language. One of the particularly interesting aspects of q-analysis is the ability to connect hierarchical data in a rather straightforward (although non-trivial) manner. The process of connecting data are described through the definition of a relational mapping and the rules defined for that mapping.  The relational mappings result in a new subset consisting of the combinations of the two input sets. The resulting new combinatorial set serves as a cover for constructing q-connected simplicies. Thus allowing for inspection of the q-connectivity of sets across hierarchical scales. The below example described in Beaumont and Gatrell , shows the mappings between elements at different hierarchical levels of N. The structure is the resulting mapping between three interrelated sets defined by the relation. In the language of q-analysis, ...