Skip to main content

Q-Analysis of Natural Language

Q-Analysis is a methodological perspective and language that can be applied to study system structure, and its dynamics. Indeed, q-analysis has been dubbed the “language of structure” (Legrand 2002), because it provides both a mathematical framework and particular vocabulary for defining system features and relationships (Atkin & Casti, 1977; Gould 1980). The mathematical framework of q-analysis is built on algebraic topology, a branch of abstract mathematics that is interested in space and shape under continuous deformation (e.g. the bending, compressing, stretching of shapes). In topology, and specifically q-analysis, shape is defined by the relationships between elements in open sets. The relationship between these sets produce new sets representing edges, faces and simplicial complexes that form as a result of the relational mapping λ  from some set A and some set B to a new set C. 
The relation λ represents a rule for defining the conditions of producing binary mapping {0 or 1 given ai, bi}. Relational rules can be defined in many ways, some particularly in important to natural language processing. For instance, a Tf-IDF algorithm can be used to create values for a given set of words across another set of documents. The theoretical perspective of Tf-IDF is that important words occur often within related documents, but not as often in unrelated topic areas. At the same time there are words that occur frequently across documents regardless of the topic (e.g. the, a, at, is, but, of, to). Of course these broadly distributed high occurring words do not really tell us about the important terms in the document. Tf-IDF provides a mechanism for determining weights for a given set of terms for a given document vector.


It is important to note the construction of our relation is grounded in some way. Ideally, every relation is defined by theory in some fashion. Continuing our example, we might define a relation λ on A x B where Cλ = { ci,j = 1; if the tf-idf weight > 0,  otherwise ci,j = 0}, that produces an incidence matrix of word vectors and document vectors given the relational rule. 

The interesting thing here is that the construction of weighted distributions of terms is just one possible method for constructing a cover of M. In the case of q-analysis, the relations represent a type of lens for exploring the topology of the data. Other rules could be defined to represent other lenses such as n-grams, context vectors, frames and more. Additionally, multiple rules can be defined to represent multiple layers of a topological space. By stacking new sets on top of the original master set M or universal set, we can begin to define the structure of subsets and the relationship between subsets at different levels of abstraction (i.e. levels of Q).
 Depicting the hierarchical layering of elements in Central Place Theory. Accessed at: http://urbagram.stdio-london.com/v1/show/A+short+History+of+Intersections
With each addition of a subsequent cover, the system begins to reveal the multi-dimensional structures that link words, topics, word sequences, documents and nearly any other data we can define a relation on. For instance, we could potentially link entirely different data spaces/structures by defining relations on a given n-simplicial level. In particular, time would serve as a useful lens for defining relations between topics, and streaming data. In fact, linking time and the topology of topics could be used to detect relational patterns and structures across the system in real-time. Moreover, connecting time to semantic representations as well as physical spatial representations offers a method for stacking and linking disparate data.

At this point, it is probably worth addressing the "why" for using this method or perspective. Why Q-Analysis? The approach provided by q-analysis could be seen in relation to neural-nets, bayesian networks and the sparse coding models of HTM.  The lenses resemble layers, elements are networked, hierarchical and sparsely coded binary arrays. So why not just use these other methods if they are already proven to work? 


My answer to this question is both practical and personal. From the practical standpoint, q-analysis and in particular the use of a modified form of algebraic topology provides the language of sets and set theory as a formalism for describing data and the relationships between different data. In other words, q-analysis provides a framework for working across data types, knowledge domains and systems. The open set and the rules of topology itself offers a space for exploring and comparing our theories of natural language, and human ideology (more to come). The concepts of q-transmission, q-connected, eccentricity, etc. [see: Beaumont & Gatrell], offer ways to inspect the relationship between simplicies at multiple partitions. These measures allow for way to model not just language but the conceptual space revealed in text and human behavior (including other systems).

And personally, neural-nets, LSTM and other methods are currently insufficient to link extreme, multi-dimensional data systems. Rather these methods provide a lens for learning about structure and ultimately for potentially improving the ways we use clustering and other predictive techniques. However, the language of q-analysis also resembles my own thinking around language, belief and behavior. I believe, q-analysis offers an approach for connecting linguistic and conceptual spaces to produce insights into the structures of social thought across heterogeneous groups.

Comments

Popular posts from this blog

Notes on defining a language model

Wikipedia defines "Language Model" as " a  probability distribution  over sequences of words.  Given such a sequence, say of length  m , it assigns a probability   to the whole sequence."    The Stanford NLP Group similarly implies this definition through the description of the language modeling in the context of Information Retrieval .  The equation above refers to the chain rule defined by:  See chain-rule definition in the  NLP Review of Basic Probability Theory .  Generating a probability distribution is one part of building a usable language processing infrastructure. A  useful statistical language model typically depends on the specific need, or problem you want to solve, and of course the domain of your problem. Thus the ability to cluster and partition sequences of words based on their likely occurrence given a query as input can serve as the starting point for connecting probability distri...

Defining "lenses" for a Q-Language Topology

I n my previous post,  Q-Analysis of Natural Language  I started to describe a path for applying q-analysis in the study of natural language. One of the particularly interesting aspects of q-analysis is the ability to connect hierarchical data in a rather straightforward (although non-trivial) manner. The process of connecting data are described through the definition of a relational mapping and the rules defined for that mapping.  The relational mappings result in a new subset consisting of the combinations of the two input sets. The resulting new combinatorial set serves as a cover for constructing q-connected simplicies. Thus allowing for inspection of the q-connectivity of sets across hierarchical scales. The below example described in Beaumont and Gatrell , shows the mappings between elements at different hierarchical levels of N. The structure is the resulting mapping between three interrelated sets defined by the relation. In the language of q-analysis, ...