Defining "lenses" for a Q-Language Topology

In my previous post, Q-Analysis of Natural Language I started to describe a path for applying q-analysis in the study of natural language. One of the particularly interesting aspects of q-analysis is the ability to connect hierarchical data in a rather straightforward (although non-trivial) manner. The process of connecting data are described through the definition of a relational mapping and the rules defined for that mapping.

The relational mappings result in a new subset consisting of the combinations of the two input sets. The resulting new combinatorial set serves as a cover for constructing q-connected simplicies. Thus allowing for inspection of the q-connectivity of sets across hierarchical scales. The below example described in Beaumont and Gatrell, shows the mappings between elements at different hierarchical levels of N.

The structure is the resulting mapping between three interrelated sets defined by the relation. In the language of q-analysis, this is called the 'backcloth.' It is on this backcloth where q-connected levels enable or direct 'traffic.' In the case of traffic, we might think of meaning, the flow of resources, beliefs, as well as traffic in the traditional sense as the flow of transport modes.

In this post I will describe three separate lenses or covers that are often used to construct a language model. These will serve to form the backcloth for the model. The first lens is simply the partition of word vectors into noun-phrase sequences contained within a given document. The second lens uses the Tf-IDF algorithm to generate a distribution of related and non-related terms for a given document. The third lens we will explore is the resulting output of a sentiment analysis. Using a Naive Bayesian Classifier, we can produce a probability score of sentiment for a given document that will also map to the Tf-IDF weights and the noun-phrase partitions of the same document, and thus, the entire corpus of documents. This stacked approach provides new windows for investigating language, but it also provides the matrix for mapping to other semantically related data (e.g. location data, stock prices, etc.).

Now, before going further, it is necessary to consider the conceptual coherence between hierarchical covers. The flexibility of using topological methods (in particular, q-analysis) can also encourage sloppy use of the methods -- "garbage in, garbage out." Since one can define almost any type of collection in set-theoretical terms, it is critical that one begins with a theory (or notion) about the hierarchical covers, and the conditions under which covers become q-connected.

In the case of language we might conceptualize the original input vector to represent N-0. N-0 is connected to N-1 as partitions or sub-sets of noun-phrases, while N-2 can represent the raw count distribution of both sub-sets and the original input vector to define the topic space (in reality we would probably create two or three Tf-IDF lenses using the raw vector stream), the partitioned stream, and the combination of both. Finally, we can map to our sentiment values for each of the document-word vectors. The diagram below shows one way each of these layers connects in a stream of text.

Now, aside from the document stream, and the documents themselves, the relational mapping levels are well defined based upon the vectorization of both words and documents in the data stream. In building our geometric perspective, we must first define our sets and the relational rules for set construction. The first, and perhaps most obvious set is our universal set M, which is essentially all the words and their indexed location within a set. For example, let W represent the set of all words that have been encountered in a text stream.

W = {w₀,w₁,w₂,...wᵢ}

A new element, w is added to W if w is an element of d which is a member of set D. The set D represents the documents and word vectors within each document.

D = {dᵢ, {w₂, w₈, w₀}}

Now, let M represent the locations of w in W. The set M thus becomes a set filled with the value in ascending order from 0 to n. 1

M = {0, 1, 2, 3,...n}

The resulting mapping produces a dictionary of words and their index, a vector of documents and related word vectors. This provides our first two covers, the word vectors and the documents in which vectors occur. We can extend this further by applying a parts-of-speech (POS) tagging method that parses our certain sequences or patterns of words based on their POS-tags. For this example, we can use the Spacy.io NLP framework along with a tool I developed called Spire (note: the version of Spire hosted on Github is not the same used in this experiment). The new version of Spire (which will likely be renamed) takes an input stream of text and breaks texts up into their atomic parts (i.e. characters and punctuation), special characters are then removed and joined back as discrete words or tokens.

Spacy is then used to identify the POS for each token in the document stream. The below code snippet shows the methods used in Spire split, tokenize and tag document streams.

By updating the pos_tag() method to include a non-empty stop_tag = [] (python list), we can specify which word sequences to retain and which to exclude based on the POS-tag value. In this case we would likely remove those words tagged as "DET", "CCONJ", "SPACE" and so on. This produces another set P similar to D, but now with a new subset of subsets as word sequences, which in this case are phrases.

Our final step is to construct another subset based upon the TF-IDF model that produces a distribution of word vectors based upon the number of counts in a document while reducing the value of those words that occur across all documents. In terms of text classification, this model assumes that there are some words that occur frequently in a given text but that do not occur frequently in all text. These are the words we want to isolate in order to construct a model of the topics in the text.

TF-IDF is commonly used in conjunction with Latent Semantic Indexing. LSI or other distance-similarity measures could be employed to create new covers. However, in this case we will stick with the TF-IDF for now. Gensim, a python tool-kit for topic analysis was used to compute the TF-IDF weights, for this document stream.

This method extends the Gensim method to produce the vector set as weighted distribution. The result for the Wikipedia article summary for 'climate change' and vector weights.

.....(195, 0.008339643109856907),

(5, -0.12934598825026603),

(183, -0.014635135465991572),

(184, -0.03335857243942763),

(53, -0.06756211046693786),

(196, 0.008339643109856907).....

At this point we can set a threshold value for mapping relations into our final cover set. For simplicity we can set that value of xᵢ = 1 if the weighted score for wᵢ > 0, otherwise xᵢ = 0. This will produce the incidence matrix between lenses and the set M - an mxn matrix of 0,1 integers.

I will go into computing on these sets in the next few posts. The goal is to see if we can come up with a topological description of meaning in text.

Notes on Human-Machine Learning

Search This Blog

Defining "lenses" for a Q-Language Topology

Labels

Comments

Post a Comment

Popular posts from this blog

Notes on defining a language model

Q-Analysis of Natural Language