Skip to main content

Notes on defining a language model

Wikipedia defines "Language Model" as "probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability  to the whole sequence."  The Stanford NLP Group similarly implies this definition through the description of the language modeling in the context of Information Retrieval


\begin{displaymath}
P(t_1t_2t_3t_4) = P(t_1)P(t_2\vert t_1)P(t_3\vert t_1t_2)P(t_4\vert t_1t_2t_3)
\end{displaymath}

The equation above refers to the chain rule defined by: 

\begin{displaymath}
P(A, B) = P(A \cap B) = P(A\vert B)P(B) = P(B\vert A)P(A)
\end{displaymath}

See chain-rule definition in the NLP Review of Basic Probability Theory

Generating a probability distribution is one part of building a usable language processing infrastructure. A useful statistical language model typically depends on the specific need, or problem you want to solve, and of course the domain of your problem. Thus the ability to cluster and partition sequences of words based on their likely occurrence given a query as input can serve as the starting point for connecting probability distributions of linguistic sequences to other semantic information (e.g. equity prices, weather data, abstract concepts). Indeed, I am especially keen to connect linguistic data to other relevant information with useful semantic data to aid in decision-support.

One example would include linking linguistic data to price trends in the equity or commodities markets. Since we know news events can influence price of a given stock (see: News and Stock Prices), we can use a probabilistic language model to both remove unnecessary information from a text corpus and connect retained features to construct a connective semantic representation of the domain.

There are a number of ways one might go about developing language model and semantic representation of the relationships between language and market behavior. First, we might apply a topic modeling approach using Latent Dirichlet Process classify/rank a given document by a set of topics (see the Gensim Module for applying LDA). If for example I was trying to define a set of semantically similar events (communicated in a given text) and connect them with market-sector price trends, one could deploy a neural-net to learn the relationships, and generate new probability distributions based on some user-defined query.

Personally, I think there is more to learn in applying language models and other semantically connected structures to creating meaning from text. In particular, I am interested in exploring the potential of topological data analysis approaches to automating clustering on connectives between linguistic information and semantically related stock price information.

And I will just blurt it out now, I think there is potential value in creating "topological language models" that can be used to automate the formation of linguistic and semantic structures that contain meaning based upon interaction with user queries or targets that require monitoring. Specifically, I think there is opportunity to explore the use of q-analysis and other methods inspired from algebraic topology. But I will discuss this in more detail later.





Comments

Popular posts from this blog

Q-Analysis of Natural Language

Q-Analysis is a methodological perspective and language that can be applied to study system structure, and its dynamics. Indeed, q-analysis has been dubbed the “language of structure” ( Legrand 2002 ), because it provides both a mathematical framework and particular vocabulary for defining system features and relationships ( Atkin & Casti, 1977 ; Gould 1980 ). The mathematical framework of q-analysis is built on algebraic topology , a branch of abstract mathematics that is interested in space and shape under continuous deformation (e.g. the bending, compressing, stretching of shapes). In topology, and specifically q-analysis, shape is defined by the relationships between elements in open sets. The relationship between these sets produce new sets representing edges, faces and simplicial complexes that form as a result of the relational mapping λ     from some set A and some set B to a new set C.   The relation  λ represents a rule for defining the condit...

Defining "lenses" for a Q-Language Topology

I n my previous post,  Q-Analysis of Natural Language  I started to describe a path for applying q-analysis in the study of natural language. One of the particularly interesting aspects of q-analysis is the ability to connect hierarchical data in a rather straightforward (although non-trivial) manner. The process of connecting data are described through the definition of a relational mapping and the rules defined for that mapping.  The relational mappings result in a new subset consisting of the combinations of the two input sets. The resulting new combinatorial set serves as a cover for constructing q-connected simplicies. Thus allowing for inspection of the q-connectivity of sets across hierarchical scales. The below example described in Beaumont and Gatrell , shows the mappings between elements at different hierarchical levels of N. The structure is the resulting mapping between three interrelated sets defined by the relation. In the language of q-analysis, ...