On the Probabilistic Latent Semantic Analysis Generalization as the Singular Value Decomposition Probabilistic Image

Pau Figuera Vinué; Pablo García Bringas

doi:10.2991/jsta.d.200605.001

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Volume 19, Issue 2, June 2020, Pages 286 - 296

On the Probabilistic Latent Semantic Analysis Generalization as the Singular Value Decomposition Probabilistic Image

Authors

Pau Figuera Vinué^*, Pablo García Bringas

Faculty of Engineering, University of Deusto Unibertsitate Etorb., 24 Bilbo, Bizkaia 48007, Spain

^*Corresponding author. Email: pau.figuera@opendeusto.es

Corresponding Author

Pau Figuera Vinué

Received 26 March 2019, Accepted 25 April 2020, Available Online 19 June 2020.

DOI: 10.2991/jsta.d.200605.001 How to use a DOI?
Keywords: Singular value decomposition; Probabilistic latent semantic analysis; Nonnegative matrix factorization; Kullback–Leibler divergence
Abstract: The Probabilistic Latent Semantic Analysis has been related with the Singular Value Decomposition. Several problems occur when this comparative is done. Data class restrictions and the existence of several local optima mask the relation, being a formal analogy without any real significance. Moreover, the computational difficulty in terms of time and memory limits the technique applicability. In this work, we use the Nonnegative Matrix Factorization with the Kullback–Leibler divergence to prove, when the number of model components is enough and a limit condition is reached, that the Singular Value Decomposition and the Probabilistic Latent Semantic Analysis empirical distributions are arbitrary close. Under such conditions, the Nonnegative Matrix Factorization and the Probabilistic Latent Semantic Analysis equality is obtained. With this result, the Singular Value Decomposition of every nonnegative entries matrix converges to the general case Probabilistic Latent Semantic Analysis results and constitutes the unique probabilistic image. Moreover, a faster algorithm for the Probabilistic Latent Semantic Analysis is provided.
Copyright: © 2020 The Authors. Published by Atlantis Press SARL.
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

1. INTRODUCTION

The Probabilistic Latent Semantic Analysis (PLSA) have formal similarities with the Singular Values Decomposition (SVD). The obtained probabilistic factorization admits a matrix representation which can be assimilated to the SVD one. In fact, the original formulation of Hofmann [1] is a probabilistic remake of Latent Semantic Analysis (LSA), which is a direct SVD application introduced by Deerwester et al. [2]. Despite similarities, fundamental differences exist, since every real entries matrix admits the SVD decomposition, but only count data structures (and consequently contingency tables) PLSA decomposes. In fact, the PLSA has been derived from the probability theory, with hard restrictions. The idea of the PLSA is to factorize the co-occurrence table $N(wi,dj)$ , identified as the relative frequencies $P(wi,dj)$ according multinomial distributions $P(wi|zk)$ and $P(dj|zk)$ with parameters $θk$ and $ϕk$ , being the $zt=(z1,…zk)$ a set of categorical latent variables. Further, the model is adjusted until certain condition is achieved by the Expectation-Maximization (EM) algorithm. When the probabilistic formulas are written in the symmetric formulation, the formal analogy with the SVD is put in scene [1], but the assumptions restricts the PLSA to very special discrete cases, and limits seriously the conditions to establish the SVD-PLSA relation.

The work of Hofmann received strong criticism after publishing the main ideas on [1]. One of the most important one is the paper of Blei et al. [3], who pointed out several objections. The most important for the purposes of this manuscript is about the number of parameters of the PLSA model, which grows linearly when new data are added (documents), and the EM algorithm heritage, which has serious computational problems in terms of memory and execution time. Despite the problems, the PLSA is a very effective tool for information retrieval, text and image classification, bioinformatics, unsupervised machine learning, and constitutes a start point of some ideas as the probabilistic clustering. Despite the problems, the number of works using this technique has grown while the computing power increases.

A closely related technique to the PLSA is the Nonnegative Matrix Factorization (NMF). The NMF historically attributed to Chen [4] and extended by Paatero and Taper [5], can be used to solve similar problems to the PLSA in a less restrictive framework. The equivalence among the NMF and the PLSA has been discussed by Gaussier and Goutte [6] and Ding et al. [7], since every nonnegative entries matrix has a probabilistic interpretation under suitable normalization conditions, as is shown in [7], between many others. And advantage of the NMF is the ability to handle more general data structures than the discrete ones, referred by the PLSA. However, the PLSA solves the problem of maximizing the log-likelihood, while the NMF try to find two matrices $W$ and $W$ such that $X≈WH$ difference is minimal in some norm or divergence (a norm which does not satisfy all the distance axioms.)

While the PLSA practitioners are going on an algorithms competition to achieve good and reasonable fast results, not many theoretical works has been published. Some exceptions are the contributions of Brants [8] who uses the model for inferential purposes. Chaudhuri (2012) relates the diagonal matrix with the Mahalanobis distance [9], and Devarajan et al. [10] discuses several divergences, relating them with probability distributions.

The NMF create a scenario on which it is possible to derive similar equations to the obtained by Hofmann, for a wider data class, into a less restrictive framework. Deriving equations from the NMF with Kullback–Leibler (KL) divergence, ensuring the minimal number of components which guarantees monotonicity as is shown by Latecki et al. [11], and extracting the diagonal matrix as the inertia vector values, we call the formulas obtained in this way general case PLSA equations.

To avoid a bad definite problem or the undesirable effect of local maximums, the iterative procedure in the generalized PLSA factorization is done until the empirical distribution of the obtained product matrices is the same as the original data matrix, both with suitable normalization conditions. Under those assumption is easy to prove that for any nonnegative entries matrix, the singular value decomposition of the Gram–Schmidt orthonormalization (referred to it in the form given in [12, p. 24]) is the same as the generalized PLSA obtained matrices product Gram–Schmidt orthonormalization. At this point is possible to establish the conditions which provides equal results, under the condition of the empirical distributions are reached. Then, the generalized PLSA formulas solves the PLSA (also, classical PLSA) problem too.

The next section is an overview of the main ideas, like the SVD theorem and the LSA, which constitutes the basis to introduce the PLSA as a probabilistic remake. The most fundamental ideas of the NMF and the PLSA formulas are introduced in section 2 and Section 3 is devoted to derive the general case PLSA equations. Other section is devoted to the PLSA and SVD relation in the simplest way we can. Section 5 offers an example of the properties, with a reduced data set. Despite the example is driven with a small dimensions matrix, up to $105$ iterations are needed to obtain acceptable results.

2. BACKGROUND

2.1. The Singular Value Decomposition

The well-known SVD is a generalization of the elementary lineal algebra Spectral Decomposition Theorem. The SVD states that every matrix $X∈ℝm×n$ can be decomposed as

$X=UΣVt$ (1)

being

$U$ and

$V$ the orthogonal eigenvectors matrices of

$X$ and

$Xt$ , respectively. The

$σ1≥⋯≥σp∈Σ$ are the corresponding eigenvalues. If the

$σi$ values are sorted in decreasing order (to avoid the

$p!$ columns permutations feasible results) the result is the same and the SVD decomposition is unique.

If the rank of $X$ is $p(p<min(m,n))$ , the approximate product $X̂≈UrΣrVrt(r<p)$ is optimal in the sense of the Fröbenius norm which is given by $∥Y−Ŷ∥F=∑rσr2$ . This is also called the dimension reduction problem.

The theoretical and practical consequences of the SVD are vaste. It is the main idea of many of the Multivariate Methods, and provides a geometrical interpretative basis for them. Moreover, the SVD does not involve many calculations and it is implemented in several of the existing programming languages [13, p. 309]. The flexibility and aptitude of the SVD for data analysis highlights the LSA, which can be considered a special interpretation case of the SVD.

On the LSA problem, the matrix $X$ takes values on the positive integers and it is a count matrix. The elements of the diagonal matrix $Σ$ are assimilated to a latent class of categorical hidden variables with any inferential sense [2]. A understandable latent class interpretation, based on text classification, could be when the rows of $X$ are the words count over the documents placed in the columns. In this case, a reasonable interpretation of the latent class significance can be the analyzed documents subjects.

2.2. The PLSA Model

The PLSA model is a probabilistic remake of the LSA, obtained from the frequencies table $N(di,wj)$ . The main idea is to provide probabilistic sense to a count table of the words $wj$ in the documents $di$ . By dividing each element of the table by the overall sum of the table, the relative frequencies $P(wj,di)$ are obtained, and

$P(wj,di)=N(di,wj)∑ijN(di,wj)$ (2)

The expression Eq. (2) can be written as

$P(wj,di)=∑kP(wj|zk)P(zk)P(zk|di)$ (3)

which is the asymmetric formulation of the PLSA [1]¹ being

$zk$ a set of

$K$ qualitative latent variables.

The PLSA estimates the probabilities of Eq. (3) maximizing the log-likelihood by using the EM algorithm. For the symmetric formulation is

$ℒ=∑ijlogNn(di,wj) =∑ijn(di,wj)logP(wj|zk)P(zk)P(zk|di)$ (4)

Hofmann maximizes the expression Eq. (4) by computing the expectation, which is the posterior (E-step)

$P(z|d,w)=P(z)P(d|z)P(w|z)P(z′)P(d|z′)P(w|z′)$ (5)

and writes the Lagrangian, takes the derivatives, equalizes to zero, and eliminates the multipliers to maximize the expected log-likelihood (M-step). The provided solutions are [1]

$P(wj|zk)=∑in(di,wj)P(zk|wj,di)∑ijn(di,wj)P(zk|wj,di)$ (6)

$P(zk)=∑ijn(di,wj)P(zk|wj,di)∑ijkn(di,wj)P(zk|wj,di)$ (7)

$P(di|zk)=∑jn(di,wj)P(zk|wj,di)∑ijn(di,wj)P(zk|wj,di)$ (8)

By switching between Eq. (5) and the group of Eqs. (6–8), until certain condition is achieved, the model is adjusted.

The analogy between Eqs. (6–8) and the given by Eq. (1) is put in the scene when is written [1]

$U=[P(wi|zk)]ik$ (9)

$Σ=diagzk$ (10)

$V=[P(dj|zk)]jk$ (11)

2.3. The NMF Point of View

A closely related PLSA problem is the MFN. It is usually stated as the decomposition of the nonnegative matrix $X∈ℜ+m×n(i)$ in the $r$ -dimensional cone $Γ$ contained in the positive orthand, which is spanned by the columns of $W$ , $H⊆Γr$ and $(ii)$ the matrix $X$ can be written as

$X≈WH$ (12)

being the column vectors of

$X$ convex combination of

$H$ [4].

The problem of the NMF for a matrix $X$ is an optimization problem which Lee and Seung [14] solves by using the generalized KL divergence or I divergence.

$DI(X∥WH)=∑ij[X[ijlog[X[ijWHij−[X[ij+WHij$ (13)

providing the results [14]

$wik←wik[YHt]ikWHHtik$ (14)

$hkj←hkj[WtY]kjWtWHkj$ (15)

The equivalence between the PLSA and NMF problems needs a diagonal matrix. Gaussier and Goutte [6] shows that introducing diagonal matrices $S$ and $D$ on the factorization given in [14], the product Eq. (12) can be written as

$WH=WSS−1DD−1H=WSS−1DD−1H=ŴZĤ$ (16)

being

$Z$ the diagonal matrix. This reveals that the PLSA solves the NMF. Ding deals with this problem, reaching same conclusions, but pointing out that the equivalent solutions, are not the same, debt to the algorithm differences [7].

3. GENERAL CASE EQUATIONS

Equivalent formulas to Eqs. (14) and (15) can be derived from the KL divergence. For $X+∈ℜm×n$ , the transformation $Y=X∕nX(withnX=∑ijX)$ , valued on $ℜ[0,1]m×n$ is similar to the bi-variate distribution $P(di,wj)$ in the PLSA model (also, classical PLSA formulation). Stating the NMF as a KL minimization divergence

$DKL(Y∥WH)=∑ij[Y]ijlog[Y]ij[WH]ij =∑ij[Y]ijlog[Y]ij−[Y]ijlog[WH]ij$ (17)

the solutions, with the Karush–Kuhn–Tucker (KKT) conditions are

$[W]ik ←[W[ik⊙[Y[ij[WH]ij[H[kjt$ (18)

$[H]kj ←[H]kj⊙[W[ikt[Y[ij[WH]ij$ (19)

where

$⊙$ is the Hadamard product.

Switching between Eqs. (18) and (19) after initializing, until certain condition is achieved, and assuming convergence in the iterative process, the obtained matrices product of $WH$ is $Ŷ$ which estimates $Y$ ².

The pairs of Eqs. (18, 19), and (14, 15) are similar, but not equivalent since the derivation context is different [16, Chap. 3]. In the other hand, the classical KL divergence definition evidences the log-likelihood similitude with the second term of Eq. (17), while the first is a constant. This shows the equivalence between the KL divergence minimization and the EM algorithm. Moreover, the use of the KL divergence has good theoretical properties in a simple framework.

Formal similitude between the Eqs. (18) and (19) and the classical PLSA given by the formulas Eqs. (6–8), requires a diagonal matrix, which plays the role of Eq. (7).

Since the centered matrix of $Y$ is

$[Y¯]ij=[Y[ij−1ȳjj=1,…,n$ (20)

where

$1$ is the ones matrix, and

$ȳj$ are the column means vector. Choosing the diagonal matrix as

$diagt=diag[Y¯]ijt[Y¯]ijtrace[Y¯]ijt[Y¯]ij$ (21)

introducing vector notation for the NMF matrices

$[WH]ij=〈wi,Hj〉s.t.wk,Hk∈ℜK$

Simple algebra manipulations

$〈wi,Hj〉=wik′tk′tk′,hk′jtk′tk′ =wik′tk′diagtk′hk′jtk′$

and identifying

$[G[ik=wik′tk′ik$ (22)

$[F[jkt=hk′jtk′jk$ (23)

the formal similitude with Eq. (1) for the NMF formulas derivation as the PLSA does, is obtained.

While the SVD dimension reduction problem is bounded by the nonzero eigenvalues, the number of model components is a controversial point within the PLSA, which inherits the EM algorithm problems. There are no restrictions on the number of latent variables (number of nonzero entries in $t$ ). Nevertheless, after certain number of iterations, small values of the KL divergence rate occurs, and the new matrices differs not significantly from the previous ones. In this way, the problem is stated as the minimal number of components which allows similar divergences values to the obtained with a greater number of model components, after enough iterations. This idea, developed in [11], requires a closer look of the expression Eq. (5) for the multivariate case. Rewritten it in matrix notation, the array of matrices

$[Ŷ]ij=∑kαk[WH]ijs.t.∑kαk=1$ (24)

represents

$P(z|w,d)$ of Eq. (6–8).

Each one of the $αkWH$ matrices is an element of the scalar tensor (or matrix array) in the space $ℜm×n×K$ , and the log-likelihood is written now as

$ℒ=∑ij∑kαk[WH]ij$ (25)

$=∑ij∑k[WH]ijk$ (26)

Only the largest values of $αk$ are significant terms in the sum Eq. (25), and is necessary that

$∑kαk[WH]ijk=[G[ikdiagtk[F[jktk=1,…,K$ (27)

which only can be achieved for a particular of the

$αk$ values. Those values are the entries of

$t$ . The minimal number of them which ensures this condition, minimizes the KL divergence too in the same way as more components does.

Taking into account that the empirical density of $Y$ is $Y(emp)$ , and being

$β=infyijyij∈[Y]ij β∗=infyij∗=1m×nyij∗∈[Y]ij(emp)$

accomplishes

$β≤β∗$ . Since

$β∗→β$ while iterating, the inequality

$Kmin(m×n)β≥β$

holds only if

Proposition 3.1.

The number of model components which ensures that the empirical distribution of $[Ŷ]ij$ is the same as $[Y[ij$ is given by

$K≥min(m,n)$

Introducing as a definition to refer the obtained formulas in a more concise way

Definition 3.1

For $Y+∈ℜm×n$ such that $∑ij[Y]ij=1$ , the general case PLSA formulas are given by Eqs. (21–23). The number of model components is such that $K≥min(m,n)$ . The entries of diag( $t$ ) are decreasing values ordered, and the same permutation is done in the columns of $G$ and $F$ . In this case can be written

$[WH]ij=[G]ikdiag(t)[F]jkt$

and the sum

$∑ij[WH]ij=∑ij[G[ikdiag(t)Fjkt=1$ is preserved.

4. PROPERTIES

The general case PLSA equations are a merely similitude reformulation of the Eqs. (21–23) with the SVD, if the connection between them is not established. The underlying relation can be seen by taking into account the convergence of the iterative process. Then, if $Y(emp)$ and $Ŷ(emp)$ are the empirical distributions of $Y$ and $Ŷ$ respectively, its accomplished that

Proposition 4.1.

The SVD of the orthonormalization (ortogonalization and column normalization) Gram–Schmidt of $Y$ , and the product of the general case PLSA equations $Ŷ$ provides the same result when the condition $Ŷ(emp)=Y(emp)$ is reached.

Proof.

After orthogonalize and normalize $Y$ and $Ŷ$ , and denoting by $ϕ$ the SVD decomposition, suppose that existes $ϕ1$ and $ϕ2$ such that

$ϕ1(Y)∼ϕ1Y(emp) ϕ2(Ŷ)∼ϕ2Ŷ(emp)$

since

$Y(emp)∼Ŷ(emp)$ , it follows

$ϕ1(Y)∼ϕ2(Ŷ)$ .

A direct consequence is, despite the data are rarely orthogonal

Proposition 4.2.

If the columns of $Y$ are orthogonal, the SVD and the PLSA general case decompositions are the same.

A practical way to ensure the empirical distribution equality $Ŷ(emp)=Y(emp)$ is reached, is to consider the $Δ$ matrix operator. This operator compares the characters matrices $N$ and $N̂$ obtained when a set of $m$ labels substitutes the values of the columns, and they are arranged in decreasing order according the numerical values of the entries of the obtained matrix (it can be increasing too, with no differences on the results.) Those matrices can be seen as ordinal-named. Then, introducing as a definition

Definition 4.1

The $Δ$ matrix operator is

$Δij(p′,p)=0ifnij(p′)=nij(p)p′>pandnij(p)∈N(p)1otherwise$ (28)

being

$p$ and

$p′$ the iterations over which the comparison is done, and

$nij(⋅)∈N$ results from the column values substitution ordered according their entries.

The degree of adjustment between the characters matrices can be measured with the $∥Δij∥1$ norm, which provides the number of noncoincidences between them. When it is zero, the total adjustment between both empirical matrices is obtained, and a consequence is

Proposition 4.3.

The condition $∥Δij∥1=0$ is equivalent to $Ŷ(emp)=Y(emp)$ .

In practical conditions the zero bound is difficult to achieve, since the matrix $Y$ can contains some identical entries. In this case the labels ordination admits permutations on the repeated values, and the lower bound is the number of repeated values, named as $r$ . Also, the row labels can be substituted by the column labels, with identical results. In the case when the lower bound of $Δ=r$ is achieved, it can be stated that

Proposition 4.4.

The matrix $Ŷ$ reconstructs $Y$ (and $X$ if the total sum has been saved).

Proof.

If $Ŷ(emp)=Y(emp)$ and denoting by $Ỹ(emp)$ the column normalized matrix of $Y(emp)$

$Y¯̃(emp)=Ỹ$

and after dividing for the total row and column sum again for both matrices, with an round-off approximation error

$ϵ≤(m×n)$ , the equality is proved.

In this case the maximum is achieved, since from a numerical point of view, when the condition $Δ(p,0)$ is reached, the approximation error between $Y$ and $Ŷ$ is small. In this case the surfaces defined by the two matrices are similar, and they reproduces all the extremes vales and modes.

A similar construction to the given by propositions 4.1–4.4 can be build up for the classical PLSA Eqs. (6–8), when they are written in matrix form, reaching also the maximum. In this case the relation between the classical PLSA model and the general case PLSA equations relies on the significance of the diagonal matrix given by Eqs. (6) and (21), respectively. In this case the relation between $P(z)$ and $t$ is, and

Proposition 4.5.

The expectation of $P(z)$ is

$EP(z)=diagvar(Y)tracevar(Y)$

Proof.

Since $z=P(zk)(k=1,…,K)$

$EP(z)=∑kP(z)zk=ztz$

and expressing as in Eq. (24), since the

$α$ 's simplifies

$ztz=∑ik′[WH]t∑k′j[WH]∑ijk′[WH]t∑ijk′[WH]=diagYtYtraceYtY =diagvar(Y)tracevar(Y)$

Introducing as a definition the steps to ensure the equality between decompositions

Definition 4.2

The probabilistic SVD image is obtained when the general case PLSA equations reach the limit on which the empirical distributions are the same, except for $r$ repeated values of the data matrix $Y[0,1]$ obtained from $X+$

And recompiling the previous properties as a compact result, it can be stated

Theorem 4.1.

The probabilistic SVD image, the classical and general case PLSA matrix factorization are equal when $Δ=r$ (being $r$ the number of repeated values in $Y$ ). The orthonormalized SVD is the same for them. In this case the local basis which spans the general case PLSA equations is the orthonormalized basis which spans too the transformed data matrix $Y$ , obtained from $X$ by dividing for the overall sum.

If statistical inference will be done is unique necessary to normalize suitable columns or rows. In this work we go no far on this point.

5. EXAMPLE

This section offers a very simple example. It analyzes the effect of the number of components in the general case PLSA equations on the reached convergence limit, and compares it with the classical PLSA. For this purpose the $Δ$ matrix goodness of use is examined. Both models offers similar numerical results.

Since the number of iterations which are necessary to reach the maximum grows linearly with the data matrix dimension and the number of components, the small data set decathlon is used to drive the example. Included in several R packages, with some differences among them, the selected one is the included in the FactoMineR one [17]. The data are the ranks of elite athletes participants in the Athenas 2012 Olimpic Games men's decathlon competition. Additional reductions are done in the data by selecting $28$ rows and $10$ columns from the $41$ and $13$ original ones. Those are the athletes unique results only in the Olympic Games. Other meetings reference values are deprecated, and omitting total points and classification too. Thus, every row corresponds to a unique participant. The reduced data set is column labeled as 100 (100 meters), long (long jump), poid (shot put), haut (high jump), 400 (400 meters), 110 (110-meter hurdles), disc (discus throw), perc (pole vault), jave (javelin), and 1500 (1,500 meters). The athletes name are the rows labels, used as identifiers.

The data can be written as a nonnegative real valued $28×10$ matrix, denoted as $X$ . Every item (athlete) is a row vector. The column vectors are the total marks of the trials, and they are continuous variables. If the row and column names are preserved, the transformation $X∕n(n=∑ij)X$ provides the matrix $Y$ . The correspondent qualitative or chars matrix $Nc$ corresponding to $Y$ is obtained by substituting in every column the athlete name instead of the mark, and reordering into it according the obtained mark (which is the trial classification).

One must be careful at the moment to ordering the results, if significance will be provided. There are two categories: more is better, which corresponds to the distance achieved for events like jumping or shots, and less is better, which is the case of the time to cover a distance by running. Not always this correspondence occurs, and it should be done in a cautious way in all the cases if significance will be provided, but it is not important for computation purposes. The qualitative matrix $Nc$ has not algebraic sense. The ordination according to the obtained ranks can be omitted and a ascendant or descendant one is sufficient for comparison purposes. This task is left to the analyst criteria and has not more importance than coherence with the data ordination.

To see how the $Δ$ matrix works, running the equations of Definition 3.1 from initial random conditions, $10$ model components, and $p=1,000$ iterations, a estimation $Ŷ$ of $Y$ defines $Nc(1000)$ . This qualitative matrix is obtained by substituting the numbers of the rows by the row-label (or athlete name). Something similar to the Table 1 will be obtained.

	100	Long	…	1500
	LB	MB	…	LB
$Nc$	Karpov	Clay	…	Lorenzo
	Averyanov	Sebrle	…	Smirnov
	Warners	Karpov	…	Hernu
	⋮	⋮	⋮	⋮
	Schoenbeck	Parkhomenko	…	Korkizoglou
$Nc(1000)$	Karpov	Clay	…	Lorenzo
	Clay	Sebrle	…	Gomez
	Averyanov	Karpov	…	Smirnov
	⋮	⋮	⋮	⋮
	Casarsa	Casarsa	…	Korkizoglou

Note: Characters matrix $Nc$ corresponding to the data matrix $Y$ and characters matrix $Nc(1000)$ of $Ŷ$ after $p=1,000$ iterations.

Table 1

Obtained qualitative matrices of characters.

The comparation between $Nc$ and $N̂c(1000)$ is done according to the criteria given by Definition 4.1, the $Δ$ matrix for the results of Table 1 is

$Δij(1000,0)=00⋯010⋯110⋯1⋮⋮⋱⋮11⋯0$

being the zeros the coincidences of the rank of the athlete in the column trial and the ones appears when they are different. The

$L1$ norm gives the accuracy of the classification. In an ideal case it should be zero, but repeated values appear. In this case, the

$47$ repeated values in the data matrix

$Y$ provides

$r=47$ permutations of the character matrix, being indistinguishable between them, and this is a limit for

$Δ$ .

The objective is to adjust the $Δ$ matrix to $r=47$ noncoincidences (or less). Then, the minimal number of model components to ensure this condition $K$ , given by the Proposition 3.1. For realistic data sets, this condition is difficult to achieve, since is expensive in terms of computational resources, since the number of iterations is not small, as it can be seen in Figure 1.

From a purely computational point of view, the obtained results are compared with the classical PLSA, which offers similar results as those shown in the Figure 1. In this case the PLSA is improperly run, since the data are not multinomial distributed, but the purpose is purely computational. The results are the same, and differences are debt to random initialization conditions. In the Figure 1 it can be seen too how oscillates close to the maximum, and a little more of iterations stabilizes the result. At this point the numerical approximation error is small and lets to reconstruct the original empirical distribution, and if the sums are saved, the original data matrix can be reconstructed (except for the repeated values and labels of the data matrix).

For a enough large number of iterations, it can be seen how the diagonal matrix obtained by the formula (21), which is the estimation of $P(zk)$ defined in Eq. (6), is the expectation of $P(z)$ , as is shown in the Figure 2. To obtain the graphics, the classical PLSA algorithm has been executed $20$ times for a uniform distribution and $20$ times for a Poisson initial distributions of the latent variables $zk$ . Its clear the effect of the initial conditions on the obtained results. Being they equivalent, are not the same. The solution proposed by Eq. (21) is a limit distribution in each case, and ensures a well-based probabilistic sense for the diagonal matrix.

A not minor consequence of the general case PLSA formulas is the increasing computational speed, as can be seen in table 2. A well-known problem of the PLSA practitioners is the complexity in terms of time and memory. The general case PLSA equations reduces this drawback to a NMF difficulty one. Since the EM algorithm convergence is slow, the KL based too, and it is a consequence of their intimately connection, but the fact to avoid the estimation of the object of Eq. (5), simplifies the operations. It has important consequences, since the final results requires less time. When only in PLSA practical results interest is, a trick is to run Eqs. (18) and (19) with a large number of model components, since the interest relies only in matrices $W$ and $H$ . When those matrices are obtained, it is easy to get the diagonal by using Eq. (21) with the desired number of model components. This procedure is hard to justify, but in practice works well.

	Time Cicle (ms)	Total Memory (MB)
Classical PLSA	14,379	132.0
Generalized PLSA	1,350	106.6

Note: Computations done with a Intel 8.00 GB RAM 2.30 GHz processor.

Table 2

Speed comparison between classical and generalized Probabilistic Latent Semantic Analysis (PLSA) equations algorithm performance (time for $104$ iterations).

6. DISCUSSION

The iterative process to obtain the generalized PLSA matrices, minimizes the KL divergence between the object and image matrix. The entries of $Gdiagt1∕2$ are spanned over the basis $diagt1∕2F$ (or vice versa, depending on the desired interpretation). In the optimal neighborhood, this local basis representation has the same coordinates as the classical algebraic decomposition. Taking into account this fact, the projective distances in the $L2$ space norm are also minimized. This consequence should be analyzed in more depth, to reveal the relation between algebraic structures and contained information, since KL divergence is a measure of information.

Another point to study in more depth are the local basis rotations from a statistical point of view, to ensure the results interchangeability between both points of view. In the euclidean space case, this fact is depth studied, and the consequences are irrelevant, but it is not so clear when talking about probability distributions.

The general case PLSA and the SVD relation has been established for the full rank case. The low rank is used in connection with the problem of dimension reduction. These concepts should be also related for the general case PLSA, to establish the limits and advantages of this similarity, and reveal in a more clear way the achieved convergence limit.

From a purely practical point of view, it is necessary to increase the computational and convergence speeds of the algorithm. The faster iterative procedure with the I-divergence is the basis of many of the current algorithms, but cannot be put in relation with the KL divergence. Necessarily, this leads to establishing the relationship between the KL divergence (or the EM algorithm) with other types of divergences. Therefore, is necessary a broader conceptual base on this field.

7. CONCLUSIONS

From the obtained results, three conclusions can be established. First, the probabilistic SVD image is asymptotically unique and allows to provide inferential sense to descriptive statistical techniques based on the SVD. Second, an algebraic consequence is the total equivalence between the classical PLSA and the general PLSA model (when certain conditions are satisfied). This shows that statistical problems, under suitable conditions, is an algebraic one, under a reliable conceptual basis. Finally, the general case PLSA equations has no distributional hypothesis, and solely restricted to real-valued entries. This leads to estimate under a quantitative basis the optimal number of model components and its significance.

CONFLICT OF INTEREST

No conflict of interest declaration for manuscript title.

AUTHORS' CONTRIBUTIONS

Figuera Pau: Conceptual development and main work. García Bringas Pablo: Professor Garcia Bringas is my Thesis Director, and his inestimable asportation is on fundamental questions and objections, with infinitely patience, to make clear the main ideas.

ACKNOWLEDGMENTS

We would like to show our gratitude to the editor-in-chief Prof. M. Ahsanullah and the unknown referees, for the patience to accept relevant changes once the manuscript was submitted. Without this collaboration, this manuscript would have been impossible to be published.

Footnotes

1

Also, it can be decomposed as $P(wj,di)=P(di)∑kP(wj|zk)P(zk|di)$ , which is the asymmetric formulation.

2

To prove convergence, the cost function is

$F=∑ij[Y[ijlog[WH]ij$

and exist a

$G$ function accomplishing

$G(h,h′)≥F(h) G(h,h)=F(h)$

which leads to the sequence

$hp+1=argminG(h,h′)$ . The derivatives provides a monotonically decreasing sequence. This procedure is similar to the detailed in [14, 15].

REFERENCES

1.T. Hofmann, J. Mach. Learn. Res., Vol. 42, 2000, pp. 177-196.

2.S. Deerwester, S. Dumais, G. Furnas, et al., J. Assoc. Inf. Sci. Technol., Vol. 41, 1990, pp. 391-407.

3.D. Blei, A. Ng, and M. Jordan, J. Mach. Learn. Res., Vol. 3, 2003, pp. 993-1022.

4.J. Chen, Linear Algebra Appl., Vol. 62, 1984, pp. 207-217.

5.M. Paatero and U. Taper, Environemetrics, Vol. 5, 1994, pp. 111-126.

6.E. Gaussier and C. Goutte, R.A. Baeza-Yates, N. Ziviani, G. Marchionini, A. Moffat, and J. Tait (editors), Association for Computing Machinery, in SIGIR '05 Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New York, NY, USA), 2005, pp. 601-602.

7.C. Ding, T. Li, and W. Peng, Comput. Stat. Data Anal., Vol. 52, 2008, pp. 3913-3927.

8.T. Brants, Inf. Retrieval, Vol. 8, 2005, pp. 181-196.

9.A. Chaudhuri and M. Murty, IEEE, in Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012) (Tsukuba, Japan), 2012, pp. 2298-2301.

10.K. Devarajan, G. Wang, and N. Ebrahimi, Mach. Learn., Vol. 99, 2015, pp. 137-163.

11.L. Latecki, M. Sobel, and R. Lakaemper, ACM, in KDD'06 Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Minings (Philadelphia, PA, USA), 2006, pp. 267-276.

12.A. Khuri, Advanced Calculus with Applications in Statistics, John Wiley & Sons, Inc., New Jersey, USA, 2003.

13.H. Golub and C. Van Loan, Matrix Computations, The Johns Hopkins University Press, Maryland, USA, 1996.

14.D. Lee and H. Seung, T. Leen, T. Dietterich, and V. Tresp (editors), MIT Press, in 14th Annual Neural Information Processing Systems Conference (NIPS) (Denver, CO, USA), 2000, pp. 556-562.

15.A. Dempster, N. Laird, and D. Rubin, J. R. Stat. Soc., Vol. 39, 1977, pp. 1-22.

16.A. Cichocki, R. Zdunek, A. Phan, and S.-I. Amary, Nonnegative Matrix and Tensor Factorizations, John Willey and Sons, Ltd, Washington, USA, 2009.

17.S. Lê, J. Josse, and F. Husson, J. Stat. Softw., Vol. 25, 2008, pp. 1-18.

18.N. Balakrishnan and V. Nevzorov, A Primer on Statistical Distributions, John Wiley and Sons, 2005.

19.G. Casella and R.B.S. Edition, Statistical Inference, Duxbury, Massachusetts, USA, 2002.

<Previous Article In Issue

Download article (PDF)

Next Article In Issue>

Journal: Journal of Statistical Theory and Applications
Volume-Issue: 19 - 2
Pages: 286 - 296
Publication Date: 2020/06/19
ISSN (Online): 2214-1766
ISSN (Print): 1538-7887
DOI: 10.2991/jsta.d.200605.001 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - JOUR
AU  - Pau Figuera Vinué
AU  - Pablo García Bringas
PY  - 2020
DA  - 2020/06/19
TI  - On the Probabilistic Latent Semantic Analysis Generalization as the Singular Value Decomposition Probabilistic Image
JO  - Journal of Statistical Theory and Applications
SP  - 286
EP  - 296
VL  - 19
IS  - 2
SN  - 2214-1766
UR  - https://doi.org/10.2991/jsta.d.200605.001
DO  - 10.2991/jsta.d.200605.001
ID  - Vinué2020
ER  -

download .riscopy to clipboard