fbpx
Wikipedia

Biclustering

Biclustering, block clustering,[1][2] Co-clustering or two-mode clustering[3][4][5] is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin[6] to name a technique introduced many years earlier,[6] in 1972, by John A. Hartigan.[7]

Given a set of samples represented by an -dimensional feature vector, the entire dataset can be represented as rows in columns (i.e., an matrix). The Biclustering algorithm generates Biclusters. A Bicluster is a subset of rows which exhibit similar behavior across a subset of columns, or vice versa.

Development edit

Biclustering was originally introduced by John A. Hartigan in 1972.[7] The term "Biclustering" was then later used and refined by Boris G. Mirkin. This algorithm was not generalized until 2000, when Y. Cheng and George M. Church proposed a biclustering algorithm based on the mean squared residue score (MSR) and applied it to biological gene expression data.[8]

In 2001 and 2003, I. S. Dhillon published two algorithms applying biclustering to files and words. One version was based on bipartite spectral graph partitioning.[9] The other was based on information theory. Dhillon assumed the loss of mutual information during biclustering was equal to the Kullback–Leibler-distance (KL-distance) between P and Q. P represents the distribution of files and feature words before Biclustering, while Q is the distribution after Biclustering. KL-distance is for measuring the difference between two random distributions. KL = 0 when the two distributions are the same and KL increases as the difference increases.[10] Thus, the aim of the algorithm was to find the minimum KL-distance between P and Q. In 2004, Arindam Banerjee used a weighted-Bregman distance instead of KL-distance to design a Biclustering algorithm that was suitable for any kind of matrix, unlike the KL-distance algorithm.[11]

To cluster more than two types of objects, in 2005, Bekkerman expanded the mutual information in Dhillon's theorem from a single pair into multiple pairs.[12]

Complexity edit

The complexity of the Biclustering problem depends on the exact problem formulation, and particularly on the merit function used to evaluate the quality of a given Bicluster. However, the most interesting variants of this problem are NP-complete. NP-complete has two conditions. In the simple case that there is an only element a(i,j) either 0 or 1 in the binary matrix A, a Bicluster is equal to a biclique in the corresponding bipartite graph. The maximum size Bicluster is equivalent to the maximum edge biclique in the bipartite graph. In the complex case, the element in matrix A is used to compute the quality of a given Bicluster and solve the more restricted version of the problem.[13] It requires either large computational effort or the use of lossy heuristics to short-circuit the calculation.[14]

Types of Biclusters edit

Bicluster with constant values (a)

When a Biclustering algorithm tries to find a constant-value Bicluster, it reorders the rows and columns of the matrix to group together similar rows and columns, eventually grouping Biclusters with similar values. This method is sufficient when the data is normalized. A perfect constant Bicluster is a matrix(I,J) in which all values a(i,j) are equal to a given constant μ. In tangible data, these entries a(i,j) may be represented with the form n(i,j) + μ where n(i,j) denotes the noise. According to Hartigan's algorithm, by splitting the original data matrix into a set of Biclusters, variance is used to compute constant Biclusters. Hence, a perfect Bicluster may be equivalently defined as a matrix with a variance of zero. In order to prevent the partitioning of the data matrix into Biclusters with the only one row and one column; Hartigan assumes that there are, for example, K Biclusters within the data matrix. When the data matrix is partitioned into K Biclusters, the algorithm ends.

Bicluster with constant values on rows (b) or columns (c)

Unlike the constant-value Biclusters, these types of Biclusters cannot be evaluated solely based on the variance of their values. To finish the identification, the columns and the rows should be normalized first. There are, however, other algorithms, without the normalization step, that can find Biclusters which have rows and columns with different approaches.

Bicluster with coherent values (d, e)

For Biclusters with coherent values on rows and columns, an overall improvement over the algorithms for Biclusters with constant values on rows or on columns should be considered. This algorithm may contain analysis of variance between groups, using co-variance between both rows and columns. In Cheng and Church's theorem, a Bicluster is defined as a subset of rows and columns with almost the same score. The similarity score is used to measure the coherence of rows and columns.


a) Bicluster with constant values
2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0
b) Bicluster with constant values on rows
1.0 1.0 1.0 1.0 1.0
2.0 2.0 2.0 2.0 2.0
3.0 3.0 3.0 3.0 3.0
4.0 4.0 4.0 4.0 4.0
5.0 5.0 5.0 5.0 5.0
c) Bicluster with constant values on columns
1.0 2.0 3.0 4.0 5.0
1.0 2.0 3.0 4.0 5.0
1.0 2.0 3.0 4.0 5.0
1.0 2.0 3.0 4.0 5.0
1.0 2.0 3.0 4.0 5.0
d) Bicluster with coherent values (additive)
1.0 4.0 5.0 0.0 1.5
4.0 7.0 8.0 3.0 4.5
3.0 6.0 7.0 2.0 3.5
5.0 8.0 9.0 4.0 5.5
2.0 5.0 6.0 1.0 2.5
e) Bicluster with coherent values (multiplicative)
1.0 0.5 2.0 0.2 0.8
2.0 1.0 4.0 0.4 1.6
3.0 1.5 6.0 0.6 2.4
4.0 2.0 8.0 0.8 3.2
5.0 2.5 10.0 1.0 4.0


The relationship between these cluster models and other types of clustering such as correlation clustering is discussed in.[15]

Algorithms edit

There are many Biclustering algorithms developed for bioinformatics, including: block clustering, CTWC (Coupled Two-Way Clustering), ITWC (Interrelated Two-Way Clustering), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, Plaid Model, OPSMs (Order-preserving submatrixes), Gibbs, SAMBA (Statistical-Algorithmic Method for Bicluster Analysis),[16] Robust Biclustering Algorithm (RoBA), Crossing Minimization,[17] cMonkey,[18] PRMs, DCC, LEB (Localize and Extract Biclusters), QUBIC (QUalitative BIClustering), BCCA (Bi-Correlation Clustering Algorithm) BIMAX, ISA and FABIA (Factor analysis for Bicluster Acquisition),[19] runibic,[20] and recently proposed hybrid method EBIC (evolutionary-based Biclustering),[21] which was shown to detect multiple patterns with very high accuracy. More recently, IMMD-CC[22] is proposed that is developed based on the iterative complexity reduction concept. IMMD-CC is able to identify co-cluster centroids from highly sparse transformation obtained by iterative multi-mode discretization.

Biclustering algorithms have also been proposed and used in other application fields under the names co-clustering, bi-dimensional clustering, and subspace clustering.[14]

Given the known importance of discovering local patterns in time-series data. Recent proposals have addressed the Biclustering problem in the specific case of time-series gene expression data. In this case, the interesting Biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the development of efficient exhaustive enumeration algorithms such as CCC-Biclustering[23] and e-CCC-Biclustering.[24] The approximate patterns in CCC-Biclustering algorithms allow a given number of errors, per gene, relatively to an expression profile representing the expression pattern in the Bicluster. The e-CCC-Biclustering algorithm uses approximate expressions to find and report all maximal CCC-Bicluster's by a discretized matrix A and efficient string processing techniques.

These algorithms find and report all maximal Biclusters with coherent and contiguous columns with perfect/approximate expression patterns, in time linear/polynomial which is obtained by manipulating a discretized version of original expression matrix in the size of the time-series gene expression matrix using efficient string processing techniques based on suffix trees. These algorithms are also applied to solve problems and sketch the analysis of computational complexity.

Some recent algorithms have attempted to include additional support for Biclustering rectangular matrices in the form of other datatypes, including cMonkey.

There is an ongoing debate about how to judge the results of these methods, as Biclustering allows overlap between clusters and some algorithms allow the exclusion of hard-to-reconcile columns/conditions. Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima. Because this is an unsupervised classification problem, the lack of a gold standard makes it difficult to spot errors in the results. One approach is to utilize multiple Biclustering algorithms, with the majority or super-majority voting amongst them to decide the best result. Another way is to analyze the quality of shifting and scaling patterns in Biclusters.[25] Biclustering has been used in the domain of text mining (or classification) which is popularly known as co-clustering.[26] Text corpora are represented in a vectoral form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary. Matrix elements Dij denote occurrence of word j in document i. Co-clustering algorithms are then applied to discover blocks in D that correspond to a group of documents (rows) characterized by a group of words(columns).

Text clustering can solve the high-dimensional sparse problem, which means clustering text and words at the same time. When clustering text, we need to think about not only the words information, but also the information of words clusters that was composed by words. Then, according to similarity of feature words in the text, will eventually cluster the feature words. This is called co-clustering. There are two advantages of co-clustering: one is clustering the test based on words clusters can extremely decrease the dimension of clustering, it can also appropriate to measure the distance between the tests. Second is mining more useful information and can get the corresponding information in test clusters and words clusters. This corresponding information can be used to describe the type of texts and words, at the same time, the result of words clustering can be also used to text mining and information retrieval.

Several approaches have been proposed based on the information contents of the resulting blocks: matrix-based approaches such as SVD and BVD, and graph-based approaches. Information-theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized. Matrix-based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized. Graph-based methods tend to minimize the cuts between the clusters. Given two groups of documents d1 and d2, the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2.

More recently (Bisson and Hussain)[26] have proposed a new approach of using the similarity between words and the similarity between documents to co-cluster the matrix. Their method (known as χ-Sim, for cross similarity) is based on finding document-document similarity and word-word similarity, and then using classical clustering methods such as hierarchical clustering. Instead of explicitly clustering rows and columns alternately, they consider higher-order occurrences of words, inherently taking into account the documents in which they occur. Thus, the similarity between two words is calculated based on the documents in which they occur and also the documents in which "similar" words occur. The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it, but a subset of the words and other similar words that are characteristic of that topic. This approach of taking higher-order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words.

In text databases, for a document collection defined by a document by term D matrix (of size m by n, m: number of documents, n: number of terms) the cover-coefficient based clustering methodology[27] yields the same number of clusters both for documents and terms (words) using a double-stage probability experiment. According to the cover coefficient concept number of clusters can also be roughly estimated by the following formula   where t is the number of non-zero entries in D. Note that in D each row and each column must contain at least one non-zero element.

In contrast to other approaches, FABIA is a multiplicative model that assumes realistic non-Gaussian signal distributions with heavy tails. FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework. The generative framework allows FABIA to determine the information content of each Bicluster to separate spurious Biclusters from true Biclusters.

See also edit

References edit

  1. ^ G. Govaert; M. Nadif (2008). "Block clustering with bernoulli mixture models: Comparison of different approaches". Computational Statistics and Data Analysis. 52 (6): 3233–3245. doi:10.1016/j.csda.2007.09.007.
  2. ^ R. Balamurugan; A.M. Natarajan; K. Premalatha (2015). "Stellar-Mass Black Hole Optimization for Biclustering Microarray Gene Expression Data". Applied Artificial Intelligence. 29 (4): 353–381. doi:10.1080/08839514.2015.1016391. S2CID 44624424.
  3. ^ G. Govaert; M. Nadif (2013). Co-clustering: models, algorithms and applications. ISTE, Wiley. ISBN 978-1-84821-473-6.
  4. ^ R. Balamurugan; A.M. Natarajan; K. Premalatha (2016). "A Modified Harmony Search Method for Biclustering Microarray Gene Expression Data". International Journal of Data Mining and Bioinformatics. 16 (4): 269–289. doi:10.1504/IJDMB.2016.082205.
  5. ^ Van Mechelen I, Bock HH, De Boeck P (2004). "Two-mode clustering methods:a structured overview". Statistical Methods in Medical Research. 13 (5): 363–94. CiteSeerX 10.1.1.706.4201. doi:10.1191/0962280204sm373ra. PMID 15516031. S2CID 19058237.
  6. ^ a b Mirkin, Boris (1996). Mathematical Classification and Clustering. Kluwer Academic Publishers. ISBN 978-0-7923-4159-8.
  7. ^ a b Hartigan JA (1972). "Direct clustering of a data matrix". Journal of the American Statistical Association. 67 (337): 123–9. doi:10.2307/2284710. JSTOR 2284710.
  8. ^ https://www.cs.princeton.edu/courses/archive/fall03/cs597F/Articles/biclustering_of_expression_data.pdf Cheng Y, Church G M. Biclustering of expression data[C]//Ismb. 2000, 8: 93–103.
  9. ^ Dhillon, Inderjit S. (2001). "Co-clustering documents and words using bipartite spectral graph partitioning". Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 269–274. doi:10.1145/502512.502550. ISBN 158113391X. S2CID 11847258.
  10. ^ Dhillon, Inderjit S.; Mallela, Subramanyam; Modha, Dharmendra S. (2003). "Information-theoretic co-clustering". Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 89–98. doi:10.1145/956750.956764. ISBN 1581137370. S2CID 12286784.
  11. ^ Banerjee, Arindam; Dhillon, Inderjit; Ghosh, Joydeep; Merugu, Srujana; Modha, Dharmendra S. (2004). "A generalized maximum entropy approach to bregman co-clustering and matrix approximation". Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 509–514. doi:10.1145/1014052.1014111. ISBN 1581138881. S2CID 2719002.
  12. ^ Bekkerman, Ron; El-Yaniv, Ran; McCallum, Andrew (2005). "Multi-way distributional clustering via pairwise interactions". Proceedings of the 22nd international conference on Machine learning - ICML '05. pp. 41–48. doi:10.1145/1102351.1102357. ISBN 1595931805. S2CID 858524.
  13. ^ Peeters R (2003). "The maximum edge biclique problem is NP-complete". Discrete Applied Mathematics. 131 (3): 651–654. doi:10.1016/S0166-218X(03)00333-0. S2CID 3102766.
  14. ^ a b Madeira SC, Oliveira AL (2004). "Biclustering Algorithms for Biological Data Analysis: A Survey". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 1 (1): 24–45. doi:10.1109/TCBB.2004.2. PMID 17048406. S2CID 206628783.
  15. ^ Kriegel, H.-P.; Kröger, P.; Zimek, A. (March 2009). "Clustering High Dimensional Data: A Survey on Subspace Clustering, Pattern-based Clustering, and Correlation Clustering". ACM Transactions on Knowledge Discovery from Data. 3 (1): 1–58. doi:10.1145/1497577.1497578. S2CID 17363900.
  16. ^ Tanay A, Sharan R, Kupiec M, Shamir R (2004). "Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data". Proceedings of the National Academy of Sciences. 101 (9): 2981–2986. Bibcode:2004PNAS..101.2981T. doi:10.1073/pnas.0308661100. PMC 365731. PMID 14973197.
  17. ^ Abdullah, Ahsan; Hussain, Amir (2006). "A new biclustering technique based on crossing minimization". Neurocomputing. 69 (16–18): 1882–1896. doi:10.1016/j.neucom.2006.02.018.
  18. ^ Reiss DJ, Baliga NS, Bonneau R (2006). "Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks". BMC Bioinformatics. 7: 280–302. doi:10.1186/1471-2105-7-280. PMC 1502140. PMID 16749936.
  19. ^ Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, Bijnens L, Gohlmann HW, Shkedy Z, Clevert DA (2010). "FABIA: factor analysis for bicluster acquisition". Bioinformatics. 26 (12): 1520–1527. doi:10.1093/bioinformatics/btq227. PMC 2881408. PMID 20418340.
  20. ^ Orzechowski P, Pańszczyk A, Huang X, Moore JH (2018). "runibic: a Bioconductor package for parallel row-based biclustering of gene expression data". Bioinformatics. 34 (24): 4302–4304. doi:10.1093/bioinformatics/bty512. PMC 6289127. PMID 29939213.
  21. ^ Orzechowski P, Sipper M, Huang X, Moore JH (2018). "EBIC: an evolutionary-based parallel biclustering algorithm for pattern discovery". Bioinformatics. 34 (21): 3719–3726. arXiv:1801.03039. doi:10.1093/bioinformatics/bty401. PMC 6198864. PMID 29790909.
  22. ^ Fanaee-T H, Thoresen, M (2020). "Iterative Multi-mode Discretization: Applications to Co-clustering". Discovery Science. Lecture Notes in Computer Science. Vol. 12323. pp. 94–105. doi:10.1007/978-3-030-61527-7_7. hdl:10852/82994. ISBN 978-3-030-61526-0. S2CID 222832035.
  23. ^ Madeira SC, Teixeira MC, Sá-Correia I, Oliveira AL (2010). "Identification of Regulatory Modules in Time Series Gene Expression Data using a Linear Time Biclustering Algorithm". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 1 (7): 153–165. doi:10.1109/TCBB.2008.34. PMID 20150677. S2CID 7369531.
  24. ^ Madeira SC, Oliveira AL (2009). "A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series". Algorithms for Molecular Biology. 4 (8): 8. doi:10.1186/1748-7188-4-8. PMC 2709627. PMID 19497096.
  25. ^ Aguilar-Ruiz JS (2005). "Shifting and scaling patterns from gene expression data". Bioinformatics. 21 (10): 3840–3845. doi:10.1093/bioinformatics/bti641. PMID 16144809.
  26. ^ a b Bisson G.; Hussain F. (2008). "Chi-Sim: A New Similarity Measure for the Co-clustering Task". 2008 Seventh International Conference on Machine Learning and Applications. pp. 211–217. doi:10.1109/ICMLA.2008.103. ISBN 978-0-7695-3495-4. S2CID 15506600.
  27. ^ Can, F.; Ozkarahan, E. A. (1990). "Concepts and effectiveness of the cover coefficient based clustering methodology for text databases" (PDF). ACM Transactions on Database Systems. 15 (4): 483–517. doi:10.1145/99935.99938. hdl:2374.MIA/246. S2CID 14309214.

Others edit

  • N.K. Verma, S. Bajpai, A. Singh, A. Nagrare, S. Meena, Yan Cui, "A Comparison of Biclustering Algorithms" in International conference on Systems in Medicine and Biology (ICSMB 2010)in IIT Kharagpur India, pp. 90–97, Dec. 16–18.
  • J. Gupta, S. Singh and N.K. Verma "MTBA: MATLAB Toolbox for Biclustering Analysis", IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions", IIT Kanpur India, pp. 148–152, Jul. 2013.
  • A. Tanay. R. Sharan, and R. Shamir, "Biclustering Algorithms: A Survey", In Handbook of Computational Molecular Biology, Edited by Srinivas Aluru, Chapman (2004)
  • Kluger Y, Basri R, Chang JT, Gerstein MB (2003). "Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions". Genome Research. 13 (4): 703–716. doi:10.1101/gr.648603. PMC 430175. PMID 12671006.
  • Adetayo Kasim, Ziv Shkedy, Sebastian Kaiser, Sepp Hochreiter, Willem Talloen (2016), Applied Biclustering Methods for Big and High-Dimensional Data Using R, Chapman & Hall/CRC Press
  • Orzechowski, P., Sipper, M., Huang, X., & Moore, J. H. (2018). EBIC: an evolutionary-based parallel biclustering algorithm for pattern discovery. Bioinformatics.

External links edit

  • FABIA: Factor Analysis for Bicluster Acquisition, an R package —software

biclustering, block, clustering, clustering, mode, clustering, data, mining, technique, which, allows, simultaneous, clustering, rows, columns, matrix, term, first, introduced, boris, mirkin, name, technique, introduced, many, years, earlier, 1972, john, harti. Biclustering block clustering 1 2 Co clustering or two mode clustering 3 4 5 is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix The term was first introduced by Boris Mirkin 6 to name a technique introduced many years earlier 6 in 1972 by John A Hartigan 7 Given a set of m displaystyle m samples represented by an n displaystyle n dimensional feature vector the entire dataset can be represented as m displaystyle m rows in n displaystyle n columns i e an m n displaystyle m times n matrix The Biclustering algorithm generates Biclusters A Bicluster is a subset of rows which exhibit similar behavior across a subset of columns or vice versa Contents 1 Development 2 Complexity 3 Types of Biclusters 4 Algorithms 5 See also 6 References 6 1 Others 7 External linksDevelopment editBiclustering was originally introduced by John A Hartigan in 1972 7 The term Biclustering was then later used and refined by Boris G Mirkin This algorithm was not generalized until 2000 when Y Cheng and George M Church proposed a biclustering algorithm based on the mean squared residue score MSR and applied it to biological gene expression data 8 In 2001 and 2003 I S Dhillon published two algorithms applying biclustering to files and words One version was based on bipartite spectral graph partitioning 9 The other was based on information theory Dhillon assumed the loss of mutual information during biclustering was equal to the Kullback Leibler distance KL distance between P and Q P represents the distribution of files and feature words before Biclustering while Q is the distribution after Biclustering KL distance is for measuring the difference between two random distributions KL 0 when the two distributions are the same and KL increases as the difference increases 10 Thus the aim of the algorithm was to find the minimum KL distance between P and Q In 2004 Arindam Banerjee used a weighted Bregman distance instead of KL distance to design a Biclustering algorithm that was suitable for any kind of matrix unlike the KL distance algorithm 11 To cluster more than two types of objects in 2005 Bekkerman expanded the mutual information in Dhillon s theorem from a single pair into multiple pairs 12 Complexity editThe complexity of the Biclustering problem depends on the exact problem formulation and particularly on the merit function used to evaluate the quality of a given Bicluster However the most interesting variants of this problem are NP complete NP complete has two conditions In the simple case that there is an only element a i j either 0 or 1 in the binary matrix A a Bicluster is equal to a biclique in the corresponding bipartite graph The maximum size Bicluster is equivalent to the maximum edge biclique in the bipartite graph In the complex case the element in matrix A is used to compute the quality of a given Bicluster and solve the more restricted version of the problem 13 It requires either large computational effort or the use of lossy heuristics to short circuit the calculation 14 Types of Biclusters editBicluster with constant values a When a Biclustering algorithm tries to find a constant value Bicluster it reorders the rows and columns of the matrix to group together similar rows and columns eventually grouping Biclusters with similar values This method is sufficient when the data is normalized A perfect constant Bicluster is a matrix I J in which all values a i j are equal to a given constant m In tangible data these entries a i j may be represented with the form n i j m where n i j denotes the noise According to Hartigan s algorithm by splitting the original data matrix into a set of Biclusters variance is used to compute constant Biclusters Hence a perfect Bicluster may be equivalently defined as a matrix with a variance of zero In order to prevent the partitioning of the data matrix into Biclusters with the only one row and one column Hartigan assumes that there are for example K Biclusters within the data matrix When the data matrix is partitioned into K Biclusters the algorithm ends Bicluster with constant values on rows b or columns c Unlike the constant value Biclusters these types of Biclusters cannot be evaluated solely based on the variance of their values To finish the identification the columns and the rows should be normalized first There are however other algorithms without the normalization step that can find Biclusters which have rows and columns with different approaches Bicluster with coherent values d e For Biclusters with coherent values on rows and columns an overall improvement over the algorithms for Biclusters with constant values on rows or on columns should be considered This algorithm may contain analysis of variance between groups using co variance between both rows and columns In Cheng and Church s theorem a Bicluster is defined as a subset of rows and columns with almost the same score The similarity score is used to measure the coherence of rows and columns a Bicluster with constant values 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 2 0 b Bicluster with constant values on rows 1 0 1 0 1 0 1 0 1 0 2 0 2 0 2 0 2 0 2 0 3 0 3 0 3 0 3 0 3 0 4 0 4 0 4 0 4 0 4 0 5 0 5 0 5 0 5 0 5 0 c Bicluster with constant values on columns 1 0 2 0 3 0 4 0 5 0 1 0 2 0 3 0 4 0 5 0 1 0 2 0 3 0 4 0 5 0 1 0 2 0 3 0 4 0 5 0 1 0 2 0 3 0 4 0 5 0 d Bicluster with coherent values additive 1 0 4 0 5 0 0 0 1 5 4 0 7 0 8 0 3 0 4 5 3 0 6 0 7 0 2 0 3 5 5 0 8 0 9 0 4 0 5 5 2 0 5 0 6 0 1 0 2 5 e Bicluster with coherent values multiplicative 1 0 0 5 2 0 0 2 0 8 2 0 1 0 4 0 0 4 1 6 3 0 1 5 6 0 0 6 2 4 4 0 2 0 8 0 0 8 3 2 5 0 2 5 10 0 1 0 4 0 The relationship between these cluster models and other types of clustering such as correlation clustering is discussed in 15 Algorithms editThere are many Biclustering algorithms developed for bioinformatics including block clustering CTWC Coupled Two Way Clustering ITWC Interrelated Two Way Clustering d bicluster d pCluster d pattern FLOC OPC Plaid Model OPSMs Order preserving submatrixes Gibbs SAMBA Statistical Algorithmic Method for Bicluster Analysis 16 Robust Biclustering Algorithm RoBA Crossing Minimization 17 cMonkey 18 PRMs DCC LEB Localize and Extract Biclusters QUBIC QUalitative BIClustering BCCA Bi Correlation Clustering Algorithm BIMAX ISA and FABIA Factor analysis for Bicluster Acquisition 19 runibic 20 and recently proposed hybrid method EBIC evolutionary based Biclustering 21 which was shown to detect multiple patterns with very high accuracy More recently IMMD CC 22 is proposed that is developed based on the iterative complexity reduction concept IMMD CC is able to identify co cluster centroids from highly sparse transformation obtained by iterative multi mode discretization Biclustering algorithms have also been proposed and used in other application fields under the names co clustering bi dimensional clustering and subspace clustering 14 Given the known importance of discovering local patterns in time series data Recent proposals have addressed the Biclustering problem in the specific case of time series gene expression data In this case the interesting Biclusters can be restricted to those with contiguous columns This restriction leads to a tractable problem and enables the development of efficient exhaustive enumeration algorithms such as CCC Biclustering 23 and e CCC Biclustering 24 The approximate patterns in CCC Biclustering algorithms allow a given number of errors per gene relatively to an expression profile representing the expression pattern in the Bicluster The e CCC Biclustering algorithm uses approximate expressions to find and report all maximal CCC Bicluster s by a discretized matrix A and efficient string processing techniques These algorithms find and report all maximal Biclusters with coherent and contiguous columns with perfect approximate expression patterns in time linear polynomial which is obtained by manipulating a discretized version of original expression matrix in the size of the time series gene expression matrix using efficient string processing techniques based on suffix trees These algorithms are also applied to solve problems and sketch the analysis of computational complexity Some recent algorithms have attempted to include additional support for Biclustering rectangular matrices in the form of other datatypes including cMonkey There is an ongoing debate about how to judge the results of these methods as Biclustering allows overlap between clusters and some algorithms allow the exclusion of hard to reconcile columns conditions Not all of the available algorithms are deterministic and the analyst must pay attention to the degree to which results represent stable minima Because this is an unsupervised classification problem the lack of a gold standard makes it difficult to spot errors in the results One approach is to utilize multiple Biclustering algorithms with the majority or super majority voting amongst them to decide the best result Another way is to analyze the quality of shifting and scaling patterns in Biclusters 25 Biclustering has been used in the domain of text mining or classification which is popularly known as co clustering 26 Text corpora are represented in a vectoral form as a matrix D whose rows denote the documents and whose columns denote the words in the dictionary Matrix elements Dij denote occurrence of word j in document i Co clustering algorithms are then applied to discover blocks in D that correspond to a group of documents rows characterized by a group of words columns Text clustering can solve the high dimensional sparse problem which means clustering text and words at the same time When clustering text we need to think about not only the words information but also the information of words clusters that was composed by words Then according to similarity of feature words in the text will eventually cluster the feature words This is called co clustering There are two advantages of co clustering one is clustering the test based on words clusters can extremely decrease the dimension of clustering it can also appropriate to measure the distance between the tests Second is mining more useful information and can get the corresponding information in test clusters and words clusters This corresponding information can be used to describe the type of texts and words at the same time the result of words clustering can be also used to text mining and information retrieval Several approaches have been proposed based on the information contents of the resulting blocks matrix based approaches such as SVD and BVD and graph based approaches Information theoretic algorithms iteratively assign each row to a cluster of documents and each column to a cluster of words such that the mutual information is maximized Matrix based methods focus on the decomposition of matrices into blocks such that the error between the original matrix and the regenerated matrices from the decomposition is minimized Graph based methods tend to minimize the cuts between the clusters Given two groups of documents d1 and d2 the number of cuts can be measured as the number of words that occur in documents of groups d1 and d2 More recently Bisson and Hussain 26 have proposed a new approach of using the similarity between words and the similarity between documents to co cluster the matrix Their method known as x Sim for cross similarity is based on finding document document similarity and word word similarity and then using classical clustering methods such as hierarchical clustering Instead of explicitly clustering rows and columns alternately they consider higher order occurrences of words inherently taking into account the documents in which they occur Thus the similarity between two words is calculated based on the documents in which they occur and also the documents in which similar words occur The idea here is that two documents about the same topic do not necessarily use the same set of words to describe it but a subset of the words and other similar words that are characteristic of that topic This approach of taking higher order similarities takes the latent semantic structure of the whole corpus into consideration with the result of generating a better clustering of the documents and words In text databases for a document collection defined by a document by term D matrix of size m by n m number of documents n number of terms the cover coefficient based clustering methodology 27 yields the same number of clusters both for documents and terms words using a double stage probability experiment According to the cover coefficient concept number of clusters can also be roughly estimated by the following formula m n t displaystyle m times n t nbsp where t is the number of non zero entries in D Note that in D each row and each column must contain at least one non zero element In contrast to other approaches FABIA is a multiplicative model that assumes realistic non Gaussian signal distributions with heavy tails FABIA utilizes well understood model selection techniques like variational approaches and applies the Bayesian framework The generative framework allows FABIA to determine the information content of each Bicluster to separate spurious Biclusters from true Biclusters See also editFormal concept analysis Biclique Galois connectionReferences edit G Govaert M Nadif 2008 Block clustering with bernoulli mixture models Comparison of different approaches Computational Statistics and Data Analysis 52 6 3233 3245 doi 10 1016 j csda 2007 09 007 R Balamurugan A M Natarajan K Premalatha 2015 Stellar Mass Black Hole Optimization for Biclustering Microarray Gene Expression Data Applied Artificial Intelligence 29 4 353 381 doi 10 1080 08839514 2015 1016391 S2CID 44624424 G Govaert M Nadif 2013 Co clustering models algorithms and applications ISTE Wiley ISBN 978 1 84821 473 6 R Balamurugan A M Natarajan K Premalatha 2016 A Modified Harmony Search Method for Biclustering Microarray Gene Expression Data International Journal of Data Mining and Bioinformatics 16 4 269 289 doi 10 1504 IJDMB 2016 082205 Van Mechelen I Bock HH De Boeck P 2004 Two mode clustering methods a structured overview Statistical Methods in Medical Research 13 5 363 94 CiteSeerX 10 1 1 706 4201 doi 10 1191 0962280204sm373ra PMID 15516031 S2CID 19058237 a b Mirkin Boris 1996 Mathematical Classification and Clustering Kluwer Academic Publishers ISBN 978 0 7923 4159 8 a b Hartigan JA 1972 Direct clustering of a data matrix Journal of the American Statistical Association 67 337 123 9 doi 10 2307 2284710 JSTOR 2284710 https www cs princeton edu courses archive fall03 cs597F Articles biclustering of expression data pdf Cheng Y Church G M Biclustering of expression data C Ismb 2000 8 93 103 Dhillon Inderjit S 2001 Co clustering documents and words using bipartite spectral graph partitioning Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining pp 269 274 doi 10 1145 502512 502550 ISBN 158113391X S2CID 11847258 Dhillon Inderjit S Mallela Subramanyam Modha Dharmendra S 2003 Information theoretic co clustering Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining pp 89 98 doi 10 1145 956750 956764 ISBN 1581137370 S2CID 12286784 Banerjee Arindam Dhillon Inderjit Ghosh Joydeep Merugu Srujana Modha Dharmendra S 2004 A generalized maximum entropy approach to bregman co clustering and matrix approximation Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining pp 509 514 doi 10 1145 1014052 1014111 ISBN 1581138881 S2CID 2719002 Bekkerman Ron El Yaniv Ran McCallum Andrew 2005 Multi way distributional clustering via pairwise interactions Proceedings of the 22nd international conference on Machine learning ICML 05 pp 41 48 doi 10 1145 1102351 1102357 ISBN 1595931805 S2CID 858524 Peeters R 2003 The maximum edge biclique problem is NP complete Discrete Applied Mathematics 131 3 651 654 doi 10 1016 S0166 218X 03 00333 0 S2CID 3102766 a b Madeira SC Oliveira AL 2004 Biclustering Algorithms for Biological Data Analysis A Survey IEEE ACM Transactions on Computational Biology and Bioinformatics 1 1 24 45 doi 10 1109 TCBB 2004 2 PMID 17048406 S2CID 206628783 Kriegel H P Kroger P Zimek A March 2009 Clustering High Dimensional Data A Survey on Subspace Clustering Pattern based Clustering and Correlation Clustering ACM Transactions on Knowledge Discovery from Data 3 1 1 58 doi 10 1145 1497577 1497578 S2CID 17363900 Tanay A Sharan R Kupiec M Shamir R 2004 Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data Proceedings of the National Academy of Sciences 101 9 2981 2986 Bibcode 2004PNAS 101 2981T doi 10 1073 pnas 0308661100 PMC 365731 PMID 14973197 Abdullah Ahsan Hussain Amir 2006 A new biclustering technique based on crossing minimization Neurocomputing 69 16 18 1882 1896 doi 10 1016 j neucom 2006 02 018 Reiss DJ Baliga NS Bonneau R 2006 Integrated biclustering of heterogeneous genome wide datasets for the inference of global regulatory networks BMC Bioinformatics 7 280 302 doi 10 1186 1471 2105 7 280 PMC 1502140 PMID 16749936 Hochreiter S Bodenhofer U Heusel M Mayr A Mitterecker A Kasim A Khamiakova T Van Sanden S Lin D Talloen W Bijnens L Gohlmann HW Shkedy Z Clevert DA 2010 FABIA factor analysis for bicluster acquisition Bioinformatics 26 12 1520 1527 doi 10 1093 bioinformatics btq227 PMC 2881408 PMID 20418340 Orzechowski P Panszczyk A Huang X Moore JH 2018 runibic a Bioconductor package for parallel row based biclustering of gene expression data Bioinformatics 34 24 4302 4304 doi 10 1093 bioinformatics bty512 PMC 6289127 PMID 29939213 Orzechowski P Sipper M Huang X Moore JH 2018 EBIC an evolutionary based parallel biclustering algorithm for pattern discovery Bioinformatics 34 21 3719 3726 arXiv 1801 03039 doi 10 1093 bioinformatics bty401 PMC 6198864 PMID 29790909 Fanaee T H Thoresen M 2020 Iterative Multi mode Discretization Applications to Co clustering Discovery Science Lecture Notes in Computer Science Vol 12323 pp 94 105 doi 10 1007 978 3 030 61527 7 7 hdl 10852 82994 ISBN 978 3 030 61526 0 S2CID 222832035 Madeira SC Teixeira MC Sa Correia I Oliveira AL 2010 Identification of Regulatory Modules in Time Series Gene Expression Data using a Linear Time Biclustering Algorithm IEEE ACM Transactions on Computational Biology and Bioinformatics 1 7 153 165 doi 10 1109 TCBB 2008 34 PMID 20150677 S2CID 7369531 Madeira SC Oliveira AL 2009 A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series Algorithms for Molecular Biology 4 8 8 doi 10 1186 1748 7188 4 8 PMC 2709627 PMID 19497096 Aguilar Ruiz JS 2005 Shifting and scaling patterns from gene expression data Bioinformatics 21 10 3840 3845 doi 10 1093 bioinformatics bti641 PMID 16144809 a b Bisson G Hussain F 2008 Chi Sim A New Similarity Measure for the Co clustering Task 2008 Seventh International Conference on Machine Learning and Applications pp 211 217 doi 10 1109 ICMLA 2008 103 ISBN 978 0 7695 3495 4 S2CID 15506600 Can F Ozkarahan E A 1990 Concepts and effectiveness of the cover coefficient based clustering methodology for text databases PDF ACM Transactions on Database Systems 15 4 483 517 doi 10 1145 99935 99938 hdl 2374 MIA 246 S2CID 14309214 Others edit N K Verma S Bajpai A Singh A Nagrare S Meena Yan Cui A Comparison of Biclustering Algorithms in International conference on Systems in Medicine and Biology ICSMB 2010 in IIT Kharagpur India pp 90 97 Dec 16 18 J Gupta S Singh and N K Verma MTBA MATLAB Toolbox for Biclustering Analysis IEEE Workshop on Computational Intelligence Theories Applications and Future Directions IIT Kanpur India pp 148 152 Jul 2013 A Tanay R Sharan and R Shamir Biclustering Algorithms A Survey In Handbook of Computational Molecular Biology Edited by Srinivas Aluru Chapman 2004 Kluger Y Basri R Chang JT Gerstein MB 2003 Spectral Biclustering of Microarray Data Coclustering Genes and Conditions Genome Research 13 4 703 716 doi 10 1101 gr 648603 PMC 430175 PMID 12671006 Adetayo Kasim Ziv Shkedy Sebastian Kaiser Sepp Hochreiter Willem Talloen 2016 Applied Biclustering Methods for Big and High Dimensional Data Using R Chapman amp Hall CRC Press Orzechowski P Sipper M Huang X amp Moore J H 2018 EBIC an evolutionary based parallel biclustering algorithm for pattern discovery Bioinformatics External links editFABIA Factor Analysis for Bicluster Acquisition an R package software Retrieved from https en wikipedia org w index php title Biclustering amp oldid 1184086555, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.