numpy - What means this sparse matrix in scipy? -
i have nlp task , i'm using scikit-learn. reading tutorials found have vectorize text , how use vectorization models feed classification algorithm. assume have text , vectorize follows:
from sklearn.feature_extraction.text import countvectorizer corpus =['''computer science scientific , practical approach computation , applications.''' #this opinion '''it systematic study of feasibility, structure, expression, , mechanization of methodical procedures underlie acquisition, representation, processing, storage, communication of, , access information, whether such information encoded bits in computer memory or transcribed in genes , protein structures in biological cell.''' #anotherone '''a computer scientist specializes in theory of computation , design of computational systems'''] vectorizer = countvectorizer(analyzer='word') x = vectorizer.fit_transform(corpus) print x
the problem dont understand meaning of output, dont see relation text , matrix returned vectorizer:
(0, 12) 3 (0, 33) 1 (0, 20) 3 (0, 45) 7 (0, 34) 1 (0, 2) 6 (0, 28) 1 (0, 4) 1 (0, 47) 2 (0, 10) 2 (0, 22) 1 (0, 3) 1 (0, 21) 1 (0, 42) 1 (0, 40) 1 (0, 26) 5 (0, 16) 1 (0, 38) 1 (0, 15) 1 (0, 23) 1 (0, 25) 1 (0, 29) 1 (0, 44) 1 (0, 49) 1 (0, 1) 1 : : (0, 30) 1 (0, 37) 1 (0, 9) 1 (0, 0) 1 (0, 19) 2 (0, 50) 1 (0, 41) 1 (0, 14) 1 (0, 5) 1 (0, 7) 1 (0, 18) 4 (0, 24) 1 (0, 27) 1 (0, 48) 1 (0, 17) 1 (0, 31) 1 (0, 39) 1 (0, 6) 1 (0, 8) 1 (0, 35) 1 (0, 36) 1 (0, 46) 1 (0, 13) 1 (0, 11) 1 (0, 43) 1
also dont understand what's happening output when use toarray()
method:
print x.toarray()
what means output , relation has corpus?:
[[1 1 6 1 1 1 1 1 1 1 2 1 3 1 1 1 1 1 4 2 3 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 1 2 1 1 1]]
the countvectorizer
produces document-term matrix. simple example, let's take of following simplified code:
from sklearn.feature_extraction.text import countvectorizer corpus =['''computer hardware''', '''computer data , software data'''] vectorizer = countvectorizer(analyzer='word') x = vectorizer.fit_transform(corpus) print x print x.toarray()
you have 2 documents, elements of corpus, , 5 terms, words. , can count terms in documents follows:
| , computer data hardware software +------------------------------------- doc 0 | 1 1 doc 1 | 1 1 2 1
and x
represents above matrix in associative manner, i.e. map (row, col) frequency of terms , x.toarray()
shows x
list of list. following execution result:
(1, 0) 1 (0, 1) 1 (1, 1) 1 (1, 2) 2 (0, 3) 1 (1, 4) 1 [[0 1 0 1 0] [1 1 2 0 1]]
as noted @dmcc, omitted comma makes corpus
have 1 document.
Comments
Post a Comment