numpy - What means this sparse matrix in scipy? -


i have nlp task , i'm using scikit-learn. reading tutorials found have vectorize text , how use vectorization models feed classification algorithm. assume have text , vectorize follows:

from sklearn.feature_extraction.text import countvectorizer  corpus =['''computer science scientific , practical approach computation , applications.''' #this opinion '''it systematic study of feasibility, structure, expression, , mechanization of methodical procedures underlie acquisition, representation, processing, storage, communication of, , access information, whether such information encoded bits in computer memory or transcribed in genes , protein structures in biological cell.'''          #anotherone '''a computer scientist specializes in theory of computation , design of computational systems''']  vectorizer = countvectorizer(analyzer='word')  x = vectorizer.fit_transform(corpus)  print x 

the problem dont understand meaning of output, dont see relation text , matrix returned vectorizer:

  (0, 12)   3   (0, 33)   1   (0, 20)   3   (0, 45)   7   (0, 34)   1   (0, 2)    6   (0, 28)   1   (0, 4)    1   (0, 47)   2   (0, 10)   2   (0, 22)   1   (0, 3)    1   (0, 21)   1   (0, 42)   1   (0, 40)   1   (0, 26)   5   (0, 16)   1   (0, 38)   1   (0, 15)   1   (0, 23)   1   (0, 25)   1   (0, 29)   1   (0, 44)   1   (0, 49)   1   (0, 1)    1   : :   (0, 30)   1   (0, 37)   1   (0, 9)    1   (0, 0)    1   (0, 19)   2   (0, 50)   1   (0, 41)   1   (0, 14)   1   (0, 5)    1   (0, 7)    1   (0, 18)   4   (0, 24)   1   (0, 27)   1   (0, 48)   1   (0, 17)   1   (0, 31)   1   (0, 39)   1   (0, 6)    1   (0, 8)    1   (0, 35)   1   (0, 36)   1   (0, 46)   1   (0, 13)   1   (0, 11)   1   (0, 43)   1 

also dont understand what's happening output when use toarray() method:

print x.toarray() 

what means output , relation has corpus?:

[[1 1 6 1 1 1 1 1 1 1 2 1 3 1 1 1 1 1 4 2 3 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1   1 1 1 1 1 1 1 1 7 1 2 1 1 1]] 

the countvectorizer produces document-term matrix. simple example, let's take of following simplified code:

from sklearn.feature_extraction.text import countvectorizer  corpus =['''computer hardware''', '''computer data , software data''']  vectorizer = countvectorizer(analyzer='word')  x = vectorizer.fit_transform(corpus)  print x  print x.toarray() 

you have 2 documents, elements of corpus, , 5 terms, words. , can count terms in documents follows:

      | , computer data hardware software       +------------------------------------- doc 0 |            1             1  doc 1 |   1        1    2                 1  

and x represents above matrix in associative manner, i.e. map (row, col) frequency of terms , x.toarray() shows x list of list. following execution result:

  (1, 0)    1   (0, 1)    1   (1, 1)    1   (1, 2)    2   (0, 3)    1   (1, 4)    1 [[0 1 0 1 0]  [1 1 2 0 1]] 

as noted @dmcc, omitted comma makes corpus have 1 document.


Comments

Popular posts from this blog

ruby on rails - RuntimeError: Circular dependency detected while autoloading constant - ActiveAdmin.register Role -

c++ - OpenMP unpredictable overhead -

javascript - Wordpress slider, not displayed 100% width -