Comparing the first columns in two csv files using python and printing matches -
i have 2 csv files each contain ngrams this:
drinks while strutting,4,1.435486010883783160220299732e-8 , since that,6,4.306458032651349480660899195e-8 state face,3,2.153229016325674740330449597e-8
it's 3 word phrase followed frequency number followed relative frequency number.
i want write script finds ngrams in both csv files, divides relative frequencies, , prints them new csv file. want find match whenever 3 word phrase matches 3 word phrase in other file , divide relative frequency of phrase in first csv file relative frequency of same phrase in second csv file. want print phrase , division of 2 relative frequencies new csv file.
below far i've gotten. script comparing lines finds match when entire line (including frequencies , relative frequencies) matches exactly. realize that because i'm finding intersection between 2 entire sets have no idea how differently. please forgive me; i'm new coding. can give me little closer such big help.
import csv import io alist, blist = [], [] open("ngrams.csv", "rb") filea: reader = csv.reader(filea, delimiter=',') row in reader: alist.append(row) open("ngramstest.csv", "rb") fileb: reader = csv.reader(fileb, delimiter=',') row in reader: blist.append(row) first_set = set(map(tuple, alist)) secnd_set = set(map(tuple, blist)) matches = set(first_set).intersection(secnd_set) c = csv.writer(open("matchedngrams.csv", "a")) c.writerow(matches) print matches print len(matches)
without dump res
in new file (tedious). idea first element phrase , other 2 frequencies. using dict
instead of set
matching , mapping together.
import csv import io alist, blist = [], [] open("ngrams.csv", "rb") filea: reader = csv.reader(filea, delimiter=',') row in reader: alist.append(row) open("ngramstest.csv", "rb") fileb: reader = csv.reader(fileb, delimiter=',') row in reader: blist.append(row) f_dict = {e[0]:e[1:] e in alist} s_dict = {e[0]:e[1:] e in blist} res = {} k,v in f_dict.items(): if k in s_dict: res[k] = float(v[1])/float(s_dict[k][1]) print(res)
Comments
Post a Comment