Comparing the first columns in two csv files using python and printing matches -

- January 15, 2011

i have 2 csv files each contain ngrams this:

drinks while strutting,4,1.435486010883783160220299732e-8 , since that,6,4.306458032651349480660899195e-8 state face,3,2.153229016325674740330449597e-8

it's 3 word phrase followed frequency number followed relative frequency number.

i want write script finds ngrams in both csv files, divides relative frequencies, , prints them new csv file. want find match whenever 3 word phrase matches 3 word phrase in other file , divide relative frequency of phrase in first csv file relative frequency of same phrase in second csv file. want print phrase , division of 2 relative frequencies new csv file.

below far i've gotten. script comparing lines finds match when entire line (including frequencies , relative frequencies) matches exactly. realize that because i'm finding intersection between 2 entire sets have no idea how differently. please forgive me; i'm new coding. can give me little closer such big help.

import csv import io   alist, blist = [], []  open("ngrams.csv", "rb") filea:     reader = csv.reader(filea, delimiter=',')     row in reader:         alist.append(row) open("ngramstest.csv", "rb") fileb:     reader = csv.reader(fileb, delimiter=',')     row in reader:         blist.append(row)  first_set = set(map(tuple, alist)) secnd_set = set(map(tuple, blist))  matches = set(first_set).intersection(secnd_set)  c = csv.writer(open("matchedngrams.csv", "a")) c.writerow(matches)  print matches print len(matches)

without dump res in new file (tedious). idea first element phrase , other 2 frequencies. using dict instead of set matching , mapping together.

import csv import io   alist, blist = [], []  open("ngrams.csv", "rb") filea:     reader = csv.reader(filea, delimiter=',')     row in reader:         alist.append(row) open("ngramstest.csv", "rb") fileb:     reader = csv.reader(fileb, delimiter=',')     row in reader:         blist.append(row)  f_dict = {e[0]:e[1:] e in alist} s_dict = {e[0]:e[1:] e in blist}  res = {} k,v in f_dict.items():     if k in s_dict:         res[k] = float(v[1])/float(s_dict[k][1])  print(res)

Search This Blog

WINAPI

Comparing the first columns in two csv files using python and printing matches -

Comments

Post a Comment

Popular posts from this blog

ruby on rails - RuntimeError: Circular dependency detected while autoloading constant - ActiveAdmin.register Role -

c++ - OpenMP unpredictable overhead -

javascript - Wordpress slider, not displayed 100% width -