ナード戦隊データマン

機械学習, 自然言語処理, データサイエンスについてのブログ

cmudict: 発音記号辞書をnltkから使う

cmudictとは、発音記号辞書です。nltkからこの辞書を使って、ラッパーのように韻(rhyme)を踏んでみます。

事前準備

コーパスをダウンロードします。

import nltk
nltk.download("cmudict")

たった30行のコード

import random
import nltk
from collections import defaultdict

def build_data(level=3):
     out1 = defaultdict(list)
     out2 = {}
     entries = nltk.corpus.cmudict.entries()
     for (word, syllable) in entries:
          a = ''.join(syllable[-level:])
          out1[a].append(word)
          out2[word] = a
     return out1, out2


def rhyme(inp, out1, out2):
     return out1[out2[inp]]


def rhyming(target, out1, out2):
    out = []
    for word in target.split():
        out.append(random.choice(list(rhyme(word, out1, out2))))
    return out


if __name__ == "__main__":
    import sys
    level = int(sys.argv[2])
    out1, out2 = build_data(level)
    for i in range(10):
         print(' '.join(rhyming(sys.argv[1], out1, out2)))

実行

python test2.py "richard stole cute girl's heart" 3

[出力]

denatured stol disrepute berles mahrt
denatured atoll cute berls dartt
lectured toll butte sirles mahrt
blanchard extol dispute earls apart
caricatured tole refute berls part
remanufactured tole nute berles harte
natured stohl pute searle's mart
cultured pistole permute searle's part
sculptured toal compute earles parte
orchard toll impute earls bart
prichard stoll commute searls bossart
pictured toelle compute girls' mahrt
echard pistole mute searls mccart
enraptured toelle commute earls harte
ruptured stoll nute berles' impart
pictured stole bute berles' schardt
sutured toelle commute searles cart
ruptured stole mute girl's tart
captured stol repute berles hardt
indentured stole refute perls ahart
indentured stol permute earles start
manufactured atoll repute searle's tarte
recaptured atoll cute earles hart
richard stol mute girls bossart
blanchard stohl butte searle's restart
caricatured stol pute earl's art
ventured tole compute berles' art
whichard stol mute searls kabart
gestured pistole compute girls hartt
ruptured tole refute sirles descartes

コード2

posタグを考慮したものを試します。

import random
import nltk
from collections import defaultdict

def build_data(level=3):
     out1 = defaultdict(lambda: defaultdict(list))
     out2 = {}
     out3 = {}
     entries = nltk.corpus.cmudict.entries()
     for (word, syllable) in entries:
          a = ''.join(syllable[-level:])
          pos = nltk.pos_tag([word])[0][1][0]
          out1[a][pos].append(word)
          out2[word] = a
          out3[word] = pos
     return out1, out2, out3


def rhyme(inp, out1, out2, out3):
     return out1[out2[inp]][out3[inp]]


def rhyming(target, out1, out2, out3):
    out = []
    for word in target.split():
         out.append(random.choice(list(rhyme(word, out1, out2, out3))))
    return out


if __name__ == "__main__":
    import sys
    level = int(sys.argv[2])
    out1, out2, out3 = build_data(level)
    for i in range(30):
         print(' '.join(rhyming(sys.argv[1], out1, out2, out3)))

実行結果

python test3.py "conspiracy theorists are completely crazy" 3
lunacy deloris are corruptly bruzzese
galasie maoris are intermittently coglianese
policy flavorists are acutely dazey
argosy terrorists' are apparently calabrese
pharmacy maoris are nightly cortese
legitimacy demaris are desperately cassese
embassy medearis are gately swayze
agassi madaris are pertinently hazy
legacy humorists are whitely swayze
odyssey amaris are compassionately butulesi
heresy demaris are tightly catanese
privacy decesaris are devoutly chianese
barkocy halamandaris are prudently bolognese
semisecrecy flavorists are indiscriminately lazy
legacy madaris are tacitly catanese
pharmacy medearis are subsequently coglianese
barkocy decesaris are stitely swayze
jealousy motorists are disproportionately buthelezi
fallacy jefferis are disproportionately maisie
argosy terrorists' are hotly maisie
democracy medearis are affectionately coglianese
obstinacy behaviorists are resolutely mazie
literacy counterterrorists are slightly crazy
jealousy futurists are portly maisie
confederacy motorists are conveniently calabrese
degeneracy amaris are lightly colaizzi
bureaucracy jefferis are lastly buthelezi
meritocracy halamandaris are intently maisie
celibacy secularists are importantly cortese
biopharmacy deloris are inaccurately cassese

参考

https://stackoverflow.com/questions/25714531/find-rhyme-using-nltk-in-python