Research ideas, thoughts and reflections: corpus linguistics

Saturday, 26 September 2009

Word frequencies: some fun

Having just written a paper on how the frequency of a word in a document can give us an idea of the semantics of that document, I thought it would be fun to run my paper through word cloud software, to summarise it by frequency... here is the result (courtesy of http://www.wordle.net)

Sunday, 8 March 2009

statistics in corpus linguistics - Sandra's discussion - and how it relates to my work

Nice clear article on Sandra's blog here, spelling out exactly how she's using statistics in her research on how word usage in bilingual people is influenced by their native language. She defines quite a few core statistical terms in a way that is easy to understand.
Statistical testing is likely to form one or more of my evaluative tests for creativity, as I will want to measure novelty and variance of the new piece of work, both within itself and compared to other comparable works.
This also links in to a previous post of Sandra's on variability and homogeneity assessment between corpora and inside an individual corpus of documents. Where she uses 2 corpora: essays written by native speakers of English and then native speakesr of French, I could be using corpora of pieces of a specific musical genre?
Summary of Sandra's post and how the content relates to my research
So as I understand it, you can use t tests to measure the difference between what you expect to see (based on what you have in the existing corpus) and what you actually see. I wonder what criteria is best for measuring this (probably more than one) - a topic for a later post I think!
T tests are most effective with a small corpus (less than 30 items). They also rely on the assumption that the data is normally distributed and representative of the source population (i.e. in my case, that musical genre). [NB How could I check this, in real terms?]
Z scores can be useful for corpora of a larger sample size than 30 (still assuming the data in the corpus is normally distributed). A z score represents the number of standard deviations a sample piece is away from the mean of the corpus (in other words, how different one piece is from the standard set by the corpus, in comparison to other pieces also being tested).
Taking away the assumption that the data is normally distributed and representative of the larger population (that specific musical genre), chi-squared tests come into play. X2 tests take two sets of frequencies/variables and measure how much these two sets vary from each other. So for example taking pitch distribution (how many times does e.g. middle C appear in pieces from a specified genre) - I could use X2 to measure how statistically different two sets of pieces are from each other on the basis of how notes are used.
I'm sure I'll return to this discussion when I get a bit more into the statistical testing, and how to establish significant results to measure creativity statistically.

Wednesday, 28 January 2009

blog progress

Well as a result of one of my friends (the lovely Sandra) finding my blog and reading about exactly what I'm interested in, we've come across something that could be quite helpful to my research - image schemas. I'd never heard of them until Sandra mentioned them but from what she says, they're definitely worth further investigation. Sandra, I'm coming over to your office now...!!

Research ideas, thoughts and reflections

Saturday, 26 September 2009

Word frequencies: some fun

Sunday, 8 March 2009

statistics in corpus linguistics - Sandra's discussion - and how it relates to my work

Wednesday, 28 January 2009

blog progress

About Me

Blog Archive

Topics

Followers

Research ideas, thoughts and reflections

Saturday, 26 September 2009

Word frequencies: some fun

Sunday, 8 March 2009

statistics in corpus linguistics - Sandra's discussion - and how it relates to my work

Wednesday, 28 January 2009

blog progress

About Me

Blog Archive

Topics

Subscribe To

Followers