Sunday, 8 March 2009

statistics in corpus linguistics - Sandra's discussion - and how it relates to my work

Nice clear article on Sandra's blog here, spelling out exactly how she's using statistics in her research on how word usage in bilingual people is influenced by their native language. She defines quite a few core statistical terms in a way that is easy to understand.
Statistical testing is likely to form one or more of my evaluative tests for creativity, as I will want to measure novelty and variance of the new piece of work, both within itself and compared to other comparable works.
This also links in to a previous post of Sandra's on variability and homogeneity assessment between corpora and inside an individual corpus of documents. Where she uses 2 corpora: essays written by native speakers of English and then native speakesr of French, I could be using corpora of pieces of a specific musical genre?
Summary of Sandra's post and how the content relates to my research
So as I understand it, you can use t tests to measure the difference between what you expect to see (based on what you have in the existing corpus) and what you actually see. I wonder what criteria is best for measuring this (probably more than one) - a topic for a later post I think!
T tests are most effective with a small corpus (less than 30 items). They also rely on the assumption that the data is normally distributed and representative of the source population (i.e. in my case, that musical genre). [NB How could I check this, in real terms?]
Z scores can be useful for corpora of a larger sample size than 30 (still assuming the data in the corpus is normally distributed). A z score represents the number of standard deviations a sample piece is away from the mean of the corpus (in other words, how different one piece is from the standard set by the corpus, in comparison to other pieces also being tested).
Taking away the assumption that the data is normally distributed and representative of the larger population (that specific musical genre), chi-squared tests come into play. X2 tests take two sets of frequencies/variables and measure how much these two sets vary from each other. So for example taking pitch distribution (how many times does e.g. middle C appear in pieces from a specified genre) - I could use X2 to measure how statistically different two sets of pieces are from each other on the basis of how notes are used.
I'm sure I'll return to this discussion when I get a bit more into the statistical testing, and how to establish significant results to measure creativity statistically.


  1. Thanks for flagging that up - very useful

  2. Nice post, Anna, thanks ;-)

    About your 'NB', the distribution of your data can be represented graphically. Depending on the shape of your curve, you can establish whether your data is normally distributed or not.
    I'll try and put a few bullet points together for you on how to do it (probably towards the end of the week).