Friday, July 25, 2014

Three easy steps to your own iCorpus

Several colleagues have asked for a follow up to the chapter  " iCorpus: making corpora meaningful for pre-service teacher education" Alev Özbilgin and I wrote for Innovations in pre-service education and training 
for English language teachers edited by Julian Edge and Steve Mann.  The main requests have been to provide a simple step-by-step guide to creating an iCorpus and using tools to analyze it.

I'm assuming that you have read our chapter and understand the nature of the iCorpus and how it can be used for self-directed language development.  If you don't have time to read the full chapter, I've added a summary at the end (read more) giving a bit of background to the iCorpus-why we saw a need for it, how we developed it, what it is and why it is useful as a language learning tool.  

Step 1:  Building the iCorpus

  1. Create a folder titled "icorpus"
  2. Open any of the writing you want to include in your word processor.
    • Chose the SAVE AS option, and select PLAIN TEXT (sometimes TXT)
    • Make sure you name your files logically.  For example, if this is an academic iCorpus, you could start each file with the course code it was written for.
    • Then direct the SAVE AS option to save into your 'iCorpus' folder.
  3. Continue adding as many of your writings as you want.  

Step 2: Building your reference corpus

  1. In order to analyze your iCorpus, you need to have a 'reference' corpus of writing that represents how you would like to write.
  2. Create a separate folder for your reference corpus, and title it "refcorpus".  
    • You may create different reference corpora, in which case you can name them differently to keep them distinct.
  3. Find articles or writing that represent your target writing.  
    • For an academic reference corpus, this may be articles that you have read in your course, or readings recommended by your instructor.
  4. Depending on the nature of the original files, you need to SAVE AS to get these files into PLAIN TEXT (or TXT) as in building your iCorpus
    • Word processing programs will have a FILE > SAVE AS option to plain text.
    • For PDFs, there is a FILE > SAVE AS TEXT option.  Note, this only works for PDFs created from electronic texts, not scanned pages as images.
    • For web pages, you will need to highlight and copy the text you want.  Then open a text editor like NOTEPAD, paste the content there, and then SAVE as a TXT file.
  5. As with the iCorpus, it would be useful to find a logical file naming protocol, using prefixes to your reference texts to have related texts appear together in the output.
  6. There are some 'pret a porter' reference corpora here for download:  academic corpus, graded reader corpus, British spoken English, British written English, TV English.


Step 3: Putting it all together in ANTONC

  1. Download the ANTCONC program and install it on your computer (note this is free to download and use).
  2. Run the program.
  3. In the main program window, open the iCorpus folder (FILE > OPEN DIR)
  4. All the individual files in your iCorpus will appear in the top left window pane.  It is then easy to interrogate the iCorpus on its own.
    • Click on WORD LIST to see an list of all the words in the iCorpus by frequency.
    • Click on a word in the WORD LIST view, and you will see a KWIC (key word in context) concordance, showing each instance of that word in the iCorpus in context.
    • Click CLUSTERS to see occurrences of groups of words that include that word.
    • Click COLLOCATES to see a list of collocates by frequency
  5. In order to compare the iCorpus with a reference (target) corpus you need to add the corpus
    • Go to TOOLS PREFERENCES > KEYWORD LIST and under "Reference Corpus Options" click CHOOSE FILES to choose a corpus as a single file, or ADD DIRECTORY to choose a corpus that consists of individual files in a folder.
    • You will see the reference corpus file(s) appear in the window pane.
    • Then click APPLY.
  6. Now, in the main program window, click on the KEYWORD tab and then click the START button.
    • You will see a list that functions the same as WORD LIST, but the words listed are those that appear more frequently in the iCorpus than the reference corpus.
  7. To see the 'mirror' list of word that occur more frequently in the reference corpus than in the iCorpus, you can 'swap' the two corpora.
    • Go to TOOLS PREFERENCES > KEYWORD LIST and under "Reference Corpus Options" click SWAP REFMAIN FILES to swap the iCorpus with the reference corpus.
    • You will see the iCorpus file(s) appear in the window pane..
    • Then click APPLY.
  8. Now, in the main program window, click on the KEYWORD tab and then click the START button.
    • You will see a list that functions the same as WORD LIST, but the words listed are those that appear more frequently in the reference corpus than the iCorpus.
  9. You can 'swap' back to the iCorpus at any time using the same procedure.

Background to the iCorpus

In our formative years as teacher trainers, language corpora was restricted to the lofty domains of academia and applied linguistics. It was something that ‘linguists’ did, and was mainly of academic interest with little relevance to the practical realities of teacher education. However, in less than two decades corpus-informed approaches have become commonplace in the language teaching profession (Braun, Kohn, & Mukherjee, 2006; Cobb, 2010; Paquot, 2010; Yoon, 2011). As educators, we have tried to keep abreast of the current trends in language teaching to explore how to incorporate corpora into foreign language instruction (Carter, McCarthy & O'Keeffe, 2007; Pérez-Paredes et al., 2011; Yoon, 2008).

Despite our efforts to acknowledge the paradigm shift towards corpora in language education, we felt that our TEFL programme was similar to the vast majority of teacher education programmes which Farr (2010) asserts take little to no account of the “corpus revolution” – a revolution which McCarthy (2008, p. 573) argues can no longer be sidelined in teacher training. Considering the needs of our pre-service teachers, we wanted to explore the practical relevance and applicability of McCarthy’s argument to integrate the corpus revolution in teacher education.

Why we saw a need for the iCorpus

Obviously, as teacher educators we felt the need to train and equip pre-service teachers to be able to apply data driven learning in their own teaching practice and to develop skills in their own use of corpora. What is more, we were acutely aware that non-native English speakers make up 80% of the English language teachers in the world (Canagarajah, as cited in Moussu & Enric, 2008) for whom English language corpora are an invaluable resource, both for them to consult as references in teaching as well as tools to use in their exploration of their own knowledge of English.

In our methodology courses, we pay lip-service to the corpus revolution, and include mention of how corpora have been aligned with approaches to language learning, such as the lexical approach. However, like most teacher educators, we had little practical first-hand knowledge of the use of modern technology and contemporary corpora for language development. We understood the fundamentals of what is meant by ‘corpus’, which Hunstan (2002) defines as “a collection of naturally occurring examples of language, consisting of anything from a few sentences to a set of written texts or tape recordings which have been collected for linguistic study” (p. 2). According to this definition, a corpus can range from hundreds of billions of words, as in the GOOGLE Books corpus, to just the few short lines of a letter. So, it was clear to us that any institution, trainer, teacher, or even learner can create and exploit bespoke corpora using simple word processing tools and an internet connection. What we sought to resolve was exactly how such an ability could impact teacher education of pre-service teachers through the use of corpora.

This led us to explore ways that teacher educators ‘use’ corpora. Biber et al. (1998) state that that a ‘corpus-informed’ approach gives the opportunity for empirically “analysing the actual patterns of use in natural texts” (p. 4), allowing people to interrogate repositories of authentic texts about grammar, vocabulary, collocates, register and much more, providing an extremely powerful resource for promoting language awareness. Typically, we discovered, this requires some type of software to search and interrogate the texts which are stored as data, and display the results in output that is meaningful to humans. The software used to reveal patterns based on selected words or phrases are known as concordancing tools. Other tools may provide a lexical frequency profile based purely on the frequency of words, or according to the occurrence of words from pre-defined lists of words. Along the way we had to learn a whole new set of jargon like lemmatization and abbreviations like KWIC (key word in context). At this point, clear on the nature of corpora in general and the technology with which they are used, we began to ponder how we as teacher educators should deal with this, and in particular what we could offer to our TEFL students in their development both as teachers and as English language learners.

There was no doubt in our minds of the significance and value of a corpus-informed approach in teacher education. Corpora can obviously develop language awareness (Tribble and Barlow, 2001). But, what is more, putting a bank of linguistic data at the fingertips of teachers can inform syllabus and materials design, classroom exploitation, and language assessment. The range and variety of corpora that can be used or developed is unlimited. While we discovered many ready-made corpora that we were free to use, such as http://corpus.byu.edu, we also discovered that it is possible for us to create our “own specialised corpora to reflect the kind of language [we] want to investigate” (Hunston, 2002, p. 14). Two examples of such bespoke corpora which are specifically compiled for teaching and learning purposes are learner corpora, which show how learners at different stages attempt to use the target language, and pedagogical corpora, which present the target language that learners will need to master (Willis, 2003, p. 165). As we started to develop our own proficiency in corpus use, we experienced in our own explorations into the English language how a corpus-informed approach exposes the heart of language, which is the essence that binds pre-service teachers and us as trainers in the undertaking of learning and teaching language.

How we developed the iCorpus approach

Although corpus-informed methodology is wide-ranging, we had discovered that the essential elements are not complex to learn or apply. However, from a teacher education perspective, we felt that introducing corpus work in an isolated session or two would not be enough to introduce TEFL students to the sheer scope and value of the available tools. We were mindful of McCarthy (2008), who advocates a paradigm shift in which TEFL professionals are not just aware of corpora, but fully integrate a corpus-informed approach into their own professional practice. He argues that this must be done throughout pre-service training, so that trainee teachers have ample time and opportunity to assimilate the essential components of a corpus-informed approach and build a practical foundation which they can carry into their professional life.

It was in the spirit and essence of McCarthy’s vision that we initiated the iCorpus, to delve into the notion that our pre-service TEFL students, who as non-native English speakers are language learners in their own right, can be empowered to create their own ‘individual corpora’ (iCorpora) and use corpus and text analysis tools not only to guide them in self-directed language development but also to develop a critical awareness of the nature of the language they have chosen to teach.

This led us to explore ways that teacher educators ‘use’ corpora. Biber et al. (1998) state that that a ‘corpus-informed’ approach gives the opportunity for empirically “analysing the actual patterns of use in natural texts” (p. 4), allowing people to interrogate repositories of authentic texts about grammar, vocabulary, collocates, register and much more, providing an extremely powerful resource for promoting language awareness. Typically, we discovered, this requires some type of software to search and interrogate the texts which are stored as data, and display the results in output that is meaningful to humans. The software used to reveal patterns based on selected words or phrases are known as concordancing tools. Other tools may provide a lexical frequency profile based purely on the frequency of words, or according to the occurrence of words from pre-defined lists of words. Along the way we had to learn a whole new set of jargon like lemmatization and abbreviations like KWIC (key word in context). At this point, clear on the nature of corpora in general and the technology with which they are used, we began to ponder how we as teacher educators should deal with this, and in particular what we could offer to our TEFL students in their development both as teachers and as English language learners.

There was no doubt in our minds of the significance and value of a corpus-informed approach in teacher education. Corpora can obviously develop language awareness (Tribble and Barlow, 2001). But, what is more, putting a bank of linguistic data at the fingertips of teachers can inform syllabus and materials design, classroom exploitation, and language assessment. The range and variety of corpora that can be used or developed is unlimited. While we discovered many ready-made corpora that we were free to use, such as http://corpus.byu.edu, we also discovered that it is possible for us to create our “own specialised corpora to reflect the kind of language [we] want to investigate” (Hunston, 2002, p. 14). Two examples of such bespoke corpora which are specifically compiled for teaching and learning purposes are learner corpora, which show how learners at different stages attempt to use the target language, and pedagogical corpora, which present the target language that learners will need to master (Willis, 2003, p. 165). As we started to develop our own proficiency in corpus use, we experienced in our own explorations into the English language how a corpus-informed approach exposes the heart of language, which is the essence that binds pre-service teachers and us as trainers in the undertaking of learning and teaching language.

Most approaches to using corpora feature the teacher or researcher as a central and controlling agent in collecting texts, building either a learner or pedagogic corpora, and analysing the data. However, we wanted to put the learner at the centre of process we referred to as the iCorpus. To illustrate exactly what an ‘iCorpus’ is and how it can help develop a deeper awareness of our own use of language, see Chapter 11 in Innovations in pre-service education and training for English language teachers.  See the following screenshots with captions to briefly illustrate the various steps in using ANTCONC with an iCorpus and a reference corpus.


View of concordance of a student iCorpus, showing a KWIC in alphabetical order.

With an academic reference corpus loaded for comparison, this KEYWORD list shows words that appear in the iCorpus relatively more frequently than in the reference corpus.  This would highlight a selection of words that the writer uses that would be priority to investigate.  For example, here the word 'therefore' is used more frequently in the iCorpus than one would expect from the frequency of occurrence in the reference corpus.

Clicking on the word 'therefore' in the KEYWORD list, we can see the KWIC extracts of how the writer used the word 'therefore' in their writings.  The file names in the right side panel show the origin.

Using the SWAP feature, the writer can see how the word 'therefore' is used in the reference corpus.  Although there are more entries, the reference corpus is much larger, so relative to the total number of words, 'therefore' is used less in the reference corpus than in the iCorpus.


One resource students were introduced to, Just-the-word, which is a collocation dictionary based on the British National Corpus.

A click on the collocation heading in the display above yields a complete KWIC from the entire British National Corpus.

Another resource, http://wordandphrase.info by Mark Davies (http://corpus.byu.edu) which shows collocations according to frequency in different genre.

References

Anthony, L. (2011). AntConc (Version 3.2.2) [Computer Software]. Tokyo, Japan: Waseda University. Available from http://www.antlab.sci.waseda.ac.jp/
Biber, D, Conrad, S. & Reppen, R. (1998). Corpus Linguistics. Investigating Language Structure and Use. Cambridge: CUP.
BNC Consortium. (2005). The British National Corpus. Available from http://www.natcorp.ox.ac.uk
Braun, S., Kohn, K., & Mukherjee, J. (Eds). (2006). Corpus technology and language pedagogy: New resources, new tools, new methods. (English Corpus Linguistics, Volume 3) Frankfurt/Main: Peter Lang
Carter, R., McCarthy, M., & O'Keeffe, A. (2007). From corpus to classroom: language use and language teaching (Applied Linguistics Non Series) (1 ed.). New York: Cambridge University Press.
Cobb, T., 2010. Learning about language and learners from computer programs. Reading in a Foreign Language 22 (1), 181-200
Davies, Mark. (2012-) Word and phrase interface to the corpus of contemporary American English: 450 million words, 1990-present. Available online at http://www.wordandphrase.info/.
Farr, F. (2010) How can corpora be used in teacher education? In O’Keefe & McCarthy (eds.) The Routledge Handbook of Corpus Linguistics (pp. 620-632). Abington, England: Routledge Handbooks.
Hancioglu, N, Neufeld, S., & Eldridge, J.. (2008). Through the looking glass and into the land of lexico-grammar. English for Specific Purposes, 27/4, pp. 459-479.
Heatly, A., Nation, I., Coxhead, A. (2002). Range. [Computer Software]. Retrieved November 4, 2010, from. http://www.victoria.ac.nz/lals/staff/paul-nation/nation.aspx
Huberman, A. M: and Miles, M. B.(2001). Data management and analysis methods. In Conrad, C.F., Haworth, J.G., Lattuca, L. R. (eds), Qualitative Research in Higher education: Expanding Perspectives (pp. 553-71). New York, NY: Pearson.
Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: CUP.
Lacey, C. (1977). The socialization of teachers. London: Methuen.
McCarthy, M. (2008). Accessing and interpreting corpus information in the teacher education context. Language Teaching. 41/4, pp. 563-574.
Merriam, S.H. (2001). Qualitative research and case study applications in education. San Francisco: Jossey-Bass Publications.
Moussu, L., & Enric, L. (2008). Non-native English-speaking English language teachers: History and research. Language Teaching, 41/3. 315-348. doi: 10.1017/S0261444808005028  Retrieved on February 3, 2012 from http://www2.warwick.ac.uk/fac/soc/al/research/groups/ellta/elted/events/download.pdf
Paquot, M. (2010). Academic vocabulary in learner writing: from extraction to analysis (Corpus and Discourse). New York: Continuum.
Pérez-Paredes. P., Sánchez-Tornel. M., J., Calero. J.M.A. & Jiménez, P.A. (2011): Tracking learners' actual uses of corpora: guided vs non-guided corpus consultation. Computer Assisted Language Learning, 24/3. pp 233-253
Seidman, I. (1998). Interviewing as qualitative research. NY: Teachers College Press.
Stake (1998). Case studies.  In Denzin, N & Lincoln, Y (eds.), Strategies of qualitative inquiry (pp. 89-109). Thousand Oaks, Ca:Sage.
Tribble C. & Barlow M. (eds.) (2001) Special Issue, Language Learning & Technology 5/3 (September 2001), "Using Corpora in Language Teaching and Learning": http://llt.msu.edu
Tribble, C., & Jones, G. (1997). Concordances in the classroom: A resource book for teachers.  Houston, USA: Athelstan Publications.
Willis, D. (2003). Rules, patterns and words: Grammar and lexis in English language teaching. Cambridge: CUP.
Xue, Guoyi, & Nation, I.S.P. (1984). A University Word List. Language Learning and Communication 3/2: pp. 215-229
Yoon, C. (2011). Concordancing in L2 writing class: An overview of research and issues. Journal of English for Academic Purposes, 10/3. pp 130-139. http://www.sciencedirect.com/science/article/pii/S1475158511000191
Yoon, H. (2008). More than a linguistic reference: the influence of corpus technology on L2 academic writing. Language, Learning & Technology, 12/2. pp 31-48. http://llt.msu.edu/vol12num2/yoon.pdf

1 comment:

  1. It is really amazing! So many new things that I even didn't hear about them. Difficult to follow for a not professional, a private individual. The world goes on progressing by innovating and inventing.

    ReplyDelete