SVALex is a lexicon of receptive vocabulary for Swedish as a second/foreign language (SVA) that reports the normalized frequencies of words (lemmas) across 5 of the six levels of the CEFR (Common European Framework of Reference for Languages), excluding C2. Apart from information on single word usage, this list also contains multi-word expressions and information on their usage at different levels, something that is rarely present in the resources of this kind.
The frequencies have been estimated on a corpus of SVA textbooks, COCTAILL. More details on the corpus can be found in
Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson 2014. "You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language". Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144.
An article on the resource in general, including methods for computation and normalization of the word frequencies is available:
Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia.
Features
menu_book |
Receptive lexiconincludes word frequencies observed in textbook reading activities and simplified readers |
---|---|
bar_chart |
CEFR levelsA1 · A2 · B1 · B2 · C1 |
toc |
Lexical entrieslemma (word)part of speech (tag) |
calculate |
Computed metricslevel_freq · normalized frequency (per 1 million words) for each level of the CEFRtotal_freq · total normalized frequency in the source corpus nb_doc · document frequency |
Format
The format is a .CSV (tab separated values) file with 8 columns (see above), encoded in UTF-8. You can also open it in an excel sheet.
Lemma | POS-tag | A1 | A2 | B1 | B2 | C1 | Total |
---|---|---|---|---|---|---|---|
bil | NN_UTR | 430.2138 | 1234.2078 | 728.9847 | 422.283 | 363.5446 | 618.8567 |
överge | VB | 0 | 0 | 7.3203 | 24.5182 | 39.6516 | 17.2695 |
rättvisa | NN_UTR | 0 | 0 | 3.6601 | 25.6189 | 26.4344 | 13.6602 |
kilo | NN_NEU | 0 | 302.0833 | 145.1229 | 65.0611 | 13.2172 | 89.8907 |
resa | VB | 166.3009 | 375.2582 | 450.3526 | 298.4905 | 330.4297 | 356.362 |
låg | JJ | 0 | 49.315 | 125.922 | 217.3103 | 252.1311 | 156.126 |
så klart | ABM_MWE | 0 | 16.2635 | 81.6019 | 45.5033 | 13.2172 | 38.1738 |
till skillnad från | PPM_MWE | 0 | 0 | 5.3395 | 2.409 | 3.6699 | 5.1839 |
Usage
search SearchThe resource can be used to compare the frequency distribution of multiple words along the CEFR scale. An online query interface is available and can be accessed via the Search tab.
bar_chart AnalyseThe resource can also be used to analyze the complexity of words in a text, in particular to identify which of the words in a text will be difficult at a given level. An online complexity analyzer is available and can be accessed via the Analyze tab.
Authors
SVALex is the result of a collaboration between two teams:
- The Center for Natural Language Processing (CENTAL) at UCLouvain;
- The Språkbanken research unit of the University of Gothenburg.
Contributors
Brayan Delmée Logo Design
Dorian Ricci, Baptiste Degryse & Anaïs Tack Prototype design
Damien De Meyere Website maintenance