SVALex is a lexicon of receptive vocabulary for Swedish as a second/foreign language (SVA) that reports the normalized frequencies of words (lemmas) across 5 of the six levels of the CEFR (Common European Framework of Reference for Languages), excluding C2. Apart from information on single word usage, this list also contains multi-word expressions and information on their usage at different levels, something that is rarely present in the resources of this kind.

The frequencies have been estimated on a corpus of SVA textbooks, COCTAILL. More details on the corpus can be found in

Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson 2014. "You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language". Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144.

An article on the resource in general, including methods for computation and normalization of the word frequencies is available:

Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia.

Features

menu_book

Receptive lexicon
includes word frequencies observed in textbook reading activities and simplified readers

bar_chart

CEFR levels
A1 · A2 · B1 · B2 · C1

toc

Lexical entries
lemma (word)
part of speech (tag)

calculate

Computed metrics
level_freq · normalized frequency (per 1 million words) for each level of the CEFR
total_freq · total normalized frequency in the source corpus
nb_doc · document frequency

Format

The format is a .CSV (tab separated values) file with 8 columns (see above), encoded in UTF-8. You can also open it in an excel sheet.

Lemma POS-tag A1 A2 B1 B2 C1 Total
bil NN_UTR 430.2138 1234.2078 728.9847 422.283 363.5446 618.8567
överge VB 0 0 7.3203 24.5182 39.6516 17.2695
rättvisa NN_UTR 0 0 3.6601 25.6189 26.4344 13.6602
kilo NN_NEU 0 302.0833 145.1229 65.0611 13.2172 89.8907
resa VB 166.3009 375.2582 450.3526 298.4905 330.4297 356.362
låg JJ 0 49.315 125.922 217.3103 252.1311 156.126
så klart ABM_MWE 0 16.2635 81.6019 45.5033 13.2172 38.1738
till skillnad från PPM_MWE 0 0 5.3395 2.409 3.6699 5.1839
Example of some entries from SVALex

Usage

search Search

The resource can be used to compare the frequency distribution of multiple words along the CEFR scale. An online query interface is available and can be accessed via the Search tab.

bar_chart Analyse

The resource can also be used to analyze the complexity of words in a text, in particular to identify which of the words in a text will be difficult at a given level. An online complexity analyzer is available and can be accessed via the Analyze tab.

Authors

SVALex is the result of a collaboration between two teams:

Contributors

Brayan Delmée
Logo Design

Dorian Ricci, Baptiste Degryse & Anaïs Tack
Prototype design

Damien De Meyere
Website maintenance