SVALex · A CEFR-graded lexical resource for Swedish as a foreign language

A CEFR-graded lexical resource for Swedish as a foreign language

SVALex is a lexicon of receptive vocabulary for Swedish as a second/foreign language (SVA) that reports the normalized frequencies of words (lemmas) across 5 of the six levels of the CEFR (Common European Framework of Reference for Languages), excluding C2. Apart from information on single word usage, this list also contains multi-word expressions and information on their usage at different levels, something that is rarely present in the resources of this kind.

The frequencies have been estimated on a corpus of SVA textbooks, COCTAILL. More details on the corpus can be found in

Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson 2014. "You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language". Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144.

An article on the resource in general, including methods for computation and normalization of the word frequencies is available:

Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia.

Features

	Receptive lexicon includes word frequencies observed in textbook reading activities and simplified readers
	CEFR levels A1 · A2 · B1 · B2 · C1
	Lexical entries lemma (`word`) part of speech (`tag`)
	Computed metrics level_freq · normalized frequency (per 1 million words) for each level of the CEFR total_freq · total normalized frequency in the source corpus nb_doc · document frequency

Format

The format is a .CSV (tab separated values) file with 8 columns (see above), encoded in UTF-8. You can also open it in an excel sheet.

*Example of some entries from SVALex*
Lemma	POS-tag	A1	A2	B1	B2	C1	Total
bil	NN_UTR	430.2138	1234.2078	728.9847	422.283	363.5446	618.8567
överge	VB	0	0	7.3203	24.5182	39.6516	17.2695
rättvisa	NN_UTR	0	0	3.6601	25.6189	26.4344	13.6602
kilo	NN_NEU	0	302.0833	145.1229	65.0611	13.2172	89.8907
resa	VB	166.3009	375.2582	450.3526	298.4905	330.4297	356.362
låg	JJ	0	49.315	125.922	217.3103	252.1311	156.126
så klart	ABM_MWE	0	16.2635	81.6019	45.5033	13.2172	38.1738
till skillnad från	PPM_MWE	0	0	5.3395	2.409	3.6699	5.1839

Usage

The resource can be used to compare the frequency distribution of multiple words along the CEFR scale. An online query interface is available and can be accessed via the Search tab.

Analyse

The resource can also be used to analyze the complexity of words in a text, in particular to identify which of the words in a text will be difficult at a given level. An online complexity analyzer is available and can be accessed via the Analyze tab.

Authors

SVALex is the result of a collaboration between two teams:

The Center for Natural Language Processing (CENTAL) at UCLouvain;
The Språkbanken research unit of the University of Gothenburg.

Contributors

Brayan Delmée
Logo Design

Dorian Ricci, Baptiste Degryse & Anaïs Tack
Prototype design

Damien De Meyere
Website maintenance

Features

Receptive lexicon

CEFR levels

Lexical entries

Computed metrics

Format

Example of some entries from SVALex

Usage

Authors

Contributors