Introduction to SVALex and SweLLex

What is SVALex?

SVALex is a lexicon of receptive vocabulary for Swedish as a second/foreign language (SVA) that reports the normalized frequencies of words (lemmas) across 5 of the six levels of the CEFR (Common European Framework of Reference for Languages), excluding C2. Apart from information on single word usage, this list also contains multi-word expressions and information on their usage at different levels, something that is rarely present in the resources of this kind.

The frequencies have been estimated on a corpus of SVA textbooks, COCTAILL. More details on the corpus can be found in

Elena Volodina, Ildikó Pilán, Stian Rødven Eide and Hannes Heidarsson 2014. "You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language". Proceedings of the third workshop on NLP for computer-assisted language learning. NEALT Proceedings Series 22 / Linköping Electronic Conference Proceedings 107: 128–144.

Article on the resource in general, including methods for computation and normalization of the word frequencies is available:

Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia.

What is SweLLex?

SweLLex is a lexicon of productive vocabulary for Swedish as a second/foreign language (SVA). Like its sister resource, SVALex, it reports the normalized frequencies of words (lemmas) across six levels of the CEFR (Common European Framework of Reference for Languages). In the same fashion as SVALex, it contains information on both single word usage, multi-word expressions, as well as information on their usage at different levels, something that is rarely present in the resources of this kind.

The frequencies have been estimated on a corpus of essays written by SVA learners, SweLL corpus, described in the article:

Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. 2016. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia.

More details on SweLLex resource are provided in the following article:

Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, Thomas François. 2016. SweLLex: second language learners' productive vocabulary. To appear in Proceedings of the workshop on NLP4CALL&LA. NEALT Proceedings Series / Linköping Electronic Conference Proceedings

If you are using SweLLex or SVALex, please, cite the respective articles on these resources.

What's in SVALex?

For every word in SVALex, you can find its part-of-speech (POS) along with its normalized frequency for each level of the CEFR, and its total normalized frequency in the source corpus. Here are some of the entries from SVALex:

Lemma POS-tag A1 A2 B1 B2 C1 Total
bil NN_UTR 430.2138 1234.2078 728.9847 422.283 363.5446 618.8567
överge VB 0 0 7.3203 24.5182 39.6516 17.2695
rättvisa NN_UTR 0 0 3.6601 25.6189 26.4344 13.6602
kilo NN_NEU 0 302.0833 145.1229 65.0611 13.2172 89.8907
resa VB 166.3009 375.2582 450.3526 298.4905 330.4297 356.362
låg JJ 0 49.315 125.922 217.3103 252.1311 156.126
så klart ABM_MWE 0 16.2635 81.6019 45.5033 13.2172 38.1738
till skillnad från PPM_MWE 0 0 5.3395 2.409 3.6699 5.1839
Example of some entries from SVALex

What's in SweLLex?

SVALex and SweLLex have been compiled the same way and contain the same information. For every word in SweLLex, you can find its part-of-speech (POS) along with its normalized frequency for each level of the CEFR, and its total normalized frequency in the source corpus. Here are some of the entries from SweLLex:

Lemma POS-tag A1 A2 B1 B2 C1 Total
bil NN_UTR 0 395.7294 441.4955 38.6082 88.3802 219.5917
rättvisa NN_UTR 0 0 0 0 0.1441 0.0256
kilo NN_NEU 0 0 0 0.268 0 0.014
resa VB 0 134.2754 1297.9139 183.2226 43.6672 363.7923
låg JJ 0 0 0 37.4576 76.5894 41.0983
Example of some entries from SweLLex

SweLLex contains a number of items marked with asterisks (* or **), that identify items that could not be (automatically) matched to any of the entries in our lexicons. Among these items there can be: misspelled words, nonexistent words, compounds not present in the lexicon, items with wrong part-of-speech tags.

How to get it?

You can download it from the Download the resources page.

How to use it?

The format is a .CSV (tab separated values) file with 9 columns (see above), encoded in UTF-8. You can also open it in an excel sheet.

What use for it?

For NLP purposes, as well as for pedagogical and language assessment purposes.

Who did it?

SVALex and SweLLex is the result of a collaboration between two teams:

Make a query in SVALex or SweLLex


      Download SVALex or SweLLex

      Feel free to download and use both resources for your own research or teaching. If you are using SweLLex or SVALex, please, cite the respective articles on these resources:

      Thomas François, Elena Volodina, Ildikó Pilán, Anaïs Tack. 2016. SVALex: a CEFR-graded lexical resource for Swedish foreign and second language learners. Proceedings of LREC 2016, Slovenia.

      Elena Volodina, Ildikó Pilán, Lorena Llozhi, Baptiste Degryse, Thomas François. 2016. SweLLex: second language learners' productive vocabulary. Proceedings of the workshop on NLP4CALL&LA. NEALT Proceedings Series / Linköping Electronic Conference Proceedings

      SVALex

      Number of entries 15,681 lemmas 6,965 lemmas
      Tagger used Korp pipeline
      Includes multiword expressions Yes
      Download csv Download SVALex Download SweLLex