Versão em Português
Description
BlogSet-BR is a collection of posts gathered from Blogspot platform written by Brazillian users. This resource has three files:
- a compress csv only with brazillian posts, and
- a xls file with survey answers, and
- a tar.gz with original json.
Download
Compress CSV with 7.4 milion Brazillian Posts.
XLS with 4 thousand answers of Brazillian Bloggers.
Compress TAR with 3 million blogs gathered from Blogspot.
Instructions
The main file blogset-br could be open in Pandas with the command line below:
import pandas as pd posts = pd.read_csv('blogset-br.csv.gz', compression='gzip', header=None) # columns: post.id, .blog.id, .published, .title, .content, .author.id, .author.displayName, .replies.totalItems, tags
Citation
Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira. 2018. BlogSet-BR: A Brazilian Portuguese Blog Corpus. In Proceedings of 11th edition of the Language Resources and Evaluation Conference, 7-12 May 2018, Miyazaki (Japan).