CS代考计算机代写 scheme Cheat Sheet

Cheat Sheet
Extensions
quanteda works well with these companion packages:
• quanteda.textmodels: Text scaling
and classification models
• readtext:aneasywaytoreadtext
data
• spacyr: NLP using the spaCy library • quanteda.corpora: additional text
corpora
• stopwords: multilingual stopword
lists in R
Extract features (dfm_*; fcm_*)
Create a document-feature matrix (dfm) from a corpus
x <- dfm(data_corpus_inaugural, tolower = TRUE, stem = FALSE, remove_punct = TRUE, remove = stopwords("en")) print(x, max_ndoc = 2, max_nfeat = 4) ## Document-feature matrix of: 58 documents, 9,210 features (92.6% sparse) and 4 docvars. General syntax • corpus_* manage text collections/metadata • tokens_* create/modify tokenized texts • dfm_*create/modifydoc-featurematrices • fcm_*workwithco-occurrencematrices • textstat_*calculatetext-basedstatistics • textmodel_* fit (un-)supervised models • textplot_* create text-based visualizations Consistent grammar: • object()constructorfortheobjecttype • object_verb()inputs&returnsobjecttype features fellow-citizens senate house representatives ## 1793-Washington 96 147 4 1793 Washington Extract or add document-level variables George none party <- data_corpus_inaugural$Party x$serial_number <- seq_len(ndoc(x)) docvars(x, "serial_number") <- seq_len(ndoc(x)) # alternative Bind or subset corpora corpus(x[1:5]) + corpus(x[7:9]) corpus_subset(x, Year > 1990)
Change units of a corpus
corpus_reshape(x, to = “sentences”) Segment texts on a pattern match
corpus_segment(x, pattern, valuetype, extract_pattern = TRUE)
Take a random sample of corpus texts
corpus_sample(x, size = 10, replace = FALSE)
Utility functions
texts(corpus)
ndoc(corpus /dfm /tokens) nfeat(corpus /dfm /tokens) summary(corpus / dfm) head(corpus / dfm) tail(corpus / dfm)
Show texts of a corpus Count documents/features Count features
Print summary
Return first part
Return last part
##
## docs
## 1789-Washington
## 1793-Washington
## [ reached max_ndoc … 56 more documents, reached max_nfeat … 9,206 more features ]
Create a dictionary
dictionary(list(negative = c(“bad”, “awful”, “sad”), positive = c(“good”, “wonderful”, “happy”)))
Apply a dictionary
dfm_lookup(x, dictionary = data_dictionary_LSD2015) Select features
dfm_select(x, pattern = data_dictionary_LSD2015, selection = “keep”) Randomly sample documents or features
dfm_sample(x, what = c(“documents”, “features”)) Weight or smooth the feature frequencies
dfm_weight(x, scheme = “prop”) | dfm_smooth(x, smoothing = 0.5)
Sort or group a dfm
dfm_sort(x, margin = c(“features”, “documents”, “both”)) dfm_group(x, groups = “President”)
Combine identical dimension elements of a dfm
dfm_compress(x, margin = c(“both”, “documents”, “features”))
Create a feature co-occurrence matrix (fcm)
x <- fcm(data_corpus_inaugural, context = "window", size = 5) fcm_compress/remove/select/toupper/tolower are also available Useful additional functions Locate keywords-in-context kwic(data_corpus_inaugural, pattern = "america*") 1 1 2 2 0 0 0 0 Create a corpus from texts (corpus_*) Read texts (txt, pdf, csv, doc, docx, json, xml) my_texts <- readtext::readtext("~/link/to/path/*") Construct a corpus from a character vector x <- corpus(data_char_ukimmig2010, text_field = "text") Explore a corpus summary(data_corpus_inaugural, n = 2) ## Corpus consisting of 58 documents, showing 2 documents: ## ## Text Types Tokens Sentences Year President FirstName Party ## 1789-Washington 625 1537 23 1789 Washington George none https://creativecommons.org/licenses/by/4.0/ Tokenize a set of texts (tokens_*) Tokenize texts from a character vector or corpus x <- tokens("Powerful tool for text analysis.", remove_punct = TRUE) Convert sequences into compound tokens myseqs <- phrase(c("text analysis")) tokens_compound(x, myseqs) Select tokens tokens_select(x, c("powerful", "text"), selection = "keep") Create ngrams and skipgrams from tokens tokens_ngrams(x, n = 1:3) tokens_skipgrams(x, n = 2, skip = 0:1) Convert case of tokens or features tokens_tolower(x) tokens_toupper(x) dfm_tolower(x) Stem tokens or features tokens_wordstem(x) dfm_wordstem(x) Calculate text statistics (textstat_*) Tabulate feature frequencies from a dfm textstat_frequency(x) topfeatures(x) Identify and score collocations from a tokenized text toks <- tokens(c("quanteda is a pkg for quant text analysis", "quant text analysis is a growing field")) textstat_collocations(toks, size = 3, min_count = 2) Calculate readability of a corpus textstat_readability(x, measure = c("Flesch", "FOG")) Calculate lexical diversity of a dfm textstat_lexdiv(x, measure = "TTR") Measure distance or similarity from a dfm textstat_simil(x, "2017-Trump", method = "cosine", margin = c("documents", "features")) textstat_dist(x, "2017-Trump", margin = c("documents", "features")) Calculate keyness statistics textstat_keyness(x, target = "2017-Trump") by Stefan Müller and Kenneth Benoit • smueller@quanteda.org, kbenoit@quanteda.org https://creativecommons.org/licenses/by/4.0/ Learn more at: http://quanteda.io • updated: 05/2020 Fit text models based on a dfm (textmodel_*) These functions require the quanteda.textmodels package Correspondence Analysis (CA) textmodel_ca(x, threads = 2, sparse = TRUE, residual_floor = 0.1) Naïve Bayes classifier for texts textmodel_nb(x, y = training_labels, distribution = "multinomial") SVM classifier for texts textmodel_svm(x, y = training_labels) Wordscores text model refscores <- c(seq(-1.5, 1.5, .75), NA)) textmodel_wordscores(data_dfm_lbgexample, refscores) Wordfish Poisson scaling model textmodel_wordfish(dfm(data_corpus_irishbudget2010), dir = c(6,5)) Textmodel methods: predict(), coef(), summary(), print() Plot features or models (textplot_*) Plot features as a wordcloud data_corpus_inaugural %>% corpus_subset(President == “Obama”) %>% dfm(remove = stopwords(“en”)) %>% textplot_wordcloud()
Plot word keyness
data_corpus_inaugural %>% corpus_subset(President %in%
c(“Obama”, “Trump”)) %>% dfm(groups = “President”,
remove = stopwords(“en”)) %>% textstat_keyness(target = “Trump”) %>% textplot_keyness()
Plot Wordfish, Wordscores or CA models (requires the quanteda.textmodels package) scaling_model %>%
textplot_scale1d(groups = party, margin = “documents”)
power
peace american
us
Kenny FG ODonnell FG Bruton FG
Quinn LAB Higgins LAB Burton LAB Gilmore LAB
Gormley Green Cuffe Green Ryan Green
OCaolain SF Morgan SF
Lenihan FF Cowen FF
know years creed common america
one act let liberty
still women believe
long
generation every spirit well god worldnewcan freedom
work uspeople
make now oath today
must citizens nation equal journey
words life
future just
men together country
government
know still common freedom journey generation
− must
can
may courage
everyone first
back right
country obama
Trump Obama
dreams
protected american
america
●
−10 0 10
chi2
● ●
●
● ●
●
● ● ●
●
●
●
●
−0.10
−0.05 0.00
0.05
0.10
Document position
Convert dfm to a non-quanteda format
convert(x, to = c(“lda”, “tm”, “stm”, “austin”, “topicmodels”, “lsa”, “matrix”, “data.frame”))
time
americans less
FG LAB Green SF FF

Related Posts