Title: | Textual Data Analysis Package Used by the TXM Software |
---|---|
Description: | Statistical exploration of textual corpora using several methods from French 'Textometrie' (new name of 'Lexicometrie') and French 'Data Analysis' schools. It includes methods for exploring irregularity of distribution of lexicon features across text sets or parts of texts (Specificity analysis); multi-dimensional exploration (Factorial analysis), etc. Those methods are used in the TXM software. |
Authors: | Sylvain Loiseau, Lise Vaudor, Matthieu Decorde, Serge Heiden |
Maintainer: | Matthieu Decorde <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.6 |
Built: | 2024-10-27 05:34:10 UTC |
Source: | https://github.com/cran/textometry |
Statistical exploration of textual corpora using several methods from French 'Textometrie' (new name of 'Lexicometrie') and French 'Data Analysis' schools. It includes methods for exploring irregularity of distribution of lexicon features across text sets or parts of texts (Specificity analysis); multi-dimensional exploration (Factorial analysis), etc. Those methods are used in the TXM software.
Package: | textometry |
Type: | Package |
Version: | 0.1.3 |
Date: | 2014-06-16 |
License: | GPLv3 |
Depends: | R (>= 1.5.0) |
Index:
specificities Compute Lexical Specificity of subcorpus progression Draw progression graphic
Sylvain Loiseau, Lise Vaudor, Matthieu Decorde, Lise Vaudor
data(robespierre); specificities(robespierre);
data(robespierre); specificities(robespierre);
A lexical table containing frequencies of adverbs from the BFM (Base de Francais m\'edi\'eval) database in 5 different domains (literary, historical, didactic, law, religious).
data(bfm)
data(bfm)
The format is: num [1:2, 1:5] 103000 1370887 23429 413441 15345 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:2] "ADV" "other" ..$ : chr [1:5] "literary" "history" "didactic" "juridical" ...
The last line of the table gives the total frequency of all the other part of speech words in each of these domains.
BFM - Base de Fran\,cais M\'edi\'eval [En ligne]. Lyon : ENS de Lyon, Laboratoire ICAR, 2012, https://bfm.ens-lyon.fr.
Draw the progression graphic of matches of CQL queries in a corpus
progression(positions, names, colors, styles, widths, corpusname, Xmin, T, doCumulative, structurepositions, strutnames, graphtitle, bande)
progression(positions, names, colors, styles, widths, corpusname, Xmin, T, doCumulative, structurepositions, strutnames, graphtitle, bande)
positions |
Vector containing corpus positions of CQL queries matches. A position is an integer from 0 (begining of corpus) to N (end of corpus) |
names |
String vector containing the CQL queries |
colors |
Vector containing the line color of each query |
styles |
Vector containing the line style of each query |
widths |
Vector containing the line width of each query |
corpusname |
String: corpus name |
Xmin |
Integer: corpus starting position of abscissa values |
T |
Integer: size of the corpus |
doCumulative |
Boolean: if true draw a cumulative graph, if false draw a density graph |
structurepositions |
optional Vector containing the structure positions of the corpus |
strutnames |
optional Vector containing the structures labels to display |
graphtitle |
String: graph title |
bande |
Float: density window size factor |
Matthieu Decorde
A lexical table containing frequencies of 5 words from 9 different public discourses of French politician Robespierre (between november 1793 and july 1794).
data(robespierre)
data(robespierre)
The format is: num [1:6, 1:10] 464 45 35 30 6 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:6] "de" "peuple" "republique" "ennemi" ... ..$ : chr [1:10] "D1" "D2" "D3" "D4" ...
The last line of the table gives the total frequency of all the other forms in each of these discourses.
Lafon P. (1980) Sur la variabilit\'e de la fr\'e quence des formes dans un corpus, Mots, 1, pp. 127–165.
Lafon P. (1980) Sur la variabilit\'e de la fr\'e quence des formes dans un corpus, Mots, 1, pp. 127–165.
data(robespierre) ## See graphic in Lafon, 1980 - page 140 t <- colSums(robespierre)["D9"]; # size of the part T <- sum(robespierre); # size of the corpus f <- rowSums(robespierre)["peuple"]; # total frequency of "peuple" p <- dhyper(1:15, f, T-f, t) title <- "Probability of each frequency of 'peuple' in the 'D9' discourse from 1 to 15" plot(p, type="h", main=title, xlab="k", ylab="Prob(k)");
data(robespierre) ## See graphic in Lafon, 1980 - page 140 t <- colSums(robespierre)["D9"]; # size of the part T <- sum(robespierre); # size of the corpus f <- rowSums(robespierre)["peuple"]; # total frequency of "peuple" p <- dhyper(1:15, f, T-f, t) title <- "Probability of each frequency of 'peuple' in the 'D9' discourse from 1 to 15" plot(p, type="h", main=title, xlab="k", ylab="Prob(k)");
Calculate the specificity - or association or surprise -
score of a word being present f
times or more
in a sub-corpus of t
words given that it appears
a total of F
times in a whole corpus of T
words.
specificities(lexicaltable, types=NULL, parts=NULL)
specificities(lexicaltable, types=NULL, parts=NULL)
lexicaltable |
a complete lexical table, i.e. a numeric matrix where each line represents a word and each column a part of the corpus. Each cell gives the frequency of the given word in the corresponding part of the corpus. |
types |
list of rows (words) for which the specificity score must be calculated.
If |
parts |
list of columns (parts) for which the specificity score must be calculated.
If |
Returns a matrix of nrow(lexicaltable) * ncol(lexicaltable)
(the number of
rows and columns may be reduced using types
or parts
), each cell
giving the specificity score.
Matthieu Decorde, Serge Heiden, Sylvain Loiseau, Lise Vaudor
Lafon P. (1980) Sur la variabilit\'e de la fr\'e quence des formes dans un corpus, Mots, 1, pp. 127–165. https://www.persee.fr/doc/mots_0243-6450_1980_num_1_1_1008
specificities.probabilities
, specificities.lexicon
data(robespierre); spe <- specificities(robespierre); string <- paste("The word %s appears f=%d times in a sub-corpus of t=%d words,", " given a total frequency of F=%d in the robespierre corpus made", " of T=%d words. The corresponding specificity score is %f", sep=""); print(sprintf(string, 'peuple', robespierre['peuple','D4'], colSums(robespierre)['D4'], rowSums(robespierre)['peuple'], sum(robespierre), spe['peuple', 'D4']));
data(robespierre); spe <- specificities(robespierre); string <- paste("The word %s appears f=%d times in a sub-corpus of t=%d words,", " given a total frequency of F=%d in the robespierre corpus made", " of T=%d words. The corresponding specificity score is %f", sep=""); print(sprintf(string, 'peuple', robespierre['peuple','D4'], colSums(robespierre)['D4'], rowSums(robespierre)['peuple'], sum(robespierre), spe['peuple', 'D4']));
Display specificities probability distribution (call dhyper and specificities.probabilities.vector
)
specificities.distribution.plot(x, F, t, T)
specificities.distribution.plot(x, F, t, T)
x |
observed number of A words |
F |
total number of A |
t |
size of part |
T |
size of corpus |
nothing
Matthieu Decorde, Serge Heiden
Compute specificities association score between a lexicon and a sub-lexicon
specificities.lexicon(lexicon, sublexicon)
specificities.lexicon(lexicon, sublexicon)
lexicon |
a frequency list (named vector) |
sublexicon |
a frequency list (named vector) |
specificities index as a named vector.
specificities
for specificities score and references
Compute specificities association score between a lexicon and a sub-lexicon. A new version of the "specificities.lexicon" function
specificities.lexicon.new(lexicon, sublexicon)
specificities.lexicon.new(lexicon, sublexicon)
lexicon |
a frequency list (named vector) |
sublexicon |
a frequency list (named vector) |
specificities index as a named vector.
specificities
for specificities score and references
Utility function computing specificity probabilities for the specificities
function.
specificities.probabilities(lexicaltable, types = NULL, parts = NULL)
specificities.probabilities(lexicaltable, types = NULL, parts = NULL)
lexicaltable |
see |
types |
see |
parts |
see |
Returns a matrix of signed specificity probabilities (between -1.0 and 1.0). By convention:
sign |
The sign indicates if the observed frequency is lower (minus) or higher (plus) than the mode of the specificity model |
.Machine$double.xmin limit |
-10.0 and 10.0 values are used to hold the sign when the zero/.Machine$double.xmin boundary line has been crossed (the |
see specificities
.
Calculate specificity probabilities on vector (call phyper and phyper_right)
specificities.probabilities.vector(v_f, v_F, T, t)
specificities.probabilities.vector(v_f, v_F, T, t)
v_f |
vector of lexicon ferquencies |
v_F |
vector of corpus frequencies |
T |
corpus size |
t |
sub-corpus size |
Hypergeometric probabilities. See specificities.lexicon
.
Matthieu Decorde, Serge Heiden