Skip to content

Commit bcfe106

Browse files
author
anisa-hawes
committed
Merge branch 'gh-pages' into Issue-3571
2 parents 27712ec + d1b586d commit bcfe106

File tree

5 files changed

+5
-5
lines changed

5 files changed

+5
-5
lines changed

en/lessons/analyzing-documents-with-tfidf.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -346,7 +346,7 @@ The Scikit-Learn `TfidfVectorizer` has several internal settings that can be cha
346346

347347
#### 1. stopwords
348348

349-
In my code, I used `python stopwords=None` but `python stopwords='english'` is available. This setting will filter out words using a [preselected list](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py) of high frequency function words such as 'the', 'to', and 'of'. Depending on your settings, many of these terms will have low __tf-idf__ scores regardless because they tend to be found in all documents. For a discussion of some publicly available stop word lists (including Scikit-Learn's), see ["Stop Word Lists in Free Open-source Software Packages"](https://doi.org/10.18653/v1/W18-2502).
349+
In my code, I used `python stopwords=None` but `python stopwords='english'` is available. This setting will filter out words using a [preselected list](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py) of high frequency function words such as 'the', 'to', and 'of'. Depending on your settings, many of these terms will have low __tf-idf__ scores regardless because they tend to be found in all documents. For a discussion of some publicly available stop word lists (including Scikit-Learn's), see ["Stop Word Lists in Free Open-source Software Packages"](https://perma.cc/V4J7-HMWH).
350350

351351
#### 2. min_df, max_df
352352

en/lessons/detecting-text-reuse-with-passim.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -905,5 +905,5 @@ MR gratefully acknowledges the financial support of the Swiss National Science F
905905
8. Aleksi Vesanto, Asko Nivala, Heli Rantala, Tapio Salakoski, Hannu Salmi, Filip Ginter. Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910. 54–58 In *Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language*. Linköping University Electronic Press, 2017. [Link](https://aclanthology.org/W17-0510.pdf)
906906
9. Hannu Salmi, Heli Rantala, Aleksi Vesanto, Filip Ginter. The long-term reuse of text in the Finnish press, 1771–1920. **2364**, 394–544 In *CEUR Workshop Proceedings*. (2019).
907907
10. Axel J Soto, Abidalrahman Mohammad, Andrew Albert, Aminul Islam, Evangelos Milios, Michael Doyle, Rosane Minghim, Maria Cristina de Oliveira. Similarity-Based Support for Text Reuse in Technical Writing. 97–106 In *Proceedings of the 2015 ACM Symposium on Document Engineering*. ACM, 2015. [Link](http://dx.doi.org/10.1145/2682571.2797068)
908-
11. Alexandra Schofield, Laure Thompson, David Mimno. Quantifying the Effects of Text Duplication on Semantic Models. 2737–2747 In *Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing*. Association for Computational Linguistics, 2017. [Link](https://doi.org/10.18653/v1/D17-1290)
908+
11. Alexandra Schofield, Laure Thompson, David Mimno. Quantifying the Effects of Text Duplication on Semantic Models. 2737–2747 In *Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing*. Association for Computational Linguistics, 2017. [https://doi.org/10.18653/v1/D17-1290](https://perma.cc/KSK6-5TXP)
909909
12. Matteo Romanello, Aurélien Berra, Alexandra Trachsel. Rethinking Text Reuse as Digital Classicists. *Digital Humanities conference*, 2014. [Link](https://wiki.digitalclassicist.org/Text_Reuse)

en/lessons/geoparsing-text-with-edinburgh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -384,7 +384,7 @@ Rosa Filgueira, Claire Grover, Vasilios Karaiskos, Beatrice Alex, Sarah Van Eynd
384384

385385
Rosa Filgueira, Claire Grover, Melissa Terras, and Beatrice Alex (2020). Geoparsing the historical Gazetteers of Scotland: accurately computing location in mass digitised texts. In Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora, pages 24–30, Marseille, France. European Language Resources Association.
386386

387-
Claire Grover and Richard Tobin (2014). A Gazetteer and Georeferencing for Historical English Documents. In Proceedings of LaTeCH 2014 at EACL 2014. Gothenburg, Sweden. [[pdf](https://doi.org/10.3115/v1/W14-0617)]
387+
Claire Grover and Richard Tobin (2014). A Gazetteer and Georeferencing for Historical English Documents. In Proceedings of LaTeCH 2014 at EACL 2014. Gothenburg, Sweden. [https://doi.org/10.3115/v1/W14-0617](https://perma.cc/S8XG-8TH3)
388388

389389
Claire Grover, Richard Tobin, Kate Byrne, Matthew Woollard, James Reid, Stuart Dunn, and Julian Ball (2010). Use of the Edinburgh Geoparser for georeferencing digitised historical collections. Philosophical Transactions of the Royal Society A. [[pdf](http://homepages.inf.ed.ac.uk/grover/papers/PTRS-A-2010-Grover-3875-89.pdf)]
390390

en/lessons/interrogating-national-narrative-gpt.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -386,7 +386,7 @@ The use of generated text as an analytical tool is relatively novel, as is the a
386386
[^1]: Jeffrey Wu et al., "Language Models Are Unsupervised Multitask Learners," *OpenAI*, (February 2019): 7, [https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf](https://perma.cc/7HCZ-DX87).
387387
[^2]: David Tarditi, Sidd Puri, and Jose Oglesby, "Accelerator: Using data parallelism to program GPUs for general-purpose uses," *Operating Systems Review* 40, (2006): 325-326. [https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2005-184.pdf](https://perma.cc/QDX9-33R6).
388388
[^3]: Shawn Graham, *An Enchantment of Digital Archaeology: Raising the Dead with Agent-Based Models, Archaeogaming, and Artificial Intelligence* (New York: Berghahn Books, 2020), 118.
389-
[^4]: Emily M. Bender and Alexander Koller, "Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data," (paper presented at Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 5 2020): 5187. [https://doi.org/10.18653/v1/2020.acl-main.463](https://doi.org/10.18653/v1/2020.acl-main.463).
389+
[^4]: Emily M. Bender and Alexander Koller, "Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data," (paper presented at Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 5 2020): 5187. [https://doi.org/10.18653/v1/2020.acl-main.463](https://perma.cc/XH59-96ML).
390390
[^5]: Kari Kraus, "Conjectural Criticism: Computing Past and Future Texts," *Digital Humanities Quarterly* 3, no. 4 (2009). [http://www.digitalhumanities.org/dhq/vol/3/4/000069/000069.html](https://perma.cc/C7D7-H7WY).
391391
[^6]: Alexandra Borchardt, Felix M. Simon, and Diego Bironzo, *Interested but not Engaged: How Europe’s Media Cover Brexit,* (Oxford: Reuters Institute for the Study of Journalism, 2018), 23, [https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2018-06/How%20Europe%27s%20Media%20Cover%20Brexit.pdf](https://perma.cc/8S2H-9ZDV).
392392
[^7]: Satnam Virdee & Brendan McGeever, "Racism, Crisis, Brexit," *Ethnic and Racial Studies* 40, no. 10 (July 2017): 1807, [https://doi.org/10.1080/01419870.2017.1361544](https://doi.org/10.1080/01419870.2017.1361544).

fr/lecons/detecter-la-reutilisation-de-texte-avec-passim.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -913,5 +913,5 @@ Matteo Romanello remercie le Fonds national suisse de la recherche scientifique
913913
8. Vesanto, Aleksi, Asko Nivala, Heli Rantala, Tapio Salakoski, Hannu Salmi et Filip Ginter. « Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910 ». *Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language* (2017): 54–58. [Lien](https://aclanthology.org/W17-0510.pdf)
914914
9. Salmi, Hannu, Heli Rantala, Aleksi Vesanto et Filip Ginter. « The long-term reuse of text in the Finnish press, 1771–1920 ». *CEUR Workshop Proceedings* 2364 (2019): 394–544.
915915
10. Soto, Axel J, Abidalrahman Mohammad, Andrew Albert, Aminul Islam, Evangelos Milios, Michael Doyle, Rosane Minghim et Maria Cristina de Oliveira. « Similarity-Based Support for Text Reuse in Technical Writing ». *Proceedings of the 2015 ACM Symposium on Document Engineering* (2015): 97–106. [Lien](http://dx.doi.org/10.1145/2682571.2797068)
916-
11. Schofield, Alexandra, Laure Thompson et David Mimno. « Quantifying the Effects of Text Duplication on Semantic Models ». *Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing* (2017): 2737–2747. [Lien](https://doi.org/10.18653/v1/D17-1290)
916+
11. Schofield, Alexandra, Laure Thompson et David Mimno. « Quantifying the Effects of Text Duplication on Semantic Models ». *Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing* (2017): 2737–2747. [https://doi.org/10.18653/v1/D17-1290](https://perma.cc/KSK6-5TXP)
917917
12. Romanello, Matteo, Aurélien Berra et Alexandra Trachsel. « Rethinking Text Reuse as Digital Classicists ». *Digital Humanities conference* (2014). [Lien](https://wiki.digitalclassicist.org/Text_Reuse)

0 commit comments

Comments
 (0)