Técnicas big dataanálisis de textos a gran escala para la investigación científica y periodística

  1. Carlos Arcila-Calderón 1
  2. Eduar Barbosa-Caro 2
  3. Francisco Cabezuelo Lorenzo 3
  1. 1 Universidad de Salamanca
    info

    Universidad de Salamanca

    Salamanca, España

    ROR https://ror.org/02f40zc51

  2. 2 Universidad del Norte
    info

    Universidad del Norte

    Barranquilla, Colombia

    ROR https://ror.org/031e6xm45

  3. 3 Universidad de Valladolid
    info

    Universidad de Valladolid

    Valladolid, España

    ROR https://ror.org/01fvbaw18

Revue:
El profesional de la información

ISSN: 1386-6710 1699-2407

Année de publication: 2016

Titre de la publication: Datos

Volumen: 25

Número: 4

Pages: 623-631

Type: Article

DOI: 10.3145/EPI.2016.JUL.12 DIALNET GOOGLE SCHOLAR lock_openAccès ouvert editor

D'autres publications dans: El profesional de la información

Objectifs de Développement Durable

Résumé

This paper conceptualizes the term big data and describes its relevance in social research and journalistic practices. We explain large-scale text analysis techniques such as automated content analysis, data mining, machine learning, topic modeling, and sentiment analysis, which may help scientific discovery in social sciences and news production in journalism. We explain the required e-infrastructure for big data analysis with the use of cloud computing and we asses the use of the main packages and libraries for information retrieval and analysis in commercial software and programming languages such as Python or R.

Références bibliographiques

  • Alpaydin, Ethem (2010). Introduction to machine learning. Cambridge/London: The MIT Press. ISBN 978 0262012430
  • Arora, Sanjeev; Ge, Rong; Halpern, Yoni; Mimno, David; Moitra, Ankur; Sontag, David; Wu, Yichen; Zhu, Michael (2013). “A practical algorithm for topic modeling with provable guarantees”. En: 30th Intl conf on machine learning. pp. 280-288. http://jmlr.org/proceedings/papers/v28/arora13.html
  • Blei, David M. (2012). “Topic modeling and digital Humanities”. Journal of digital humanities, v. 2, n. 1, pp. 8-11. http://journalofdigitalhumanities.org/2-1/topic-modelingand-digital-humanities-by-david-m-blei
  • Blum, Avrim (2003). “Machine learning theory”. En: FOCS 2003 Procs of the 44th Annual IEEE Symposium on foundations of computer science. Washington DC: IEEE Computer Society, pp. 2-4. ISBN: 0 7695 2040 5
  • Cai, Keke; Spangler, Scott; Chen, Ying; Zhang, Li (2010). “Leveraging sentiment analysis for topic detection”. En: IEEE/ WIC/ACM International Conference on Web Intelligence and Agent Systems: An International Journal, pp. 265-271. http://www.csce.uark.edu/~sgauch/5013NLP/S13/hw/Chris. pdf http://dx.doi.org/10.1109/WIIAT.2008.188
  • Cambria, Erick; Schuller, Björn; Liu, Bing; Wang, Haixun; Havasi, Catherine (2013). “Knowledge-based approaches to concept-level sentiment analysis”. IEEE intelligent systems, v. 28, n. 2, pp. 12-14. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6547971 http://dx.doi.org/10.1109/MIS.2013.45
  • Cheng, An-Shou; Fleischmann, Kenneth; Wang, Ping; Oard, Douglas (2008). “Advancing social science research by applying computational linguistics”. En: Procs of the American Society for Information Science and Technology, v. 45, n. 1, pp. 1-12. http://www.asis.org/Conferences/AM08/proceedings/ posters/55_poster.pdf
  • Dhar, Vasant (2013). “Data science and prediction”. Communications of the ACM, v. 56, n. 12, pp. 64-73. https://archive.nyu.edu/bitstream/2451/31553/2/DharDataScience.pdf http://dx.doi.org/10.1145/2500499
  • Dietterich, Thomas (2003). “Machine learning”. Nature encyclopedia of cognitive science. London: Macmillan. http://eecs.oregonstate.edu/~tgd/publications/nature-ecsmachine-learning.ps.gz
  • Domingos, Pedro (2012). “A few useful things to know about machine learning”. Communications of the ACM, v. 55, n. 10, pp. 78-87. http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf http://dx.doi.org/10.1145/2347736.2347755
  • Feldman, Ronen (2013). “Techniques and applications for sentiment analysis”. Communications of the ACM, v. 56, n. 4, pp. 82-89. http://dx.doi.org/10.1145/2436256.2436274
  • Han, Jiawei; Kamber, Micheline; Pei, Jian (2006). Data mining. Concepts and techniques. San Francisco: Morgan Kaufmann Publishers. ISBN: 978 0123814791 http://goo.gl/5zTYb6
  • Hand, David; Mannila, Heikki; Smyth, Padhraic (2001). Principles of data mining. Cambridge: MIT Press. ISBN: 978 0262082907 ftp://gamma.sbin.org/pub/doc/books/Principles_of_Data_ Mining.pdf
  • Harwood, Tracy; Garry, Tony (2003). “An overview of content analysis”. The marketing review, v. 3, pp. 479-498. http://dx.doi.org/10.1362/146934703771910080
  • Kalina, Jan (2013). “Highly robust methods in data mining”. Serbian journal of management, v. 8, n. 1, pp. 9-24. http://www.sjm06.com/SJM%20ISSN1452-4864/8_1_2013_ May_1_132/8_1_2013_9-24.pdf http://dx.doi.org/10.5937/sjm8-3226
  • Kechaou, Zied; Ben-Ammar, Mohammed; Alimi, Adel (2013). “A multi-agent based system for sentiment analysis of user-generated content”. International journal on artificial intelligence tools, v. 22, n. 2, pp. 1-28. http://dx.doi.org/10.1142/S0218213013500048
  • Kelleher, John D.; MacNamee, Brian; D’Arcy, Aoife (2015). Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. Londres: MIT Press. ISBN: 978 0262029445
  • Krippendorff, Klaus. (2004). Content analysis. An introduction to its methodology. Los Angeles: Sage Publications. ISBN: 978 0761915454
  • Leetaru, Kalev-Hannes (2011). Data mining methods for the content analyst: An introduction to the computational analysis of informational center. New York: Routledge. ISBN: 978 0415895149
  • Mayer-Schönberger, Viktor; Cukier, Kenneth (2013). Big data. La revolución de los datos masivos. Madrid: Turner. ISBN: 978 8415832102
  • McCallum, Andrew-Kachites (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu
  • Meena, Arun; Prabhakar, T. V. (2007). Sentence level sentiment analysis in the presence of conjuncts using linguistic analysis. En: Amati, Giambattista; Carpineto, Claudio; Romano, Giovanni (eds.). Advances in information retrieval. 29th European conf on IR research (ECIR), April 2-5, 2007, Rome, Italy, pp. 573-580. http://dx.doi.org/10.1007/978-3-540-71496-5_53
  • Mitchell, Tom (1997). Machine learning. New York: McGraw-Hill. ISBN: 978 0070428072 http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_ Machine_Learning_-Tom_Mitchell.pdf
  • Murphy, Kevin (2012). Machine learning. A probabilistic perspective. Cambridge/London: The MIT Press. ISBN: 978 0262018029
  • Murphy, Michael; Barton, John (2014). “From a sea of data to actionable insights: Big data and what it means for lawyers”. Intellectual property & technology law journal, v. 26, n. 3, pp. 8-17. http://www.pillsburylaw.com/publications/from-a-sea-ofdata-to-actionable-insights
  • Nunan, Dan; Di-Domenico, Maria-Laura (2013). “Market research and the ethics of big data”. International journal of market research, v. 55, n. 4, pp. 505-520. http://dx.doi.org/10.2501/IJMR-2013-015
  • Pennacchiotti, Marco; Popescu, Ana-Maria (2011). “A machine learning approach to Twitter user classification”. En: Procs of the 5th Intl conf on weblogs and social media. Menlo Park, California: The Association for the Advancement of Artificial Intelligence Press. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/ paper/download/2886/3262
  • Téllez-Valero, Alberto; Montes, Manuel; Villaseñor-Pineda, Luis (2009). “Using machine learning for extracting information from natural disaster news reports”. Computación y sistemas, v. 13, n. 1, pp. 33-44. http://www.scielo.org.mx/pdf/cys/v13n1/v13n1a4.pdf
  • Turney, Peter (2002). “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews”. En: Procs of the 40th Annual meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 417-424. http://www.aclweb.org/anthology/P02-1053.pdf
  • Verbeke, Mathias; Berendt, Bettina; D’Haenens, Leen; Opgenhaffen, Michaël (2014). “When two disciplines meet, data mining for communication science”. En: 64th Annual meeting of International Communication Association (ICA) conf. Seattle, USA. https://lirias.kuleuven.be/handle/123456789/436424
  • Vinodhini, Gopalakrishnan; Chandrasekaran, Ramaswamy M. (2012). “Sentiment analysis and opinion mining: A survey”. International journal of advanced research in computer science and software engineering, v. 2, n. 6, pp. 282-292. http://www.i jarcsse.com/docs/papers/June2012/ Volume_2_issue_6/V2I600263.pdf
  • West, Mark (2001). Theory, method, and practice in computer content analysis. Westport, Connecticut: Ablex Publishing. ISBN: 978 1567505030
  • White, Marilyn-Domas; Marsh, Emiliy (2006). “Content analysis: A flexible methodology”. Library trends, v. 55, n.1, pp. 22-45. https://www.ideals.illinois.edu/bitstream/handle/2142/3670/ whitemarch551.pdf?sequence=2 http://dx.doi.org/10.1353/lib.2006.0053
  • Woody, Alex (2016). “Inside the Panama papers: How cloud analytics made it all possible”. Datanami, 7 April. http://www.datanami.com/2016/04/07/inside-panamapapers-cloud-analytics-made-possible