Text Data Mining of the Tobacco Industry Internal …

上传人:e****s 文档编号:243705665 上传时间:2024-09-29 格式:PPT 页数:54 大小:729KB
返回 下载 相关 举报
Text Data Mining of the Tobacco Industry Internal …_第1页
第1页 / 共54页
Text Data Mining of the Tobacco Industry Internal …_第2页
第2页 / 共54页
Text Data Mining of the Tobacco Industry Internal …_第3页
第3页 / 共54页
点击查看更多>>
资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Who Uses the Online Tobacco Industry Documents?,Martha Michel,1,2, M.S., Ph.D. Lisa Bero,1,2, Ph.D.,1,Graduate Group in Biological and Medical Informatics, UCSF,2,Center for Tobacco Control Research and Education, UCSF,What are the Tobacco Industry Documents?,As a result of the Master Settlement, millions of internal tobacco industry documents were released onto the Internet (legacy.library.ucsf.edu),The documents contain memos, scientific reports, faxes, emails, budgets, etc,The documents include information about scientific research, manufacturing, marketing, advertising and sales of cigarettes, and more,Example of an Internal Tobacco Industry Document,Document Collections,Legacy document depository at UCSF,5 million documents,About 32 million pages and growing,1.5 terabytes,Guilford document depository,8 million British American Tobacco documents,About 32-40 million pages,UCSF has 13,000 documents which have been manually indexed.,Industry websites PM, Lorillard, B&W, RJR,Other collections Tobacco Documents Online, CDC tobacco industry documents,Difficulties of searching the documents,No OCR available for searching the full text,Variations in spelling and problems when names suddenly change,Duplicates,Vast quantities of information,No or varied indexing,Unknown recall and low precision,Malone RE, Balbach ED. Tobacco industry documents: treasure trove or quagmire? Tobacco Control 2000;9(3):334-8,.,Prior Studies of Who Searches,Different types of groups used the paper documents depositories (i.e. lawyers, government officials, researchers, tobacco control advocates, health related fields).,We still dont know who uses the electronic documents or why they search,We are currently conducting an online survey of the UCSF Legacy website to examine the use and barriers to searching the existing websites.,Aim 1: Conduct Online Survey,Purpose of survey:,Who uses the documents (demographics),Purposes for which documents are used,Barriers to searching the documents and,Suggestions for improving the archives,Methods,Developed and designed survey using Web Surveyor,Conducted pilot test of survey - N=14,Launched surveys in November 2002,2 surveys one on TCA, one on Legacy,Tobacco Control Archives (n=50),Legacy Tobacco Control Documents Collection (n=22),Results from Tobacco Control Archives Survey (n=50),Who Uses the documents?,Who Uses the documents?,Who Uses the documents?,Text under “other,I would like more structure on how to work the music sight. (4),some stuff,its okay,telling about schools,direct assistance,links to student lead orgainization programs against tobacco,more categorization,I found your site useful. I had to fill out a worksheet from Health class and move along wite the website, but there a few things I could not find. Maybe it was the worksheet not the site, but overall your site helped me out. Thank you,love it,Why do I have do this survey its slowing me down?,This site is cool I think,The site is wonderful and very usefull. I want to congratulate the authors for the wonderful job.,Id like more if smuggling was in better view.,Results from Legacy Survey (n=22),Text from Other response,full textmanipulation of saved sets: more bookbag features,ability to search within retrieved set,more popular documents type oflinks.,OCR,It would be great to have a quick way to search only for ads like Philip Morris offers from its advertising archives search engine,dont know,first visit, I dont know yet,fix the bookbag problem,quick search box right at top of home page bypassing other list of other bates in a set from which the one comes.,Nested searchingSlightly larger font,Full Boolean search capacity;more than six search term limit;feedback on when user errors are syntactical (as PM gives);not having to toggle back and forth between long and short displays;master ID numbers in the display;OCR capacity-not only would it be fantastic to be able to search the text of the document, it would be invaluable to be able to cut and paste text from the documents into a word processor.,A better search engine.you seem to have more documents than Tobacco Documents Online, but when I use the same search terms your search engine tells me it doesnt find any.when TDO finds over 100!,Text from Other response,print them out; download to pc.,If there are useful documents in a search, I print out the list, then download and print out the useful documents, numbering them with the number on the list. I file them chronologically by theme.,by dateby topic and correspondence,First by theme (subject) and then by organization/corporation, and/or date.,first visit, I dont know yet,I e-mail them to my eudora account and search it when I want a citation.,I look for a doc at your site, then go to or the like, type the bates, and pull the description. Then I type the first bates from master file, and this way I get the set of documents with context. Sometime, the same document is in a few different sets! Then I get back to you to download it, or to cross reference with other collections (say, TI).,Im not sure I understand the difference between this question and the previous one. I use Endnote, if thats the question.,I wish I had a consistent way. Can you conduct a seminar showing suggestions?,Prefer to collect paper documents and arrange them in files that mirror the files they originally came from and/or dates and or events within a date range,Who Uses the documents?,Who Uses the documents?,Who Uses the documents?,Who Uses the documents?,Aim 2: Add the British American Tobacco documents to the Flamenco interface,The Tobacco Flamenco,13,000 British American Tobacco documents have been “Flamencoized,More documents are to be indexed as they arrive from Guilford, England and the industry websites,Tobacco Thesaurus,The thesaurus terms are controlled terminology which are described by broader terms, narrower terms, and related terms on the UCSF website,We created a hierarchy with 834 terms based on these relations,The parent-child relationships are 7 levels deep.,Conclusions,Many barriers to searching the documents exist,Current searches are characterized by unknown recall and low precision, whether using TCA, Legacy or TDO,There are different searching profiles for the people who use TCA vs. Legacy,Upcoming Goals,Collect more survey data and write up results,To create Flamenco server at home,Eventually conduct usability trial of modified Tobacco Flamenco,To work on additional BATCo documents,
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 图纸专区 > 幼儿教育


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!