Sort Blog Posts

Sort Posts by:

  • in
    from   

Suggest a Blog

Enter a Blog's Feed URL below and click Submit:

Most Commented Posts

In the past 7 days

Recent Posts

(tagged with 'Dataverse')

Recent Comments

Recently Viewed

JacketFlap Sponsors

Spread the word about books.
Put this Widget on your blog!
  • Powered by JacketFlap.com

Are you a book Publisher?
Learn about Widgets now!

Advertise on JacketFlap

MyJacketFlap Blogs

  • Login or Register for free to create your own customized page of blog posts from your favorite blogs. You can also add blogs by clicking the "Add to MyJacketFlap" links next to the blog name in each post.

Blog Posts by Tag

In the past 7 days

Blog Posts by Date

Click days in this calendar to see posts by day or month
new posts in all blogs
Viewing: Blog Posts Tagged with: Dataverse, Most Recent at Top [Help]
Results 1 - 2 of 2
1. Replication redux and Facebook data

Introduction, from Michael Alvarez, co-editor of Political Analysis

Recently I asked Nathaniel Beck to write about his experiences with research replication. His essay, published on 24 August 2014 on the OUPblog, concluded with a brief discussion of a recent experience of his when he tried to obtain replication data from the authors of a recent study published in PNAS, on an experiment run on Facebook regarding social contagion. Since then the story of Neal’s efforts to obtain this replication material have taken a few interesting twists and turns, so I asked Neal to provide an update — because the lessons from his efforts to get the replication data from this PNAS study are useful for the continued discussion of research transparency in the social sciences.

Replication redux, by Nathaniel Beck

When I last wrote about replication for the OUPblog in August (“Research Replication in Social Science”), there was one smallish open question (about my own work) and one biggish question (on whether I would ever see the Kramer et al., “Experimental evidence of massive-scale emotional contagion through social networks”, replication file, which was “in the mail”). The Facebook story is interesting, so I start with that.

After not hearing from Adam Kramer of Facebook, even after contacting PNAS, I persisted with both the editor of PNAS (Inder Verma, who was most kind) and with the NAS through “well connected” friends. (Getting replication data should not depend on knowing NAS members!). I was finally contacted by Adam Kramer, who offered that I could come out to Palo Alto to look at the replication data. Since Facebook did not offer to fly me out, I said no. I was then offered a chance to look at the replication files in the Facebook office 4 blocks from NYU, so I accepted. Let me stress that all dealings with Adam Kramer were highly cordial, and I assume that delays were due to Facebook higher ups who were dealing with the human subjects firestorm related to the Kramer piece.

When I got to the Facebook office I was asked to sign a standard non-disclosure agreement, which I dec. To my surprise this was not a problem, with the only consequence being that a security officer would have had to escort me to the bathroom. I then was put in a room with a Facebook secure notebook with the data and R-studio loaded; Adam Kramer was there to answer questions, and I was also joined by a security person and an external relations person. All were quite pleasant, and the security person and I could even discuss the disastrous season being suffered by Liverpool.

I was given a replication file which was a data frame which had approximately 700,000 rows (one for each respondent) and 7 columns containing the number of positive and negative words used by each respondent as well as the total word count of each respondent, percentages based on these numbers, experimental condition. and a variable which omitted some respondents for producing the tables. This is exactly the data frame that would have been put in an archive since it contained all the data needed to replicate the article. I also was given the R-code that produced every item in the article. I was allowed to do anything I wanted with that data, and I could copy the results into a file. That file was then checked by Facebook people and about two weeks later I received the entire file I created. All good, or at least as good as it is going to get.

Intel team inside Facebook data center. Intel Free Press. CC BY 2.0 via Wikimedia Commons.
Intel team inside Facebook data center. Intel Free Press. CC BY 2.0 via Wikimedia Commons.

The data frame I played with was based on aggregating user posts so each user had one row of data, regardless of the number of posts (and the data frame did not contain anything more than the total number of words posted). I can understand why Facebook did not want to give me the data frame, innocuous as it seemed; those who specialize in de-de-identifying private data and reverse engineering code are quite good these days, and I can surely understand Facebook’s reluctance to have this raw data out there. And I understand why they could not give me all the actual raw data, which included how feeds were changed and so forth; this is the secret sauce that they would not like reverse engineered.

I got what I wanted. I could see their code, could play with density plots to get a sense of words used, I could change the number of extreme points dropped, and I could have moved to a negative binomial instead of a Poisson. Satisfied, I left after about an hour; there are only so many things one can do with one experiment on two outcomes. I felt bad that Adam Kramer had to fly to New York, but I guess this is not so horrible. Had the data been more complicated I might have felt that I could not do everything I wanted, and running a replication with 3 other people in a room is not ideal (especially given my typing!).

My belief is that that PNAS and the authors could simply have had a different replication footnote. This would have said that the code used (about 5 lines of R, basically a call to a Poisson regression using GLM) is available at a dataverse. In addition, they could have noted that the GLM called used the data frame I described, with the summary statistics for that data frame. Readers could then see what was done, and I can see no reason for such a procedure to bother Facebook (though I do not speak for them). I also note a clear statement on a dataverse would have obviated the need for some discussion. Since bytes are cheap, the dataverse could also contain whatever policy statement Facebook has on replication data. This (IMHO) is much better than the “contact the authors for replication data” footnote that was published. It is obviously up to individual editors as to whether this is enough to satisfy replication standards, but at least it is better than the status quo.

What if I didn’t work four blocks from Astor Place? Fortunately I did not have to confront this horror. How many other offices does Facebook have? Would Adam Kramer have flown to Peoria? I batted this around, but I did most of the batting and the Facebook people mostly did no comment. So someone else will have to test this issue. But for me, the procedure worked. Obviously I am analyzing lots more proprietary data, and (IMHO) this is a good thing. So Facebook, et al., and journal editors and societies have many details to work out. But, based on this one experience, this can be done. So I close this with thanks to Adam Kramer (but do remind him that I have had auto-responders to email for quite while now).

On the more trivial issue of my own dataverse, I am happy to report that almost everything that was once on an a private ftp site is now on my Harvard dataverse. Some of this was already up because of various co-authors who always cared about replication. And on stuff that was not up, I was lucky to have a co-author like Jonathan Katz, who has many skills I do not possess (and is a bug on RCS and the like, which beats my “I have a few TB and the stuff is probably hidden there somewhere”). So everything is now on the dataverse, except for one data set that we were given for our 1995 APSR piece (and which Katz never had). Interestingly, I checked the original authors’ web sites (one no longer exists, one did not go back nearly that far) and failed to make contact with either author. Twenty years is a long time! So everyone should do both themselves and all of us a favor, and build the appropriate dataverse files contemporaneously with the work. Editors will demand this, but even with this coercion, this is just good practice. I was shocked (shocked) at how bad my own practice was.

Heading image: Wikimedia Foundation Servers-8055 24 by Victorgrigas. CC BY-SA 3.0 via Wikimedia Commons.

The post Replication redux and Facebook data appeared first on OUPblog.

0 Comments on Replication redux and Facebook data as of 1/19/2015 6:47:00 AM
Add a Comment
2. Gary King: an update on Dataverse

At the American Political Science Association meetings earlier this year, Gary King, Albert J. Weatherhead III University Professor at Harvard University, gave a presentation on Dataverse. Dataverse is an important tool that many researchers use to archive and share their research materials. As many readers of this blog may already know, the journal that I co-edit, Political Analysis, uses Dataverse to archive and disseminate the replication materials for the articles we publish in our journal. I asked Gary to write some remarks about Dataverse, based on his APSA presentation. His remarks are below.

*   *   *   *   *

An update on Dataverse

By Gary King

 
If you’re an academic researcher, odds are you’re not a professional archivist and so you probably have more interesting things to do when making data available than following the detailed protocols and procedures established over many years by the archiving community. That of course might be OK for any one of us but it is a terrible loss for all of us. The Dataverse Network Project offers a solution to this problem by eliminating transaction costs and changing the incentives to make data available by giving you substantial web visibility and academic citation credit for your data and scholarship (King, 2007). Dataverse Networks are installed at universities and other institutions around the world (e.g., here is the Dataverse network at Harvard’s IQSS), and represent the world’s largest collection of social science research data. In recent years, Dataverse has also been adopted by an increasingly diverse array of other fields and protocols and procedures are being built out to enable numerous fields of science, social science, and the humanities to work together.

With a few minutes of set-up time, you can add your own Dataverse to your homepage with a list of data sets or replication data sets you make available, with whatever levels of permission you want for the broader community, and a vast array of professional services (e.g., here’s my Dataverse on my homepage). People will be able to more easily find your data and homepage, explore your data and scholarship, find connections to other resources, download data in any format, and learn proper ways of citing your work. They will even be able to analyze your data while still on your web site with a vast array of statistical methods through the transparent and automated connection Dataverse has built to Zelig: Everyone’s Statistical Software, and through Zelig to R. The result is that your data will be professionally preserved and easier to access — effectively automating the tasks of professional archiving, including citing, sharing, analyzing, archiving, preserving, distributing, cataloging, translating, disseminating, naming, verifying, and replicating data.

Dataverse_Network_Diagram
Dataverse Network Diagram, by Institute for Quantitative Social Science. CC-BY-2.0 via Wikimedia Commons.

Dataverse is an active project with new developments in software, protocols, and community connections coming rapidly. A brand new version of the code, written from scratch, will be available in a few months. Through generous grants from the Sloan Foundation, we have been working hard on eliminating other types of transaction costs for capturing data for the research community. These include deep integration with scholarly journals so that it can be trivially easy for an editor to encourage or require data associated with publications to be made available. We presently offer journals three options:

  • Do it yourself. Authors publish data to their own dataverse, put the citation to their data in their final submitted paper. Journals verify compliance by having the copyeditor check for the existence of the citation.
  • Journal verification. Authors submit draft of replication data to Journal Dataverse. Journal reviews it, and approves it for release. Finally, the dataset is published with a formal data citation and back to the article. (See, for example, the Political Analysis Dataverse, with replication data back to 1999.)
  • Full automation: Seamless integration between journal submission system and Dataverse; Automatic Link created between article and data. The result is that it is easy for the journal and author and many errors are eliminated.

Full automation in our third option is where we are heading. Already today, in 400 scholarly journals in the Open Journal System, the author enters their data as part of submission of the final draft of the accepted paper for publication, and the citation, permanent links between the data and the article, and formal preservation is taken care of, all automatically. We are working on expanding this as an option for all of OJS’s 5,000+ journals, and to a wide array of other scholarly journal publishers. The result will be that we capture data with the least effort on anyone’s part, at exactly the point where it is easiest and most important to capture.

We are also working on extending Dataverse to cover new higher levels of security that are more prevalent in big data collections and those in public health, medicine, and other areas with informative data on human subjects. Yes, you can preserve data and make it available under appropriate protections, even if you have highly confidential, proprietary, or otherwise sensitive data. We are working on other privacy tools as well. We already have an extensive versioning system in Dataverse, but are planning to add support for continuously updated data such as streamed from sensors, tools for online fast data access, queries, visualization, analysis methods for when data cannot be moved because of size or privacy concerns, and ways to use the huge volume of web analytics to improve Dataverse and Zelig.

This post comes from the talk I gave at the American Political Association Meetings August 2014, using these slides. Many thanks to Mike Alvarez for inviting this post.

Featured image: Matrix code computer by Comfreak. CC0 via Pixabay.

The post Gary King: an update on Dataverse appeared first on OUPblog.

0 Comments on Gary King: an update on Dataverse as of 12/7/2014 8:33:00 AM
Add a Comment