NIMH Data Archive NDA Data Access Webinar
Transcript
David Obenshain: NDA Data Archive, Webinar, on Data Access. This is a prerecorded webinar. We are going to go ahead and start by playing the recording.
Ushna Ahmad: Now we will begin the NDA Data Access webinar focused primarily for researchers who are looking to build new research datasets using human subjects data from hundreds of projects housed in NDA, all data are accessible to qualified researchers.
This webinar, the NDA Data Access, is the last in the series of webinars. The first webinar is the new grantee sharing orientation, primarily aimed at principal investigators. However, other staff are welcome and encouraged to attend. This training focuses on the first tasks that are expected to be completed early after grant award.
The second webinar is data harmonization. It's aimed at staff members who will be working with the actual data to format and submit. This training dives into the meat of the NDA Data Dictionary. Attending the data harmonization training early on will teach you how to format data. We urge you to take advantage of the already defined data structures in the NDA data dictionary that can be used right away to create a method for data collection. Using the Data Dictionary has the added benefit of already being formatted for submission to the NDA.
The third webinar is Data Validation Submission. This webinar overlaps where the data on where the Data Harmonization Webinar left off, and is offered around the two submission cycles, which are January 15th and July 15th, and helps prepare users to understand the process of validation, error handling, submission, and post submission quality assurance and quality control.
This webinar, the Data Access, is the last in our series, and discusses the process of requesting data and access and covers various methods to query package and download data from NDA. This webinar is offered around the two standard data sharing cycles each year, which are May 15th and November 15th.
Today, we will go over how to make an account with NDA and request access. We will then go over how to search for data in the NDA Archive, build queries, and the filter cart and package creation process.
Then, we'll focus on the different methods for downloading or accessing the data. Lastly, we'll talk about computational credits.
We hope that you'll walk away with an understanding of how to request permission, build queries, and access data from NDA. We also hope that you learn some tips along the way.
The NDA framework states how data is available to the research community as soon as possible without compromising the ability of the authors to interpret and communicate only their findings.
So, this means we make available raw data so that researchers can see what data are coming from other projects as they plan primary or secondary analysis of their own and identify opportunities for collaboration.
So now we've covered the NDA framework and we will now go over the process of making an account with NDA and requesting access to the data. In order to access data, you'll need to follow these three steps. First, you will need to make an account with NDA.
So, under the Get Data section you'll click on the Request Data Access tab.
Sometimes it takes a minute to load.
So, this will ask you to either sign in or request an account.
Here, you should click on the request account button, which will take you to the form that we saw on the slide.
Here, you just need to fill out basic information about yourself and your affiliated institution. You can enter a username and password of your choice. Make sure to fill out all the starred fields, and once you've created an account, you'll receive a confirmation e-mail, and will then be able to login to request access.
Once you have logged in, you should navigate to the Data Permissions tab under your user dashboard to see what data you have access to.
In the second column, you'll see various permission groups and the corresponding descriptions.
Each permission group governs access to the one NDA repository which contains one or many collections of data from research projects. If you already have approved access to the permission groups data under the status column, you will see the number of days until your access has expired. If you don't have access, the box will be blank.
Under actions, you can click on the drop-down to request access.
This will bring up a window and walk you through the steps to gain access to data from that specific permission group.
All your active data requests will be seen at the table at the top.
Requests for access to shared data are typically processed in up to 10 business days.
Those approved will have access to the NDA repository for one year, after which you will need to re-apply even if you are still in the process of completing your research. You should include a progress report in your application.
We expect and encourage you to share back your results with NDA as an NDA study, which is covered in one of our other Webinars as agreed upon in the Data Use Certification. If your access to the data expires, you are required to delete any NDA data you may have.
So, now that we've walked through the steps, how to make an account and request access to data, we'll talk about how to search for data using NDA Search and Filter types.
If you already know the data you are looking for exists or want to access pre-packaged data, then you can go straight to our New Query tool.
If you're interested in just seeing what data exists in NDA, then you can use our search tool.
To begin your search, you can navigate to the NDA homepage and type in a query in the search box. This search tool allows you to look for data or other information in the NDA by entering search terms and reviewing or filtering results based on the type of information. The search page will display hits and provide some summary information specific to that type.
This is a good tool in seeing what is available based on key terms you might have in mind.
Now I'm going switch over to the browser again to walk you through an example. So, here's the search tool, and for our example, we're just going to search for fMRI.
In the righthand box, you can see the different search results.
So, content contains results across the data archive website, NDA Studies represent publications or other results based on data contained an NDA and serve as a link between the publication and underlying data.
Publications are the actual paper that were published using NDA data and have an associated DOI. Collections each represent one research project, lab, or grant contributing data. Data structures represent a standardized definition of a measure or instrument that researchers use to collect and submit data.
Data elements are the data, value ranges, and descriptions that make up the data structure.
Experiments return different types of experiments related to your search, and what NDA Collection they are part of. Concepts return data based on ontological concepts. These concepts are based on an Autism Spectrum Disorder Phenotype Ontology defined by Alexa T McCray at Harvard University. So, you'll see in parentheses the number of results that are returned for each of these types.
And when you scroll down, it'll show you the top five results. And then you can load more to see the rest of them.
Once you've searched for data, to see what's available at a summary level, it's time to actually start building queries and explore data you have approved access for at an item level.
When ready to query data, you should go back to the NDA homepage and click on the Get Data tab.
There are currently five methods used to build filters to query data.
There's data from labs, data from papers, data dictionary.
You can query by concepts, or you can query by GUIDs. You can also add additional filters from experiments or select one of our featured datasets.
Here, data from labs corresponds to collections and data from papers corresponds to studies. Query by GUID lets authenticated users find all data they have access to that is associated with a list of specific GUIDs.
Anyone can visit the NDA website and use the various query tools to browse summary and general information on the data, but those with approved access can view detailed subject level data, as well as build data packages.
When you've selected data that you want to package, you will first add it to your filter card, which serves as sort of a shopping cart for your data.
If you make multiple selections in a single filter, for example, in the data dictionary, it will apply “or” logic and return all data that matches any selection.
Let's say you have image03, and FMRIresults01, which are data structures.
Subject data from both data structures will be added to your filter cart.
Our new query tool also features a selected filter workplace, which allows you to select data you want to package and review it before adding it to your filter card.
If you apply filters in subsequent transactions, for example, a collection, and then a specific data structure, it will use “and” logic and only return data that matches all the selections.
Therefore, if you add multiple filters to the car, it will reduce the number of subjects available as they become more specific to your query. This is why it's important to pay attention to the order you are building your queries. While one way is not necessarily better than the other, it can affect the number of subjects or packaging.
The filter cart allows you to edit or clear the selected filters as needed.
So now, from our new query filter tool, I'll show you how to make a query and add it to your filter cart. In this example, I'm going to add data from two data structures in the same transaction. So, the Filter cart will use “OR” logic to add subjects from both data structures to the cart.
So, I'm going to select this data structure, and add it to the workspace.
I can also search for data structures. So, here, I'll search for image03.
Now, in my workspace, I can see both the structures that I've added, and now I'm going to submit them to the filter cart in the same transaction.
Depending on the number of subjects I am putting in my filter card, this could take a couple of moments.
Now, I'm going to add another filter.
And I'm going to filter by phenotypes here, filter by concepts.
So, for this example, I only want to add data to my cart on subjects that are depressed. So, I'm going to go ahead and select depressed. I'm going to add this to my workspace.
At this point, you can add more filters for the concepts, if you want. But, for our example, we're just going to stick to this one, and I'm going to go ahead and add it to my filter cart.
Once it's been added, we should see a reduction in the number of subjects in our filter cart.
Again, this might take a minute or so.
OK, and it looks like it's done, it returns zero subjects, which just means that the subjects that, the filter that we added to the card, don't return any subjects.
If this happens, you can simply edit your filter car and remove the Previous Filter, which will bring back the subjects into your filter cart.
While the filter cart is updating again, I'm going to switch back and show you how to actually create the package.
After you have added all the data that you want to your filter cart, it's time to create a Package. A package is tied to your user and contains data that you have access to and selected to download and access.
Once you have used one or more query tools to add filters to your cart for download, you'll be able to view and edit filters in the Cart Panel in the upper, right-hand corner of your page.
Clicking package slash add to Study will take you to a landing page.
This is a page where you can view the data you currently have returned by your query.
The left panel displays all of the collections that contain subject data that satisfy the filters in your cart.
Collections are organized in the left panel by permission group of all the data. And here, you will see if you have access to that permission. If you do not, and you want to request access, you can follow the link provided.
The right panel displays a list of all the data structures that you have that your query returned. The data structures contain individual level subject data from the collections on the left panel. You can check or uncheck collections and structures to remove or include in this particular package.
An individual must be in both a checked collection and a checked structure to be included.
You can also click find all subject data to drop the existing query and replace it with a query by GUID filter of all data for all currently included subjects.
This will help you identify if the subjects have data in other NDA collections. But you will lose the current query output by selecting this option.
Also, don't be confused when you see only collections in the left panel after you've added studies to your filter cart, all studies link directly to one or multiple collections and individual level data is always housed in collections, not studies. Studies contain analyze data, and experimental design metadata.
Once your selections are made, you can click Create Package to name and begin creating your package, which you can directly download or push into a database that you can then access virtually in the NDA Cloud.
This database is called a miNDAR and allows you to work with data easily access raw, omics and imaging files, and even submit analyzed data back to NDA without ever downloading the data and managing it locally.
Make sure to name your package a new or unique name as old packages can be overwritten.
You'll have the option to include or exclude associated files.
These are associated raw and analyzed data files such as omics, EEG, and images.
Note that including these files in your package will increase the size of your package.
Not including them in your package doesn't mean you won't have access to them for downloading later. We will be going over how to download associated files, not including the package in upcoming steps.
If you suspect that associate files are, if you suspect that adding associate files will make your package larger than five terabytes, don't select the option to include associated files. NDA does have a download limit of five terabytes for all users. If you need to work with a dataset that has associated files larger than five terabytes, you can compute in the Cloud, which we will go over shortly.
To view all your packages, you can go back to the user dashboard and select packages.
This will show you the package ID and the package name that you created.
You can also click on the drop-down to see which packages you have that are shared packages.
Now that we've gone over how to create a package, we'll want to access the data inside it.
There are several ways to do this, and the first one we're going to go over is with a miNDAR
So, a miNDAR, or miniature NDAR, is an Oracle database hosted in Amazon Web Services for your data. We sort of touched on this earlier.
When you create a miNDAR, you'll receive an e-mail with instructions on how to connect to your database.
To create a miNDAR, you're going to go back to your dashboard under Packages and on the table under Actions, you're going to hit the drop-down and click on Create miNDAR.
It's going to ask you for a unique password that is specific to your miNDAR. It does not have to be the same one you use to login to NDA.
And, just like I mentioned, it will send you an e-mail confirmation with directions on how to connect to your database.
Once you've connected your database, you'll be able to view the tables. Each structuring your package will translate to a table in the database, which you will be able to connect to using the provided credentials.
In the S3 links table, you will find the location of all the rich data files, such as omics and imaging files directly where they are stored in Amazon S3.
They'll be listed in the table, even if you package your data to exclude the associated files.
We will talk about how to access these files in more detail in the following sections.
So now that we have talked more about accessing your package through a miNDAR, we'll discuss methods to actually begin downloading and streaming the contents of your package.
The easiest way to download your data is by using Download Manager, a .JNLP file that can be downloaded from the Tools tab. Note, you will need to have Java installed to use this tool.
You can also install our Python package and E tools, which includes a built-in command line download tool.
Lastly, you can use the Jupyter Notebook on our GitHub page, which also makes use of NDA tools. We will discuss all three methods in detail.
First, we will go over Download Manager. You can download this file from the NDA homepage under the Tools tab.
Once you have downloaded the JNLP file, you can open it up where it will ask you to login using your NDA username and password.
It will then show you our Data Access terms, which you should read over. Once accepted, you'll be able to view a list of your packages. This tool will allow you to directly download your package to a location of your choice. Please note that, for any package exceeding the size limit of five terabytes, you will need to use an alternative access method. This is why it is important to build specific filters when querying the data.
If you have recently created your package, it may still be displayed as having the status of creating package when you launch the manager. You can update the Download Manager status by using the Refresh cube button.
Browse allows to select a new download destination. The option at the bottom allows you to stop ongoing downloads. Start all selected downloads and delete or clear packages from the interface. Packages are tied to your account and will persist after you have downloaded them.
To use computational methods to access data in the Cloud, you'll need two things. First, you'll need a list of S3 locations where the data files are located, and second, you'll need temporary AWS tokens.
To obtain the S3 locations, you can look at the S3 links table in the miNDAR, as explained previously. You can also look at the data structure file, such as image03 to find those locations.
Once you have the S3 locations, you'll need to generate temporary AWS tokens. Note, these tokens do expire for 24 hours but are free to regenerate. Tokens can be generated in Download Manager by clicking on tools and generate AWS credentials.
You can also find Python and cloud scripts in our GitHub page to generate the tokens as well.
Now, you can use any third-party tool to install your S3 files using the temporary tokens, or you can simply use NDA tools.
NDA tools is an easy to install Python package that will allow you to download the data using command line arguments and will take care of generating and refreshing your temporary tokens. To install, open up a terminal, or command prompt. You should verify you have a version of Python and PIP install on your machine.
Then simply run pip install NDA tools to install the latest version of the package.
You can also clone a copy from our GitHub page and install from there.
Once installed you can run download CMD-help to see a list of options. Files are downloaded automatically in a folder called AWS_ Downloads in your Home Directory. This can be configured to a directory of your choice.
The download CMD tool provides four options for downloading your data files from S3.
You can either enter the name of the file directly into the command line, so that would be the S3, the full S3 path. You can pass in a .txt file containing a list of S3 files to download.
So, if you have a list of S3 file, then you can save it in a text file and then pass that in as an argument.
You can pass in a data structure .txt file. So, if, in your package, you had an image03 data structure, which lists a bunch of S3 links, you can pass in that data structure file.
And the tool will download the S3 links from that fil. Or you can pass in a package ID, and, from there, it can download the files that you had packaged.
This allows you the flexibility of downloading a specific subset of your data package. Furthermore, NDA tools can be installed in any OS and Python version, so you can work with data directly in the cloud, for example, from your EC2 instance.
So, now I'm going to switch over to the terminal and I want to show a small example of how the download CMD tool would work. Once you have NDA tools properly installed, you should already have this command line tool pre-installed with it. So, it's just download CMD. And with the dash dash hub option, you can see a list of the different arguments you can pass in.
So, for my example, I have a Sample Links text file, which just contains a couple of S3 links that I want to download.
And, so, I'm going to do download C and D, The name of the file. I'm going to specify that. It's a text file, a .txt, with the dash T argument, and I'm also going to pass in -d. Just so it prints out the file as it's downloading, but that's optional. And I'm going to go ahead and hit Enter.
And then, if I look in my Finder, I can see the files are downloading.
And I just had it set to the default.
But, again, you could configure that if you passed in this -d option, to change the directory. So it’s pretty simple to use. And you can also pass in, like I said, you can pass in the S3 link directly, so, for example, you could just pass this indirectly. Or you can pass in a data structure or package ID.
The last method I want to go over is the Jupyter Notebook. The NDA tools, Jupyter Notebook is available on our GitHub page. This notebook will allow you to work with NDA tools and more interactive format. It provides the same for download options as the download CMD command line tool does. This code also serves as a starting point for users to write their own Python scripts using NDA tools that will allow them to download data from NDA as part of a larger pipeline.
So, once you have the Jupyter Notebook setup and this file downloaded, you can start the Jupyter Notebook. And we have some directions on how you can validate and submit data too. If you just scroll towards the bottom, under the Download section, you can use this to download data directly.
So, you just have to run every kernel.
Here, you can see that we're importing certain classes from NDA tools.
And here I'm going to use a data structure file, which is image03. And I'm going to go ahead and click Submit.
I’m going to run this, and it prints out the S3 files, and it's going to download.
Here, you can generate the tokens, and it'll print out access key, Secret Key, Session token.
Then you can click on start downloads, and it'll start downloading the files.
So, as it's downloading, I'm going to open my terminal, and it also defaults to the AWS downloads directory. At least you can see, that's where I had it configured in the script.
But, again, this is something that you can change, if you would like, in your code, and we can see the files are downloaded here and are also listed here as they download.
So, The last topic we're going to cover today is computational credits.
The NDA has a computational credits pilot program, which aims to provide an efficient model for scientists to request and receive approval for computational access into the NDA. Through Amazon Web Services, NDA awards credits that support compute resources in short-term data storage for specific research initiatives. The resources are provided through the NIMH Data Archive, a federal system, and are to be used only in support of the proved research purpose.
To be approved for computational credits, researchers must meet three criteria.
First, they must be approved by for access to one of the NDA permission groups.
Second, researchers and all colleagues are required to apply for computational credits. The application must include how the requested resources will enhance research discovery and the mission of the NDA, the request potential contribution to science, the amount of initial funds needed, the amount of storage required, the services needed, and the number of simultaneous NDA AWS instances required.
And lastly, after the request is approved and computational credits provided, a required Security and Best Practices webinar will be schedule Python and curl scripts ed. All contributors must attend the webinar to receive accounts and access.
We highly recommend you access your data in the cloud. Our datasets are quite large in size, and oftentimes researchers do not have the storage resources to download and keep the files locally.
Due to network and connection issues, downloading locally is not always efficient.
Performing downloads and streaming files in the cloud is much faster.
We'd like to thank you for attending today's webinar and hope it was helpful.
A recording will be included in our email.
As always, feel free to contact us with any questions. We have a helpdesk that can be reached via e-mail.
There are also resources on our website, such as video tutorials, to walk you through many processes, and the staff are always available and willing to help.
David Obenshain:
Thank you, everybody, for joining us for today's webinar. We have around 20 minutes left.
If anybody has any questions that they'd like to ask, you can ask them via the question interface on the GoToWebinar. Otherwise after the recording sent, you can email us:
Our e-mail address is NDAHelp@mail.nih.gov
You can also send us questions after this webinar at the same address, and we'll be happy to answer any questions that you haven't, so we'll wait around to see if anybody has any questions.
So, somebody had a question about getting the recording over mail? That's correct. We will. We'll send out an e-mail after the webinar, attendees tomorrow, with the recording.
We had a question about showing us, again, how to add a second layer of the filters. If you'll give us a moment, we can go back and replay that portion the recording. So, give me one minute.
SKIP TO 39:21
Recording repeated: “Now, we will begin the NDA Data Access webinar focused primarily for researchers who are looking to build new research datasets using human subjects data from hundreds of projects housed.”
So, to answer, there was a question from, I'm probably going to mispronounce his name, but he hang out.
There was a question to have us go over adding a second layer of filters. I can talk through adding the second layer filters.
Generally, if you want to add a second layer of filters, you wait till you've added your first filter, I can show that here.
Go to Get Data.
And we want to go ahead and add a filter. So, for example, maybe I want to get a specific sex within a specific age range. So that would be one filter. The first thing you do is add that to your Workspace. When you do that, you'll see a little dialog over here that says you added it.
So, I can add multiple filters here, before submitting to the filter card, or I can submit to the cart now, and I can come back.
So, we're not allowed to add demographic filters first.
So, you need to add another type of filter before you can add that. But the important thing to take away from the filters and the layering of filters is that the filter logic works based on the previous set of filters that are in the cart. So, if you think about a funnel filtering from a large amount of data down to a small amount of data, the first filter is applied, and that subsequent filters are applied to the results of those initial filters that you've had it. So, hopefully, that's answering the question that was asked. He hanged out. If that's not the answer that you're looking for, you can let us know.
So, I can show an example. We can add this to the workspace. Now, we see we have these two filters. I can't run this filter first, so I'll remove this one.
Right, so there's a question about deleting selected filters. So, from the Filter, the Workspace, you can remove individual filters with the button, remove, so you can either remove the type or the class I filter.
So, this is data dictionary, I can remove that entire thing, if I had more than one data dictionary included in the filter, I could move individual items from that using these links here.
I can also empty my Workspace entirely if I want, so if I wanted to come back here.
I'll show an example where we add more than one data dictionary at a time. So, here, I've selected two items from this category of the data dictionary to add to the filter cart. If we go then to our workspace we’ll see under this category of data dictionary, I have two items. So these would be, or together, because they're being added at the same time.
So, someone described this as a layer of filtering, so these are the same layer. If I want to then go now and add demographics to my filter cart, I can do that.
For some reason, on my browser here, these side panels are not closing.
when you're done with the filters that you want, you click submit filter cart, that will run all your filters and give you a link to be able to access the results from the landing page and create a package.
I have not seen any additional questions in the last three minutes.
I still see we have handful of people on the line. I think, we will wait, another five minutes. If no one else has any questions, now would be a good time to ask them. Otherwise, we will end the webinar and you can ask your questions by contacting NDAHelp@mail.nih.gov.
We haven't had any additional questions. So, we're going to go ahead and end the webinar. Again, we'll follow up with an e-mail tomorrow regarding the recording.
If you have any other questions, please contact NDAHelp@mail.nih.gov.
Thank you for everybody.