Using Python
Data in Yoda is not directly accessible, you have to download data to the machine that contains your analysis software first. If you do your analysis with Python scripts anyway, for example on Snellius or Ada, it can be useful to script the data access and transfer as well.
Python iRODS Client
The Python iRODS Client (PRC) is the default way to access data in iRODS programatically.
Install
pip install python-irodsclientSetting up a session to access Yoda
The easiest way to setup a session to Yoda is by using the information in the irods environment file.
The code below sets up a session using all the correct settings for Yoda:
import json
from irods.session import iRODSSession
from pathlib import Path
from getpass import getpass
import ssl
def get_irods_environment(irods_environment_file):
"""Reads the irods_environment.json file, which contains the environment configuration."""
print(
f"Trying to retrieve connection settings from: {irods_environment_file}"
)
try:
with open(irods_environment_file, "r") as f:
return json.load(f)
except:
print(f'Could not open {irods_environment_file}')
exit()
def setup_session(ca_file='/etc/ssl/certs/ca-certificates.crt'):
"""Use irods environment files to configure a iRODSSession. User is prompted for the password"""
irods_env = get_irods_environment(f"{Path.home()}/.irods/irods_environment.json")
password = getpass(f"Enter valid DAP for user {irods_env['irods_user_name']}: ")
ssl_context = ssl.create_default_context(
purpose=ssl.Purpose.SERVER_AUTH, cafile=ca_file, capath=None, cadata=None
)
ssl_settings = {
"client_server_negotiation": "request_server_negotiation",
"client_server_policy": "CS_NEG_REQUIRE",
"encryption_algorithm": "AES-256-CBC",
"encryption_key_size": 32,
"encryption_num_hash_rounds": 16,
"encryption_salt_size": 8,
"ssl_context": ssl_context,
}
session = iRODSSession(
host=irods_env["irods_host"],
port=irods_env["irods_port"],
user=irods_env["irods_user_name"],
password=password,
zone=irods_env["irods_zone_name"],
authentication_scheme="pam_password",
**ssl_settings,
)
return session
session=setup_session()
# workload
coll=session.collections.get(f"/{session.zone}/home")
for col in coll.subcollections:
print(col.name)More information
You can find more information on using the iRODS client in the README on github.
iBridges
The PRC can be hard to use, because it requires some prior knowledge on the structure and terminology used in iRODS. For this reason, developers at Utrecht University created iBridges, which makes it easier to do basic file and metadata manipulation in iRODS.
Installation
Installation is again as simple as:
pip install ibridgesConnecting
To connect you will need the irods environment file. iBridges expects the file to be in ~/.irods/irods_environment.json but you can point it to a different location.
from ibridges import Session
from pathlib import Path
from getpass import getpass
password = getpass(f"Enter valid DAP: ")
session = Session(irods_env_path=Path.home() / ".irods" / "irods_environment.json", password=password)Upload data
You can easily upload your data with the previously created session:
from ibridges import upload
upload(session, "/your/local/path", "/irods/path")This upload function can upload both directories (collections in iRODS) and files (data objects in iRODS).
Add iRODS metadata
One of the powerful features of iRODS is its ability to store metadata with your data in a consistent manner. Let’s add some metadata to a collection or data object:
from ibridges import IrodsPath
ipath = IrodsPath(session, "/irods/path")
ipath.meta.add("some_key", "some_value", "some_units")We have used the IrodsPath class here, which is another central class to the iBridges API. From here we have access to the metadata as shown above, but additionally there are many more convenient features directly accessible such as getting the size of a collection or data object. A detailed description of the features is present in another part of the documentation.
Download data
Naturally, we also want to download the data back to our local machine. This is done with the download function:
from ibridges import download
download(session, "/irods/path", "/other/local/path")Closing the session
When you are done with your session, you should generally close it:
session.close()More information
More information on using iBridges can be found in the online documentation.
Streaming
With the python-irodsclient which iBridges is built on, we can open the file inside of a data object as a stream and process the content without downloading the data. This is especially useful if you need to access data stored in large files. That works without any problems for textual data.
from ibridges import IrodsPath
obj_path = IrodsPath(session, "path", "to", "object")
with obj_path.open('r') as stream:
content = stream.read().decode()Some python libraries allow to be instantiated directly from such a stream. This is supported by e.g. pandas and polars for datafiles or whisper for transcription and translation of audio files.
import pandas as pd
with obj_path.open('r') as stream:
df = pd.read_csv(stream)
print(df)