Feature #1163
openimage format convertor script to be used during or after makestack
Added by Anchi Cheng almost 14 years ago. Updated over 13 years ago.
0%
Description
In appion meeting on 2011-01-26, decision was made regarding issue #1155 to always make an IMAGIC format stack in makestack but allow the option of making duplicate stack or individual boxed files for different format. The making of the duplicated stacks will be handled by a converter script. Program in later part of the pipeline will check for existence of the stack of the desired format. When available, it can then save the time of conversion from IMAGIC format.
We will need to plan out the folder structure to associate different format stacks to the same database record of stack run and particle stack so that it is easier for future development.
Updated by Anchi Cheng almost 14 years ago
- Status changed from New to Assigned
- Assignee set to Anchi Cheng
Gabe, Neil, Scott,
Need feedback from you guys since you have thought more about directory structure than anyone here.
Here is my plan:
1. We will have an Apppion table called, tentatively ApStackFormatData, with extensable fields like this in sinedon appiondata format:
'stack':ApStackData,
'imagic_stack':ApPathData,
'spider_stack':ApPathData,
'spider_singles':ApPathData,
'mrc_stack':ApPathData,
'mrc_singles':ApPathData,
When stackconversion is run, the table is updated to add the path to the new format to the stack reference. Since sinedon does not allow update, we will have to create new row, but it is not a problem in query.
2. at each ApPathData referenced in there, it would have the log file from running AppionScript, the done-dict etc. and either the single stack or a subfolder of singles and subfolder name will be standardized and not format dependent. If we are going to use proc2d to convert from imagic_stack, done-dict will just be copied from imagic_stack path during the conversion so that it carries the same information, and a mechanism will be needed to compare done-dicts of the format when "continue" option is used.
3. When an Appion script need a stack of a particular format, A function is used to return the path to the required format. The function queries into ApStackFormatData. If the stack of the required format is available, it will return the path, if not, it will make the conversion, probably from imagic_stack, which should always be made by makestack2.py or substack.py
Now the question for you, where should each of these point to:
We know where imagic_stack would be:
imagic_stack_path = /appion/stacks/stackrun/
From a human point of view, all other conversion should go as subfolder from that,i.e,
spider_stack_path = /appion/stacks/stackrun/spider_stack/
From appion point of view, it seems more logical to be
spider_stack_path = /appion/conv_stacks/convrun/
What was done previously? Any concern that was an issue before? Preference? Any logical flaw you see in my plan?
Updated by Anchi Cheng over 13 years ago
- Status changed from Assigned to In Code Review
- Assignee changed from Anchi Cheng to Gabriel Lander
In r15481, I've coded an version of my idea in the last update. It worked best in FreeAlign refinement since the current Appion code uses a prepFreeAlign step which allow the python code check and copy or create the format required by the refinement before it is transferred to the cluster. It will not work in case of current xmippRefine.py because the script is only run from the remote cluster side that may not have access to the formatted stack files unless a preparation step is added from the local processing host to gather the needed files first. A local poll among Appion users here showed that they didn't mind having such a step in other refinement programs as long as it is one click from the actual transfer/submission of the actual refinement job to the cluster as FreeAlign is now.
Gabe, since you are interested in participate in this, please see if this would work for other formats that you would like to add. I still have Imagic format for FreeAlign until I add that conversion to mrc. Assign it back to me after you reviewed the code.
testStackFormat.py is just ...a test. The linkFormattedStack function will be called from AppionScript directly.
Updated by Gabriel Lander over 13 years ago
Hi Anchi,
thanks for spearheading this. I like this so far, but I wonder if we need to figure out a better way to specify the file format. "Eman", for example isn't a file format, but by default we use the Imagic format for Eman. Eman2 would use the "hdf" format, but that's also compatible with Eman1. Frealign could be mrc, imagic, or spider. Then we have Xmipp that uses Spider-format, but only individual spider files, not spider stacks. Having a list of available formats like "img,hdf,mrc,spi,spi(individual)", however would be non-intuitive, so I don't know if that would work either.
Are we planning on having a 1-to-1 relationship between programs and stack formats? Will these be defined somewhere?
It was, of course, my intention to make this message as confusing as possible. Hopefully I succeeded.
Updated by Anchi Cheng over 13 years ago
Gabe,
I had a debate in my head as I was writing the code as well. At the end I gather that even though every package claims that it accepts many file format, it has a default format, and as we learned from the case of Freealign, the non-default format may be problematic and is endless for us to figure out what the package internal definition is in term of the conversion. Therefore, I decided to define format as the package default file format. This way, we can extend to the next package with its default format and not to worry whether it uses the same format as others. If so happens that one package is using the exact same file format as the other, such as the possibility of eman and IMAGIC, we can add an automatic insertion to the StackFormatData the other when one is created. One example is that currently our default stack format is IMAGIC format used in eman, Therefore makestack.py will have to fill in two fields in the data. If one day we decide to use a different default, we will be able to switch easily.
To answer your question in short. Yes, I think the safest is to define 1-to-1 relationship between package and format. I didn't spell it out at the beginning of apStackFormat.py but that certainly should be done in the same file if we continue on this path.
Updated by Anchi Cheng over 13 years ago
more thought. The other reason I ended up with package format, not file format, is that we are not only producing stack files of the particular format but often the organization and naming that make it easier to input in the wrapper scripts. SPIDER actually never said that its files need to be called *.spi Xmipp probably never said how many particles should be in each folder or to use multiple folders not one.
Updated by Gabriel Lander over 13 years ago
Hi Anchi,
this plan sounds good to me then. I'm on board.
Updated by Neil Voss over 13 years ago
Note: Xmipp stacks need multiple folders because if you have over 10,000 files in a folder it can take several minutes to do a simple ls
from the command line. Maybe more modern file systems are better, but the Appion-Xmipp system was designed out of necessity for reasonable speed.
Updated by Anchi Cheng over 13 years ago
- Assignee changed from Gabriel Lander to Anchi Cheng
Yes, Neil, I am using the multiple folder organization you created. I should mention that Christopher and I did a test to make sure that we are not going to be losing too much time when we transfer the particle stack as a tar file of the foldered xmipp format rather than as one eman IMAGIC stack file and then break it up at the remote cluster. It is a bit slower but we decided that the overhead is compensated when more refinement run is done on the same stack because we will be saving the tar file at the formatted stack path.
Updated by Scott Stagg over 13 years ago
I haven't looked at the code, but based on the discussion, this all looks good to me. Since we are moving to having a one to one correspondence between packages and stacks, what if we made mrc the official stack of Appion? That way, we could get rid of batchboxer for boxing the particles, and we could just do it internally with python. I realize that would affect lots of parts of the workflow, but it would free us from some of our EMAN1 dependence. We will probably have to do something like this eventually anyway if and when we migrate to EMAN2. Frealign and IMOD both use mrc stacks, so the Appion format would be compatible with those two packages without conversion. It would be a lot cleaner too.
Updated by Anchi Cheng over 13 years ago
- Status changed from In Code Review to Assigned
I don't want to do this change of default at the moment. Neil wrote a lot of stack statistics tools based on IMAGIC used in EMAN1. It will be a lot to change. Also, we are in the process of re-ramp php-mrc tool that need to be phased out because php 5.3 (used by CentOS 6, current Fedora, SUSE etc.) no longer allow us to compile it with php and gd. I think it will be better to make the switch when its replacement (named Redux by Jim) is usable on the web pages. Problem with mrc is that there is no standard so that a format mode noted in its header does not guarantee the data structure. We will have to rely on two packages not to use the same mode number to define two different data type....
Updated by Gabriel Lander over 13 years ago
I agree with Scott in that we should eventually move away from imagic stacks as the default, but I would like to propose that we switch to hdf5 as the default... I've been using this format a lot recently, and very happy with it - I can see why Ludtke decided to use it as the EMAN2 default. It's standardized, but offers a lot of flexibility, including making stacks of 3D models, which is really useful for multi-model refinements, or sub-tomographic averaging. It would also be straightforward to write a python function to write to hdf5, in order to replace batchboxer. We should probably have a conference call about this when and if the time comes.