Tutorial 4. Performing queries

In this tutorial, we will correct sampling distortions. Let’s setup the simple layer object.

[1]:
import numpy as np
from pyxc.core.layer import Layer
from pyxc.core.processor.arrays import column_parser
from pyxc.core.container import Container2D
from pyxc.core.loader import ImageLoader, XYDLoader
from pyxc.transform.homography import Homography

EBSD = np.genfromtxt(
    "./data/SiC_in_NiSA.ctf", dtype=float, skip_header=15, delimiter="\t", names=True
)

layer_ebsd = Layer(
    data=column_parser(EBSD, format_string="dxydddddddd"),
    container=Container2D,
    dataloader=XYDLoader,
    transformer=Homography,
)

You have two choices to query data. You can either query by single a coordinate or multiple coordinates.

The first option provides better flexibility. You can receive correlation results and you can run your own analysis. The second option provides better convenience but is rather limited.

Let’s see!

Single point query

You can query the data by a single object. Several columns are additionally provided along with the columns contained in the container object. 1. query_index: for internal reference. This will be dealt little later. 2. distance: Euclidean distance between given coordinate and nearby point. 3. x-coordinates: query x coordinate 4. y-coordinates: query y coordinate

Also, note that we’ve got several x and y related columns. Read this carefully: 1. x: distortion-corrected x 2. y: distortion_corrected y 3. x_raw: initially supplied x value, before correction. 4. Y_raw: initially supplied y value, before correction. 5. x-coordinates: x for query 6. y-coordinates: y for query

[2]:
layer_ebsd.query(5, 5)
[2]:
array([(1818, 4.9977, 4.9977, 4.9977, 4.9977, 2., 9., 0., 160.46, 48.224, 234.31, 1.1127, 161., 255., 0, 0.00325239, 5, 5)],
      dtype=[('row', '<i4'), ('x', '<f4'), ('y', '<f4'), ('x_raw', '<f4'), ('y_raw', '<f4'), ('Phase', '<f8'), ('Bands', '<f8'), ('Error', '<f8'), ('Euler1', '<f8'), ('Euler2', '<f8'), ('Euler3', '<f8'), ('MAD', '<f8'), ('BC', '<f8'), ('BS', '<f8'), ('query_index', '<i8'), ('distance', '<f8'), ('x-coordinates', '<i8'), ('y-coordinates', '<i8')])

There are two important options, cut-off and output_number. If your data points’ nearest neighbour distances are larger than a specific cutoff, you might not get results. For example,

[3]:
layer_ebsd.query(5, 5, cutoff=0.0001)
/home/docs/checkouts/readthedocs.org/user_builds/pyxc/envs/latest/lib/python3.11/site-packages/pyxc/core/layer.py:317: UserWarning: Couldn't find the matching point. Please ignore rows containing NaN.
  warn("Couldn't find the matching point. Please ignore rows containing NaN.")
[3]:
array([],
      dtype=[('row', '<i4'), ('x', '<f4'), ('y', '<f4'), ('x_raw', '<f4'), ('y_raw', '<f4'), ('Phase', '<f8'), ('Bands', '<f8'), ('Error', '<f8'), ('Euler1', '<f8'), ('Euler2', '<f8'), ('Euler3', '<f8'), ('MAD', '<f8'), ('BC', '<f8'), ('BS', '<f8'), ('query_index', '<i8'), ('distance', '<f8'), ('x-coordinates', '<f8'), ('y-coordinates', '<f8')])

Furthermore, you can get more datapoints, if you want, by explicitly specifying the cut-off and output_number parameters.

[4]:
layer_ebsd.query(x=5, y=5, cutoff=5, output_number=5)
[4]:
array([(1818, 4.9977, 4.9977, 4.9977, 4.9977, 2.,  9., 0., 160.46, 48.224, 234.31, 1.1127, 161., 255., 0, 0.00325239, 5, 5),
       (1918, 4.9977, 5.2753, 4.9977, 5.2753, 2., 10., 0., 161.  , 48.286, 234.02, 1.1432, 167., 255., 0, 0.27530963, 5, 5),
       (1819, 5.2753, 4.9977, 5.2753, 4.9977, 2., 10., 0., 161.09, 48.425, 233.9 , 1.1092, 151., 255., 0, 0.27530963, 5, 5),
       (1817, 4.72  , 4.9977, 4.72  , 4.9977, 2., 10., 0., 160.55, 48.13 , 233.82, 1.3118, 159., 255., 0, 0.28000965, 5, 5),
       (1718, 4.9977, 4.72  , 4.9977, 4.72  , 2.,  9., 0., 160.78, 48.349, 234.04, 1.0346, 159., 255., 0, 0.28000965, 5, 5)],
      dtype=[('row', '<i4'), ('x', '<f4'), ('y', '<f4'), ('x_raw', '<f4'), ('y_raw', '<f4'), ('Phase', '<f8'), ('Bands', '<f8'), ('Error', '<f8'), ('Euler1', '<f8'), ('Euler2', '<f8'), ('Euler3', '<f8'), ('MAD', '<f8'), ('BC', '<f8'), ('BS', '<f8'), ('query_index', '<i8'), ('distance', '<f8'), ('x-coordinates', '<i8'), ('y-coordinates', '<i8')])

Multi point query

Let’s do it more conveniently! You can retrieve data from multiple points at once. If data is large, execute_queries might take approximately one or two minutes, but it is perfectly normal. It is preparing parallel execution.

[5]:
xs = [4.1, 4.2, 4.3]
ys = [4.5, 4.6, 4.7]
layer_ebsd.execute_queries(xs, ys)
Maximum worker:  6
Executing queries: 100%|██████████| 3/3 [00:00<00:00, 13781.94it/s]
[5]:
array([(1615, 4.1647, 4.4424, 4.1647, 4.4424, 2., 10., 0., 161.55, 48.938, 233.88, 0.96  , 172., 255., 0, 0.0866248 , 4.1, 4.5),
       (1715, 4.1647, 4.72  , 4.1647, 4.72  , 2., 10., 0., 161.2 , 49.065, 233.51, 1.0923, 162., 255., 1, 0.12508412, 4.2, 4.6),
       (1715, 4.1647, 4.72  , 4.1647, 4.72  , 2., 10., 0., 161.2 , 49.065, 233.51, 1.0923, 162., 255., 2, 0.13677015, 4.3, 4.7)],
      dtype=[('row', '<i4'), ('x', '<f4'), ('y', '<f4'), ('x_raw', '<f4'), ('y_raw', '<f4'), ('Phase', '<f8'), ('Bands', '<f8'), ('Error', '<f8'), ('Euler1', '<f8'), ('Euler2', '<f8'), ('Euler3', '<f8'), ('MAD', '<f8'), ('BC', '<f8'), ('BS', '<f8'), ('query_index', '<i8'), ('distance', '<f8'), ('x-coordinates', '<f8'), ('y-coordinates', '<f8')])

Use query_index column to filter out not correlated points!

Warning

See the code below very carefully. There is no guarantee that all points that you have provided yield a correlation result. If the points are too far away from the data point (beyond the cut-off distance), you will not get the result. You will be required to filter out the points that are not hit by using the query_index column.

This is especially useful when you are comparing correlation results with the serialized data.

Let’s assume we have xs, ys, and hardness. For example, data provided below means we have 100 MPa hardness point at the (4.1, 4.5). The 4th point (-10, -10, 150) is deliberately set to not existing point.

[6]:
xs = np.array([4.1, 4.2, 4.3, -10])
ys = np.array([4.5, 4.6, 4.7, -10])
hd = np.array([100, 200, 110, 150])
result = layer_ebsd.execute_queries(xs, ys)
Maximum worker:  6
Executing queries: 100%|██████████| 4/4 [00:00<00:00, 18808.54it/s]
[7]:
result
[7]:
array([(1615, 4.1647, 4.4424, 4.1647, 4.4424, 2., 10., 0., 161.55, 48.938, 233.88, 0.96  , 172., 255., 0, 0.0866248 , 4.1, 4.5),
       (1715, 4.1647, 4.72  , 4.1647, 4.72  , 2., 10., 0., 161.2 , 49.065, 233.51, 1.0923, 162., 255., 1, 0.12508412, 4.2, 4.6),
       (1715, 4.1647, 4.72  , 4.1647, 4.72  , 2., 10., 0., 161.2 , 49.065, 233.51, 1.0923, 162., 255., 2, 0.13677015, 4.3, 4.7)],
      dtype=[('row', '<i4'), ('x', '<f4'), ('y', '<f4'), ('x_raw', '<f4'), ('y_raw', '<f4'), ('Phase', '<f8'), ('Bands', '<f8'), ('Error', '<f8'), ('Euler1', '<f8'), ('Euler2', '<f8'), ('Euler3', '<f8'), ('MAD', '<f8'), ('BC', '<f8'), ('BS', '<f8'), ('query_index', '<i8'), ('distance', '<f8'), ('x-coordinates', '<f8'), ('y-coordinates', '<f8')])

Now you can see that the provided data has a length of 4, but the returned data only has a length of 3. So it is not directly plottable. In this case, ‘query_index’ plays a significant role. It can be used to filter out failed data points from the initially provided data, like below:

[8]:
xs_refined = xs[result["query_index"]]
ys_refined = ys[result["query_index"]]
hd_refined = hd[result["query_index"]]

Now, you can use the query result with your own hardness data. Such as doing,

[9]:
import matplotlib.pyplot as plt

plt.scatter(result["BC"], hd_refined)
[9]:
<matplotlib.collections.PathCollection at 0x7f2bbd72c050>
../_images/notebooks_T4_performing_query_17_1.png

However, one single caveat of this multi-point query cannot handle the situation when the output_number is other than 1. If you try to query more than one point, you will get an error.

[10]:
xs = [4.1, 4.2, 4.3]
ys = [4.5, 4.6, 4.7]
layer_ebsd.execute_queries(xs, ys, output_number=2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 3
      1 xs = [4.1, 4.2, 4.3]
      2 ys = [4.5, 4.6, 4.7]
----> 3 layer_ebsd.execute_queries(xs, ys, output_number=2)

File ~/checkouts/readthedocs.org/user_builds/pyxc/envs/latest/lib/python3.11/site-packages/pyxc/core/layer.py:394, in Layer.execute_queries(self, xs, ys, cutoff, output_number, reducer, **kwargs)
    389     raise ValueError(
    390         "Error: The length of xs must be the same as the length of ys"
    391     )
    393 if output_number != 1 and reducer is None:
--> 394     raise ValueError(
    395         "Error: Output number must be 1 (nearest-neighbour) or reducer should be specified."
    396     )
    398 results = []
    400 with ThreadPoolExecutor(**kwargs) as executor:

ValueError: Error: Output number must be 1 (nearest-neighbour) or reducer should be specified.

You can specify the reducer to handle this situation. Reducer objecst should be specified from List[Tuple[Callable, List[‘ColumnNames’]]]. Callable should accept 1-dimensional numpy arrays and yields a single value. Such as np.std, np.mean.

The Reducer object can be used for a single point query also. It is useful to do statistical analyses on the results.

[11]:
from pyxc.core.processor.reducer import Reducer

reducer_obj = Reducer([(np.mean, ["BS", "Phase"]), (np.std, ["BS", "Phase"])])

Then, you can do like this. Note that you have got new columns such as “Phase_std”.

[12]:
xs = [4.1, 4.2, 4.3, -10]
ys = [4.5, 4.6, 4.7, -10]
layer_ebsd.execute_queries(xs, ys, output_number=2, reducer=reducer_obj)
/home/docs/checkouts/readthedocs.org/user_builds/pyxc/envs/latest/lib/python3.11/site-packages/pyxc/core/layer.py:317: UserWarning: Couldn't find the matching point. Please ignore rows containing NaN.
  warn("Couldn't find the matching point. Please ignore rows containing NaN.")
Maximum worker:  6
Executing queries: 100%|██████████| 4/4 [00:00<00:00, 11008.67it/s]
Query failed for (-10, -10). Reason: index 0 is out of bounds for axis 0 with size 0

[12]:
array([(2, 0, 4.1, 4.5, 4.02589989, 4.44239998, 255., 2., 0., 0.),
       (2, 1, 4.2, 4.6, 4.16470003, 4.58119965, 255., 2., 0., 0.),
       (2, 2, 4.3, 4.7, 4.30354977, 4.71999979, 255., 2., 0., 0.)],
      dtype=[('count', '<i8'), ('query_index', '<i8'), ('x-coordinates', '<f8'), ('y-coordinates', '<f8'), ('avg_x', '<f8'), ('avg_y', '<f8'), ('Phase_mean', '<f8'), ('BS_mean', '<f8'), ('Phase_std', '<f8'), ('BS_std', '<f8')])

Query performance tip

Please use small cut-off and small output_number. As you can see, by reducing the cut-off parameter, the performance enhances for almost 5 times.

[13]:
%%timeit
layer_ebsd.query(5, 5, cutoff=10, output_number=1000)
7.33 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
[14]:
%%timeit
layer_ebsd.query(5, 5, cutoff=1, output_number=1000)
807 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[15]:
%%timeit
layer_ebsd.query(5, 5, cutoff=1, output_number=10)
584 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[16]:
%%timeit
layer_ebsd.query(5, 5, cutoff=1, output_number=1)
530 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)